I was just wondering why is that primes are used in a class's hashCode()
method? For example, when using Eclipse to generate my hashCode()
method there is always the prime number 31
used:
我想知道为什么在类的hashCode()方法中使用了质数?例如,当使用Eclipse生成hashCode()方法时,总是使用质数31:
public int hashCode() {
final int prime = 31;
//...
}
References:
引用:
Here is a good primer on Hashcode and article on how hashing works that I found (C# but the concepts are transferrable): Eric Lippert's Guidelines and rules for GetHashCode()
这里有一篇关于哈希代码的入门文章和关于哈希如何工作的文章(c#但是概念是可转移的):Eric Lippert的GetHashCode()的指导方针和规则
83
Because you want the number you are multiplying by and the number of buckets you are inserting into to have orthogonal prime factorizations.
因为你想要你乘以的数和你要插入的桶数来得到正交的质因数分解。
Suppose there are 8 buckets to insert into. If the number you are using to multiply by is some multiple of 8, then the bucket inserted into will only be determined by the least significant entry (the one not multiplied at all). Similar entries will collide. Not good for a hash function.
假设有8个桶要插入。如果要相乘的数字是8的某个倍数,那么插入的桶将只由最不重要的条目(根本没有相乘的条目)决定。类似的条目将碰撞。不适合哈希函数。
31 is a large enough prime that the number of buckets is unlikely to be divisible by it (and in fact, modern java HashMap implementations keep the number of buckets to a power of 2).
31是一个足够大的素数,所以桶的数量不太可能被它整除(实际上,现代java HashMap实现将桶的数量保持为2的幂)。
113
Prime numbers are chosen to best distribute data among hash buckets. If the distribution of inputs is random and evenly spread, then the choice of the hash code/modulus does not matter. It only has an impact when there is a certain pattern to the inputs.
选择素数以便在哈希桶之间最好地分配数据。如果输入的分布是随机的、均匀分布的,那么选择散列码/模数并不重要。只有当输入有特定的模式时才会产生影响。
This is often the case when dealing with memory locations. For example, all 32-bit integers are aligned to addresses divisible by 4. Check out the table below to visualize the effects of using a prime vs. non-prime modulus:
在处理内存位置时通常是这样。例如,所有32位整数都对齐到可被4整除的地址。查看下面的表格,以可视化使用prime和非prime模数的效果:
Input Modulo 8 Modulo 7
0 0 0
4 4 4
8 0 1
12 4 5
16 0 2
20 4 6
24 0 3
28 4 0
Notice the almost-perfect distribution when using a prime modulus vs. a non-prime modulus.
注意,当使用素数模和非素数模时,几乎完美的分布。
However, although the above example is largely contrived, the general principle is that when dealing with a pattern of inputs, using a prime number modulus will yield the best distribution.
然而,尽管上面的例子很大程度上是人为设计的,但一般的原则是,在处理输入模式时,使用素数模量将产生最佳的分布。
21
For what it's worth, Effective Java 2nd Edition hand-waives around the mathematics issue and just say that the reason to choose 31 is:
值得注意的是,有效的Java第二版围绕着数学问题手把手地说,选择31的原因是:
Here's the full quote, from Item 9: Always override hashCode
when you override equals
:
下面是第9项的完整引用:当你重写=时,总是重写hashCode:
The value 31 was chosen because it's an odd prime. If it were even and multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional.
之所以选择31是因为它是一个奇数素数。如果它是偶数并且乘法溢出,信息就会丢失,因为乘法2等同于移位。使用素数的优势不那么明显,但它是传统的。
A nice property of 31 is that the multiplication can be replaced by a shift (§15.19) and subtraction for better performance:
31日是一个很好的属性,乘法可以被转变为更好的性能(§15.19)和减法:
31 * i == (i <<5) - i
Modern VMs do this sort of optimization automatically.
现代虚拟机自动执行这种优化。
While the recipe in this item yields reasonably good hash functions, it does not yield state-of-the-art hash functions, nor do Java platform libraries provide such hash functions as of release 1.6. Writing such hash functions is a research topic, best left to mathematicians and theoretical computer scientists.
虽然本项目中的菜谱产生了相当好的散列函数,但它并没有生成最先进的散列函数,Java平台库也没有提供1.6版那样的散列函数。写这样的哈希函数是一个研究课题,最好留给数学家和理论计算机科学家。
Perhaps a later release of the platform will provide state-of-the-art hash functions for its classes and utility methods to allow average programmers to construct such hash functions. In the meantime, the techniques described in this item should be adequate for most applications.
平台的后续版本可能会为它的类和实用方法提供最先进的哈希函数,以允许普通程序员构建此类哈希函数。与此同时,本项目中所描述的技术应该适用于大多数应用程序。
Rather simplistically, it can be said that using a multiplier with numerous divisors will result in more hash collisions. Since for effective hashing we want to minimize the number of collisions, we try to use a multiplier that has fewer divisors. A prime number by definition has exactly two distinct, positive divisors.
简单地说,使用带有多个除数的乘数会导致更多的散列冲突。因为为了有效的散列,我们希望最小化冲突的数量,所以我们尝试使用一个具有更少的因数的乘数。一个素数根据定义有两个截然不同的正因数。
4
I heard that 31 was chosen so that the compiler can optimize the multiplication to left-shift 5 bits then subtract the value.
我听说31被选择了,这样编译器就可以将乘法优化为左移5位,然后减去这个值。
2
Here's a citation a little closer to the source.
这里有一个引文更接近源。
It boils down to:
它可以归结为:
2
First you compute the hash value modulo 2^32 (the size of an int
), so you want something relatively prime to 2^32 (relatively prime means that there are no common divisors). Any odd number would do for that.
首先计算出散列值模2 ^ 32(int)的大小,所以你想要相对' 2 ^ 32(互质意味着没有公共因子)。任何奇数都可以。
Then for a given hash table the index is usually computed from the hash value modulo the size of the hash table, so you want something that is relatively prime to the size of the hash table. Often the sizes of hash tables are chosen as prime numbers for that reason. In the case of Java the Sun implementation makes sure that the size is always a power of two, so an odd number would suffice here, too. There is also some additional massaging of the hash keys to limit collisions further.
然后对于给定的哈希表,索引通常由哈希值模块计算哈希表的大小,因此您需要一个相对于哈希表大小的素数。通常由于这个原因,哈希表的大小被选择为素数。在Java的例子中,Sun实现确保大小总是2的乘方,所以在这里一个奇数也足够了。还有一些哈希键的附加修改,以进一步限制冲突。
The bad effect if the hash table and the multiplier had a common factor n
could be that in certain circumstances only 1/n entries in the hash table would be used.
如果哈希表和乘数有一个公因式n,那么不好的影响可能是在某些情况下,在哈希表中只使用1/n项。
0
It generally helps achieve a more even spread of your data among the hash buckets, especially for low-entropy keys.
它通常有助于在hash bucket中实现更均匀的数据分布,特别是对于低熵的键。
0
31 is also specific to Java HashMap which uses a int as hash data type. Thus the max capacity of 2^32. There is no point in using larger Fermat or Mersenne primes.
31也是特定于Java HashMap的,它使用int作为散列数据类型。因此,最大容量为2 ^ 32。使用更大的Fermat或Mersenne素数是没有意义的。