Another question asks if there are some basic low-level things that all programmers should know, and I think hash lookups are one of those. So here goes.
另一个问题是,是否存在所有程序员都应该知道的基本低级事物,我认为哈希查找就是其中之一。所以这里。
A hash table (note that I'm not using an actual classname) is basically an array of linked lists. To find something in the table, you first compute the hashcode of that something, then mod it by the size of the table. This is an index into the array, and you get a linked list at that index. You then traverse the list until you find your object.
哈希表(注意我没有使用实际的类名)基本上是一个链表的数组。要在表中查找内容,首先要计算该内容的哈希码,然后根据表的大小对其进行修改。这是数组的索引,您将获得该索引的链接列表。然后,您遍历列表,直到找到您的对象。
Since array retrieval is O(1), and linked list traversal is O(n), you want a hash function that creates as random a distribution as possible, so that objects will be hashed to different lists. Every object could return the value 0 as its hashcode, and a hash table would still work, but it would essentially be a long linked-list at element 0 of the array.
由于数组检索是O(1),并且链表遍历是O(n),因此您需要一个尽可能随机创建分布的哈希函数,以便将对象散列到不同的列表。每个对象都可以返回值0作为其哈希码,并且哈希表仍然可以工作,但它本质上是数组元素0处的长链表。
You also generally want the array to be large, which increases the chances that the object will be in a list of length 1. The Java HashMap, for example, increases the size of the array when the number of entries in the map is > 75% of the size of the array. There's a tradeoff here: you can have a huge array with very few entries and waste memory, or a smaller array where each element in the array is a list with > 1 entries, and waste time traversing. A perfect hash would assign each object to a unique location in the array, with no wasted space.
您通常也希望数组很大,这会增加对象在长度为1的列表中的可能性。例如,当地图中的条目数大于75时,Java HashMap会增加数组的大小。数组大小的百分比。这里有一个权衡:你可以拥有一个包含极少数条目和浪费内存的巨大数组,或者一个较小的数组,其中数组中的每个元素都是一个包含> 1个条目的列表,并且浪费时间遍历。完美的哈希会将每个对象分配给数组中的唯一位置,而不会浪费空间。
The term "perfect hash" is a real term, and in some cases you can create a hash function that provides a unique number for each object. This is only possible when you know the set of all possible values. In the general case, you can't achieve this, and there will be some values that return the same hashcode. This is simple mathematics: if you have a string that's more than 4 bytes long, you can't create a unique 4-byte hashcode.
术语“完美散列”是一个实际术语,在某些情况下,您可以创建一个散列函数,为每个对象提供唯一的数字。只有在知道所有可能值的集合时才可以执行此操作。在一般情况下,您无法实现此目的,并且会有一些值返回相同的哈希码。这是简单的数学:如果您的字符串长度超过4个字节,则无法创建唯一的4字节哈希码。
One interesting tidbit: hash arrays are generally sized based on prime numbers, to give the best chance for random allocation when you mod the results, regardless of how random the hashcodes really are.
一个有趣的花絮:哈希数组通常根据素数来确定大小,以便在修改结果时为随机分配提供最佳机会,无论哈希码实际上是多么随机。
Edit based on comments:
根据评论进行编辑:
1) A linked list is not the only way to represent the objects that have the same hashcode, although that is the method used by the JDK 1.5 HashMap. Although less memory-efficient than a simple array, it does arguably create less churn when rehashing (because the entries can be unlinked from one bucket and relinked to another).
1)链表不是表示具有相同哈希码的对象的唯一方法,尽管这是JDK 1.5 HashMap使用的方法。虽然比简单数组的内存效率更低,但在重新散列时可能会产生更少的流失(因为条目可以从一个存储桶取消链接并重新链接到另一个存储桶)。
2) As of JDK 1.4, the HashMap class uses an array sized as a power of 2; prior to that it used 2^N+1, which I believe is prime for N <= 32. This does not speed up array indexing per se, but does allow the array index to be computed with a bitwise AND rather than a division, as noted by Neil Coffey. Personally, I'd question this as premature optimization, but given the list of authors on HashMap, I'll assume there is some real benefit.
2)从JDK 1.4开始,HashMap类使用一个大小为2的幂的数组;在此之前,它使用2 ^ N + 1,我认为这是N <= 32的素数。这不会加速数组索引本身,但允许使用按位AND而不是除法来计算数组索引,如Neil Coffey所述。就个人而言,我认为这是过早的优化,但鉴于HashMap的作者列表,我会假设有一些真正的好处。