您的位置:首页 > 理论基础 > 数据结构算法

Hashing Table

2015-11-24 15:31 309 查看

Concepts

The process of finding a record using some computation to map its key value to a position in the array is called hashing.

The function that maps key values to positions is called a hash function and will be denoted by h.

The array that holds the records is called the hash table and will be denoted by HT.

A position in the hash table is also known as a slot. The number of slots in hash table HT will be denoted by the variable M, with slots numbered from 0 to M -1.

Hashing is suitable for both in-memory and disk-based searching and is one of the two most widely used methods for organizing large databases stored on disk (the other is the B-tree)

Because the possible key range is larger than the size of the table, at least some of the slots must be mapped to from multiple key values. Given a hash function h and two keys k1 and k2, if h(k1) = B = h(k2) where is a slot in the table, then we say that k1 and k2 have a collision at slot under hash function h.

Finding a record with key value K in a database organized by hashing follows a two-step procedure:

Compute the table location h(K).

Starting with slot h(K), locate the record containing key K using (if necessary) a collision resolution policy.

Hush Functions

Requests

A hash function must return a value within the hash table range.

To be practical, a hash function should evenly distribute the records stored among the hash table slots.

When designing hash functions, we are generally faced with one of two situations.

We know nothing about the distribution of the incoming keys. In this case, we wish to select a hash function that evenly distributes the key range across the hash table, while avoiding obvious opportunities for clustering such as hash functions that are sensitive to the high- or low-order bits of the key value.

We know something about the distribution of the incoming keys. In this case, we should use a distribution-dependent hash function that avoids assigning clusters of related key values to the same hash table slot. For example, if hashing English words, we should not hash on the value of the first character because this is likely to be unevenly distributed.

Common Hush Functions

Direct Address: H(key) = key / H(key) = a×key + b

Division method: H(key)= key MOD p / H(key)= key MOD p + c

p is a prime number, p<= m.

Data Analysis Chart: Choose the most evenly distributed part of the data as H(key).

Middle-square method: For example:4731 * 4731 = 22,382,361

How many bits to select depends on the size of the array. If the size is 100, then select 2 bits. Then the position of key 4731 is 82.

If the keys have good distribution, but the range of the key is much larger than the size of the array, we can select some bits (normally middle) of the key’s square.

Folding method: For example, key is 542242241, then 542+242+241=1025. The position of this key is 25.

If the key is too long, then use this method.

Collision Resolution Policy

Classes

Open Hashing (separate chaining): Collisions are stored outside the table;

Closed Hashing (open addressing): ollisions result in storing one of the records at another slot in the table.

Open Hashing

The simplest form of open hashing defines each slot in the hash table to be the head of a linked list. All records that hash to a particular slot are placed on that slot’s linked list.

Records within a slot’s list can be ordered in several ways: by insertion order, by key value order, or by frequency-of-access order.

Open hashing is most appropriate when the hash table is kept in main memory, with the lists implemented by a standard in-memory linked list.

Closed Hashing

Closed hashing stores all records directly in the hash table. Each record R with key value kR has a home position that is h(kR), the slot computed by the hash function.

If R is to be inserted and another record already occupies R’s home position, then R will be stored at some other slot in the table. It is the business of the collision resolution policy to determine which slot that will be.

Naturally, the same policy must be followed during search as during insertion, so that any record not found in its home position can be recovered by repeating the collision resolution process.

Bucket Hushing

One implementation for closed hashing groups hash table slots into buckets. The M slots of the hash table are divided into B buckets, with each bucket consisting of M=B slots.

The hash function assigns each record to the first slot within one of the buckets. If this slot is already occupied, then the bucket slots are searched sequentially until an open slot is found. If a bucket is entirely full, then the record is stored in an overflow bucket of infinite capacity at the end of the table. All buckets share the same overflow bucket.

A good implementation will use a hash function that distributes the records evenly among the buckets so that as few records as possible go into the overflow bucket.

Bucket methods are good for implementing hash tables stored on disk, because the bucket size can be set to the size of a disk block. Whenever search or insertion occurs, the entire bucket is read into memory.

Linear Probing

Linear probing is the most commonly used form of hashing, a collision resolution policy that can potentially use any slot in the hash table.

We can view any collision resolution method as generating a sequence of hash table slots that can potentially hold the record. The first slot in the sequence will be the home position for the key. If the home position is occupied, then the collision resolution policy goes to the next slot in the sequence. If this is occupied as well, then another slot must be found, and so on. This sequence of slots is known as the probe sequence.

If the home position for the record is occupied, then move down the bucket until a free slot is found. This is an example of a technique for collision resolution known as linear probing. The prob function is P(k,i)=i.

Primary clustering is the tendency for certain open-addressing hash tables collision resolution schemes to create long sequences of filled slots. It is most commonly referred to in the context of problems with linear probing.

Pseudo-Random Probing

The ideal prob function would select the next position on the probe sequence at random(in fact, pseudo-random) from among the unvisited slots; that is, the probe sequence should be a random permutation of the hash table positions.

In pseudo-random probing, the ith slot in the probe sequence is (h(K) + ri) mod M where ri is the ith value in a random permutation of the numbers from 1 to M - 1. The prob function is P(k,i)=Perm[i−1]. All insertion and search operations use the same random permutation.

Quadratic Probing

The probe function of Quadratic Probing is some quadratic function: P(k,i)=c1i2+c2i+c3 for some choice of constants c1, c2, and c3.

Under quadratic probing, two keys with different home positions will have diverging probe sequences.

Unfortunately, quadratic probing has the disadvantage that typically not all hash table slots will be on the probe sequence. or many hash table sizes, this probe function will cycle through a relatively small number of slots. If all slots on that cycle happen to be full, then the record cannot be inserted at all!

Fortunately, it is possible to get good results from quadratic probing at low cost. The right combination of probe function and table size will visit many slots in the table.

In particular, if the hash table size is a prime number and the probe function is P(k,i)=i2, then at least half the slots in the table will be visited. Thus, if the table is less than half full, we can be certain that a free slot will be found.

Alternatively, if the hash table size is a power of two and the probe function is P(k,i)=(i2+i)2, then every slot in the table will be visited by the probe function.

Both pseudo-random probing and quadratic probing eliminate primary clustering, which is the problem of keys sharing substantial segments of a probe sequence.

If two keys hash to the same home position, however, then they will always follow the same probe sequence for every collision resolution method that we have seen so far. The probe sequences generated by pseudo-random and quadratic probing (for example) are entirely a function of the home position, not the original key value.

If the hash function generates a cluster at a particular home position, then the cluster remains under pseudo-random and quadratic probing. This problem is called secondary clustering.

Double Hashing

To avoid secondary clustering, we need to have the probe sequence make use of the original key value in its decision-making process.

A simple technique for doing this is to return to linear probing by a constant step size for the probe function, but to have that constant be determined by a second hash function, h2. The prob function is P(k,i)=i∗h2(k)

Pseudo random or quadratic probing can be combined with double hashing to solve this problem.

A good implementation of double hashing should ensure that all of the probe sequence constants are relatively prime to the table size M. This can be achieved easily.

One way is to select M to be a prime number, and have h2 return a value in the range 1≤h2(k)≤M−1.

Another way is to set M=2m for some value m and have h2 return an odd value between 1 and 2m.

Make Closed Hashing More Efficient

Recall the 80/20 rule: 80% of the accesses will come to 20% of the data. In other words, some records are accessed more frequently.

According to the 80/20 rule, the record with higher frequency of access should be placed in the home position, because this will reduce the total number of record accesses. Ideally, records along a probe sequence will be ordered by their frequency of access.

One approach to approximating this goal is to modify the order of records along the probe sequence whenever a record is accessed. If a search is made to a record that is not in its home position, a self-organizing list heuristic can be used.

Another approach is to keep access counts for records and periodically rehash the entire table. The records should be inserted into the hash table in frequency order, ensuring that records that were frequently accessed during the last series of requests have the best chance of being near their home positions.

Deletion

When deleting records from a hash table, there are two important considerations.

Deleting a record must not hinder later searches. In other words, the search process must still pass through the newly emptied slot to reach records whose probe sequence passed through this slot. Thus, the delete process cannot simply mark the slot as empty, because this will isolate records further down the probe sequence.

We do not want to make positions in the hash table unusable because of deletion. The freed slot should be available to a future insertion.

Both of these problems can be resolved by placing a special mark in place of the deleted record, called a tombstone.

The tombstone indicates that a record once occupied the slot but does so no longer.

If a tombstone is encountered when searching along a probe sequence, the search procedure continues with the search.

When a tombstone is encountered during insertion, that slot can be used to store the new record. However, to avoid inserting duplicate keys, it will still be necessary for the search procedure to follow the probe sequence until a truly empty position has been found, simply to verify that a duplicate is not in the table. However, the new record would actually be inserted into the slot of the first tombstone encountered.

The use of tombstones allows searches to work correctly and allows reuse of deleted slots. However, after a series of intermixed insertion and deletion operations, some slots will contain tombstones. This will tend to lengthen the average distance from a record’s home position to the record itself, beyond where it could be if the tombstones did not exist.

Two possible solutions to this problem are

Do a local reorganization upon deletion to try to shorten the average path length. For example, after deleting a key, continue to follow the probe sequence of that key and swap records further down the probe sequence into the slot of the recently deleted record (being careful not to remove any key from its probe sequence). This will not work for all collision resolution policies.

Periodically rehash the table by reinserting all records into a new hash table. Not only will this remove the tombstones, but it also provides an opportunity to place the most frequently accessed records into their home positions.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  C++ 数据结构