Data Structures and Algorithms

126
Data Structures and Algorithms Course slides: Hashing www.mif.vu.lt/~algis

description

Data Structures and Algorithms. Course slides: Hashing www.mif.vu.lt /~ algis. Data Structures for Sets. Many applications deal with sets. Compilers have symbol tables (set of vars , classes) Dictionary is a set of words. Routers have sets of forwarding rules. - PowerPoint PPT Presentation

Transcript of Data Structures and Algorithms

Data Structures and Algorithms

Data StructuresandAlgorithmsCourse slides: Hashing

www.mif.vu.lt/~algis2Data Structures for SetsMany applications deal with sets.Compilers have symbol tables (set of vars, classes)Dictionary is a set of words.Routers have sets of forwarding rules.Web servers have set of clients, etc.A set is a collection of members No repetition of membersMembers themselves can be sets 3Data Structures for SetsExamplesSet of first 5 natural numbers: {1,2,3,4,5} {x | x is a positive integer and x < 100} {x | x is a CA driver with > 10 years of driving experience and 0 accidents in the last 3 years} 4Set OperationsUnary operation: min, max, sort, makenull, Binary operationsMemberSetMemberOrder (=, )Find, insert, delete, split, SetFind, insert, delete, split,

Union, intersection, difference, equal, 5ObservationsSet + Operations define an ADT.A set + insert, delete, findA set + orderingMultiple sets + union, insert, deleteMultiple sets + merge Etc.Depending on type of members and choice of operations, different implementations can have different asymptotic complexity.

6Dictionary ADTsMaintain a set of items with distinct keys with:find (k): find item with key kinsert (x): insert item x into the dictionaryremove (k): delete item with key k

Where do we use them:Symbol tables for compilerCustomer records (access by name)Games (positions, configurations)Spell checkersPeer to Peer systems (access songs by name), etc.7Nave ImplementationsThe simplest possible scheme to implement a dictionary is log file or audit trail.Maintain the elements in a linked list, with insertions occuring at the head.The search and delete operations require searching the entire list in the worst-case.Insertion is O(1), but find and delete are O(n).A sorted array does not help, even with ordered keys. The search becomes fast, but insert/delete take O(n). 8Hash Tables: IntuitionHashing is function that maps each key to a location in memory.A keys location does not depend on other elements, and does not change after insertion.unlike a sorted listA good hash function should be easy to compute.

With such a hash function, the dictionary operations can be implemented in O(1) time9Hash Tables: IntuitionLet us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements are assigned keys from the set of small natural numbers. That is, U Z+ and U is relatively small. If no two elements have the same key, then this dictionary can be implemented by storing its elements in the array T[0, ... , U - 1]. This implementation is referred to as a direct-access table since each of the requisite DICTIONARY ADT operations - Search, Insert, and Delete - can always be performed in (1) time by using a given key value to index directly into T, as shown: 10Hash Tables: IntuitionThe obvious shortcoming associated with direct-access tables is that the set U rarely has such "nice" properties. In practice, U can be quite large. This will lead to wasted memory if the number of elements actually stored in the table is small relative to U. Furthermore, it may be difficult to ensure that all keys are unique. Finally, a specific application may require that the key values be real numbers, or some symbols which cannot be used directly to index into the table. An effective alternative to direct-access tables are hash tables. A hash table is a sequentially mapped data structure that is similar to a direct-access table in that both attempt to make use of the random- access capability afforded by sequential mapping.

11Hash Tables: Intuition

Hash TablesAll search structures so farRelied on a comparison operationPerformance O(n) or O( log n)Assume I have a functionf ( key ) integerie one that maps a key to an integerWhat performance might I expect now?Hash Tables - StructureSimplest case:Assume items have integer keys in the range 1 .. mUse the value of the key itselfto select a slot in a direct access table in which to store the itemTo search for an item with key, k,just look in slot kIf theres an item there,youve found itIf the tag is 0, its missing.Constant time, O(1)

14Hashing : the basic ideaMap key values to hash table addresses keys -> hash table addressThis applies to find, insert, and removeUsually: integers -> {0, 1, 2, , Hsize-1}Typical example: f(n) = n mod HsizeNon-numeric keys converted to numbersFor example, strings converted to numbers asSum of ASCII valuesFirst three characters15Hashing : the basic idea91020394148Perm # (mod 9)Student RecordsHash Tables - Choosing the Hash FunctionUniform hashingIdeal hash functionP(k) = probability that a key, k, occursIf there are m slots in our hash table,a uniform hashing function, h(k), would ensure:

or, in plain English, the number of keys that map to each slot is equal

S P(k) =k | h(k) = 0S P(k) = ....k | h(k) = 1S P(k) =k | h(k) = m-11mRead as sum over all k such that h(k) = 0Hash Tables - A Uniform Hash FunctionIf the keys are integers randomly distributed in [ 0 , r ), then

is a uniform hash functionMost hashing functions can be made to map the keys to [ 0 , r ) for some r, eg adding the ASCII codes for characters mod 255 will give values in [ 0, 256 ) or [ 0, 255 ]Replace + by xor same range without the mod operation

Read as 0 k < r h(k) = mkr Hash Tables - Reducing the range to [ 0, m )Weve mapped the keys to a range of integers 0 k < rNow we must reduce this range to [ 0, m )where m is a reasonable size for the hash tableStrategiesDivision - use a mod functionMultiplicationUniversal hashing

Hash Tables - Reducing the range to [ 0, m )Division Use a mod function h(k) = k mod mChoice of m?Powers of 2 are generally not good!h(k) = k mod 2n selects last n bits of kAll combinations are not generally equally likelyPrime numbers close to 2n seem to be good choiceseg want ~4000 entry table, choose m = 4093

0110010111000011010 k mod 28 selects these bitsHash Tables - Reducing the range to [ 0, m )Multiplication method Multiply the key by constant, A, 0 < A < 1Extract the fractional part of the product ( kA - kA )Multiply this by m h(k) = m * ( kA - kA )Now m is not critical and a power of 2 can be chosenSo this procedure is fast on a typical digital computer Set m = 2pMultiply k (w bits) by A2w 2w bit productExtract p most significant bits of lower halfA = (5 -1) seems to be a good choice (see Knuth)Hash Tables - Reducing the range to [ 0, m )Universal Hashing A determined adversary can always find a set of data that will defeat any hash functionHash all keys to same slot O(n) searchSelect the hash function randomly (at run time)from a set of hash functionsReduced probability of poor performanceSet of functions, H, which map keys to [ 0, m )H, is universal, if for each pair of keys, x and y,the number of functions, h H,for which h(x) = h(y) is |H |/mHash Tables - Reducing the range to ( 0, m ]Universal Hashing A determined adversary can always find a set of data that will defeat any hash functionHash all keys to same slot O(n) searchSelect the hash function randomly (at run time)from a set of hash functions---------Functions are selected at run timeEach run can give different resultsEven with the same dataGood average performance obtainableHash Tables - Reducing the range to ( 0, m ]Universal Hashing Can we design a set of universal hash functions?Quite easilyKey, x = x0, x1, x2, ...., xrChoose a = a is a sequence of elements chosen randomly from { 0, m-1 }ha(x) = S aixi mod mThere are mr+1 sequences a,so there are mr+1 functions, ha(x)TheoremThe ha form a set of universal hash functionsProof:See CormenHash Tables - ConstraintsConstraintsKeys must be uniqueKeys must lie in a small rangeFor storage efficiency,keys must be dense in the rangeIf theyre sparse (lots of gaps between values),a lot of space is used to obtain speedSpace for speed trade-offHash Tables - Relaxing the constraintsKeys must be uniqueConstruct a linked list of duplicates attached to each slotIf a search can be satisfiedby any item with key, k,performance is still O(1)butIf the item has some other distinguishing featurewhich must be matched,we get O(nmax)where nmax is the largest number of duplicates - or length of the longest chain

Hash Tables - Relaxing the constraintsKeys are integersNeed a hash functionh( key ) integerie one that maps a key to an integerApplying this function to thekey produces an addressIf h maps each key to a uniqueinteger in the range 0 .. m-1then search is O(1)

Hash Tables - Hash functionsForm of the hash functionExample - using an n-character keyint hash( char *s, int n ) { int sum = 0; while( n-- ) sum = sum + *s++; return sum % 256; }returns a value in 0 .. 255xor function is also commonly used sum = sum ^ *s++;But any function that generates integers in 0..m-1 for some suitable (not too large) m will doAs long as the hash function itself is O(1) !

Hash Tables - CollisionsHash functionWith this hash functionint hash( char *s, int n ) { int sum = 0; while( n-- ) sum = sum + *s++; return sum % 256; }hash( AB, 2 ) andhash( BA, 2 )return the same value!This is called a collisionA variety of techniques are used for resolving collisions

Hash Tables - Collision handlingCollisionsOccur when the hash function maps two different keys to the same addressThe table must be able to recognise and resolve thisRecogniseStore the actual key with the item in the hash tableCompute the addressk = h( key )Check for a hitif ( table[k].key == key ) then hitelse try next entryResolutionVariety of techniques

Hash Tables - Linked listsCollisions - ResolutionLinked list attached to each primary table sloth(i) == h(i1)h(k) == h(k1) == h(k2)Searching for i1Calculate h(i1)Item in table, i,doesnt matchFollow linked list to i1If NULL found, key isnt in table

Hash Tables - Overflow areaOverflow areaLinked list constructedin special area of tablecalled overflow areah(k) == h(j)k stored firstAdding jCalculate h(j)Find kGet first slot in overflow areaPut j in itks pointer points to this slotSearching - same as linked listh(x) - second hash function

Hash Tables - Re-hashingUse a second hash functionMany variationsGeneral term: re-hashingh(k) == h(j)k stored firstAdding jCalculate h(j)Find kRepeat until we find an empty slotCalculate h(j)Put j in itSearching - Use h(x), then h(x)Hash Tables - Re-hash functionsThe re-hash functionMany variationsLinear probingh(x) is +1Go to the next slotuntil you find one empty

Can lead to bad clusteringRe-hash keys fill in gapsbetween other keys and exacerbatethe collision problem

Hash Tables - Re-hash functionsThe re-hash functionMany variationsQuadratic probingh(x) is c i2 on the ith probeAvoids primary clusteringSecondary clustering occursAll keys which collide on h(x) follow the same sequenceFirst a = h(j) = h(k)Then a + c, a + 4c, a + 9c, ....Secondary clustering generally less of a problemHash Tables - Collision Resolution SummaryChainingUnlimited number of elementsUnlimited number of collisionsOverhead of multiple linked listsRe-hashingFast re-hashing Fast access through use of main table spaceMaximum number of elements must be knownMultiple collisions become probableOverflow areaFast access Collisions don't use primary table spaceTwo parameters which govern performance need to be estimated

Hash Tables - Collision Resolution SummaryRe-hashingFast re-hashing Fast access through use of main table spaceMaximum number of elements must be knownMultiple collisions become probableOverflow areaFast access Collisions don't use primary table spaceTwo parameters which govern performance need to be estimated

Hash Tables - Summary so far ...Potential O(1) search timeIf a suitable function h(key) integer can be foundSpace for speed trade-offFull hash tables dont work (more later!)CollisionsInevitableHash function reduces amount of information in keyVarious resolution strategiesLinked listsOverflow areasRe-hash functionsLinear probing h is +1Quadratic probing h is +ci2Any other hash function!or even sequence of functions!38Hashing:Choose a hash function h; it also determines the hash table size.Given an item x with key k, put x at location h(k).To find if x is in the set, check location h(k).What to do if more than one keys hash to the same value. This is called collision.We will discuss two methods to handle collision:Separate chainingOpen addressing

39Maintain a list of all elements that hash to the same valueSearch -- using the hash function to determine which list to traverseInsert/deletiononce the bucket is found through Hash, insert and delete are list operationsSeparate chaining

class HashTable { private:unsigned int Hsize;List *TheList; find(k,e)HashVal = Hash(k,Hsize);if (TheList[HashVal].Search(k,e))then return true;else return false;14422920136562316243117701234567891039Hash function is x mod 11

40Insertion: insert 5314422920136562316243117701234567891053 = 4 x 11 + 953 mod 11 = 91442292013656231624531770123456789103141Analysis of Hashing with ChainingWorst caseAll keys hash into the same bucket a single linked list.insert, delete, find take O(n) time.Average caseKeys are uniformly distributed into bucketsO(1+N/B): N is the number of elements in a hash table, B is the number of buckets. If N = O(B), then O(1) time per operation.N/B is called the load factor of the hash table.42Open addressingIf collision happens, alternative cells are tried until an empty cell is found.

Linear probing :Try next available position01234567891042914116243128743Linear Probing (insert 12)01234567891042914116243128712 = 1 x 11 + 112 mod 11 = 10123456789104291411624312871244Search with linear probing (Search 15)15 = 1 x 11 + 415 mod 11 = 401234567891042914116243128712NOT FOUND ! 45Deletion in Hashing with Linear ProbingSince empty buckets are used to terminate search, standard deletion does not work.One simple idea is to not delete, but mark. Insert: put item in first empty or marked bucket. Search: Continue past marked buckets. Delete: just mark the bucket as deleted. Advantage: Easy and correct.Disadvantage: table can become full with dead items.46Deletion with linear probing: LAZY (Delete 9)9 = 0 x 11 + 99 mod 11 = 901234567891042914116243128712FOUND ! 01234567891042D1411624312871247remove(j) {i = j;empty[i] = true;i = (i + 1) % D; // candidate for swappingwhile ((not empty[i]) and i!=j) {r = Hash(ht[i]); // where should it go without collision? // can we still find it based on the rehashing strategy?if not ((j