Hash Functions FTW

download Hash Functions FTW

of 20

  • date post

    08-May-2015
  • Category

    Technology

  • view

    2.907
  • download

    1

Embed Size (px)

description

Presentation on Hash Functions, Bloom Filters, and Hash-Oriented Storage

Transcript of Hash Functions FTW

  • 1.Hash FunctionsFTW* Fast Hashing, Bloom Filters & Hash-Oriented StorageSunny Gleason * For the win (see urbandictionary FTW[1]); this expression has nothing to do with hash functions

2. Whats in this Presentation Hash Function Survey Hash Performance Bloom Filters HashFile : Hash Storage 3. Hash Functions int getIntHash(byte[] data); // 32-bit long getLongHash(byte[] data) // 64-bitint v1 = hash(foo); int v2 = hash(goo);int hash(byte[] value) { // a simple hash int h = 0; for (byte b: value) { h = (h27) ^ b; } return h % PRIME; } 4. Hash Functions Goal : v1 has many bit differences from v2 Desirable Properties: Uniform Distribution - no collisions Very Fast Computation 5. Hash Applications Goal: O(1) access Hash Table Hash Set Bloom Filter 6. Popular Hash Functions FNV Hash DJB Hash Jenkins Hash Murmur2 New (Promising?): CrapWow Awesome & Slow: SHA-1, MD5 etc. 7. Evaluating Hash Functions Hash Function Zoo Quality of: CRC32 DJBJenkins FNV Murmur2 SHA1 Performance:!"#$%&'()*(+",-'%./%0'/%1',23$% (MM ops/s) '#"'!""&!"%#"*+,-.,/"%!"012312%"$#"456$"$!" #" !"%#("('")" 8. A Strawman Set N keys, K bytes per key Allocate array of size K * N bytes Utilize array storage as: a heap or tree: O(lg N) insert/delete/ remove a hash: O(1) insert/delete/remove What if we dont have room for K*N bytes? 9. Bloom Filter Key Point: give up on storing all the keys Store r bits per key instead of K bytes Allocate bit vector of size: M = r * N, where N is expected number of entries Use multiple hash functions of key to determine which bits to set Premise: if hash functions are well- distributed, few collisions, high accuracy 10. Bloom Filter 11. Tuning Bloom Filters Let r = M bits / N keys (r: num bits/key) Let k = 0.7 * r(k: num hashes to use) Let p = 0.6185 ** r (p: probability of false positives)Working backwards, we can use desired false positive rate p to tune the data structure space consumption:r = 8, p = 2.1e-2r = 16, p = 4.5e-4 r = 24, p = 9.8e-6 r = 32, p = 2.1e-7 r = 40, p = 4.5e-9 r = 48, p = 9.6e-11 12. Bloom Filter Performance 100MM entries, 8bits/key :833k ops/s 100MM entries, 32bits/key : 256k ops/s 1BN entries, 8bits/key :714k ops/s 1BN entries, 32bits/key : 185k ops/sHypothesis : difference between 100MM and 1BN is due to locality of memory access in smaller bit vector 13. Hash-Oriented Storage HashFile : 64-bit clone of djbs constant db CDB Plain ol Key/Value storage:add(byte[] k, byte[] v), byte[] lookup(byte[] k) Constant aka Immutable Data Storecreate(), add(k, v) ... , build() ... before lookup(k) Use properties of hash table to achieve O(1) disk seeks per lookup 14. HashFile Structure Header (xed width): table pointers, contains offests of hash tables and count of elements per table Body (variable width): contains concatenation of all keys and values (with data lengths) Footer (xed width): hash tables containing long hash values of keys alongside long offsets into body 15. HashFile Diagram HEADERBODYFOOTER p1s3p2s4p3s2p4s1 k1v1k2v2k3v3k4v4k5v5k6v6k7v7 hk7o7hk3o3hk4o4hk1o1 Create: initialize empty header, start appendingkeys/values while recording offsets and hash valuesof keys Build: take list of hash values and offsets and turnthem into hash tables, backll header with values Lookup: compute hash(key), compute offset intotable (hash modulo size of table), use table to ndoffset into body, return the value from body 16. HashFile Performance Spec: 2 disk seeks per lookup Number of seeks independent of number of entries X25E SSD: 1BN 8-byte keys, values (41GB): 650s lookup w/ cold cache, up to 700x faster as lesystem cache warms, 0.9s when in-memory With 100MM entries (4GB), cold cache is ~600s (from locality), 0.6s warm 17. Conclusions Be aware of different Hash Functions and their collision / performance tradeoffs Bloom Filters are extremely useful for fast, large-scale set membership HashFile provides excellent performance in cases where a static K/V store sufces 18. Future Work Implement cWow hash in Java Extend HashFile with congurable hash, pointer, and key/value lengths to conserve space (reduce 24 bytes-per-KV overhead) Implement a read-write (non-constant) version of HashFile Bloom Filter that spills to SSD 19. Thank You! ...Any questions? :) 20. References GitHub Project: g414-hash (hash function, bloom lter, HashFile implementations) Wikipedia: Hash Function, Bloom Filter Non-Cryptographic Hash Function Zoo DJB CDB, sg-cdb (java implementation)