The Simigle Image Search Engine

The Simigle Image Search Engine

Wei Dong

2010-09-23

http://www.simigle.com/

Challenges

• Large dataset– ~100 million images w/ single server

• High confidence– False positive rate < 10-6

• High recall– Recall ~ 80%

• Online search• High throughput

– Still a long way to go

System Overview

Loosely coupledSearch servers

Easy to replicate

Read OnlyDatabaseImages

A cluster for crawling and indexing images

Clients w/Various Browsers

JsonJpeghtml

Software techniques:

C++, boost, pocoJavascript, jquery C++, java, hadoop

http://images.google.com/imgres?imgurl=http://ui03.gamespot.com/1186/997953862f602966618_2.jpg&imgrefurl=http://www.gamespot.com/users/TwiztedMetal/show_blog_entry.php%3Ftopic_id%3Dm-100-25103641&usg=__XcpyONr6WivZJCq6puLwE2-e0U4=&h=362&w=337&sz=65&hl=en&start=0&zoom=1&tbnid=yzCMEfYdnCosFM:&tbnh=151&tbnw=139&prev=/images%3Fq%3Dserver%2Brack%26hl%3Den%26biw%3D1276%26bih%3D882%26gbv%3D2%26tbs%3Disch:1&itbs=1&iact=hc&vpx=273&vpy=110&dur=258&hovh=151&hovw=141&tx=120&ty=102&ei=FWCbTJjLM4PGlQfZtqzlCQ&oei=FWCbTJjLM4PGlQfZtqzlCQ&esq=1&page=1&ndsp=29&ved=1t:429,r:1,s:0

http://images.google.com/imgres?imgurl=http://1.bp.blogspot.com/_lWnJEx1aTgA/Sc3YDSpoLqI/AAAAAAAAASE/cfXlBj1AA4o/s400/Apple%2BMAC%2B%2Blaptop-1.jpg&imgrefurl=http://laptopblank.blogspot.com/2010/02/apple-white-cool-laptop.html&usg=__RbqP5Y_XgByemAQT0WZI_-noPRw=&h=300&w=400&sz=14&hl=en&start=115&zoom=1&tbnid=4MgC4ZigRgs9gM:&tbnh=134&tbnw=188&prev=/images%3Fq%3Dwhite%2Blaptop%26hl%3Den%26biw%3D1276%26bih%3D882%26gbv%3D2%26tbs%3Disch:1&itbs=1&iact=hc&vpx=864&vpy=441&dur=1419&hovh=194&hovw=259&tx=80&ty=82&ei=TV-bTNylEMO78gbQ0JRY&oei=O1-bTKj3E8aAlAfdtrTSCQ&esq=5&page=5&ndsp=32&ved=1t:429,r:24,s:115

http://images.google.com/imgres?imgurl=http://www.dealgiant.co.uk/wp-content/uploads/2009/11/toshiba_g61_110sa_windows7_laptop_review.jpg&imgrefurl=http://www.dealgiant.co.uk/hp-g61-110sa-laptop-review-windows-7-laptop-deals-specs/&usg=__Y5yJz72kvdMuRRAzjJMfK-Mm1sM=&h=361&w=394&sz=51&hl=en&start=0&zoom=1&tbnid=MzVe6_o8ywVd5M:&tbnh=141&tbnw=157&prev=/images%3Fq%3Dwindows%2Blaptop%26hl%3Den%26biw%3D1276%26bih%3D882%26gbv%3D2%26tbs%3Disch:1&itbs=1&iact=hc&vpx=284&vpy=304&dur=1765&hovh=215&hovw=235&tx=142&ty=111&ei=qmObTOvMDYTGlQfc_sTdCQ&oei=qmObTOvMDYTGlQfc_sTdCQ&esq=1&page=1&ndsp=24&ved=1t:429,r:13,s:0

http://images.google.com/imgres?imgurl=http://www.chinagadgetland.com/wp-content/uploads/wpsc/product_images/Trekker%2520Ubuntu-powered%25208.9%2520inch%2520Mini%2520Laptop,%2520Intel%2520Atom%2520N270%25201.6G%2520Processor,%2520512M%2520Memory,%252060G%2520Harddisk,%2520%2520%25208.9%2520Inch%2520WXGA%2520LCD%2520Screen-1.jpg&imgrefurl=http://www.chinagadgetland.com/products-page/ubuntu-powered-laptop-packages/&usg=__5EEsjyqQeNm8uENCqdJJkTnyVbk=&h=322&w=351&sz=14&hl=en&start=0&zoom=1&tbnid=VdA5fvq057dwrM:&tbnh=153&tbnw=177&prev=/images%3Fq%3Dubuntu%2Blaptop%26hl%3Den%26biw%3D1276%26bih%3D882%26gbv%3D2%26tbs%3Disch:1&itbs=1&iact=hc&vpx=811&vpy=407&dur=415&hovh=215&hovw=234&tx=96&ty=106&ei=FWSbTJ_0N8SBlAe51uncCQ&oei=FWSbTJ_0N8SBlAe51uncCQ&esq=1&page=1&ndsp=20&ved=1t:429,r:13,s:0

http://images.google.com/imgres?imgurl=http://windows7.iyogi.net/wp-content/uploads/zahipedia_mozila_firefox.jpg&imgrefurl=http://windows7.iyogi.net/windows-7/insight/browsers&usg=__qTgVY_mwgzcbLLQJ3-LWiwwnJco=&h=356&w=369&sz=31&hl=en&start=0&zoom=1&tbnid=lsAAWNEhsoj0MM:&tbnh=142&tbnw=141&prev=/images%3Fq%3Dfirefox%26hl%3Den%26biw%3D1276%26bih%3D882%26gbv%3D2%26tbs%3Disch:1&itbs=1&iact=hc&vpx=317&vpy=104&dur=7689&hovh=221&hovw=229&tx=110&ty=96&ei=3mObTNjxB4eglAe716zLCQ&oei=3mObTNjxB4eglAe716zLCQ&esq=1&page=1&ndsp=31&ved=1t:429,r:1,s:0

http://images.google.com/imgres?imgurl=http://www.teknobites.com/wp-content/uploads/2010/02/safari512px.png&imgrefurl=http://www.teknobites.com/2010/02/25/5-must-have-plugins-for-safari/&usg=__jI9wbTD-tFxPUqkYyRSY3k1ZjFQ=&h=512&w=512&sz=215&hl=en&start=0&zoom=1&tbnid=uVaEQlK1kjL8kM:&tbnh=137&tbnw=137&prev=/images%3Fq%3Dsafari%26hl%3Den%26biw%3D1276%26bih%3D882%26gbv%3D2%26tbs%3Disch:1&itbs=1&iact=hc&vpx=124&vpy=99&dur=421&hovh=225&hovw=225&tx=133&ty=115&ei=OGSbTKm_DoWclgfpiK3bCQ&oei=OGSbTKm_DoWclgfpiK3bCQ&esq=1&page=1&ndsp=30&ved=1t:429,r:0,s:0

http://images.google.com/imgres?imgurl=http://topnews.net.nz/images/Chrome_Logo.png&imgrefurl=http://topnews.net.nz/content/25834-chrome-soon-block-older-plug-ins&usg=__HfHmqI0m873V8GVuqrTVxZMxeAs=&h=256&w=256&sz=77&hl=en&start=0&zoom=1&tbnid=HH75ZWu7en_13M:&tbnh=134&tbnw=128&prev=/images%3Fq%3Dchrome%2Blogo%26hl%3Den%26biw%3D1276%26bih%3D882%26gbv%3D2%26tbs%3Disch:1&itbs=1&iact=hc&vpx=746&vpy=403&dur=522&hovh=204&hovw=204&tx=101&ty=78&ei=K2abTIuTJIG0lQex59TNCQ&oei=K2abTIuTJIG0lQex59TNCQ&esq=1&page=1&ndsp=35&ved=1t:429,r:18,s:0

http://images.google.com/imgres?imgurl=http://yutubemedia.com/wp-content/uploads/2010/08/ie-logo.png&imgrefurl=http://yutubemedia.com/myths-about-internet-explorer-is-it-better/&usg=__SOO8WNBH38gB5ChIaD_NWVP2GLc=&h=300&w=300&sz=153&hl=en&start=0&zoom=1&tbnid=7L95hifOJ7SUQM:&tbnh=162&tbnw=162&prev=/images%3Fq%3DIE%26hl%3Den%26biw%3D1276%26bih%3D882%26gbv%3D2%26tbs%3Disch:1&itbs=1&iact=hc&vpx=664&vpy=76&dur=1440&hovh=225&hovw=225&tx=132&ty=129&ei=8WObTN6GMYaglAe9wuXZCQ&oei=8WObTN6GMYaglAe9wuXZCQ&esq=1&page=1&ndsp=24&ved=1t:429,r:3,s:0

Search Server Architecture

query

SessionCache

(by UUID)

RetrievalCache

(by SHA1)Feature Extraction

Feature Search

Query Expansion

Search Processmiss

ThumbnailDatabase

FeatureIndex

FeatureIndex

FeatureIndex

FeatureIndex

Main Techniques

• Entropy-filtered local image features– High confidence

• Graph-based query expansion– High recall

• Compact sketch representation– Smaller database, faster search

• Flexible bit-vector indexing– Online search

• Content-aware disk layout– High throughput thumbnail retrieval

Entropy-Filtered Local Feature

• Feature detection w/ Difference-of- Gaussian

• Entropy-based filtering for high confidence

• DoG detects more regions than needed. • Some plain regions can cause false positives (like A, D). • We only keep regions with high entropy (rich content, like B, C)• 10x reduction of error rate• Less features have to be indexed

[ Unpublished ]

Graph-Base Query Expansion

• We can find more results if we use the initial results to search again

• Keep searching until we find no more

• Problem: hit a lot of false positives

• We use graph-partitioning method[1] to smartly cut-off expansion.

• Recall from 43% to ~80% w/ same false positive rate[2].

[1] Andersen, et al. Local graph partitioning using PageRank vectors. FOCS’ 06.[2] Unpublished.

Compact Sketch Representation

• Raw features are large, 5~10KB/image– About 80 features / image– 128 bytes / feature (SIFT)

or 64 bytes / feature (SURF) with lower quality– Encodes all information about a region

• We only need to tell if two features are extremely similar

• 128-bit sketch with random space partitioning techniques

Dong, et al. Asymmetric Distance Estimation with Sketches for Similarity Search in High-Dimensional Spaces. SIGIR ’08.

Flexible Bit-Vector Indexing

• Search for sketches w/ <=3 bits different.

• Divide 128-bit into 4 blocks, so at least one block is identical.

• State-of-art[1] is equal partitioning.

• We find optimal partitioning with dynamic programming[2]

– Faster– More flexible

[1] Manku, et al. Detecting near-duplicates for web crawling. WWW'07.[2] Unpublished

Content-Aware Disk Layout

• Query results range from a few to 1000s

• 20~100 thumbnails / page

• If thumbnails are randomly stored on disk, throughput will be limited by disk seeks

• We store similar images together on disk and load a bunch with one disk seek

• Results on a single query can be covered with a few disk seeks.

[ Unpublished ]

Conclusion

• We present a system for similar web image retrieval– High capacity (~100 million images / server)– High confidence (10-6 error rate)– High recall (~80% recall)– Online search (searches return in seconds)

• Future work: further improve responsiveness and throughput.

The Simigle Image Search Engine

Documents

Transcript of The Simigle Image Search Engine