How to build_a_search_engine

16
How to build a small How to build a small distributed search distributed search engine using open engine using open source software source software

description

How to build a small search engine using hadoop (HDFS) and Lucene.

Transcript of How to build_a_search_engine

Page 1: How to build_a_search_engine

How to build a small How to build a small

distributed search distributed search

engine using open engine using open

source softwaresource software

Page 2: How to build_a_search_engine

Building a distributed search engine

Search engine subsytems:Search engine subsytems:● Page databasePage database

● List of the pages to retrieveList of the pages to retrieve

● Pages retrieval and savePages retrieval and save

● Page content parsingPage content parsing

● Full-text indexing of the contentsFull-text indexing of the contents

● Graph database of the links for rankingGraph database of the links for ranking

Page 3: How to build_a_search_engine

Building a distributed search engine

Open Source SoftwareOpen Source Software

• Apache HadoopApache Hadoop• MapReduceMapReduce• HDFSHDFS• HBaseHBase

• Apache LuceneApache Lucene

Page 4: How to build_a_search_engine

Building a distributed search engine

HDFSHDFS

Hadoop Distributed File SystemHadoop Distributed File System

Page 5: How to build_a_search_engine

Building a distributed search engine

HDFS – Assumptions and goalsHDFS – Assumptions and goals

● Hardware failureHardware failure

● Big dataBig data

● Write once / read manyWrite once / read many

● Moving computation, not dataMoving computation, not data

Page 6: How to build_a_search_engine

Building a distributed search engine

Page 7: How to build_a_search_engine

Building a distributed search engine

Page 8: How to build_a_search_engine

Building a distributed search engine

LuceneLucene

Page 9: How to build_a_search_engine

Building a distributed search engine

Lucene - Inverse IndexingLucene - Inverse IndexingTerm Doc Id Weight

JUG301 0.97198 0.65120 0.43

Lugano301 0.94278 0.15451 0.87103 0.45763 0.77

Page 10: How to build_a_search_engine

Building a distributed search engine

Lucene - Indexing main classesLucene - Indexing main classes

IndexWriterIndexWriter DirectoryDirectory AnalyzerAnalyzer DocumentDocument FieldField

Page 11: How to build_a_search_engine

Building a distributed search engine

Lucene - Searching main classesLucene - Searching main classes

IndexSearcherIndexSearcher CollectorCollector QueryQuery TopDocsTopDocs ScoreDocScoreDoc

Page 12: How to build_a_search_engine

Building a distributed search engine

Lucene - AnalyzersLucene - Analyzers

StopWordsStopWords ””the book is on the table” [book, table]→the book is on the table” [book, table]→

StemmingStemming [paint, paints, painted, …] paint→[paint, paints, painted, …] paint→

SynonimsSynonims [cat, feline] cat→[cat, feline] cat→

Page 13: How to build_a_search_engine

Building a distributed search engine

Lucene - Search optionsLucene - Search options

FieldsFields Title: JUGTitle: JUG body: ”JUG Lugano”body: ”JUG Lugano”

WildcardsWildcards J?G [JUG, JAG, ...]→J?G [JUG, JAG, ...]→ J*G [JUG, JEEG, JUNG, …]→J*G [JUG, JEEG, JUNG, …]→

FuzzyFuzzy (basata su vocabolario) (basata su vocabolario) JUG~[n] [MUG, JAG, …]→JUG~[n] [MUG, JAG, …]→

Page 14: How to build_a_search_engine

Building a distributed search engine

Lucene - Search optionsLucene - Search options RangeRange

Year: [2002 TO 2012]Year: [2002 TO 2012] Name: {Alberto TO Andrea}Name: {Alberto TO Andrea}

BoostBoost JUG^5 LuganoJUG^5 Lugano ””JUG Lugano”^5JUG Lugano”^5

ProximityProximity ””JUG Lugano”~5JUG Lugano”~5

Boolean and existanceBoolean and existance AND, OR, NOT, (), +, -AND, OR, NOT, (), +, -

Page 15: How to build_a_search_engine

Building a distributed search engine

HDFS - Lucene IntegrationHDFS - Lucene Integration

File copy from/to HDFSFile copy from/to HDFS

Patch IndexWriter/DirectorPatch IndexWriter/Directory

Rewrite of IndexWriter on RAMRewrite of IndexWriter on RAM

Lucene 4Lucene 4

Page 16: How to build_a_search_engine

Building a distributed search engine

And now...And now...

Hands on!Hands on!