LOGO A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华...

43

Click here to load reader

Transcript of LOGO A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华...

Page 1: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

LOGO

www.themegallery.com

A Full Text Search Engine For BBS Lily

主讲人:顾荣 指导老师:黄宜华

Email:[email protected]

Page 2: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Contents

Background

Brief Intro to principle of Full Text Search Engine

Implement of FTSE for BBS Lily

Maybe Google&Baidu has done these...

Conclusion

Page 3: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

1.Background

What is a full text search engine?1.1

1.2 Why do we need it?

Page 4: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

What is a full text search engine?

In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user.  ------From Wiki

Page 5: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Why do we need a FTSE for BBS Lily?

Total amount :around 3million posts

Over a thousand everyday.

Each post’s size :1K~4K

Data InBBS Lily

Base

Capacity

Increasing

Speed

Post

Granularity

Page 6: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Contents

Background

Brief Intro to principle of Full Text Search Engine

Implement of FTSE for BBS Lily

Maybe Google&Baidu has done these...

Conclusion

Page 7: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

2.Brief Intro to the Principle of Full Text Search Engine

What happens after you press enter?What happens after you press enter?

Page 8: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Abstract IR Architecture

DocumentsQuery

Hits

RepresentationFunction

RepresentationFunction

Query Representation Document Representation

ComparisonFunction Index

offlineonline

document acquisition

(e.g., web crawling)

Page 9: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

About Representation Function

Documents

InvertedIndex

Bag of Words

case folding, tokenization, stopword removal, stemming

syntax, semantics, word knowledge, etc.

Page 10: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

A Simple Inverted Index Demo

1

1

1

2

1

1

1

1

1

1

1

22

11

11

11

11

11

11

1 2 3

11

11

11

4

11

11

11

22

11

11

22

11

blueblue

catcat

eggegg

fishfish

greengreen

hamham

hathat

oneone

1

1

1

1

1

1

2

1

blue

cat

egg

fish

green

ham

hat

one

11 11redred

11 11twotwo

1red

1two

one fish, two fishDoc 1

red fish, blue hatDoc 2

cat in the hatDoc 3

green eggs and hamDoc 4

3

4

1

4

4

3

2

1

2

2

1

12

Page 11: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Map/Reduce’s Role…

Not so good…

1.must have sub-second response time2.for the web, only need relatively few results

Indexing ProblemIndexing Problem

Retrieval ProblemRetrieval Problem

Character DescriptionCharacter Description Suitable?Suitable?

Perfect !

1.scalability 2.relatively fast3.batch operation4.updates may not be important5.crawling is a challenge in itself

Page 12: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Contents

Background

Brief Intro to principle of Full Text Search Engine

Implement of FTSE for BBS Lily

Maybe Google&Baidu has done these...

Conclusion

Page 13: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

3.Implement of FTSE for Lily BBS

3.4

Outline of Work Flow3.1

3.2

3.3

3.5

Crawl Web Pages & Mine Info

Indexing Process

Set up Web Retrieval Interface

Optimization

Page 14: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Response Query String

3.1 Outline of work Flow

Web Page 0Web Page 0

Web Page 1Web Page 1

Web Page nWeb Page n

Crawl && Info Mining

Formated Files

/Content

/Vice Info

Inverted Index&&Ranking

<DID,Rank>……

<DID,Rank>……

<DID,Rank>……

<DID,Rank>……

<DID,Rank>……

<DID,Rank>……

JSP PageJSP Page

Split

Term0,Term1…Term n

Search &Merge

Target DID

Result List

TitleContext

AuthorURLHot

token 1 token 0token n

IndexForIndices

Crawler

Web Retrival

Map/Reduce

Page 15: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

3.2 Crawl Web Pages & Mine Info

3.2.1 Target

Framework of Lily BBS

Strategy of Crawler

Strategy of Miner3.2.4

3.2.2

3.2.3

Page 16: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Target of Crawler&Miner

Crawl every postFrom BBS lily Continuously .

Fault tolerance

Mine wanted infoFrom each post that Crawler has got from web;store the them in a designed pattern.

A

Crawler

B

Miner

Page 17: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Framework of BBS Lily (1)

Post 0 Post 1 Post n

Title in hereBBSLily

Title in here

section 12Title in here

section0 Title in here

section2Title in here

section1 ………

Title in

here

Board 0 Board 1 Title in

here

Board n………

………

Page 18: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Framework of BBS Lily (2)

Page 19: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Strategy of Crawler——DFS

Post 1 Post nPost 0

Title in hereBBSLily

Title in here

Section 12Title in here

section0 Title in here

section2Title in here

section1 ………

Title in

here

Board 0 Board 1 Title in

here

Board n………

………

-Traversal catalog links to

get the content;

-Automatic link to Next

Page and do the routine

job.

tips

Page 20: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Strategy of Miner——Regex

Use HtmlParserTo get Tags’ Content

Extract Info by regex

Store in a designed pattern

[Each post will be stored in a line as the pattern blew] [Each post will be stored in a line as the pattern blew]

Click to add Text

URL’/007’hot’/007’auhtor’/007’title’/007’content

See Demo

Page 21: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

3.3 Indexing Process

3.3.1 Target

Filter Source File

Build Inverted Index

3.3.2

3.3.3

Partition Inverted Index File

3.3.5

3.3.4

Second-Level Index (Index for Indices)

Page 22: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Target of Indexing Process

Run a series of Map/Reduce operations to generate Inverted Indices with rank and position info.

Indexing Process

Txt_Filter

PartitionIndex Table

Inverted Index

IndexFor

Indices

Page 23: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Filter Source File

Although Source File stores posts in a well-designed pattern ,We still need to filter it before we do the Inverted Indices job.

1.Examine and eliminate noises and duplications

-“http://bbs.nju.edu.cn/...’\007’ null ‘\007’ \null ‘\

007’ null ‘\007’ null”

-About duplications…2.It is natural to pre-process the data before we really handle it.

Reasons

Page 24: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Build Inverted Index

The process of building Inverted Index is smart ,it will be smarter if we can calculate and record some side info properly at the same time.

The side info includes rank 、 positions etc.

Details…

Page 25: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Build Inverted Index—Side Info

1.TF-IDF (Term Frequency-Inverse Document Frequency):

2.Positions info do not need any calculation , the can be record as a Integer Pair like(StartIndex,EndIndex).

Side Info

•| D | : total number of documents in the corpus • : number of documents where the term ti appears (that is )

Page 26: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Build Inverted Index--structure

1. For each post in filtered source file , the offset in the file can be considered as its DID;

2. Each line of Inverted Index file stores a term with its info ,the details are as blew:

term infoinfo=SingleDIDInfo;SingleDIDInfo;SingleDIDInfo....SingleDIDInfo=DID:rank:positionspositions=position%position%position%position...position=IsTitle|start|end

Eg.黑莓 48522292:162.6:1|2|4%0|804|806;42910773:106.26:0|456|458%0|560|562

Page 27: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Partition Inverted Index File

After last step,we got the Inverted Index File.

However,the file is so big…..

Source file size Inverted index file size

48M 72.5M

182M 240M

703M 828M

Page 28: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Second-Level Index (Index for Indices)

In last step,we partitioned the Inverted Index file into a certain num parts,for example 16.Each file contains some term-info pairs.

So,when a term is given?How can we know which part-file is it in?which line is it in?

We need an Index for Indices.

Ps.This really works.The second-level index file’s size is less than 10% of the source file.

Page 29: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Second-Level Index (Index for Indices)

Source file size

Inverted index file size

Second-Level Index file size

48.1M 72.5M 2.375M

182M 240M 5.17M

703M 828M 10.5M

Page 30: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

3.4 Set up Web Retrieval Interface

3.4.1 Target

Sort Pages3.4.2

Page 31: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Target of Web Retrieval Interface

Make an Interface which accpet user’s query and response search results.

1.Restrict the query string;2.Sort search result dynamically;3.Response results page by page.

Web

Retrieval

Interface

Page 32: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Sort Pages

Term 0

Term 2

Term 1

Here is a demo.

Doc1 10

Doc3 90

Doc7 20

Text in here

Query String

Word

Segement Doc2 20

Doc7 80

Doc5 15

Doc3 05

Doc2 40

Doc6 40

Merge

Rank Again

Doc7 100

Doc3 95

Doc6 40

Doc2 40

Doc5 15

Doc1 10

Merge the rank and Rank again~

Page 33: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

3.5 Optimization

a)For each term only top 1500 DID are reserved at most.b)Use TreeMap to sort..

Reduce Sort Time Reduce I/O operations

……Cache Strategy

Optimization measures in different areas.

a)Response Page is created dynamically.b)Each time return 10 records.

..........

a) Put some hot Inverted Index file in the memory.

b) Cache replacement --- LRU

Page 34: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Contents

Background

Brief Intro to principle of Full Text Search Engine

Implement of FTSE for BBS Lily

Maybe Google&Baidu has done these...

Conclusion

Page 35: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

4. Maybe Google&Baidu has done...

Parallelly

Word Segement

……

User's query

Rank

3.A better rank strategy :To descirbe the relationship between a token and DID precisely

4.Record each user's query string;a) feed back to Word Segementb) Provide remind function.(By input change event)

1. Search Stuff parallelly

2. An OutStanding Word Segement

Algorithm

…….

Page 36: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Contents

Background

Brief Intro to principle of Full Text Search Engine

Implement of FTSE for BBS Lily

Maybe Google&Baidu has done these...

Conclusion

Page 37: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

5. Conclusion

summary5.1

5.2

5.3

Highlights

About Map/Reduce…

Page 38: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Summary

Crawler

Indexingprocess

Web retrieval

A Hard Coding Crawler and Miner,aimed to get data for BBS lily

The indexing process runs as a sequence of MapReduce operations.

Set up a Web Interface for userto retrieve info.

Crawler

Web Interface

Indexingprocess

Related Work has three parts as blew:

Page 39: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Highlights

This stuff is COOL~ It can provide a friendly User Experience,when we wanna search something in our BBS Lily.

This stuff is COOL~ It can provide a friendly User Experience,when we wanna search something in our BBS Lily.

Use Map/Reduce to process data offline.It has provided several benefits such as:1.The indexing code is simpler, smaller, and easier tounderstand. 2. we can keep conceptually unrelated computations separate.3. The indexing process has become much easier toOperate and maintain.

View of Application

View of Technics

Page 40: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

the system

Page 41: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

1Map/Reduce is not just a Programming Model,actually it’s also a Life Model…

Click to add Text

About Map/Reduce

Page 42: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

Many thanks to…

Teacher Huang;Yang Xiaoliang;Xiao Tao;Liu Yulong;Zhang Lu;NUAA & NJU…

Page 43: LOGO  A Full Text Search Engine For BBS Lily 主讲人:顾荣 指导老师:黄宜华 Email:gurongwalker@gmail.com.

LOGO

www.themegallery.com

Email:[email protected]