Search Engine - How to Make it
-
Upload
andreas-yunanto -
Category
Technology
-
view
106 -
download
1
description
Transcript of Search Engine - How to Make it
![Page 1: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/1.jpg)
Search EngineHow To Make it
Wednesday, December 12, 12
![Page 2: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/2.jpg)
Search Engine
All documents
retrieved documents (RET)
relevant documents (REL)
RET ∩ REL
database search:- low recall- high precision
web search:- high recall- low precision
Search Quality Measurement
Wednesday, December 12, 12
![Page 3: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/3.jpg)
Search EngineFile
System
3rd party apps
Database
File System Crawler
Crawler API
Database Crawler
AaBb
Text Parser
HTML Parser
PDF Parser
AaBbPDFTextHTML
DocumentImage...
Document Enhancing
Documents (title,
summary, author,
datetime)
Indexer
Documents (Categorized, Taxonomized)
Stop AnalyzerLanguage Analyzer
Index Searcher Index
Mobile Client
Web Client
Index Searcher
Document Landing Page
Wednesday, December 12, 12
![Page 4: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/4.jpg)
Search Engine
• Process in Search Engine
• Crawling
• Parsing
• Indexing
• Searching
Wednesday, December 12, 12
![Page 5: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/5.jpg)
Search Engine• Process in Search Engine
• Crawling
• Parsing
• Duplicate Content Detection
• Document Enhancement
• Indexing
• Searching
• Document ServingWednesday, December 12, 12
![Page 6: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/6.jpg)
Search Engine
• Crawling
• Collecting Data
• Input : Data content to Search
• Output : Raw Content Data in its original format
Wednesday, December 12, 12
![Page 7: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/7.jpg)
Search Engine• Crawling
AaBb
File System
3rd party apps
Database
File System Crawler
Crawler API
Database Crawler
AaBbPDFTextHTML
DocumentImage...
Wednesday, December 12, 12
![Page 8: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/8.jpg)
Search Engine
• Parsing
• Process to extract elements from crawled documents
• Input : Raw Contents
• Output : Textual Structured Documents
Wednesday, December 12, 12
![Page 9: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/9.jpg)
Search Engine• Parsing
AaBb
Text Parser
HTML Parser
PDF Parser
AaBbPDFTextHTML
DocumentImage...
Documents (title,
summary, author,
datetime)
Wednesday, December 12, 12
![Page 10: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/10.jpg)
Search Engine
• Content Duplication Detection
• Bigger Data means Bigger Duplication on Data
• Search Engine implement similiar document detection
Wednesday, December 12, 12
![Page 11: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/11.jpg)
Search Engine• Document Representation
Model: Term Frequency(Tf)Contoh:
Document 1(d1)=”andi likes to watch movie. His wife likes it too”
Document 2(d2)=”andi also likes to watch soccer game.”
Dictionary={1:andi, 2:likes, 3:watch, 4:movie, 5:wife, 6:too, 7:soccer}
Document representation in model Tf:d1={1, 2, 2, 2, 1, 1, 0}
d2={1, 1, 1, 0, 0, 0, 1}
Wednesday, December 12, 12
![Page 12: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/12.jpg)
Search Engine• Document Similiarity
Similarity between document d1 dan d2 : S(d1, d2)
S(d1, d2)=|d1-d2|
d1={1, 2, 2, 2, 1, 1, 0}
d2={1, 1, 1, 0, 0, 0, 1}
Contoh:
S(d1, d2)=|1-1|+|2-1|+|2-1|+|2-0|+|1-0|+|1-0|+|0-1|
S(d1, d2)=7
With above definition, less value we got means more those two documents are getting more similiar
Wednesday, December 12, 12
![Page 13: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/13.jpg)
Search Engine• Alghoritms
1. Counting Tf for every document
2. Find the smallest value of S(d, di) from all documents collection to get the most similiar of document d3. if the value of S(d, di) < threshold then document d and compared with create date, then erase older document4. Repeat process 2 dan 3 until there is no value of S that less than Theshold
Wednesday, December 12, 12
![Page 14: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/14.jpg)
Search Engine
• Document Enhancement
• Give tagging based on taxonomy
Wednesday, December 12, 12
![Page 15: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/15.jpg)
Search Engine• Document Enhancement
Document Enhancing
Documents (title,
summary, author,
datetime)
Documents (Categorized, Taxonomized)
Wednesday, December 12, 12
![Page 16: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/16.jpg)
Search Engine
• Indexing
• Indexing process from all information that have been gathered in one document
• Faster Searching process
• Able to search based on certain field
Wednesday, December 12, 12
![Page 17: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/17.jpg)
Search Engine• Indexing
IndexerDocuments
(Categorized, Taxonomized)
Index
Stop Analyzer
Language Analyzer
Wednesday, December 12, 12
![Page 18: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/18.jpg)
Search Engine
• Searching
Index SearcherIndex
Mobile Client
Web Client
Wednesday, December 12, 12
![Page 19: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/19.jpg)
Search Engine
• Document Serving
• Search Engine also has a function to display result
Wednesday, December 12, 12
![Page 20: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/20.jpg)
Search Engine
Index SearcherIndex
Mobile Client
Web ClientIndex
SearcherDocument
Landing Page
Wednesday, December 12, 12
![Page 21: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/21.jpg)
Search Engine• Recommended Open Source
Technology• Search Engine : Lucene, Nutch
• Programming Library : Hadoop, Scala Actor
• Database : MongoDB, PostgreSQL
• Programming Language : Java, Scala, PHP
Wednesday, December 12, 12
![Page 22: Search Engine - How to Make it](https://reader033.fdocuments.net/reader033/viewer/2022052908/559473e91a28ab5f6e8b458c/html5/thumbnails/22.jpg)
Thank You
Wednesday, December 12, 12