Introduction of Crawler
Speaker: Jinglun
Target
• Know concept of crawler• Design/Implement a crawler with good
performance, flexibility
Agenda
• Crawler Introduction• Design• More Challenge• Source Code (If have time)
What’s Crawler
What’s Crawler
• Web crawler• Vertical crawler
Example
• Get mappings from source code to yicu– Step1 : Find products link from codesearch– Step2: Find source code link from svn– Step3: Find mapping from source code
Requirements
• Get mappings from source code to yicu
Format RequirementsFeatures Qualities Constraints
Business Find which source code using our libs
1. High performance
2. Flexibility3. Easy Maintain
1.Thousands of page2. One machine3. Several days
Users Only me now
Developers Only me
Architect
• Two directions• Two dimensions
Process
Components
Layers
View
Control
Module
General Components
Business Bus
Layers
• Crawler (View and Control) • Downloader, Extractor (Module)• Storage (General components)• Not use business bus
Other views
• Running view• Deploy view• Data view• Develop view• …
Other views
• Running view• Deploy view• Data view• Develop view• …
Develop View
• Trunk/– Src/– Test/– Bin/
Review DesignFeatures Qualities Constrai
nts
Business Find which source code using our libs
1. High performance
2. Flexibility
3. Easy maintain
1.Thousands of page2. One machine3. Several days
Users Only me now
Developers
Only me
Solutions
• Crawler• Downloader• Storage• Extractor
Solutions
• Crawler• Downloader (One API)• Storage• Extractor
Crawler
DFS or BFS?
Crawler
DFS or BFS?BFS1. More deep, more UNIMPORTANT2. Many paths are available to an certain page3. Simple for distributed crawler4. More efficient developing
Crawler
Foreach($seed_urls as $url) { $page = GetPage($url); // download or read
fileSave($page);$meta_info = Parse($page);
Save($meta_info);}
Storage
• What to be stored?– Urls?– Meta info?– Http Response (header, body, curl_info)?
• How to store?– Mysql?– File system?
Storage
Hash table• Index• physical store
Storage
Hash table• Index– md5
• physical store– Head file, body file, curl_info file
Storage
• Basedir/– header_093e3575e895287cf471e6d5f5028446 body_093e3575e895287cf471e6d5f5028446 info_093e3575e895287cf471e6d5f5028446– header_ 66e7c612cf23049fa731c831bcee9048
body_ 66e7c612cf23049fa731c831bcee9048info_ 66e7c612cf23049fa731c831bcee9048
– …– meta_info.txt– failed_download_url.txt– failed_extract_page.txt
Extractor
• Dom tree• Regular expresses
Review DesignFeatures Qualities Constrai
nts
Business Find which source code using our libs
1. High performance
2. Flexibility
3. Easy maintain
1.Thousands of page2. One machine3. Several days
Users Only me now
Developers
Only me
Performance Issue?
• Multi threads?• Multi process?
Data analyze
• Vim• Shell (awk, sed)• Regular expression
More Challenges
• Distributed• Noise• Duplication• Quick updates• Concurrent and Performance• …
Top Related