Introduction of vertical crawler

Post on 27-May-2015

3.490 views 1 download

Tags:

description

Introduction of vertical crawler

Transcript of Introduction of vertical crawler

Introduction of Crawler

Speaker: Jinglun

Target

• Know concept of crawler• Design/Implement a crawler with good

performance, flexibility

Agenda

• Crawler Introduction• Design• More Challenge• Source Code (If have time)

What’s Crawler

What’s Crawler

• Web crawler• Vertical crawler

Example

• Get mappings from source code to yicu– Step1 : Find products link from codesearch– Step2: Find source code link from svn– Step3: Find mapping from source code

Requirements

• Get mappings from source code to yicu

Format RequirementsFeatures Qualities Constraints

Business Find which source code using our libs

1. High performance

2. Flexibility3. Easy Maintain

1.Thousands of page2. One machine3. Several days

Users Only me now

Developers Only me

Architect

• Two directions• Two dimensions

Process

Components

Layers

View

Control

Module

General Components

Business Bus

Layers

• Crawler (View and Control) • Downloader, Extractor (Module)• Storage (General components)• Not use business bus

Other views

• Running view• Deploy view• Data view• Develop view• …

Other views

• Running view• Deploy view• Data view• Develop view• …

Develop View

• Trunk/– Src/– Test/– Bin/

Review DesignFeatures Qualities Constrai

nts

Business Find which source code using our libs

1. High performance

2. Flexibility

3. Easy maintain

1.Thousands of page2. One machine3. Several days

Users Only me now

Developers

Only me

Solutions

• Crawler• Downloader• Storage• Extractor

Solutions

• Crawler• Downloader (One API)• Storage• Extractor

Crawler

DFS or BFS?

Crawler

DFS or BFS?BFS1. More deep, more UNIMPORTANT2. Many paths are available to an certain page3. Simple for distributed crawler4. More efficient developing

Crawler

Foreach($seed_urls as $url) { $page = GetPage($url); // download or read

fileSave($page);$meta_info = Parse($page);

Save($meta_info);}

Storage

• What to be stored?– Urls?– Meta info?– Http Response (header, body, curl_info)?

• How to store?– Mysql?– File system?

Storage

Hash table• Index• physical store

Storage

Hash table• Index– md5

• physical store– Head file, body file, curl_info file

Storage

• Basedir/– header_093e3575e895287cf471e6d5f5028446 body_093e3575e895287cf471e6d5f5028446 info_093e3575e895287cf471e6d5f5028446– header_ 66e7c612cf23049fa731c831bcee9048

body_ 66e7c612cf23049fa731c831bcee9048info_ 66e7c612cf23049fa731c831bcee9048

– …– meta_info.txt– failed_download_url.txt– failed_extract_page.txt

Extractor

• Dom tree• Regular expresses

Review DesignFeatures Qualities Constrai

nts

Business Find which source code using our libs

1. High performance

2. Flexibility

3. Easy maintain

1.Thousands of page2. One machine3. Several days

Users Only me now

Developers

Only me

Performance Issue?

• Multi threads?• Multi process?

Data analyze

• Vim• Shell (awk, sed)• Regular expression

More Challenges

• Distributed• Noise• Duplication• Quick updates• Concurrent and Performance• …