iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general...
Transcript of iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general...
![Page 1: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/1.jpg)
iRobot: An Intelligent Crawler for Web Forums
Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang
Microsoft Research, Asia
![Page 2: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/2.jpg)
Outline
• Motivation & Challenge
• iRobot – Our Solution
– System Overview
– Module Details
• Evaluation
2
![Page 3: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/3.jpg)
Outline
• Motivation & Challenge
• iRobot – Our Solution
– System Overview
– Module Details
• Evaluation
3
![Page 4: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/4.jpg)
Why Web Forum is Important
• Forum is a huge resource of human knowledge– Popular all over the world
– Contain any conceivable topics and issues
• Forum data can benefit many applications– Improve quality of search result
– Various data mining on forum data
• Collecting forum data– Is the basis of all forum related research
– Is not a trivial task
4
![Page 5: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/5.jpg)
Why Forum Crawling is Difficult
• Duplicate Pages– Forum is with complex in-site structure
– Many shortcuts for browsing
• Invalid Pages– Most forums are with access control
– Some pages can only be visited after registration
• Page-flipping– Long thread is shown in multiple pages
– Deep navigation levels
5
![Page 6: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/6.jpg)
The Limitation of Generic Crawlers
• In general crawling, each page is treated independently
– Fixed crawling depth
– Cannot avoid duplicates before downloading
– Fetch lots of invalid pages, such as login prompt
– Ignore the relationships between pages from a same thread
• Forum crawling needs a site-level perspective!
6
![Page 7: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/7.jpg)
Statistics on Some Forums
• Around 50% crawled pages are useless
• Waste of both bandwidth and storage
7
![Page 8: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/8.jpg)
Outline
• Motivation & Challenge
• Our Solution – iRobot
– System Overview
– Module Details
• Evaluation
8
![Page 9: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/9.jpg)
What is Site-Level Perspective?
• Understand the organization structure
• Find our an optimal crawling strategy
9
List-of-Board
List-of-Thread
Browse-by-Tag
Search Result
Post-of-Thread
Login Portal
Entry
Digest
The site-level perspective of "forums.asp.net"
![Page 10: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/10.jpg)
iRobot: An Intelligent Forum Crawler
Crawler
General Web Crawling
Sitemap Construction
Traversal Path Selection
Forum Crawling
Segmentation
& Archiving
Raw Pages Meta
Restart
10
![Page 11: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/11.jpg)
Outline
• Motivation & Challenge
• Our Solution – iRobot– System Overview
– Module Details• How many kinds of pages?
• How do these pages link with each other?
• Which pages are valuable?
• Which links should be followed?
• Evaluation
11
Sitemap Construction
Traversal Path Selection
![Page 12: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/12.jpg)
Page Clustering
• Forum pages are based on database & template
• Layout is robust to describe template
– Repetitive regions are everywhere on forum pages
– Layout can be characterized by repetitive regions
(b) (d)(a) (c)
12
![Page 13: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/13.jpg)
Page Clustering
13
![Page 14: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/14.jpg)
14
List-of-Board
List-of-Thread
Browse-by-Tag
Search Result
Post-of-Thread
Login Portal
Digest
![Page 15: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/15.jpg)
Link Analysis
• URL Pattern can distinguish links, but not reliable on all the sites
• Location can also distinguish links
15
1. Login
4. Thread List
5. Thread
A Link = URL Pattern + Location
![Page 16: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/16.jpg)
16
List-of-Board
List-of-Thread
Browse-by-Tag
Search Result
Post-of-Thread
Login Portal
Entry
Digest
![Page 17: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/17.jpg)
Informativeness Evaluation
• Which kind of pages (nodes) are valuable?
• Some heuristic criteria
– A larger node is more like to be valuable
– Page with large size are more like to be valuable
– A diverse node is more like to be valuable
• Based on content de-dup
17
![Page 18: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/18.jpg)
18
List-of-Board
List-of-Thread
Browse-by-Tag
Search Result
Post-of-Thread
Login Portal
Entry
Digest
![Page 19: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/19.jpg)
Traversal Path Selection
• Clean sitemap– Remove valueless nodes
– Remove duplicate nodes
– Remove links to valueless / duplicate nodes
• Find an optimal path– Construct a spanning tree
– Use depth as cost• User browsing behaviors
– Identify page-flipping links• Number, Pre/Next
19
![Page 20: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/20.jpg)
20
List-of-Board
List-of-Thread
Browse-by-Tag
Search Result
Post-of-Thread
Login Portal
Entry
Digest
![Page 21: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/21.jpg)
Outline
• Motivation & Challenge
• iRobot – Our Solution
– System Overview
– Module Details
• Evaluation
21
![Page 22: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/22.jpg)
Evaluation Criteria
• Duplicate ratio
• Invalid ratio
• Coverage ratio
0%
10%
20%
30%
40%
50%
60%
70%
Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina
Mirrored Pages iRobot
0%
5%
10%
15%
20%
25%
Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina
Mirrored Pages iRobot
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina
Coverage ratio
22
![Page 23: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/23.jpg)
Effectiveness and Efficiency
• Effectiveness
• Efficiency
0
1000
2000
3000
4000
5000
6000
Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina
(a) A Generic Crawler Invalididate Duplicate Valuable
0
1000
2000
3000
4000
5000
6000
Biketo Asp Baidu Douban CQZG Tripadvisor Gentoo
(b) iRobot Invalididate Duplicate Valuable
0
2500
5000
7500
10000
12500
15000
17500
20000
Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina
(a) A Generic Crawler Invalididate
Duplicate
Valuable
0
2500
5000
7500
10000
12500
15000
17500
20000
Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina
(b) iRobot Invalididate
Duplicate
Valuable
23
![Page 24: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/24.jpg)
Performance vs. Sampled Page#
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
10 20 50 100 500 1000Number of Sampled Pages
Coverage ratio
Duplicate ratio
Invalid ratio
24
![Page 25: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/25.jpg)
Preserved Discussion Threads
Forums MirroredCrawled by
iRobot
Correctly
Recovered
Biketo 1584 1313 1293
Asp 600 536 536
Baidu − − −
Douban 62 60 37
CQZG 1393 1384 1311
Tripadvisor 326 272 272
Hoopchina 2935 2829 2593
25
94.5%
87.6%
![Page 26: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/26.jpg)
Conclusions
• An intelligent forum crawler based on site-level structure analysis– Identify page templates / valuable pages / link
analysis / traversal path selection
• Some modules can still be improved– More automated & mature algorithms in SIGIR’08
• More future work directions– Queue management
– Refresh strategies
26
![Page 27: iRobot: An Intelligent Crawler for Web Forums · The Limitation of Generic Crawlers •In general crawling, each page is treated independently –Fixed crawling depth –Cannot avoid](https://reader034.fdocuments.net/reader034/viewer/2022050517/5fa138904f9e0a452f3f9700/html5/thumbnails/27.jpg)
Thanks!
27