Crawling the world
-
Upload
marc-morera -
Category
Technology
-
view
371 -
download
0
description
Transcript of Crawling the world
![Page 1: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/1.jpg)
Crawling the world@mmoreram
![Page 2: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/2.jpg)
Apparently...
![Page 3: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/3.jpg)
Nobodyuses parsing in their applications
![Page 4: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/4.jpg)
Not evenChuck Norris
![Page 5: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/5.jpg)
Many bussinesses
need crawling
![Page 6: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/6.jpg)
Crawling brings you knowledge
Knowledge is
power
![Page 7: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/7.jpg)
And power is
Money
![Page 8: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/8.jpg)
What is crawling?Or parsing
![Page 9: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/9.jpg)
Crawling
We download just an url with a request (HTML, XML…)
We manipulate response by searching the desired data, like links, headers or any kind of text or label
Once we have needed content, we can just update our database and take following decisions, for example parsing some found links.
and that’s it!
![Page 10: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/10.jpg)
–Marc Morera, yesterday
“Machines will do what humans do before they realize”
![Page 11: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/11.jpg)
Let’s see an example
Step by step-
![Page 12: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/12.jpg)
![Page 13: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/13.jpg)
chicplace.com
Our goal is parse all available products, saving name, description, price, shop and categories
We will use linear strategy. There are some kind of strategies when a site must be parsed
Let’s see all available strategies
![Page 14: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/14.jpg)
Parsing Strategies
Linear. Just one script. If any page fails (crawling error, server timeout, …) some kind of exception could be thrown and catched.
Advantages: Just an script is needed. Easier? Not even close…
Problems: Cannot be distributed. Just one script for 1M requests. Memory problem?
![Page 15: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/15.jpg)
Parsing Strategies
Distributed. One script for each case. If any page fails can be recovered by simply execute himself again.
Advantages: All cases are encapsulated in an individual script, low memory. Can be easily distributed by using queues.
Problems: Any
![Page 16: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/16.jpg)
Crawling steps
Analyzing. Think like Google does. find the fastest way through the labyrinth
Scripting. Build scripts using queues for distributed strategy. Each queue means one page
Running. keep in mind the impact of your actions. DDOS attack, copyright
![Page 17: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/17.jpg)
Analyzing
Every parsing process should be evaluated as a simple crawler. For example Google
How to access to all needed pages with the lowest server impact
Usually, all serious websites are designed to easily access to all pages within 3 clicks
![Page 18: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/18.jpg)
AnalyzingWe will use category map to just access to
all available products
![Page 19: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/19.jpg)
AnalyzingEach category will list all available products
![Page 20: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/20.jpg)
Analyzing
Do we need also to parse product page?
In fact, we do. We already have name, price and category, but we also need description and shop
So we have main page to parse all category links, we have category page with all product ( can be paginated ) and we need also product page to get all information
Product page is responsible for saving all data in DDBB
![Page 21: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/21.jpg)
Scripting
We will use distributed strategy, using queues and supervisord
Supervisord is responsible for managing X instances of a process running at the same time.
Using distributed queue system, we will have 3 workers.
![Page 22: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/22.jpg)
Worker?
Yep, worker. Using a queue system, a worker is like a box ( script ) with a parameters ( input values ), that just do something.
We have 3 kind of workers. One of them, the CategoryWorker will just receive a category url, will parse related content ( HTML ) and will detect all products. Each product will generate a new instance for ProductWorker
![Page 23: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/23.jpg)
Running
We just enable all workers and forces first to run.
First worker will find all categories urls and will enqueue them into a queue named categories-queue
Second worker ( for example 10 instances ) will just consume categories-queue looking for urls and parsing their content.
Their content means just products urls
![Page 24: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/24.jpg)
Running
Each url is enqueued to another queued named products-queue
Third and last worker ( 50 instances ) just consume this queue, parses their content and get needed data ( name, description, shop, category and price.
![Page 25: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/25.jpg)
OK. Call me God
![Page 26: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/26.jpg)
but…
![Page 27: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/27.jpg)
–Some bored man
“Don't shoot the messenger”
![Page 28: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/28.jpg)
warning!
50 workers requesting chicplace in parallel. This is a big problem
@Gonzalo (CTO) will be angry and he will detect something is happening
So, we must be careful to not alert him or just prevent us discover
![Page 29: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/29.jpg)
Warningdo not try this at home
![Page 30: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/30.jpg)
Be invisible
To be invisible we just can parse all site slowly ( days )
To be faster we just can mask our IP using Proxies ( How about different proxy for every request? )
To be faster we just can user some reversed Proxy, like TOR.
To be stupid we can just parse chicplace with our IP ( most companies will not even notice )
![Page 31: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/31.jpg)
They are attacking me !
![Page 32: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/32.jpg)
–Matthew 21:22
“And whatever you ask in prayer, you will receive, if you have faith”
![Page 33: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/33.jpg)
My pray!
A good crawling implementation is infallible
Server will receive dozens of requests per second and will not recognize any pattern to discriminate crawler requests from simple user requests
So…?
![Page 34: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/34.jpg)
Welcome to amazing world of
Crawling
![Page 35: Crawling the world](https://reader034.fdocuments.net/reader034/viewer/2022051412/5492ef3dac795959288b48ee/html5/thumbnails/35.jpg)
Where no one isSAVE