兩棲類和爬蟲類的 基礎知識 - 晨星網路書店booklook.morningstar.com.tw/pdf/0145004.pdf · 有8000種。在逐一介紹種類繁多的爬蟲類動物前,讓我們先來仔
從零開始的爬蟲之旅 Crawler from zero
-
Upload
shi-ken-don -
Category
Internet
-
view
264 -
download
4
Transcript of 從零開始的爬蟲之旅 Crawler from zero
![Page 1: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/1.jpg)
Crawler from [email protected] Taiwan 2016
![Page 2: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/2.jpg)
About Me• a.k.a. Shi-Ken Don
• 2009 - 2014
• Ruby Developer since 2013
• 2014 - 2015
• Web Developer at Backer-Founder since 2015
![Page 3: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/3.jpg)
About Me• a.k.a. Shi-Ken Don
• 2009 - 2014
• Ruby Developer since 2013
• 2014 - 2015
• Web Developer at Backer-Founder since 2015
200 20
![Page 4: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/4.jpg)
Agenda
• What I did
• Why Ruby? Ruby
• Comparison
• Know-how
![Page 5: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/5.jpg)
Crawler Architecture
Wikipedia:File:WebCrawlerArchitecture.svg
![Page 6: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/6.jpg)
Crawler Architecture
Wikipedia:File:WebCrawlerArchitecture.svg
![Page 7: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/7.jpg)
CrowdTrail
:D
![Page 8: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/8.jpg)
CrowdTrail
• 14
• 2
•
![Page 9: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/9.jpg)
DEMO
CrowdTrail :D
>>>> https://goo.gl/sWfDBc <<<<
![Page 10: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/10.jpg)
• 5000
• Web
• 0
![Page 11: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/11.jpg)
• 5000
• Web
• 0
Kickstarter 2 T T
![Page 12: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/12.jpg)
•
• JRuby
• Ruby
![Page 13: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/13.jpg)
Ruby
![Page 14: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/14.jpg)
Ruby
Ruby
![Page 15: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/15.jpg)
Ruby
Ruby
Why, or Why not?
![Page 16: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/16.jpg)
(Crawler) Python Python Ruby
![Page 17: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/17.jpg)
Ruby
![Page 18: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/18.jpg)
Ruby
(Scrape web content)
![Page 19: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/19.jpg)
Network flow
Google Chrome Developer Tools screenshot
1.75 1.34 76%
![Page 20: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/20.jpg)
Ruby Python
user system total real Ruby Thread.new 16.090000 1.010000 17.100000 ( 17.254499) Ruby Parallel 16.900000 1.100000 18.000000 ( 18.080813) Python threading 0.000000 0.000000 16.360000 ( 16.532583)
1000 thread www.facebook.com
ruby_thread_tests.rbthreading_test.py
Python Ruby
![Page 21: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/21.jpg)
user system total real Thread.new (parse and create) 5.640000 0.580000 6.220000 ( 6.073333) Thread.new (parse only) 5.060000 0.440000 5.500000 ( 5.455837) Thread.new (create directly) 3.340000 0.520000 3.860000 ( 5.169519)
ruby_thread_sidekiq_test.rb
100 thread www.facebook.com
parse only
![Page 22: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/22.jpg)
• ActiveRecord::ConnectionTimeoutError: could not obtain a database connection within 5.000 seconds (waited 5.009 seconds)
![Page 23: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/23.jpg)
database.yml
default: &default adapter: postgresql encoding: unicode pool: 25 # Increase this
![Page 24: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/24.jpg)
Thread
begin # … ProjectLog.create!(title: title) ensure ActiveRecord::Base.clear_active_connections! end
![Page 25: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/25.jpg)
Sidekiq
• 200 8
• Ruby Thread
• Auto scaling
![Page 26: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/26.jpg)
Heroku Auto Scaling
heroku.rake
task :auto_scaling, [:WORKER_NAME] => :environment do |_t, args| args.with_defaults(WORKER_NAME: "worker") APP_NAME = ENV["HEROKU_APP_NAME"] WORKER_NAME = args[:WORKER_NAME]
heroku = Heroku::API.new queues = Sidekiq::Queue.all queues_size = queues.map { |queue| Sidekiq::Queue.new(queue.name).size }.inject(0, :+)
# 2X dyno 600 jobs
# 50 parse project_log
# jobs 45 now_minutes = Time.now.strftime("%M").to_i # / / left_minutes = now_minutes.between?(0, 45) ? 45 - now_minutes : 0 workers_size = queues_size / 500 / [left_minutes, 1].max workers_size = 1 if workers_size < 1 workers_size = 10 if workers_size > 10 # 10 worker puts "Scaling #{WORKER_NAME} dyno count to #{workers_size}" heroku.post_ps_scale(APP_NAME, WORKER_NAME, workers_size) end
![Page 27: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/27.jpg)
Sidekiq
• PostgreSQL > Redis > Sidekiq connections
Sidekiq Redis Thread Redis Thread
![Page 28: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/28.jpg)
Heroku PostgreSQL Pricing
![Page 29: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/29.jpg)
PostgreSQL Standard 2
• Rails App DB
• Rake Task DB
• Restarting
• Sidekiq MAX connections = 200
![Page 30: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/30.jpg)
• File descriptor
✦ MacOS: 256 (default)
✦ Linux: 1024 (default)
✦ Windows: who cares
Linux File descriptor 1024
CPU RAM
![Page 31: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/31.jpg)
Linux
• #
• cat /proc/sys/fs/file-max
• #
• sysctl -w fs.file-max=100000
![Page 32: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/32.jpg)
Know-how
![Page 33: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/33.jpg)
•
• Heroku dyno 500 Thread
• dynos#process-thread-limits
![Page 34: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/34.jpg)
Queue (Weight)Sidekiq Queue Job
![Page 35: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/35.jpg)
Queue
Project.find_each do |project| # Sidekiq queue name = project.platform.name SnapshotWorker.sidekiq_options_hash["queue"] = name SnapshotWorker.perform_async(…) end
![Page 36: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/36.jpg)
Retry
sidekiq_options_hash["retry"] += 1 self.class.sidekiq_retry_in do |count| Random.rand(retry_after + count) end
![Page 37: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/37.jpg)
• User-Agent
• Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)
• Google Bot
• Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
![Page 38: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/38.jpg)
• User-Agent
• Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)
• Google Bot
• Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) SEO Googlebot ψ( ∇´)ψ
![Page 39: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/39.jpg)
User-Agent DEFAULT_USER_AGENT = "Mozilla/5.0 (compatible; CrowdTrail/1.0; +https://crowdwatch.tw/)"
def random_user_agent_string format( "%s Random/0.%d.%d", DEFAULT_USER_AGENT, Random.rand(100), Random.rand(100) ) end
HTTParty.get("https://www.facebook.com", headers: { "User-Agent" => random_user_agent_string })
![Page 40: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/40.jpg)
• header IP
• X-Forward-For
• X-Real-IP
• CF-Connecting-IP
![Page 41: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/41.jpg)
X-Forward-For
def random_x_forward_for format( "140.118.%d.%d", Random.rand(255), Random.rand(255) ) end
HTTParty.get("https://www.facebook.com", headers: { "X-Forward-For" => random_x_forward_for })
![Page 43: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/43.jpg)
CAPTCHA
• Ruby OCR
e.g. ruby-tesseract-ocr
• antigate.com
reCAPTCHA antigate
![Page 44: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/44.jpg)
Parser Know-how
![Page 45: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/45.jpg)
index
# Bad doc.css('.tab')[2].text
# Good doc.css('.tab').text[/ (\d+) /, 1]
DOM Parser Regular Expression
![Page 46: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/46.jpg)
Integer Float to_i
# Bad doc.css('.tab .pledged').text.to_i
# Good Integer(doc.css('.tab .pledged').text)
to_i nil 0
![Page 47: 從零開始的爬蟲之旅 Crawler from zero](https://reader034.fdocuments.net/reader034/viewer/2022042505/5880ca051a28abba3b8b70b5/html5/thumbnails/47.jpg)
THANK YOUFollow me on
https://github.com/shikendonhttps://medium.com/@shikendon
https://www.facebook.com/zxuandon