Web Scraping in Python with Scrapy
-
Upload
orangain -
Category
Technology
-
view
1.135 -
download
6
Transcript of Web Scraping in Python with Scrapy
Web Scraping in Python with ScrapyKota Kato @orangain 2015-09-08, 鮨会
Who am I?
• Kota Kato
• @orangain
• Software Engineer
• Interested in automation such as Jenkins, Chef, Docker etc.
Definition: Web Scraping
• Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.
Web scraping - Wikipedia, the free encyclopedia https://en.wikipedia.org/wiki/Web_scraping
eBook-1• Cross-store search engine for ebooks. • Retrieve ebook data from 9 ebook stores.
http://ebook-1.com/
QB Meter
• Visualize crowdedness of QB HOUSE, 10 minutes barbershop.
• Retrieve crowdedness from QB HOUSE's Web site every 5 minutes.
http://qbmeter.capybala.com/
Prototype of Glance• Prototype of simple news
app like newspaper.
• Retrieve news from NHK NEWS WEB 4 times per a day.
Pokedos
• Web app to find nearest bus stops to see the arrival information of buses.
• Retrieve location of the all bus stops in Kyoto-city.
http://bus.capybala.com/
Why Web Scraping?
• For Web Developer:
• Develop mash-up application.
• For Data Analyst:
• Retrieve data to analyze.
• For Everybody:
• Automate operation of web sites.
Why Use Python?
• Easy to use
• Powerful libraries, especially Scrapy
• Seamlessness between data processing and developing application
Web Scraping in Python
• Combination of lightweight libraries:
• Retrieving: Requests
• Scraping: lxml, Beautiful Soup
• Full stack framework:
• Scrapy Today's topic
Scrapy
Scrapy
• Fast, simple and extensible Web scraping framework in Python
• Currently compatible only with Python 2.7
• In-progress Python 3 support
• Maintained by Scrapinghub
• BSD License
http://scrapy.org/
Why Use Scrapy?
• Annoying stuffs in crawling and scraping are done by Scrapy.
Extracting Links Throttling Concurrency
robots.txt and <meta> Tags
XML SitemapsFiltering Duplicated URLs
Retry on Error Job Control
Getting Started with Scrapy
$ pip install scrapy$ cat > myspider.py <<EOFimport scrapy
class BlogSpider(scrapy.Spider): name = 'blogspider' start_urls = ['http://blog.scrapinghub.com']
def parse(self, response): for url in response.css('ul li a::attr("href")').re(r'.*/\d\d\d\d/\d\d/$'): yield scrapy.Request(response.urljoin(url), self.parse_titles)
def parse_titles(self, response): for post_title in response.css('div.entries > ul > li a::text').extract(): yield {'title': post_title}
EOF$ scrapy runspider myspider.py
http://scrapy.org/Requirements: Python 2.7, libxml2 and libxslt
Let's Collect Sushi Images
Create a Scrapy Project
$ scrapy startproject sushibot$ tree sushibot/sushibot/!"" scrapy.cfg#"" sushibot !"" __init__.py !"" items.py !"" pipelines.py !"" settings.py #"" spiders #"" __init__.py
2 directories, 6 files
Generate a Spider
$ cd sushibot$ scrapy genspider sushi api.flickr.com$ cat sushibot/spiders/sushi.py# -*- coding: utf-8 -*-import scrapy
class SushiSpider(scrapy.Spider): name = "sushi" allowed_domains = ["api.flickr.com"] start_urls = ( 'http://www.api.flickr.com/', )
def parse(self, response): pass
Flickr API to Search Photos
$ curl 'https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=******&text=sushi&sort=relevance' > photos.xml$ cat photos.xml<?xml version="1.0" encoding="utf-8" ?><rsp stat="ok"><photos page="1" pages="871" perpage="100" total="87088">
<photo id="4794344495" owner="38553162@N00" secret="d907790937" server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0" isfamily="0" />
<photo id="8486536177" owner="78779574@N00" secret="f77b824ebb" server="8382" farm="9" title="Best Salmon Sushi" ispublic="1" isfriend="0" isfamily="0" />...
https://www.flickr.com/services/api/flickr.photos.search.html
Construct Photo's URL
<photo id="4794344495" owner="38553162@N00" secret="d907790937" server="4093" farm="5" title="Sushi!" ispublic="1" isfriend="0" isfamily="0" />
https://farm{farm-id}.staticflickr.com/{server-id}/{id}_{secret}_[mstzb].jpg
https://farm5.staticflickr.com/4093/4794344495_d907790937_b.jpg
https://www.flickr.com/services/api/misc.urls.html
Photo element:
Photo's URL template:
Result:
spider/sushi.py (Modified)# -*- coding: utf-8 -*-import osimport scrapyfrom sushibot.items import SushibotItem
class SushiSpider(scrapy.Spider): name = "sushi" allowed_domains = ["api.flickr.com", "staticflickr.com"] start_urls = ( 'https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=' + os.environ['FLICKR_KEY'] + '&text=sushi&sort=relevance', )
def parse(self, response): for photo in response.css('photo'): yield scrapy.Request(photo_url(photo), self.handle_image)
def handle_image(self, response): return SushibotItem(url=response.url, body=response.body)
def photo_url(photo): return 'https://farm{farm}.staticflickr.com/{server}/{id}_{secret}_{size}.jpg'.format( farm=photo.xpath('@farm').extract_first(), server=photo.xpath('@server').extract_first(), id=photo.xpath('@id').extract_first(), secret=photo.xpath('@secret').extract_first(), size='b', )
Scrapy's Architecture
http://doc.scrapy.org/en/1.0/topics/architecture.html
items.py
# -*- coding: utf-8 -*-from pprint import pformat
import scrapy
class SushibotItem(scrapy.Item): url = scrapy.Field() body = scrapy.Field()
def __str__(self): return pformat({ 'url': self['url'], 'body': self['body'][:10] + '...', })
pipelines.py
# -*- coding: utf-8 -*-import os
class SaveImagePipeline(object):
def process_item(self, item, spider): output_dir = 'images' if not os.path.exists(output_dir): os.makedirs(output_dir)
filename = item['url'].split('/')[-1] with open(os.path.join(output_dir, filename), 'wb') as f: f.write(item['body'])
return item
settings.py
• Appended settings:
# Crawl responsibly by identifying yourself (and your website) on the user-agentUSER_AGENT = 'sushibot ([email protected])'
# Configure a delay for requests for the same website (default: 0)# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docsDOWNLOAD_DELAY = 1
# Configure item pipelines# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = { 'sushibot.pipelines.SaveImagePipeline': 300,}
Run Spider
$ FLICKR_KEY=********** scrapy crawl sushi
NOTE: Provide Flickr's API key with environment variables.
Thank you!
• Web scraping has power to propose improvement.
• Source code is available athttps://github.com/orangain/sushibot
@orangain