Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
-
Upload
sammy-fung -
Category
Technology
-
view
3.626 -
download
1
Transcript of Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
![Page 1: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/1.jpg)
Web Scraping 1-2-3 with Python + Scrapy
Sammy Fungsammy.hk, gownjob.com
![Page 2: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/2.jpg)
Today Agenda
● Some Cases● Python and Scrapy
![Page 3: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/3.jpg)
Web Scraping
● a computer software technique of extracting information from websites. (Wikipedia)
● for business, hobbies, research.......● NOT talk about business cases today.
![Page 4: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/4.jpg)
CableTV & NOWTV Programme (Past)
● 2004.● slow, slow, slow, or worst - can't connect.● use Flash.
![Page 5: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/5.jpg)
HK Observatory and Joint Typhoon Warning Center
● no easy data exchange format, eg. RSS/Atom.
● We won't check websites everyday.
![Page 6: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/6.jpg)
Transportation - KMB, PTES
● no map view on KMB website for a bus route in the past.
● Exteremly Poor, Ugly (or much worse) map UI on PTES.
![Page 7: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/7.jpg)
My experiences on web scraping
● 2004: php● year after: python● recent year: python with scrapy
![Page 8: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/8.jpg)
Document Types
● HTML, XML,...... ● Text● Others, eg. pictures, videos,......
![Page 9: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/9.jpg)
Web Scraping
● Look for right URLs to scrap.● Look for right content from webpages.● Saving data into data store.● When to run the web scraping program ?
![Page 10: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/10.jpg)
What is Scrapy ?
● An open source web scraping framework for Python.
● Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
![Page 11: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/11.jpg)
Features of Scrapy
● define data you want to scrapy● write spider to extract data● Built-in: selecting and extracting data from
HTML and XML● Built-in: JSON, CSV, XML output● Interactive shell console● Built-in: web service, telnet console, logging● Others
![Page 12: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/12.jpg)
Installation of Scrapy
● pip● APT repo● RPM● tarball (binary/source)
![Page 13: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/13.jpg)
Create new scrapy project
$ scrapy startproject mybotmybot/mybot/scrapy.cfgmybot/mybot/items.pymybot/mybot/pipeline.pymybot/mybot/settings.pymybot/mybot/spiders/myspider.pyetc.......
![Page 14: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/14.jpg)
items.py
from scrapy.item import Item, Field
class HKOCurrentItem(Item): time = Field() station = Field() temperature = Field() humidity = Field() #......
![Page 15: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/15.jpg)
spiders/hko_spider.py (1/5)
from scrapy.spider import BaseSpiderfrom scrapy.selector import HtmlXPathSelectorfrom weatherhk.items import HKOCurrentItem
import datetime, re
![Page 16: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/16.jpg)
spiders/hko_spider.py (2/5)
class HKOCurrentSpider(BaseSpider): name = "HKOCurrentSpider" #allowed_domains = ["www.weather.gov.hk"] start_urls = [ "http://www.weather.gov.hk/textonly/forecast/chinesewx.htm" ]
![Page 17: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/17.jpg)
spiders/hko_spider.py (3/5)
def parse(self, response): hxs = HtmlXPathSelector(response) stations = [] # Getting weather data from each stations. tx = hxs.select("//pre[1]/text()").re('[^\n]*\n')
![Page 18: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/18.jpg)
spiders/hko_spider.py (4/5)
for i in tx: if re.search(u'\d 度',i):
data = HKOCurrentItem() data['time'] = int(dt) data['station'] = self.station.code(i) data['temperature'] = int(re.findall(u'\d+',i)[0]) stations.append(data)
![Page 19: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/19.jpg)
spiders/hko_spider.py (5/5)
return stations
![Page 20: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/20.jpg)
pipelines.py (1/2)
class HKOCurrentPipeline(object): def process_item(self, item, spider): station = self.db[item['station']] storeditem = dict(item.__dict__)['_values']
![Page 21: Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)](https://reader034.fdocuments.net/reader034/viewer/2022052619/5552bfbeb4c90581158b46a3/html5/thumbnails/21.jpg)
pipelines.py (2/2)
try: if 'temperature' in storeditem: lasttime = station.find({'temperature': {'$gt': 0}}).sort('time', -1).limit(1) if lasttime[0]['time'] != storeditem['time']: id = self.insert(storeditem)
return item