Scraping in 60 minutes

Post on 06-May-2015

4.192 views 0 download

description

Presentation at Data Harvest 2014

Transcript of Scraping in 60 minutes

Paul BradshawLeanpub.com/scrapingforjournalists*

Scraping in 60 mins

Saturday, 10 May 14

https://www.youtube.com/watch?v=Efr-VEkwWoM

Saturday, 10 May 14

Saturday, 10 May 14

Saturday, 10 May 14

Saturday, 10 May 14

Saturday, 10 May 14

Saturday, 10 May 14

Saturday, 10 May 14

Saturday, 10 May 14

*

Saturday, 10 May 14

*

Saturday, 10 May 14

*

Function (Arguments)(aka parameters)

Saturday, 10 May 14

*

Function (arguments)=SUM(A2:A50)=AVERAGE(B2:B300)=COUNTIF(A10:A3000,”Smith”)

Saturday, 10 May 14

*

Function (parameters)=SUM(range of cells to be summed)=AVERAGE(range of cells to be averaged)=COUNTIF(range of cells to be counted,what to count)

Saturday, 10 May 14

*

(“string”, index)

Saturday, 10 May 14

*

Tip: search for documentation

Saturday, 10 May 14

*

Variable

Saturday, 10 May 14

*

Variables

Saturday, 10 May 14

*

Jargon checklist:FunctionArgumentsParametersStringIndexVariableDocumentation

Saturday, 10 May 14

Vote:=importXML orPython?

Saturday, 10 May 14

*

Another function?

Saturday, 10 May 14

*

Search for documentation!https://www.distilled.net/blog/distilled/guide-to-google-docs-importxml/

Saturday, 10 May 14

*

Query (XPath)

Saturday, 10 May 14

*

XPath is a path through XML (or HTML)<table> = //table<table><tr> = //table//tr<table><tr><td> = //table//tr//td

Saturday, 10 May 14

*

Search for documentation!http://www.w3schools.com/XPath/xpath_syntax.asp

Saturday, 10 May 14

*

Tip: search for structure around data

http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs

Saturday, 10 May 14

*

Saturday, 10 May 14

*

"//div[@class= 'leftcolumn']"

Saturday, 10 May 14

*

//div[starts-with(@class, ‘jobWrap’)]

Saturday, 10 May 14

*

A crib sheet:

Saturday, 10 May 14

*

Chrome extension:

Saturday, 10 May 14

Saturday, 10 May 14

#!/usr/bin/env python

import scraperwiki

html = scraperwiki.scrape('http://uk.soccerway.com/teams/netherlands/fortuna-sittard/1551/')

print html

Variable (assigned with

= sign)

Statement used to show

variableSaturday, 10 May 14

#!/usr/bin/env python

import scraperwiki

html = scraperwiki.scrape('http://uk.soccerway.com/teams/netherlands/fortuna-sittard/1551/')

print html

Saturday, 10 May 14

Jargon checklist:LibraryShebangList

Saturday, 10 May 14

Paul BradshawLeanpub.com/scrapingforjournalists*

Thank you.

Saturday, 10 May 14