Scraping in 20 mins
-
Upload
paul-bradshaw -
Category
Education
-
view
5.968 -
download
1
description
Transcript of Scraping in 20 mins
![Page 1: Scraping in 20 mins](https://reader034.fdocuments.net/reader034/viewer/2022052315/55494908b4c905194d8b598c/html5/thumbnails/1.jpg)
Paul BradshawLeanpub.com/scrapingforjournalists*
Scraping in 20 mins
Friday, 13 July 2012
![Page 2: Scraping in 20 mins](https://reader034.fdocuments.net/reader034/viewer/2022052315/55494908b4c905194d8b598c/html5/thumbnails/2.jpg)
*
Friday, 13 July 2012
![Page 3: Scraping in 20 mins](https://reader034.fdocuments.net/reader034/viewer/2022052315/55494908b4c905194d8b598c/html5/thumbnails/3.jpg)
*
Function (Parameters)
Friday, 13 July 2012
![Page 4: Scraping in 20 mins](https://reader034.fdocuments.net/reader034/viewer/2022052315/55494908b4c905194d8b598c/html5/thumbnails/4.jpg)
*
Function (Parameters)=SUM(A2:A50)=AVERAGE(B2:B300)=COUNTIF(A10:A3000,”Smith”)
Friday, 13 July 2012
![Page 5: Scraping in 20 mins](https://reader034.fdocuments.net/reader034/viewer/2022052315/55494908b4c905194d8b598c/html5/thumbnails/5.jpg)
*
(“string”, index)
Friday, 13 July 2012
![Page 6: Scraping in 20 mins](https://reader034.fdocuments.net/reader034/viewer/2022052315/55494908b4c905194d8b598c/html5/thumbnails/6.jpg)
*
Tip: search for documentation
Friday, 13 July 2012
![Page 7: Scraping in 20 mins](https://reader034.fdocuments.net/reader034/viewer/2022052315/55494908b4c905194d8b598c/html5/thumbnails/7.jpg)
*
Tip: search for structure around data
Friday, 13 July 2012
![Page 8: Scraping in 20 mins](https://reader034.fdocuments.net/reader034/viewer/2022052315/55494908b4c905194d8b598c/html5/thumbnails/8.jpg)
*
Friday, 13 July 2012
![Page 9: Scraping in 20 mins](https://reader034.fdocuments.net/reader034/viewer/2022052315/55494908b4c905194d8b598c/html5/thumbnails/9.jpg)
*
//div[starts-with(@class, ‘jobWrap’)]
Friday, 13 July 2012
![Page 10: Scraping in 20 mins](https://reader034.fdocuments.net/reader034/viewer/2022052315/55494908b4c905194d8b598c/html5/thumbnails/10.jpg)
*
bit.ly/nrwscraper2
Friday, 13 July 2012
![Page 11: Scraping in 20 mins](https://reader034.fdocuments.net/reader034/viewer/2022052315/55494908b4c905194d8b598c/html5/thumbnails/11.jpg)
*
excelnotes.posterous.com/tag/importxml/tag/importhtml
Friday, 13 July 2012
![Page 12: Scraping in 20 mins](https://reader034.fdocuments.net/reader034/viewer/2022052315/55494908b4c905194d8b598c/html5/thumbnails/12.jpg)
*
Friday, 13 July 2012
![Page 13: Scraping in 20 mins](https://reader034.fdocuments.net/reader034/viewer/2022052315/55494908b4c905194d8b598c/html5/thumbnails/13.jpg)
*
https://scraperwiki.com/scrapers/basic_twitter_scraper/
Friday, 13 July 2012
![Page 14: Scraping in 20 mins](https://reader034.fdocuments.net/reader034/viewer/2022052315/55494908b4c905194d8b598c/html5/thumbnails/14.jpg)
*
https://scraperwiki.com/docs/python/tutorials/ - Screen Scraper 2
Friday, 13 July 2012
![Page 15: Scraping in 20 mins](https://reader034.fdocuments.net/reader034/viewer/2022052315/55494908b4c905194d8b598c/html5/thumbnails/15.jpg)
Things to know
• Libraries• Functions• Variables• Lists or arrays [‘Bob’, ‘Jane’]• Index• String, integer, float• If/Else• For loops• Operators
Friday, 13 July 2012
![Page 16: Scraping in 20 mins](https://reader034.fdocuments.net/reader034/viewer/2022052315/55494908b4c905194d8b598c/html5/thumbnails/16.jpg)
Following the data
• From String (URL) ->• Variable (html) ->• Variable (root) ->• Variable containing a list (tds) ->• Variable (td)
Friday, 13 July 2012
![Page 17: Scraping in 20 mins](https://reader034.fdocuments.net/reader034/viewer/2022052315/55494908b4c905194d8b598c/html5/thumbnails/17.jpg)
Looping through a list
• Tds = [‘Duarte’, ‘Sihl’, ‘Franzi’, ‘Paul’]• For td in tds• The first time, td = Duarte• The second time, td = Sihl• Then td = Franzi• Then td = Paul• Then it has finished the loop!
Friday, 13 July 2012
![Page 18: Scraping in 20 mins](https://reader034.fdocuments.net/reader034/viewer/2022052315/55494908b4c905194d8b598c/html5/thumbnails/18.jpg)
*
Friday, 13 July 2012
![Page 19: Scraping in 20 mins](https://reader034.fdocuments.net/reader034/viewer/2022052315/55494908b4c905194d8b598c/html5/thumbnails/19.jpg)
***
Leanpub.com/scrapingforjournalists@paulbradshaw
onlinejournalismblog.comhelpmeinvestigate.com
slideshare.net/onlinejournalistlinkedin.com/in/onlinejournalist
Friday, 13 July 2012