Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1....

Scraping Multiple Pages in PythonWORKSHOP 3 | CREATOR: CHARLOTTE LLOYD

Outline

I. RecapII. Workshop ExampleIII. Verify DataIV. Celebration, Back-slapping

RecapPART I

Three Major Ways to Use Python

1. Command Line

2. “IDE”

3. Notebook

Scraping Process // Battle Plan

u 1. Surveillanceu Evaluate the page, learn the terrain.

u 2. Plan of Attack

u Brainstorm ways to approach the enemy.

u 3. Write codeu Be willing to change your strategy if you encounter obstacles or see another

“weakness” to exploit.

u 4. Emerge bloodied, yet victorious.

u Verify the data before all that syntax evaporates from your short term memory.

Workshop ExamplePART IV

u http://www.bfi.org.uk/films-tv-people/sightandsoundpoll2012/voters

u Scrape all information about all voters

u Scrape “film details” (except ”featuring”) for all films chosen by voters in their “top ten"

u Save data as 2 different csv files

1. Surveillance

u Voter: http://www.bfi.org.uk/films-tv-people/sightandsoundpoll2012/voter/94

u special case: http://www.bfi.org.uk/films-tv-people/sightandsoundpoll2012/voter/6

u Film: http://www.bfi.org.uk/films-tv-people/4ce2b6a7a801b

u special case: http://www.bfi.org.uk/films-tv-people/4ce2b8bb6b693

u special case: http://www.bfi.org.uk/films-tv-people/4ce2b7d2993a2

2. Plan of Attack: Voters

u What is our strategy to get the judge URLs? u exploit the “class=sas-poll” feature to scrape URLs from each of 25 tables

u What is our strategy to get the data for each judge?u scrape the name, type, info and country from the main page

u scrape the 10 films and comment from the judge’s individual page

u How can we handle the special cases? u manually create filmIDs for films without webpages

2. Plan of Attack: Films

u What is our strategy for getting the film URLs? u save them to a list while we’re scraping the judges

u What is our strategy to get the data for each film? Why do we have to incorporate the special cases directly into the strategy?

u we need to separately search for cells containing the director, country, year, genre, type, and category info

u the number of cells in the table varies, so we have to know what they are based on their content and not their position

3. Let’s look at the code together

u available at: https://github.com/charlloyd/film-gaze

u First let’s run it in Spyder.

u Then let’s download the jupyter notebook.

Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1....

Documents

Transcript of Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1....

Account Aggregation A Break Through Technology · 2. Screen scraping What is screen scraping? • Screen scraping or web scraping is the ability to automatically scrape or retrieve

Onlineinfo2012 - Scraping

Scraping HTML with XPathbooks.pharo.org/booklet-Scraping/pdf/2020-02-04-scraping... · 2020. 2. 4. · Illustrations IcamewiththeideaofthisbookletthanktoPeterthatkindlyanswereda questiononthePharomailing-list.TohelpPetershowedtoaPharoerhow

Acceptance & Scraping

Scraping Handout

Scraping by examples

Detailed Scraping Instructions

Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

Odd2015 scraping

Web Scraping Services

Scraping web pages

Presentacion web scraping

Scraping for Stories

Scraping with Geb

Scraping talk public

Website Scraping

COMP 4971C Independent Project Web Scraping …Web Scraping Website with Python for Database Construction HALIM, Kevin 8 2.1.2. Scraping the website Scraping the website GSMArena can

Object Oriented Design Goals OOD meets input from the Web Design workshop u Form teams u Brainstorm projects.

SCREEN SCRAPING USING “IBM PERSONAL COMMUNICATION” scraping IBM 5250.pdf · SCREEN SCRAPING USING “IBM PERSONAL COMMUNICATION” ... Meaning the use of screen scrapper program

Python Scraping Showdown