Scraping recalcitrant web sites with Python & Selenium

Scraping recalcitrant web sites with Python & Selenium

Roger Barnes

SyPy July 2012

Some sites suck

Some sites suck - "for your own good"

For security reasons, each button is an image, dynamically generated by a hash wrapped in a mess of javascript, randomly placed

...but they work in a web browser!

Let's use the web browser to scrape them

Enter Selenium

Selenium automates browsers

That's it

Selenium can...

● navigate (windows, frames, links)● find elements and parse attributes● interact and trigger events (click, type, ...)● capture screenshots● run javascript● let the browser take care of the hard stuff

(cookies, javascript, sessions, profiles, DOM)

Comes with various components and bindings... including python

General Recipe

Ingredients:● firefox (or chrome)● firebug (or chrome dev tools)● Selenium IDE

○ record a session, write less code● python and its batteries● python-selenium● xvfb and pyvirtualdisplay (optional)● other libraries to taste

○ eg image manipulation, database access, DOM parsing, OCR

General Recipe

Method:● Install requirements (apt-get, pip etc)

○ sudo apt-get install xvfb firefox○ pip install selenium pyvirtualdisplay

● Start up Firefox and Selenium IDE● Record a "test" run through site

○ Add in some assertions along the way● Export test as Python script● Hack from there

○ Loops○ Image/data extraction○ Wrangling data into a database

Example from Selenium IDEclass Ingdirect2(unittest.TestCase): def setUp(self): self.driver = webdriver.Firefox() self.driver.implicitly_wait( 30) self.base_url = "https://www.ingdirect.com.au" self.verificationErrors = []

def test_ingdirect2(self): driver = self.driver driver.get(self.base_url + "/client/index.aspx") driver.switch_to_frame('body') # Had to add this driver.find_element_by_id( "txtCIF").clear() driver.find_element_by_id( "txtCIF").send_keys("12345678") driver.find_element_by_id( "objKeypad_B1").click() driver.find_element_by_id( "objKeypad_B2").click() driver.find_element_by_id( "objKeypad_B3").click() driver.find_element_by_id( "objKeypad_B4").click() driver.find_element_by_id( "btnLogin").click() self.assertTrue(self.is_element_present(By.ID, "ctl2_lblBalance"))

But what about that dang keypad? ...

PIL saves the day# Get screenshot for extraction of button imagesscreenshot = driver.get_screenshot_as_base64()im = Image.open(StringIO.StringIO(base64.decodestring(screenshot)))

table = driver.find_element_by_xpath( '//*[@id="objKeypad_divShowAll"]/table')all_buttons = table.find_elements_by_tag_name( "input")

# Determine md5sum of each button by cropping based on element positionsfor button in all_buttons: button_image = im.crop(getcropbox(button)) hexid = hashlib.md5(button_image.tostring()).hexdigest() button_mapping[hexid] = button.get_attribute( "id")

# Now we know which button is which ( based on previous lookup), enter the PINfor char in self.pin: driver.find_element_by_id(button_mapping[hex_mapping[char]]).click()

driver.find_element_by_id( "btnLogin").click()

# We're in!!!11one

But why do all this?

It's my data! ... and I'll graph if i want to

* Actual results may vary. Graph indicates open inodes, not high-roller gambling problem

That's all folks

Slides● http://bit.ly/scrapium

Code● https://gist.github.com/3015852

Me● https://twitter.com/mindsocket● https://github.com/mindsocket● [email protected]

http://bit.ly/scrapium

http://bit.ly/scrapium

https://gist.github.com/3015852

https://gist.github.com/3015852

https://twitter.com/mindsocket

https://twitter.com/mindsocket

https://github.com/mindsocket

https://github.com/mindsocket

Scraping recalcitrant web sites with Python & Selenium

Technology

Transcript of Scraping recalcitrant web sites with Python & Selenium