Scrape the Web: Strategies for programming websites that don’t
Transcript of Scrape the Web: Strategies for programming websites that don’t
Scrape the Web: Strategies for programmingwebsites that don’t expect it
Presenter: Asheesh Laroia, @asheeshlaroia([email protected], +1-585-506-8865)
February 18, 2010
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
Intro
Meta
Hello
I You will learn neat tricks
I DO NOT BECOME AN EVIL COMMENT SPAMMER
I Theory, practice, and iterative development
I Brittle? Sometimes.
I The comics aren’t mine; ask me for references.
Hello
I You will learn neat tricks
I DO NOT BECOME AN EVIL COMMENT SPAMMER
I Theory, practice, and iterative development
I Brittle? Sometimes.
I The comics aren’t mine; ask me for references.
Hello
I You will learn neat tricks
I DO NOT BECOME AN EVIL COMMENT SPAMMER
I Theory, practice, and iterative development
I Brittle? Sometimes.
I The comics aren’t mine; ask me for references.
Hello
I You will learn neat tricks
I DO NOT BECOME AN EVIL COMMENT SPAMMER
I Theory, practice, and iterative development
I Brittle? Sometimes.
I The comics aren’t mine; ask me for references.
Hello
I You will learn neat tricks
I DO NOT BECOME AN EVIL COMMENT SPAMMER
I Theory, practice, and iterative development
I Brittle? Sometimes.
I The comics aren’t mine; ask me for references.
Hello
I You will learn neat tricks
I DO NOT BECOME AN EVIL COMMENT SPAMMER
I Theory, practice, and iterative development
I Brittle? Sometimes.
I The comics aren’t mine; ask me for references.
Format introduction
I I’ll stand up here and talk about things.
I You’ll ask me questions.
Format introduction
I I’ll stand up here and talk about things.
I You’ll ask me questions.
Format introduction
I I’ll stand up here and talk about things.
I You’ll ask me questions.
You know what sucks?
I It sucks when everyone’s thinking something and nobody’ssaying it.
I If I am incoherent, stop me.
You know what sucks?
I It sucks when everyone’s thinking something and nobody’ssaying it.
I If I am incoherent, stop me.
You know what sucks?
I It sucks when everyone’s thinking something and nobody’ssaying it.
I If I am incoherent, stop me.
“Only” three hours
I Slow me down,
I or speed me up.
I Do this with your voice or by raising your hand.
I Don’t try to do it via Twitter.
“Only” three hours
I Slow me down,
I or speed me up.
I Do this with your voice or by raising your hand.
I Don’t try to do it via Twitter.
“Only” three hours
I Slow me down,
I or speed me up.
I Do this with your voice or by raising your hand.
I Don’t try to do it via Twitter.
“Only” three hours
I Slow me down,
I or speed me up.
I Do this with your voice or by raising your hand.
I Don’t try to do it via Twitter.
“Only” three hours
I Slow me down,
I or speed me up.
I Do this with your voice or by raising your hand.
I Don’t try to do it via Twitter.
What is screen scraping?
Photo
Photo
Brittle?
Remote procedure call
I Every time you press a key, you cause the remote computer toexecute code.
I Every keypress causes a remote procedure call.
I If you understand this, you can document it as an API.
Remote procedure call
I Every time you press a key, you cause the remote computer toexecute code.
I Every keypress causes a remote procedure call.
I If you understand this, you can document it as an API.
Remote procedure call
I Every time you press a key, you cause the remote computer toexecute code.
I Every keypress causes a remote procedure call.
I If you understand this, you can document it as an API.
Remote procedure call
I Every time you press a key, you cause the remote computer toexecute code.
I Every keypress causes a remote procedure call.
I If you understand this, you can document it as an API.
Power
I We get to interact with the raw data.
I We could write our own interface.
I We get to programmatically interact with a system that onlyexpect humans at the door.
Power
I We get to interact with the raw data.
I We could write our own interface.
I We get to programmatically interact with a system that onlyexpect humans at the door.
Power
I We get to interact with the raw data.
I We could write our own interface.
I We get to programmatically interact with a system that onlyexpect humans at the door.
Power
I We get to interact with the raw data.
I We could write our own interface.
I We get to programmatically interact with a system that onlyexpect humans at the door.
Independence
I Design choices and restrictions fall away.
Independence
I Design choices and restrictions fall away.
Power, too much
I WE CAN SEND SPAM!
I Don’t do that.
Power, too much
I WE CAN SEND SPAM!
I Don’t do that.
Power, too much
I WE CAN SEND SPAM!
I Don’t do that.
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
Programming the web
Say
The Web
I It’s the twenty-first century.
I The Web is a massive, mostly-unrestricted remote procedurecall system.
The Web
I It’s the twenty-first century.
I The Web is a massive, mostly-unrestricted remote procedurecall system.
The Web
I It’s the twenty-first century.
I The Web is a massive, mostly-unrestricted remote procedurecall system.
Mac OS “say”
I I’m not hip enough to have “say”
I but I do have the Web
Mac OS “say”
I I’m not hip enough to have “say”
I but I do have the Web
Mac OS “say”
I I’m not hip enough to have “say”
I but I do have the Web
Cepstral demo
Curry
Delicious
Curry on the web
http://mehfilindian.com/LunchMenuTakeOut.htm
Beneath the covers...
I FrontPage 6.0 is from 2003
I Some really ugly HTML...
I I like to call this 1998-style HTML
Beneath the covers...
I FrontPage 6.0 is from 2003
I Some really ugly HTML...
I I like to call this 1998-style HTML
Beneath the covers...
I FrontPage 6.0 is from 2003
I Some really ugly HTML...
I I like to call this 1998-style HTML
Beneath the covers...
I FrontPage 6.0 is from 2003
I Some really ugly HTML...
I I like to call this 1998-style HTML
The easy way
examples/curry/trivial.py
I urllib2.urlopen() gives you a file descriptor
I Now you can read() it... (and you get a big ol’ byte string)
I Test its contents for squash, and you’re done.
The easy way
examples/curry/trivial.py
I urllib2.urlopen() gives you a file descriptor
I Now you can read() it... (and you get a big ol’ byte string)
I Test its contents for squash, and you’re done.
The easy way
examples/curry/trivial.py
I urllib2.urlopen() gives you a file descriptor
I Now you can read() it... (and you get a big ol’ byte string)
I Test its contents for squash, and you’re done.
The easy way
examples/curry/trivial.py
I urllib2.urlopen() gives you a file descriptor
I Now you can read() it... (and you get a big ol’ byte string)
I Test its contents for squash, and you’re done.
The Web and standards
I We don’t have to resort to visual screen scraping.
I The web has a standard data format for marking up pagecontent.
I What is it called?
The Web and standards
I We don’t have to resort to visual screen scraping.
I The web has a standard data format for marking up pagecontent.
I What is it called?
The Web and standards
I We don’t have to resort to visual screen scraping.
I The web has a standard data format for marking up pagecontent.
I What is it called?
The Web and standards
I We don’t have to resort to visual screen scraping.
I The web has a standard data format for marking up pagecontent.
I What is it called?
XHTML and HTML
I It’s 2010.
I Surely XHTML has won by now.
XHTML and HTML
I It’s 2010.
I Surely XHTML has won by now.
XHTML and HTML
I It’s 2010.
I Surely XHTML has won by now.
“Extract some information”
I HTML
I vs. XHTML (2000)
I Both are trees of tags; both can be visualized in FireBug.
I ...did XHTML win?
“Extract some information”
I HTML
I vs. XHTML (2000)
I Both are trees of tags; both can be visualized in FireBug.
I ...did XHTML win?
“Extract some information”
I HTML
I vs. XHTML (2000)
I Both are trees of tags; both can be visualized in FireBug.
I ...did XHTML win?
“Extract some information”
I HTML
I vs. XHTML (2000)
I Both are trees of tags; both can be visualized in FireBug.
I ...did XHTML win?
“Extract some information”
I HTML
I vs. XHTML (2000)
I Both are trees of tags; both can be visualized in FireBug.
I ...did XHTML win?
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?
I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?
I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:
I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?
I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?
I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?
I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)
I Average page size?I 16.5K
I HTML to XHTML ratio?I 2:1
I Transitional vs. Strict/Frameset:I 10:1
I How many in ”Quirks” mode?I 85%
I What’s more popular? TITLE or BODY?I TITLE
I What percent validate in general?I ca. 4.13%
I What percent of web pages that have validation badgesvalidate?
I ca. 12
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
The web: Round one
Parsing considerations
A showcase of some of your options
I An example of valid HTML (written by hand)(examples/parsing/)
I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)
I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidom
I Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in Firefox
I In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidom
I in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
A showcase of some of your optionsI An example of valid HTML (written by hand)
(examples/parsing/)I Parsed with HTMLParser
I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)
I Parsed with HTMLParser
I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)
I Parsed with xml.dom.minidomI Parsed with HTMLParser
I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)
I in FirefoxI In xml.dom.minidomI in HTMLParser
I If web HTML is not always parseable, we need a differentapproach.
Other ways to get information out of web pages?
I “squash” in page contents.lower()
I re.search(“squash”, page contents, re.IGNORECASE)
Other ways to get information out of web pages?
I “squash” in page contents.lower()
I re.search(“squash”, page contents, re.IGNORECASE)
Other ways to get information out of web pages?
I “squash” in page contents.lower()
I re.search(“squash”, page contents, re.IGNORECASE)
Inspirational quote: JWZ
Some people, when confronted with a problem, think“Iknow, I’ll use regular expressions.” Now they have twoproblems.– Jamie Zawinski
What’s wrong with regular expressions for scraping
I <a href=”/whatever/”>
I <a href=’whatever’>
I <a href=‘whatever”>
I Okay for “Reviews 1-10 of 430”
I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)
What’s wrong with regular expressions for scraping
I <a href=”/whatever/”>
I <a href=’whatever’>
I <a href=‘whatever”>
I Okay for “Reviews 1-10 of 430”
I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)
What’s wrong with regular expressions for scraping
I <a href=”/whatever/”>
I <a href=’whatever’>
I <a href=‘whatever”>
I Okay for “Reviews 1-10 of 430”
I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)
What’s wrong with regular expressions for scraping
I <a href=”/whatever/”>
I <a href=’whatever’>
I <a href=‘whatever”>
I Okay for “Reviews 1-10 of 430”
I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)
What’s wrong with regular expressions for scraping
I <a href=”/whatever/”>
I <a href=’whatever’>
I <a href=‘whatever”>
I Okay for “Reviews 1-10 of 430”
I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)
What’s wrong with regular expressions for scraping
I <a href=”/whatever/”>
I <a href=’whatever’>
I <a href=‘whatever”>
I Okay for “Reviews 1-10 of 430”
I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)
Inspirational quote: Jon Postel
Robustness principle: “Be conservative in what you do, be liberal inwhat you accept from others.”– Jon Postel, Transmission Control Protocol, RFC 793
Inspirational quote: Leonard Richardson
“You didn’t write that awful page. You’re just trying to get somedata out of it. Right now, you don’t really care what HTML issupposed to look like.“– Leonard Richardson, author of BeautifulSoup
Back to curry
New goal for curry: Objectify
Map the menu to Python objects
I play with the source in BeautifulSoup
I ...this is a text processing problem, not tag processing.
New goal for curry: Objectify
Map the menu to Python objects
I play with the source in BeautifulSoup
I ...this is a text processing problem, not tag processing.
New goal for curry: Objectify
Map the menu to Python objects
I play with the source in BeautifulSoup
I ...this is a text processing problem, not tag processing.
Model the data
examples/curry/menu.pyclass Entree:
I index
I name
I description
I long winded description
I price
Model the data
examples/curry/menu.pyclass Entree:
I index
I name
I description
I long winded description
I price
Model the data
examples/curry/menu.pyclass Entree:
I index
I name
I description
I long winded description
I price
Model the data
examples/curry/menu.pyclass Entree:
I index
I name
I description
I long winded description
I price
Model the data
examples/curry/menu.pyclass Entree:
I index
I name
I description
I long winded description
I price
Model the data
examples/curry/menu.pyclass Entree:
I index
I name
I description
I long winded description
I price
Mini-lesson
I hand-written pages vs.
I machine-written pages
Mini-lesson
I hand-written pages vs.
I machine-written pages
Mini-lesson
I hand-written pages vs.
I machine-written pages
New goal: Scrape Yahoo! finance
I examples/tree-builders/beautifulsoup yfinance.py
New goal: Scrape Yahoo! finance
I examples/tree-builders/beautifulsoup yfinance.py
We’re done!
Right?
Trees of tags
What defines how HTML gets parsed?
Web browsers
Surfing tag trees in FireBug
I Or Opera Dragonfly
I Or Chrome’s Inspector
Surfing tag trees in FireBug
I Or Opera Dragonfly
I Or Chrome’s Inspector
Surfing tag trees in FireBug
I Or Opera Dragonfly
I Or Chrome’s Inspector
Parsing trees and finding elements
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...I titleI span.title
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...I titleI span.title
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...I titleI span.title
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...I titleI span.title
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...I titleI span.title
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...I titleI span.title
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...I titleI span.title
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...I titleI span.title
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...
I titleI span.title
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...I title
I span.title
Early history
I 1998: HTML::TokeParser for Perl
I $p->get tag(“title”)
I 1999: W3C XPath standard
I xmlDoc.selectNodes(“//title”)
I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”
I soup(“title”)
I 2006: scrAPI for Ruby
I CSS Selectors...I titleI span.title
Recent history
I 2007: lxml.html improved, publicized by Ian Bicking
I CSS selectors for Pythonistas
I 2007: html5lib: Parse web pages like a browser
I 2008: BeautifulSoup 3.1.0, the end of an era
I 2010: html5lib deprecates BeautifulSoup
I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”
Recent history
I 2007: lxml.html improved, publicized by Ian Bicking
I CSS selectors for Pythonistas
I 2007: html5lib: Parse web pages like a browser
I 2008: BeautifulSoup 3.1.0, the end of an era
I 2010: html5lib deprecates BeautifulSoup
I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”
Recent history
I 2007: lxml.html improved, publicized by Ian Bicking
I CSS selectors for Pythonistas
I 2007: html5lib: Parse web pages like a browser
I 2008: BeautifulSoup 3.1.0, the end of an era
I 2010: html5lib deprecates BeautifulSoup
I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”
Recent history
I 2007: lxml.html improved, publicized by Ian Bicking
I CSS selectors for Pythonistas
I 2007: html5lib: Parse web pages like a browser
I 2008: BeautifulSoup 3.1.0, the end of an era
I 2010: html5lib deprecates BeautifulSoup
I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”
Recent history
I 2007: lxml.html improved, publicized by Ian Bicking
I CSS selectors for Pythonistas
I 2007: html5lib: Parse web pages like a browser
I 2008: BeautifulSoup 3.1.0, the end of an era
I 2010: html5lib deprecates BeautifulSoup
I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”
Recent history
I 2007: lxml.html improved, publicized by Ian Bicking
I CSS selectors for Pythonistas
I 2007: html5lib: Parse web pages like a browser
I 2008: BeautifulSoup 3.1.0, the end of an era
I 2010: html5lib deprecates BeautifulSoup
I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”
Recent history
I 2007: lxml.html improved, publicized by Ian Bicking
I CSS selectors for Pythonistas
I 2007: html5lib: Parse web pages like a browser
I 2008: BeautifulSoup 3.1.0, the end of an era
I 2010: html5lib deprecates BeautifulSoup
I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”
Searching tag trees
I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)
I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)
I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)
I “minimal stable XPath”
I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)
Searching tag trees
I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)
I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)
I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)
I “minimal stable XPath”
I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)
Searching tag trees
I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)
I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)
I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)
I “minimal stable XPath”
I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)
Searching tag trees
I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)
I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)
I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)
I “minimal stable XPath”
I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)
Searching tag trees
I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)
I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)
I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)
I “minimal stable XPath”
I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)
Searching tag trees
I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)
I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)
I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)
I “minimal stable XPath”
I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)
Interacting with the web
Basic Yahoo! search (hard-coded)
examples/search/yahoo.py
Basic Google! search (hard-coded)
examples/search/google.py
I Great code, but broken due to ?
Basic Google! search (hard-coded)
examples/search/google.py
I Great code, but broken due to ?
Something’s wrong...
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
The web: HTTP and you
A network trace of an HTTP conversation
User-Agent, and other headers the client sends
Status codes
I 2xx: Success
I 3xx: Redirection
I 4xx: Error
I 402: Payment Required
I 404 Not Found
I 410 Gone
I 418 I’m a teapot
Status codes
I 2xx: Success
I 3xx: Redirection
I 4xx: Error
I 402: Payment Required
I 404 Not Found
I 410 Gone
I 418 I’m a teapot
Status codes
I 2xx: Success
I 3xx: Redirection
I 4xx: Error
I 402: Payment Required
I 404 Not Found
I 410 Gone
I 418 I’m a teapot
Status codes
I 2xx: Success
I 3xx: Redirection
I 4xx: Error
I 402: Payment Required
I 404 Not Found
I 410 Gone
I 418 I’m a teapot
Status codes
I 2xx: Success
I 3xx: Redirection
I 4xx: Error
I 402: Payment Required
I 404 Not Found
I 410 Gone
I 418 I’m a teapot
Status codes
I 2xx: Success
I 3xx: Redirection
I 4xx: Error
I 402: Payment Required
I 404 Not Found
I 410 Gone
I 418 I’m a teapot
Status codes
I 2xx: Success
I 3xx: Redirection
I 4xx: Error
I 402: Payment Required
I 404 Not Found
I 410 Gone
I 418 I’m a teapot
Status codes
I 2xx: Success
I 3xx: Redirection
I 4xx: Error
I 402: Payment Required
I 404 Not Found
I 410 Gone
I 418 I’m a teapot
HTTP methods
I GET
I POST
I PUT
I BREW
HTTP methods
I GET
I POST
I PUT
I BREW
HTTP methods
I GET
I POST
I PUT
I BREW
HTTP methods
I GET
I POST
I PUT
I BREW
HTTP methods
I GET
I POST
I PUT
I BREW
Once we set User-Agent, are we just like Firefox?
I JavaScript behavior
I Image download behavior
I Cookie behavior
I Invalid HTML handling behavior (?)
I Accept: headers
Once we set User-Agent, are we just like Firefox?
I JavaScript behavior
I Image download behavior
I Cookie behavior
I Invalid HTML handling behavior (?)
I Accept: headers
Once we set User-Agent, are we just like Firefox?
I JavaScript behavior
I Image download behavior
I Cookie behavior
I Invalid HTML handling behavior (?)
I Accept: headers
Once we set User-Agent, are we just like Firefox?
I JavaScript behavior
I Image download behavior
I Cookie behavior
I Invalid HTML handling behavior (?)
I Accept: headers
Once we set User-Agent, are we just like Firefox?
I JavaScript behavior
I Image download behavior
I Cookie behavior
I Invalid HTML handling behavior (?)
I Accept: headers
Once we set User-Agent, are we just like Firefox?
I JavaScript behavior
I Image download behavior
I Cookie behavior
I Invalid HTML handling behavior (?)
I Accept: headers
What if we settle for approximate emulation?
Re-do of Google search with a cooked user-agent
examples/search/urllib2-user-agent/google as ie.py
Favorite User-Agent headers
I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;)
I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;(compatible;))
I I can’t believe it’s not Googlebot/2.1
Favorite User-Agent headers
I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;)
I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;(compatible;))
I I can’t believe it’s not Googlebot/2.1
Favorite User-Agent headers
I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;)
I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;(compatible;))
I I can’t believe it’s not Googlebot/2.1
Favorite User-Agent headers
I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;)
I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;(compatible;))
I I can’t believe it’s not Googlebot/2.1
HTTP: State via cookies
I HTTP implements state on top of TCP
HTTP: State via cookies
I HTTP implements state on top of TCP
robots.txt
I User-agent: *
I Disallow: /
I Allow: /crawlme.html
I http://www.robotstxt.org/
robots.txt
I User-agent: *
I Disallow: /
I Allow: /crawlme.html
I http://www.robotstxt.org/
robots.txt
I User-agent: *
I Disallow: /
I Allow: /crawlme.html
I http://www.robotstxt.org/
robots.txt
I User-agent: *
I Disallow: /
I Allow: /crawlme.html
I http://www.robotstxt.org/
robots.txt
I User-agent: *
I Disallow: /
I Allow: /crawlme.html
I http://www.robotstxt.org/
robots.txt and detectability
I “How does the server know you’re a robot?”
I Well, if you GET /robots.txt...
robots.txt and detectability
I “How does the server know you’re a robot?”
I Well, if you GET /robots.txt...
robots.txt and detectability
I “How does the server know you’re a robot?”
I Well, if you GET /robots.txt...
Filling out more forms: POST and GET
(Be sure to pay attention to the clock; minute 90 is when snackbreak starts.)
POST: Cepstral Weather demo (by hand)
http://cepstral.com/cgi-bin/demos/weather
Note the URL we POST to
I from FireBug
Note the URL we POST to
I from FireBug
Note the data we POST
I from FireBug
Note the data we POST
I from FireBug
Write simple Python that also POSTs
examples/cepstral/just post.py
Pull out the .wav file and play it with mplayer
examples/cepstral/play wav.py
POST: Cepstral weather demo (via mechanize)
examples/cepstral/just post via mechanize.py
Basic Yahoo! search (via mechanize)
examples/search/yahoo mechanize.py
I Great code, but broken due to robots.txt
Basic Yahoo! search (via mechanize)
examples/search/yahoo mechanize.py
I Great code, but broken due to robots.txt
Basic Yahoo! search (via mechanize, handle robots=False)
examples/search/yahoo mechanize norobots.py
Basic Google! search (via mechanize,handle robots=False, changeuser-agent)
examples/search/google mechanize.py
Cookies
emusic: Log in and verify that we logged in successfully(with cookielib)(optional)
examples/cookies/emusic login byhand.py
emusic: Log in and verify that we logged in successfully(with mechanize)
examples/cookies/emusic login mechanize.py
emusic: Check how many downloads we have left (withmechanize)
examples/cookies/emusic check downloads.py
Now we’re done, right?
Whew.
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
Recap and philosophy
Recap
We’ve seen:
I Loading web pages from the network with urllib2
I Parsing web pages (even broken ones)
I Scraping that page into a set of structured Python objects
I HTTP status codes
I Faking the user agent header
I Submitting forms
I Keeping a session with cookies
Recap
We’ve seen:
I Loading web pages from the network with urllib2
I Parsing web pages (even broken ones)
I Scraping that page into a set of structured Python objects
I HTTP status codes
I Faking the user agent header
I Submitting forms
I Keeping a session with cookies
Recap
We’ve seen:
I Loading web pages from the network with urllib2
I Parsing web pages (even broken ones)
I Scraping that page into a set of structured Python objects
I HTTP status codes
I Faking the user agent header
I Submitting forms
I Keeping a session with cookies
Recap
We’ve seen:
I Loading web pages from the network with urllib2
I Parsing web pages (even broken ones)
I Scraping that page into a set of structured Python objects
I HTTP status codes
I Faking the user agent header
I Submitting forms
I Keeping a session with cookies
Recap
We’ve seen:
I Loading web pages from the network with urllib2
I Parsing web pages (even broken ones)
I Scraping that page into a set of structured Python objects
I HTTP status codes
I Faking the user agent header
I Submitting forms
I Keeping a session with cookies
Recap
We’ve seen:
I Loading web pages from the network with urllib2
I Parsing web pages (even broken ones)
I Scraping that page into a set of structured Python objects
I HTTP status codes
I Faking the user agent header
I Submitting forms
I Keeping a session with cookies
Recap
We’ve seen:
I Loading web pages from the network with urllib2
I Parsing web pages (even broken ones)
I Scraping that page into a set of structured Python objects
I HTTP status codes
I Faking the user agent header
I Submitting forms
I Keeping a session with cookies
Recap
We’ve seen:
I Loading web pages from the network with urllib2
I Parsing web pages (even broken ones)
I Scraping that page into a set of structured Python objects
I HTTP status codes
I Faking the user agent header
I Submitting forms
I Keeping a session with cookies
“Play nice” on the web
I Ignore Terms of Service at your own peril
I robots.txt
I DO NOT BECOME AN EVIL COMMENT SPAMMER
“Play nice” on the web
I Ignore Terms of Service at your own peril
I robots.txt
I DO NOT BECOME AN EVIL COMMENT SPAMMER
“Play nice” on the web
I Ignore Terms of Service at your own peril
I robots.txt
I DO NOT BECOME AN EVIL COMMENT SPAMMER
“Play nice” on the web
I Ignore Terms of Service at your own peril
I robots.txt
I DO NOT BECOME AN EVIL COMMENT SPAMMER
Why scrape the web?
I Anger
I Interoperation with unmaintained systems
I “Rogue interoperability”
Why scrape the web?
I Anger
I Interoperation with unmaintained systems
I “Rogue interoperability”
Why scrape the web?
I Anger
I Interoperation with unmaintained systems
I “Rogue interoperability”
Why scrape the web?
I Anger
I Interoperation with unmaintained systems
I “Rogue interoperability”
Web APIs
Facebook uses standards!
I XMPP chat doesn’t support:
I support grouping contactsI status messagesI large profile imagesI notifications
I What’s the point?
Facebook uses standards!
I XMPP chat doesn’t support:
I support grouping contactsI status messagesI large profile imagesI notifications
I What’s the point?
Facebook uses standards!
I XMPP chat doesn’t support:
I support grouping contacts
I status messagesI large profile imagesI notifications
I What’s the point?
Facebook uses standards!
I XMPP chat doesn’t support:
I support grouping contactsI status messages
I large profile imagesI notifications
I What’s the point?
Facebook uses standards!
I XMPP chat doesn’t support:
I support grouping contactsI status messagesI large profile images
I notifications
I What’s the point?
Facebook uses standards!
I XMPP chat doesn’t support:
I support grouping contactsI status messagesI large profile imagesI notifications
I What’s the point?
Facebook uses standards!
I XMPP chat doesn’t support:
I support grouping contactsI status messagesI large profile imagesI notifications
I What’s the point?
“Sorry”
I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”
I Flickr: No way to get a user avatar via the API.
I API keys are evidence of submission.
I Where is the love?
I Why even play this game?
“Sorry”
I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”
I Flickr: No way to get a user avatar via the API.
I API keys are evidence of submission.
I Where is the love?
I Why even play this game?
“Sorry”
I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”
I Flickr: No way to get a user avatar via the API.
I API keys are evidence of submission.
I Where is the love?
I Why even play this game?
“Sorry”
I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”
I Flickr: No way to get a user avatar via the API.
I API keys are evidence of submission.
I Where is the love?
I Why even play this game?
“Sorry”
I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”
I Flickr: No way to get a user avatar via the API.
I API keys are evidence of submission.
I Where is the love?
I Why even play this game?
“Sorry”
I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”
I Flickr: No way to get a user avatar via the API.
I API keys are evidence of submission.
I Where is the love?
I Why even play this game?
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
Parser redux
Choosing a parser
I Performance
I Ease-of-use
I Quality
I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?
Choosing a parser
I Performance
I Ease-of-use
I Quality
I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?
Choosing a parser
I Performance
I Ease-of-use
I Quality
I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?
Choosing a parser
I Performance
I Ease-of-use
I Quality
I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?
Choosing a parser
I Performance
I Ease-of-use
I Quality
I Especially as relates to cleaning broken HTML
I HTML: 1998-style, or 2003-style?
Choosing a parser
I Performance
I Ease-of-use
I Quality
I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?
Benchmarks by Ian Bicking
I Benchmarks run by me this morning
I same results as Ian
Benchmarks by Ian BickingI Benchmarks run by me this morning
I same results as Ian
Benchmarks by Ian BickingI Benchmarks run by me this morning
I same results as Ian
Ease of use
Tree fixups
I lxml ≈ BeautifulSoup
I lxml ≈ html5lib
I BeautifulSoup 3.0.7 > BeautifulSoup 3.1.0
Tree fixups
I lxml ≈ BeautifulSoup
I lxml ≈ html5lib
I BeautifulSoup 3.0.7 > BeautifulSoup 3.1.0
Tree fixups
I lxml ≈ BeautifulSoup
I lxml ≈ html5lib
I BeautifulSoup 3.0.7 > BeautifulSoup 3.1.0
Tree fixups
I lxml ≈ BeautifulSoup
I lxml ≈ html5lib
I BeautifulSoup 3.0.7 > BeautifulSoup 3.1.0
A winner
I lxml!
I ...?
A winner
I lxml!
I ...?
A winner
I lxml!
I ...?
More about CSS selectors
I FireQuark
I http://www.imdb.com/title/tt0111161/
I h5:contains(“Release”)
I CSS...
More about CSS selectors
I FireQuark
I http://www.imdb.com/title/tt0111161/
I h5:contains(“Release”)
I CSS...
More about CSS selectors
I FireQuark
I http://www.imdb.com/title/tt0111161/
I h5:contains(“Release”)
I CSS...
More about CSS selectors
I FireQuark
I http://www.imdb.com/title/tt0111161/
I h5:contains(“Release”)
I CSS...
More about CSS selectors
I FireQuark
I http://www.imdb.com/title/tt0111161/
I h5:contains(“Release”)
I CSS...
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
Countermeasures
Easy
Imagine a really stupid bot
Check Referer header
I mechanize solves this
Check Referer header
I mechanize solves this
Extra hidden form fields
I mechanize solves this
Extra hidden form fields
I mechanize solves this
Requiring cookies
I mechanize solves this
Requiring cookies
I mechanize solves this
Countermeasures: hard
Per-IP address query limits
Example: Yahoo web search API
I Use more IPs
I Tor, orI your own machines
I Use SOCKS (plus SSH) to make this easy
Per-IP address query limits
Example: Yahoo web search API
I Use more IPs
I Tor, orI your own machines
I Use SOCKS (plus SSH) to make this easy
Per-IP address query limits
Example: Yahoo web search API
I Use more IPs
I Tor, or
I your own machines
I Use SOCKS (plus SSH) to make this easy
Per-IP address query limits
Example: Yahoo web search API
I Use more IPs
I Tor, orI your own machines
I Use SOCKS (plus SSH) to make this easy
Per-IP address query limits
Example: Yahoo web search API
I Use more IPs
I Tor, orI your own machines
I Use SOCKS (plus SSH) to make this easy
CAPTCHAs
Example: Google web search (when you exceed undeclared querylimits).
I uh-oh
CAPTCHAs
Example: Google web search (when you exceed undeclared querylimits).
I uh-oh
JavaScript
Example: “Hash cash” system for avoiding comment spam.
I uh-oh
JavaScript
Example: “Hash cash” system for avoiding comment spam.
I uh-oh
Invisible countermeasures
Behavior profiling
I Time-based?
Behavior profiling
I Time-based?
Inserting false link visible only to bots
I “Tarpits”
Inserting false link visible only to bots
I “Tarpits”
robots.txt access
I As soon as you access it, you lose.
robots.txt access
I As soon as you access it, you lose.
Getting around IP address limits
Understand
I We still have to stay within the limits. We can just takeadvantage of IPs we do have.
Understand
I We still have to stay within the limits. We can just takeadvantage of IPs we do have.
ssh -D
I Borrow the IP of any machine you can log in to
I ssh -D 1080 asheesh.org
ssh -D
I Borrow the IP of any machine you can log in to
I ssh -D 1080 asheesh.org
ssh -D
I Borrow the IP of any machine you can log in to
I ssh -D 1080 asheesh.org
socks monkey
I SOCKSify Python from within Python
I examples/ip-limits/socks monkey.py
socks monkey
I SOCKSify Python from within Python
I examples/ip-limits/socks monkey.py
socks monkey
I SOCKSify Python from within Python
I examples/ip-limits/socks monkey.py
tsocks
I SOCKSify Python via LD PRELOAD
I examples/ip-limits/tsocks/
tsocks
I SOCKSify Python via LD PRELOAD
I examples/ip-limits/tsocks/
tsocks
I SOCKSify Python via LD PRELOAD
I examples/ip-limits/tsocks/
tor
“The onion router”
I SOCKSify but borrow someone else’s IP
I (play nice...)
tor
“The onion router”
I SOCKSify but borrow someone else’s IP
I (play nice...)
tor
“The onion router”
I SOCKSify but borrow someone else’s IP
I (play nice...)
Cycling strategies
I Drain it dry
I easy to implement first
I Round-robin
I generally preferable
Cycling strategies
I Drain it dry
I easy to implement first
I Round-robin
I generally preferable
Cycling strategies
I Drain it dry
I easy to implement first
I Round-robin
I generally preferable
Cycling strategies
I Drain it dry
I easy to implement first
I Round-robin
I generally preferable
Cycling strategies
I Drain it dry
I easy to implement first
I Round-robin
I generally preferable
Return to JavaScript: breaking Hash Cash
Detecting its presence
I Attempt to submit a comment with JS disabled
I Attempt to submit a comment with JS enabled
I Trace the second in FireBug
Detecting its presence
I Attempt to submit a comment with JS disabled
I Attempt to submit a comment with JS enabled
I Trace the second in FireBug
Detecting its presence
I Attempt to submit a comment with JS disabled
I Attempt to submit a comment with JS enabled
I Trace the second in FireBug
Detecting its presence
I Attempt to submit a comment with JS disabled
I Attempt to submit a comment with JS enabled
I Trace the second in FireBug
Rewriting the JavaScript as Python
I You may think I’m joking, but this is a common strategy.
Rewriting the JavaScript as Python
I You may think I’m joking, but this is a common strategy.
DOMForm
I Good news
“DOMForm is a Python module for web scraping and web testing.It knows how to evaluate embedded JavaScript code in response toappropriate events.”– John J. Lee of mechanize
I Bad news
“This module is unmaintained. Maybe someday...”Also, it does not execute page-global JavaScript, which is whereHashCash is implemented.
DOMForm
I Good news
“DOMForm is a Python module for web scraping and web testing.It knows how to evaluate embedded JavaScript code in response toappropriate events.”– John J. Lee of mechanize
I Bad news
“This module is unmaintained. Maybe someday...”Also, it does not execute page-global JavaScript, which is whereHashCash is implemented.
DOMForm
I Good news
“DOMForm is a Python module for web scraping and web testing.It knows how to evaluate embedded JavaScript code in response toappropriate events.”– John J. Lee of mechanize
I Bad news
“This module is unmaintained. Maybe someday...”Also, it does not execute page-global JavaScript, which is whereHashCash is implemented.
python-spidermonkey
I Good news
I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”
I Bad news
I ...do you really want to parse the web page for JavaScript andexecute it?
I examples/javascript/hashcash.py
python-spidermonkey
I Good news
I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”
I Bad news
I ...do you really want to parse the web page for JavaScript andexecute it?
I examples/javascript/hashcash.py
python-spidermonkey
I Good news
I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”
I Bad news
I ...do you really want to parse the web page for JavaScript andexecute it?
I examples/javascript/hashcash.py
python-spidermonkey
I Good news
I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”
I Bad news
I ...do you really want to parse the web page for JavaScript andexecute it?
I examples/javascript/hashcash.py
python-spidermonkey
I Good news
I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”
I Bad news
I ...do you really want to parse the web page for JavaScript andexecute it?
I examples/javascript/hashcash.py
python-spidermonkey
I Good news
I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”
I Bad news
I ...do you really want to parse the web page for JavaScript andexecute it?
I examples/javascript/hashcash.py
Ick
I None of this is as clean and automated as mechanize.
Ick
I None of this is as clean and automated as mechanize.
“Breaking” CAPTCHAs
Fallback: yourself
I Can always just prompt the operator to figure it out and enterit
Fallback: yourself
I Can always just prompt the operator to figure it out and enterit
Mailinator: “Enter these words to delete the email”
I Only so many different images
I So build a look-up table
I ...indexed by URL?
I ...indexed by image contents?
I ...indexed by fuzzy image contents?
(I don’t have a good tool for the last one.)
Mailinator: “Enter these words to delete the email”
I Only so many different images
I So build a look-up table
I ...indexed by URL?
I ...indexed by image contents?
I ...indexed by fuzzy image contents?
(I don’t have a good tool for the last one.)
Mailinator: “Enter these words to delete the email”
I Only so many different images
I So build a look-up table
I ...indexed by URL?
I ...indexed by image contents?
I ...indexed by fuzzy image contents?
(I don’t have a good tool for the last one.)
Mailinator: “Enter these words to delete the email”
I Only so many different images
I So build a look-up table
I ...indexed by URL?
I ...indexed by image contents?
I ...indexed by fuzzy image contents?
(I don’t have a good tool for the last one.)
Mailinator: “Enter these words to delete the email”
I Only so many different images
I So build a look-up table
I ...indexed by URL?
I ...indexed by image contents?
I ...indexed by fuzzy image contents?
(I don’t have a good tool for the last one.)
Mailinator: “Enter these words to delete the email”
I Only so many different images
I So build a look-up table
I ...indexed by URL?
I ...indexed by image contents?
I ...indexed by fuzzy image contents?
(I don’t have a good tool for the last one.)
Audio captchas: “Simple” signal analysis
I Should be doable in pylab/matplotlib with fast Fouriertransforms
Audio captchas: “Simple” signal analysis
I Should be doable in pylab/matplotlib with fast Fouriertransforms
JavaScript CAPTCHAs (like reCAPTCHA)
I re-implement CAPTCHA-downloading logic in Python
I ...or execute the JavaScript with spidermonkey
JavaScript CAPTCHAs (like reCAPTCHA)
I re-implement CAPTCHA-downloading logic in Python
I ...or execute the JavaScript with spidermonkey
JavaScript CAPTCHAs (like reCAPTCHA)
I re-implement CAPTCHA-downloading logic in Python
I ...or execute the JavaScript with spidermonkey
...JDownloader
I “Again, our captcha team did a great job and implementedmany new captcha methods.”
...JDownloader
I “Again, our captcha team did a great job and implementedmany new captcha methods.”
The website from Hell: US PTO Public PAIR
http://portal.uspto.gov/external/portal/pair
Start with a CAPTCHA
Solve it and move on to...
I document.write()
Solve it and move on to...
I document.write()
The page is invisible.
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
Automating the web browser
Selenium Remote Control
examples/seleniumrc/start.py
Selenium IDE
I Our friend, XPath
I FireBug
Selenium IDE
I Our friend, XPath
I FireBug
Selenium IDE
I Our friend, XPath
I FireBug
Why don’t we just do this all the time?
I Firefox memory footprint
I Flexibility
Why don’t we just do this all the time?
I Firefox memory footprint
I Flexibility
Why don’t we just do this all the time?
I Firefox memory footprint
I Flexibility
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
Other tricks
Your parser may fail
Text encoding
I Look in the HTTP header!
I Try UTF-8!
I ...chardet, if you must
Text encoding
I Look in the HTTP header!
I Try UTF-8!
I ...chardet, if you must
Text encoding
I Look in the HTTP header!
I Try UTF-8!
I ...chardet, if you must
Text encoding
I Look in the HTTP header!
I Try UTF-8!
I ...chardet, if you must
Automatically reverse-engineer templates
I templatemaker by Adrian Holovaty
I everyblock templatemaker
Automatically reverse-engineer templates
I templatemaker by Adrian Holovaty
I everyblock templatemaker
Automatically reverse-engineer templates
I templatemaker by Adrian Holovaty
I everyblock templatemaker
table2dict
I Python bug tracker
table2dict
I Python bug tracker
Outline
Intro
Programming the web
Stats pop quiz
The web: Round one
The web: HTTP and you
Recap and philosophy
Parser redux
Countermeasures
Automating the web browser
Other tricks
Conclusions
Conclusions
Scaling and stability
I Choosing reliable queries from web pages
I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)
I Tor (and other proxy considerations)
I registrar.py: was seven years stable...
Scaling and stability
I Choosing reliable queries from web pages
I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)
I Tor (and other proxy considerations)
I registrar.py: was seven years stable...
Scaling and stability
I Choosing reliable queries from web pages
I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)
I Tor (and other proxy considerations)
I registrar.py: was seven years stable...
Scaling and stability
I Choosing reliable queries from web pages
I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)
I Tor (and other proxy considerations)
I registrar.py: was seven years stable...
Scaling and stability
I Choosing reliable queries from web pages
I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)
I Tor (and other proxy considerations)
I registrar.py: was seven years stable...
Summary
I If it’s on a web page, you can scrape it out.
I “Now you have an API for everything.”
Summary
I If it’s on a web page, you can scrape it out.
I “Now you have an API for everything.”
Summary
I If it’s on a web page, you can scrape it out.
I “Now you have an API for everything.”
Future directions
I More automation
I Using cssselect everywhere, geez it’s cool
Future directions
I More automation
I Using cssselect everywhere, geez it’s cool
Future directions
I More automation
I Using cssselect everywhere, geez it’s cool
Bonus time
If we have time:
I Greasemonkey demo: scraping in the browser
I Audience-suggested scraping lab
I Workshopping on queries or regular expressions
Bonus time
If we have time:
I Greasemonkey demo: scraping in the browser
I Audience-suggested scraping lab
I Workshopping on queries or regular expressions
Bonus time
If we have time:
I Greasemonkey demo: scraping in the browser
I Audience-suggested scraping lab
I Workshopping on queries or regular expressions
Bonus time
If we have time:
I Greasemonkey demo: scraping in the browser
I Audience-suggested scraping lab
I Workshopping on queries or regular expressions