Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data...
Transcript of Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data...
![Page 1: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/1.jpg)
Art of the scrape!!!!
Show the internet who’s boss. Scrape it!
![Page 2: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/2.jpg)
Before we begin!
You might want….
A computer!
Server space!
Processing!
![Page 3: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/3.jpg)
What’s going on here?
Today we’re going to….• Examine our data resources!
• Try some scraping!
• Try some pulling!
• Mess around with an API!
• Say hello to visualization!
![Page 4: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/4.jpg)
Data? I hardly knew‐a!
• Data: Any discreet unit and its meta information
• Useful data: More than one record of data...but that second record can be in your head!
• Everything is numbers!
![Page 5: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/5.jpg)
Internal Use
![Page 6: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/6.jpg)
External Use
![Page 7: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/7.jpg)
Tell me more of this data of which you speak!
Real‐time• Blogs• Twitter feed• News feeds…• Etc
Static data sets• Gov’t census data,• EPA data• National League salaries• etc
![Page 8: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/8.jpg)
Data is Powerful!
The act of measuring something solidifies its state.
Ahh, the power!!!
![Page 9: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/9.jpg)
Data is misleading!
• Choosing one source over another
• Only portraying parts of the statistic
• Choosing a biased method of portrayal
![Page 10: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/10.jpg)
![Page 11: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/11.jpg)
Information Overload: don’t believe the hype
![Page 12: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/12.jpg)
Flavors of data
• Indexed data ‐ documents, weblogs, images, videos, shopping articles, jobs ...
• Cartographic and geographic data ‐Geolocation software, Geovisualization
• News Aggregators ‐ Feeds, podcasts:
![Page 13: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/13.jpg)
DATATYPE!
• Straight text
• CSV/ tab delimited
• XML/RSS/ATOM
• JSON
Would it fit in here? Then its data!
![Page 14: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/14.jpg)
VIA
• Text file
• Data feed
• Scraping html
• API
• Some combinationCould it potentially be transferred by this? Then it’s grabable!
![Page 15: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/15.jpg)
DESTINATION
• Spreadsheet (by hand)
• Browser (direct, javascript, php, perl...)
• Database (via sql using php, perl, etc....)
• Application (Processing, java, python)
• A second API
![Page 16: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/16.jpg)
Mom, where does data come from?HTML for scraping: Anywhere you can see text online
• Weather.com
• Yahoo trending topics
Preformatted data sets: Anywhere it’s available
• Amazon data sets
• opendata.gov
Realtime rss feeds: Anywhere there’s a data feed
• Any blog feed
• Any news feed
Personalized Awesome targeted data: Anywhere with an API.
• New York times API
• Twitter API
![Page 17: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/17.jpg)
Choose wisely!
DATATYPE VIA DESTINATION• xml/rss Browser Excel• csv text file php: database• xml api php:browser• xml api javascript:browser• html scraping php browser• csv text file Processing• html scraping Processing• xml browser Processing
(through php)
![Page 18: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/18.jpg)
Example 1 and 2
Datatype VIA Destination
HTML SCRAPING BROWSER
(Weather info) (PHP) (Firefox, or whatever)
• Step one: Get to know your data: http://www.weather.com/weather/today/New+York+NY+10010?lswe=10010
• Step two: Set up the code
![Page 19: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/19.jpg)
![Page 20: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/20.jpg)
Example 1: Straight scrapin’
<?php
$url = 'http://www.weather.com/weather/today/New+York+NY+10010?lswe=10010';
$output = file_get_contents($url);echo $output;
?>
Get the data!
Do Something with it!
![Page 21: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/21.jpg)
Example 1
<?php
$url = 'http://www.weather.com/weather/today/New+York+NY+10010?lswe=10010';
$output = file_get_contents($url);echo $output;
?>
![Page 22: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/22.jpg)
Example 2: Scraping with a purpose$currentTerm = NULL; //we'll use this to hold the words!
$myUrl = "http://www.google.com/trends/hottrends/atom/hourly
$searchForStart = "sa=X\">";
$searchForEnd = "</a>";
$rawPage = file_get_contents($myUrl);
echo "<B>These are this hour's trending topics on Google!</b><BR><BR>";
while ($startPos = (strpos($rawPage, $searchForStart))) { //as long as there's more stuff to find, find it!
$endPos = strpos($rawPage, $searchForEnd); //And then find where it ends!
$length = $endPos ‐ $startPos; //How long is this string we've found, anyway?
if ($startPos && $endPos) { //Did we find something? Then
$currentTerm = substr($rawPage, ($startPos+strlen($searchForStart)), $length‐6);
echo $currentTerm . "<BR>";
} //end if
$rawPage = substr($rawPage, ($endPos + 4));
} //end while
Get the data!
Do Something with it!
Get everything ready
![Page 23: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/23.jpg)
Example 2$currentTerm = NULL; //we'll use this to hold the words!
$myUrl = "http://www.google.com/trends/hottrends/atom/hourly
$searchForStart = "sa=X\">";
$searchForEnd = "</a>";
$rawPage = file_get_contents($myUrl);
echo "<B>These are this hour's trending topics on Google!</b><BR><BR>";
while ($startPos = (strpos($rawPage, $searchForStart))) { //as long as there's more stuff to find, find it!
$endPos = strpos($rawPage, $searchForEnd); //And then find where it ends!
$length = $endPos ‐ $startPos; //How long is this string we've found, anyway?
if ($startPos && $endPos) { //Did we find something? Then
$currentTerm = substr($rawPage, ($startPos+strlen($searchForStart)), $length‐6);
echo $currentTerm . "<BR>";
} //end if
$rawPage = substr($rawPage, ($endPos + 4));
} //end while
![Page 24: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/24.jpg)
Example 3
Datatype VIA Destination
XML RSS FEED BROWSER
(Huffington post) (PHP) (Firefox, or whatever)
• Step one: Get to know your data: http://feeds.huffingtonpost.com/huffingtonpost/raw_feed
• Step two: Set up the code
![Page 25: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/25.jpg)
What’s this xml stuff?
<introductory tags>
<entry>
<title></title>
<id></id>
<published></published>
<updated>2010‐06‐19T15:50:45Z</updated>
<summary>summary>
<author>
<name></name>
<uri>http://www.huffingtonpost.com/anne‐naylor/</uri>
</author>
<content></ content>
</entry>
![Page 26: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/26.jpg)
Example 3: XML makes things awesome
$url = "http://feeds.huffingtonpost.com/huffingtonpost/raw_feed";
$data = file_get_contents($url);
$xml = new SimpleXmlElement($data);
echo "<b>Here are the current popular posts from Huffington Post without the ads!</b><BR><BR><ul>";
foreach ($xml‐>entry as $item) { //navigate to the tag we want?
$myTitle = "unknown"; //initialize the variable so it's all set!
$myTitle = trim($item‐>title);
echo "<LI>" . $myTitle . "<br>"; //now print it!
}//end foreach
echo "</ul>";
Get the data!
Do Something with it!
Get it in a form we can use
![Page 27: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/27.jpg)
Example 3: XML makes things awesome
$url = "http://feeds.huffingtonpost.com/huffingtonpost/raw_feed";
$data = file_get_contents($url);
$xml = new SimpleXmlElement($data);
echo "<b>Here are the current popular posts from Huffington Post without the ads!</b><BR><BR><ul>";
foreach ($xml‐>entry as $item) { //navigate to the tag we want?
$myTitle = "unknown"; //initialize the variable so it's all set!
$myTitle = trim($item‐>title);
echo "<LI>" . $myTitle . "<br>"; //now print it!
}//end foreach
echo "</ul>";
![Page 28: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/28.jpg)
Example 4
Datatype VIA Destination
XML data FEED API and BROWSER
(US Exchange rates) (PHP) (Google Charts API
Firefox, or whatever)
• Step one: Get to know your data: http://rss.timegenie.com/forex.xml
• Step two: Set up the code
![Page 29: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/29.jpg)
API’s? Eh?
• Data –all the types of data we discussed before
• FunctionalityData converters: language translators, speech processing, urlshorteners)
Communication: email, IM, notifications
Visual data rendering: Information visualization, diagrams, maps
Security related : electronic payment systems, ID identification...
![Page 30: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/30.jpg)
Example 4: Doing the two‐stepGet the data!
Run it through a second Process
Get it in a form we can use
Do something with it (like displaying that baby!)
![Page 31: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/31.jpg)
Bringing data into a higher‐level Application…like processing!
• Install the simplml library:
http://www.learningprocessing.com/tutorials/simpleml/
• Inspect your data for structure
• Write some code!– Declare your xml intent!
– Make the request!
– Process the request!
– Do fun stuff with it!
![Page 32: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/32.jpg)
![Page 33: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/33.jpg)
![Page 34: Art of the scrape!!!!What’s going on here? Today we’re going to…. • Examine our data resources! • Try some scraping! • Try some pulling! • Mess around with an API! Tell](https://reader035.fdocuments.net/reader035/viewer/2022062609/60fd0644007bda7fd512f176/html5/thumbnails/34.jpg)