Mining Social Web APIs with IPython Notebook (PyCon 2014)
-
Upload
matthew-russell -
Category
Software
-
view
1.398 -
download
1
description
Transcript of Mining Social Web APIs with IPython Notebook (PyCon 2014)
![Page 1: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/1.jpg)
Mining Social Web APIswith IPython Notebook
Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com
Montréal - 9 April 2014
1
![Page 2: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/2.jpg)
Intro
2
![Page 3: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/3.jpg)
Hello, My Name Is ... Matthew
3
Background in Computer Science
Data mining & machine learning
CTO @ Digital Reasoning Systems
Data mining; machine learning
Author @ O'Reilly Media
5 published books on technology
Principal @ Zaffra
Selective boutique consulting
![Page 4: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/4.jpg)
4
The only easy day was yesterday
-- Motto of the U.S. Navy SEALs
![Page 5: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/5.jpg)
5
It pays to be a winner
-- Motto of the U.S. Navy SEALs
![Page 6: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/6.jpg)
Transforming Curiosity Into Insight
6
An open source software (OSS) project
http://bit.ly/MiningTheSocialWeb2E
A book
http://bit.ly/135dHfs
Accessible to (virtually) everyone
Virtual machine with turn-key coding templates for data science experiments
Think of the book as "premium" support for the OSS project
![Page 7: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/7.jpg)
Table of Contents (1/2)
Chapter 1 - Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking About, and More
Chapter 2 - Mining Facebook: Analyzing Fan Pages, Examining Friendships, and More
Chapter 3 - Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More
Chapter 4 - Mining Google+: Computing Document Similarity, Extracting Collocations, and More
Chapter 5 - Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More
Chapter 6 - Mining Mailboxes: Analyzing Who's Talking to Whom About What, How Often, and More
7
![Page 8: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/8.jpg)
Table of Contents (2/2)
Chapter 7 - Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More
Chapter 8 - Mining the Semantically Marked-Up Web: Extracting Microformats, Inferencing over RDF, and More
Chapter 9 - Twitter Cookbook
Appendix A - Information About This Machine's Virtual Machine Experience
Appendix B - OAuth Primer
Appendix C - Python and IPython Notebook Tips & Tricks
8
![Page 9: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/9.jpg)
Designed for PedagogyBrief Intro
Objectives
API Primer
Analysis Technique(s)
Data Visualization
Recap
Suggested Exercises
Recommended Resources
9
![Page 10: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/10.jpg)
The Social Web Is All the Rage
World population: ~7B people
Facebook: 1.15B users
Twitter: 500M users
Google+ 343M users
LinkedIn: 238M users
~200M+ blogs (conservative estimate)
10
![Page 11: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/11.jpg)
OverviewIntro (5 mins)
Module 1 - Virtual Machine Setup (10 mins)
Module 2 - Mining Twitter (45 mins)
Module 3 - Mining Facebook (30 mins)
BREAK (20 mins)
Module 4 - Mining LinkedIn (30 mins)
Module 5 - Choice: Open Hack (30 mins)
Module 6 - Privacy & Ethics; (20 mins)
Module 7 - Final Q&A; Surveys (10 mins)
11
![Page 12: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/12.jpg)
Module Format
~10-15 minutes of exposition
I talk; you listen
~15 minutes of independent (or collaborative) work
You hack while I walk around and help you
~5 minutes of recap with Q&A
You ask; I try to answer
12
![Page 13: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/13.jpg)
Workshop Objective
To send you away as a social web hacker
Broad working knowledge popular social web APIs
Hands-on experience hacking on social web data with a common toolkit
Not for me talk to you for 3 straight hours
13
![Page 14: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/14.jpg)
Just a Few More Things
This workshop is...
An adaptation of Mining the Social Web, 2nd Edition
More of a guided hacking session where you follow along (vs a preso)
Wider than it is deeper
There's only so much you can do in a few hours
I'm available 24/7 this week (and beyond) to help you be successful
14
![Page 15: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/15.jpg)
Assumptions
At some point in your life, you have
Programmed with Python
Worked with JSON
Made requests and processed responses to/from web servers
Or you want to learn to do these things now...
And you're a quick learner
15
![Page 16: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/16.jpg)
Module 1: Virtual Machine Setup
16
![Page 17: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/17.jpg)
Why do you need a VM?
17
To save time
Because installation and configuration management is harder than it first appears
So that you can focus on the task at hand instead
So that I can support you regardless of your hardware and operating system
![Page 18: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/18.jpg)
But I can do all of that myself...True...
If you would rather troubleshoot unexpected installation/configuration issues instead of immediately focusing on the real task at hand
At least give it a shot before resorting to your own devices so that you don't have to install specific versions of ~40 Python packages
Including scientific computing tools that require underlying C/C++ code to be compiled
Which requires specific versions of developer libraries to be installed
You get the idea...
18
![Page 19: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/19.jpg)
The Virtual Machine ExperienceVagrant
A nice abstraction around virtual machine providers
One ring to rule them all
Virtualbox, VMWare, AWS, ...
IPython Notebook
The easiest way to program with Python
A better REPL (interpreter)
Great for hacking
19
![Page 20: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/20.jpg)
What happens when you vagrant up?
Vagrant follows the instructions in your Vagrantfile
Starts up a Virtualbox instance
Uses Chef to provision it
Installs OS patches/updates
Installs MTSW software dependencies
Starts IPython Notebook server on port 8888
20
![Page 21: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/21.jpg)
Why Should I Use IPython Notebook?
Because it's great for hacking
And hacking is usually the first step
Because it's great for collaboration
Sharing/publishing results is trivial
Because the UX is as easy as working in a notepad
Think of it as "executable paper"
21
![Page 22: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/22.jpg)
22
![Page 23: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/23.jpg)
23
![Page 24: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/24.jpg)
VM Quick Start Instructions
Go to http://MiningTheSocialWeb.com/quick-start/
Follow the instructions
And watch the screencasts!
Basically:
Install Virtualbox & Vagrant
Run "vagrant up" in a terminal to start a guest VM
Then, go to http://localhost:8888 on your host machine's web browser
24
![Page 25: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/25.jpg)
What Could Be Easier?
A hosted version of the VM!
But only for a few hours during this workshop
Because it costs money to run these servers
Go to [See Live Slides for URL] and pick a machine
Do not share the URLs outside of this workshop!
Please don't try to hack the machines
Learn how I arrived at this setup at http://MiningTheSocialWeb.com
25
![Page 26: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/26.jpg)
Module 2: Mining Twitter
26
![Page 27: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/27.jpg)
Objectives
27
Be able to identify Twitter primitives
Understand tweet metadata and how to use it
Learn how to extract entities such as user mentions, hashtags, and URLs from tweets
Apply techniques for performing frequency analysis with Python
Be able to plot histograms of Twitter data with IPython Notebook
![Page 28: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/28.jpg)
Twitter Primitives
28
Accounts Types: "Anything"
"Following" Relationships
Favorites
Retweets
Replies
(Almost) No Privacy Controls
![Page 29: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/29.jpg)
API RequestsRESTful requests
Everything is a "resource"
You GET, PUT, POST, and DELETE resources
Standard HTTP "verbs"
Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=SocialWebMining
Streaming API filters
JSON responses
Cursors (not quite pagination)
29
![Page 30: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/30.jpg)
Twitter is an Interest Graph
30
Roberto Mercedes
Jorge
Ana
Nina
Johnny Araya
Rodolfo Hernández
![Page 31: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/31.jpg)
What's in a Tweet?
31
140 Characters ...
... Plus ~5KB of metadata!
Authorship
Time & location
Tweet "entities"
Replying, retweeting, favoriting, etc.
![Page 32: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/32.jpg)
What are Tweet Entities?
Essentially, the "easy to get at" data in the 140 characters
@usermentions
#hashtags
URLs
multiple variations
(financial) symbols
stock tickers
media
32
![Page 33: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/33.jpg)
Data Mining = Curiosity + StatsCuriosity
Interests, desires, and intuitions
Statistics
Counting
Comparing
Filtering
Ranking
Hypothesis testing; knowledge discovery
33
![Page 34: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/34.jpg)
Histograms
A chart that is handy for frequency analysis
They look like bar charts...except they're not bar charts
Each value on the x-axis is a range (or "bin") of values
Not categorical data
Each value on the y-axis is the combined frequency of values in each range
34
![Page 35: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/35.jpg)
35
Example: Histogram of Retweets
![Page 36: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/36.jpg)
Social Media Analysis FrameworkA memorable four step process to guide data science experiments:
Aspire
To test a hypothesis (answer a question)
Acquire
Get the data
Analyze
Count things
Summarize
Plot the results
36
![Page 37: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/37.jpg)
ExercisesReview Python idioms in the "Appendix C (Python Tips & Tricks)" notebook
Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook
Fill in Example 1-1 with credentials and begin work
Execute each example sequentially
Customize queries
Explore tweet metadata; count tweet entities; plot histograms of results
Explore the "Chapter 9 (Twitter Cookbook)" notebook
Think of it as a collection of building blocks
37
![Page 38: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/38.jpg)
Module 3: Mining Facebook
38
![Page 39: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/39.jpg)
Objectives
39
Be able to identify Facebook primitives
Learn about Facebook’s Social Graph API and how to make API requests
Understand how Open Graph protocol extends Facebook's Social Graph API
Be able to analyze likes from Facebook pages and friends
![Page 40: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/40.jpg)
Facebook Primitives
Account Types: People & Pages
Mutual Connections
Likes
Shares
Comments
Extensive Privacy Controls
40
![Page 41: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/41.jpg)
API Requests
Social Graph API requests
Not RESTful but easy to learn and use
Special "field expansion" syntax
Example: GET http://graph.facebook.com/ptwobrussell/?fields=id,name,friends.fields(likes.limit(10))
JSON responses
Traditional pagination
41
![Page 42: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/42.jpg)
Facebook is an Interest Graph
42
Roberto Mercedes
Jorge
Ana
Nina
Johnny Araya
Rodolfo Hernández
![Page 43: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/43.jpg)
Facebook API Explorer
43
Go to https://developers.facebook.com/tools/explorer
Really, go there right now...
![Page 44: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/44.jpg)
44
Retrieve Your Likes
![Page 45: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/45.jpg)
Facebook Permissions
45
![Page 46: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/46.jpg)
Facebook Permissions
46
![Page 47: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/47.jpg)
Explore Facebook Pages
47
Names of pages
MiningTheSocialWeb
CrossFit
OReilly
Web URLs (OGP extensions to Facebook's Social Graph)
http://www.imdb.com/title/tt0117500
![Page 48: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/48.jpg)
Social Media Analysis Framework
Recall the same four step process to guide data science experiments:
Aspire
Acquire
Analyze
Summarize
48
![Page 49: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/49.jpg)
Social Network Diagram with D3
49
![Page 50: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/50.jpg)
Exercises
Copy/paste your access token from the Graph API Explorer into the "Chapter 2 (Mining Facebook)" notebook
Paste the value and execute the cell just before Example 2-1
Execute examples sequentially (try to at least make it to Example 2-10)
Analyze your likes, your friends and likes from pages of interest
If you have time...
Remaining examples
50
![Page 51: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/51.jpg)
Module 4: Mining LinkedIn
51
![Page 52: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/52.jpg)
Objectives
52
Learn about LinkedIn’s Developer Platform
Understand how clustering works
A fundamental type of machine learning
Be able to employ geocoding services to arrive at a set of coordinates from a textual reference to a location
Visualize geographic data with cartograms
![Page 53: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/53.jpg)
LinkedIn Primitives
Account Types: People, Companies
The data seems "more closely held" than Facebook or Twitter
No FOAF visibility
Richest data source
Profile descriptions from mutual connections
A little messier than it first appears
Not necessarily a bad thing
53
![Page 54: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/54.jpg)
API Requests
(Strangely) RESTful Requests
Not really RESTful
Field selector syntax
http://api.linkedin.com/v1/people/~:(first-name,last-name,headline,picture-url)
XML responses
CSV address book download
54
![Page 55: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/55.jpg)
Is LinkedIn an Interest Graph?
Fundamentally: yes. But not so much at the developer API level
Less trivial to find some of the "pivots"
No Skills API (yet?)
But the data is there (mostly in profile descriptions) for your direct connections
Companies, job titles, job descriptions
Lots of richness is tucked away in human language data
55
![Page 56: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/56.jpg)
Clustering
An unsupervised machine learning learning technique
Think: an algorithm that organizes the data into partitions
56
![Page 57: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/57.jpg)
Example: Clustered Job Titles
57
![Page 58: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/58.jpg)
3 Steps to Clustering Your Data
Normalization
Compare (similarity/distance measurement)
n-grams, edit distance, and Jaccard are common, but your imagination is the limit
Why can't you just compare everything to everything?
Dimensionality Reduction
Ideally, your clustering algorithm will mitigate the pain
k-means is among the most common clustering techniques in use
58
![Page 59: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/59.jpg)
Jaccard Similarity
59
![Page 60: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/60.jpg)
k-Means Explained
1. Randomly pick k points in the data space as initial values that will be used to compute the k clusters: K1, K2, ..., Kk.
2. Assign each of the n points to a cluster by finding the nearest Kn—effectively creating k clusters and requiring k*n comparisons.
3. For each of the k clusters, calculate the centroid, or the mean of the cluster, and reassign its Ki value to be that value. (Hence, you’re computing “k-means” during each iteration of the algorithm.)
4. Repeat steps 2–3 until the members of the clusters do not change between iterations. Generally speaking, relatively few iterations are required for convergence.
60
![Page 61: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/61.jpg)
k-Means: Initialize
61
![Page 62: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/62.jpg)
k-Means: Step 1
62
![Page 63: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/63.jpg)
k-Means: Step 2
63
![Page 64: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/64.jpg)
k-Means: Step 3
64
![Page 65: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/65.jpg)
k-Means: (Fast-Forward) Step 9
65
![Page 66: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/66.jpg)
Geocoding
Transforming a location to a set of coordinates
Nashville, TN => (36.16783905029297, -86.77816009521484)
A harder problem than it first appears
The Bing API is especially generous
Requires an account sign up: http://bingmapsportal.com
Use the API key with the geopy package
66
![Page 67: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/67.jpg)
Introducing: The Dorling Cartogram
67
![Page 68: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/68.jpg)
Social Media Analysis Framework
Remember: Use the same four step process to guide data science experiments:
Aspire
Acquire
Analyze
Summarize
68
![Page 69: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/69.jpg)
ExercisesFollow the instructions in the "Chapter 3 (Mining LinkedIn)" notebook to create an API connection and follow along with the first few examples
Download your connections as a CSV file from http://www.linkedin.com/people/export-settings and save them to your VM
A deviation from instructions in Example 3-6 is necessary for remote VMs
See http://bit.ly/mtsw-ch03-helper-code
Create a Bing Maps portal account and get your API key for Examples 3-8 and beyond
Try clustering your contacts in Example 3-12
Try Example 3-13 (visualizing data in Google Earth) at home...
69
![Page 70: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/70.jpg)
Module 5: Choice
70
![Page 71: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/71.jpg)
Objectives
71
To work on "loose ends" or areas of interest from previous modules
To hack on code in notebooks not yet encountered
To setup the virtual machine on your own box if you haven't yet
To collaborate/talk and otherwise make the most of our togetherness
![Page 72: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/72.jpg)
Social Media Analysis Framework
Remember:
Aspire
Acquire
Analyze
Summarize
72
![Page 73: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/73.jpg)
RecommendationsSetup your own development environment if you haven't already
Appendix A
Text Mining & Natural Language Processing
Chapter 4 (Mining Google+) & Chapter 5 (Mining Web Pages)
Graph Mining
Chapter 7 (Mining GitHub)
Analyzing Semantic Markup
Chapter 8 (Mining the Semantically Marked-Up Web)
73
![Page 74: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/74.jpg)
Module 6: Privacy & Ethics
74
![Page 75: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/75.jpg)
75
Know thy data, and know thyself
--Matthew A. Russell
![Page 76: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/76.jpg)
76
If we have data, let’s look at data. If we have opinions, let’s go with mine
--Jim Barksdale
![Page 77: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/77.jpg)
77
In God we trust. All others must bring data
--W. Edwards Deming
![Page 78: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/78.jpg)
Communication => Data
Communication
Senders
humans & machines
Messages
natural language, images, videos, etc.
Recipients
humans & machines
78
![Page 79: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/79.jpg)
Data Alchemy
Data: Documents & document fragments (text messages, etc.)
Information: "Assertions", summaries, tags, etc.
Knowledge: Aggregated, queryable information
Wisdom: “Compressed” knowledge
Gold: Money
79
![Page 80: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/80.jpg)
Machine Learning
80
A program that learns (improves) from experience (data) according to some objective
Supervised learning
Unsupervised learning
Reinforcement learning
How to do it
Program mathematical models and hope for the best...
How to do it well
Program state-of-the-art mathematical models with sufficient representative data
![Page 81: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/81.jpg)
81
Knowledge is a process of piling up facts; wisdom lies in their simplification
--Martin Fischer
![Page 82: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/82.jpg)
82
Any sufficiently advanced technology is indistinguishable from magic
--Arthur C. Clarke
![Page 83: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/83.jpg)
Is Privacy Already an Illusion?
83
Digital happenings circa 2014
The Cloud
Social Media
Deep Learning
The Internet of Things
Internet.org
![Page 84: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/84.jpg)
84
Civilization is the progress toward a society of privacy...
-- Ayn Rand
![Page 85: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/85.jpg)
85
If you have something that you don’t want anyone to know, maybe you shouldn’t be doing it in the first place.
-- Eric Schmidt, (former) CEO of Google
![Page 86: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/86.jpg)
Influences on Ethics
Capitalism, economics, & marketing
A for-profit corporation's fiduciary duty: To maximize the common stock's value
How to do it? By transacting commerce
How do it well? By advertising more effectively than competitors
How to do it really well? With highly relevant personalized ads (recommenders)
Terms of Service (ToS) - The legal extent of ethical obligations?
86
![Page 87: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/87.jpg)
Module 7: Final Q&A; Survey
87
Survey Link:
https://www.surveymonkey.com/s/pycon2014_tutorials
![Page 88: Mining Social Web APIs with IPython Notebook (PyCon 2014)](https://reader033.fdocuments.net/reader033/viewer/2022042623/53fde7cc8d7f72a81c8b4bbb/html5/thumbnails/88.jpg)
Free Stuff
http://MiningTheSocialWeb.com
Mining the Social Web 2E Chapter 1 (Chimera)
http://bit.ly/13XgNWR
Source Code (GitHub)
http://bit.ly/MiningTheSocialWeb2E
http://bit.ly/1fVf5ej (numbered examples)
Screencasts (Vimeo)
http://bit.ly/mtsw2e-screencasts
88