Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API...
Transcript of Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API...
![Page 1: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/1.jpg)
Harvesting, Processing and Visualising Geo-Encoded Data from Social Media
Nikola LjubešićDepartment of Knowledge Technologies
“Jožef Stefan” Institute, Ljubljana
![Page 2: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/2.jpg)
Aims of this tutorial
1. Understand APIs2. Learn how to harvest data via APIs (hands-on)3. Process harvested content (hands-on)4. Perform downstream analyses, inferences and
visualisations
![Page 3: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/3.jpg)
Data harvesting from social mediaAll social media sites have open APIs
APIs (application programming interfaces) - interfaces for various processes / programs to interact
● Twitter● Mobile app (Instagram) used for posting on Twitter
● Twitter● Your program for data harvesting (TweetCat)
![Page 4: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/4.jpg)
API examplesTwitter API
● Search API - enables your program to query for specific keywords● Streaming API - sends part of the currently published content to your program
Facebook Graph API
● Enables updating or reading the “social graph” via object IDs○ Nodes - User, Photo, Page, Post, Comment○ Edges - connections between nodes (Post and Comment)○ Fields - attributes of nodes, such as User’s birthday
![Page 5: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/5.jpg)
Using APIsCommunication
1. Via HTTP with cURL (command line tool), urllib (Python library)2. Libraries that wrap the HTTP communication
○ tweepy for Python and the Twitter API○ facebook-sdk for Python and the Facebook Graph API
3. Tools that perform specific tasks○ TweetCat for gathering tweets of low-frequency languages or published on specific locations
Authentication
● Each API requires you to authenticate with a series of tokens● Those tokens can be obtained from the social media (https://apps.twitter.com)
![Page 6: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/6.jpg)
TweetCatModular command-line tool / set of scripts written (mostly) in Python
● Harvesting Twitter data either via seed terms or from geographical perimeters○ Python○ output arrays of JSON objects
● Extracting variables from the harvested data (text, metadata, variables from text)○ Python○ output CSV file
● Analysis, inference, visualisation, performed in R (or other tool of choice)
https://github.com/clarinsi/tweetcat
![Page 7: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/7.jpg)
Data harvesting
![Page 8: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/8.jpg)
Data harvestingTwo basic modes for data harvesting
1. LANG modea. Want to collect data published in a specific languageb. User input
i. Seed words (used for querying the Search API)ii. Languages of interest (langid.py dependency)
2. GEO modea. Want to collect geo-encoded data published in a geographical perimeterb. User input
i. Geographical perimeter (used for listening on the Streaming API)ii. Languages of interest for potential filtering
![Page 9: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/9.jpg)
Hands on… Prerequisites
● Python2.7● tweepy module● langid module● access tokens
![Page 10: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/10.jpg)
Data sharingDefined? by the Developer Agreement / Terms of Service.
You can share user and tweet IDs that can be used to recollect data from the API.
You can publish up to 50k public tweets directly.
What to do when the data is linguistically annotated? https://github.com/clarinsi/tweetpub
Facebook?
As long as you do not sell the data and the data is public, you can share the harvested data.
![Page 11: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/11.jpg)
Variable extraction
![Page 12: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/12.jpg)
Variable extractionFour variable extraction levels
1. Extraction from the Status object (metadata)2. Extraction from original text3. Extraction from lowercased text4. Extraction from normalised text
Two text extraction principles
1. Lexicon-based (list of words mapped to variable values)2. Regex-based (regular expressions mapped to variable values)
![Page 13: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/13.jpg)
Hands on...
![Page 14: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/14.jpg)
Data analysis
![Page 15: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/15.jpg)
R
![Page 16: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/16.jpg)
Language variation analysis in BCMSBosnian, Croatian, Montenegrin, Serbian
Analyse 16 linguistic variables known to vary between variants
Focus on the linguistic strength of administrative borders - has the continuum been interrupted?
Tweets collected since 2013 (200 million tweets, 10 million geo-encoded, 1 million with relevant linguistic variables)
![Page 17: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/17.jpg)
leplijep
mlekomlijeko
smehsmijeh
devojka djevojka
![Page 18: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/18.jpg)
leplijep
mlekomlijeko
smehsmijeh
devojka djevojka
![Page 19: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/19.jpg)
jučerjuče
večerveče
takođertakođe
navečernaveče
![Page 20: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/20.jpg)
jučerjuče
večerveče
takođertakođe
navečernaveče
![Page 21: Harvesting, Processing and Visualising “Jožef Stefan” Institute, … · 2017-07-28 · API examples Twitter API Search API - enables your program to query for specific keywords](https://reader033.fdocuments.net/reader033/viewer/2022050515/5f9fd23e7dc97f2b8e0d4a4e/html5/thumbnails/21.jpg)
Harvesting, Processing and Visualising Geo-Encoded Data from Social Media
Nikola LjubešićDepartment of Knowledge Technologies
“Jožef Stefan” Institute, Ljubljana