Flux of MEME - final report
-
Upload
thomas-alisi -
Category
Technology
-
view
751 -
download
5
description
Transcript of Flux of MEME - final report
flux of meme - final reporttelecom italia, milan 30.9.11thomas alisi@grudelsud
Friday, September 30, 11
the basics
Friday, September 30, 11
the idea
Meme: a postulated unit or element of cultural ideas transmitted from one mind to another through speech or similar phenomena.
Zeitgeist: German language expression referring to "the spirit of the times"
Semantic Web: an evolving development of the World Wide Web in which the meaning (semantics) of information on the web is defined, making it possible for machines to process it
Flux of MEME: analysis of the web Zeitgeist through geo-localized Memes, updated and shared on social media mainly via mobile networks
Friday, September 30, 11
background
yahoo researchWWW2011 - Who Says What to Whom on Twitter - Wu, Hofman, Mason, WattsWSDM2011 - Who Uses Web Search for What? And How? - Weber, JaimesCSCW2011 - Peaks and Persistence: Modeling the Shape of Microblog Conversations - Shamma, Kennedy, Churchill
othersWWW2010 - What is Twitter, a Social Network or a News Media? - Kwak, Lee, Park, MoonTech report 2009 (Princeton / Carnegie Mellon) - Topic Models - Blei, LaffertyTech report 2009 (Facebook / Maryland / Princeton) - Reading Tea Leaves: How Humans Interpret Topic Models - Chang, Boyd-Graber, Gerrish, Wang, Blei
Friday, September 30, 11
algorithm steps
1. fetch data 2. create clusters 3. extract topics 4. analyze stats
Friday, September 30, 11
implementation
Friday, September 30, 11
step 1. fetch data!
using the free Spritzer access to Twitter streaming API (~1% of total tweets)defined set of location boxes (Italy, UK, France, Spain)reinforcing locations with geonames didn’t prove to be efficient (origin: from a galaxy far far away)enrich content through web scraping, also carrying meta & opengraph keywordsblacklist of noisy sources
Friday, September 30, 11
step 2. create geo-clusters
create time slicesselect all the posts within a time slicechoose geo-granularity (radius of clusters)agglomerate posts with Hierarchical Agglomerative Clustering (HAC)
Friday, September 30, 11
step 3. extract topics
a geo-cluster represents the whole bag of word used to define a document topic extraction is implemented with LDAα Dirichlet prior param. on the per-document topic distributions (frontend output: weight)β Dirichlet prior param on the per-topic word distributionθi is the topic distribution for document i,zij is the topic for the jth word in document i, andwij is the specific word.
user defined params: number of topics, number of words per topic, min followers
Friday, September 30, 11
step 4. analyze data
define search context: topics or keywordsperform live search with TF-IDF indicatorsdisplay time-lapse of clusters’ analytics evolution (log-scale count and average size)quick and easy interface: toggle visibility of clusters
Friday, September 30, 11
step 4. analyze data
drag and zoom on specific location boxesselect time intervaldisplay aggregated stats of clusters (count and size) within location boxshow and export breakdown of posts’ languages
Friday, September 30, 11
step 4. analyze data
show stats and content of specific clusters
lat-lon of centroids, std. deviation, surface and radius
display weighted topics, TF-IDF of terms within topics, TF-IDF of meta keywordsshow / export list of postsshow related links
Friday, September 30, 11
step 4. analyze data
show query metrics and parametersdisplay overall TF-IDF for the selected query
Friday, September 30, 11
demohttp://fom.londondroids.com/fom/
Friday, September 30, 11
sorry guys, now the boring stuff...backend, front-end API, cron jobs
Friday, September 30, 11
Backend
Streaming APIa batch process is constantly running and saving data on the dboptions: fetch by search query, expand terms with wikiminer, access all the stream, filter geotagged, filter location box, fetch related content
Clustering and Topic extractiondefine geo granularity time/size of geo clustersfollowers and retweetsnumber of topics / keywordslanguage mapping
Friday, September 30, 11
API
search clusters containing specific topics / keywordsreturns lists of clusters ordered by topic weightall the data extraction API conforms to a RESTful model and returns JSON structured data
Friday, September 30, 11
API
read list of geographic clustersusually called after a search topic has been raised
Friday, September 30, 11
API
read semantic content of a geographic clustertopics group by score (alpha parameter in LDA) and word weighted with TF-IDF with respect to the whole cluster content
Friday, September 30, 11
API
read meta / opengraph content of a geographic cluster
Friday, September 30, 11
APIexport list of posts
exports all the posts contained in a cluster
example request: /cluster/export_posts/1026/csv
read post content
reads the content of a post
example request: /cluster/read_post/560951
read related link
read the content of a link related to a post (the id is usually fetched through the variable “links” returned by the function above)
example request: /cluster/read_link/16268
execute cluster stats within a location box
read list of clusters contained within a location box and creates stat charts (in form of google chart images)
example request: /cluster/dzstat/c_since=2011-05-07/c_until=2011-05-10/swLat=44.61/swLon=8.52/neLat=45.57/neLon=11.33
execute post stats within a location box
read list of posts contained within a location box and perform stats on languages
example request: /search/dzstat/p_since=2011-05-07/p_until=2011-05-10/p_timespan=daily/swLat=44.61/swLon=8.52/neLat=45.57/neLon=11.33
read query content
reads the list of geo-clusters associated to a specific query id (usually fetched by the function above)
example request: /cluster/read/2
Friday, September 30, 11
Cron
keep everything runningrestart the streaming API now and then, so as to keep twitter happycreate the clusters at the end of the day
Friday, September 30, 11
Friday, September 30, 11
servers
Friday, September 30, 11
final thoughts
Friday, September 30, 11
improvements
optimize time slicing!emerging topics should be checked on hourly basis among the complete dataset
train models!a training set would be ideal to create models and optimize performances of the topic extraction algorithmmodels could relate to specific context in order to improve results (e.g. all the tweets from newspapers)
create language classifiersincrease the precision of language detection with naive bayes classifiers
think of scalabilityincreasing the amount of data makes it necessary to scale up to Map/Reduce architectures
increase flexibility (e.g. manage multimedia data, offer a rich contextualized API, ...)enhance analysis and visualization (e.g. reinforce topic correlation / n-grams)
Friday, September 30, 11
other refs
algorithmsLDA - http://en.wikipedia.org/wiki/Latent_Dirichlet_allocationHAC - http://en.wikipedia.org/wiki/Cluster_analysis
librariestwitter 4 java - http://twitter4j.orgmachine learning - http://mallet.cs.umass.edu/jquery (core + ui) - http://jquery.org/data tables - http://datatables.net/chart api - http://code.google.com/apis/chart/
image courtesyhttp://yesyesno.com/nike-city-runs
Friday, September 30, 11
thanks!codebase source + wiki https://github.com/grudelsud/fomthomas alisi @grudelsudgiuseppe serra @giuseppeserramarco bertini @bertinimarco ?
Friday, September 30, 11