Flux of MEME - final report

flux of meme - final reporttelecom italia, milan 30.9.11thomas alisi@grudelsud

Friday, September 30, 11

the basics


the idea

Meme: a postulated unit or element of cultural ideas transmitted from one mind to another through speech or similar phenomena.

Zeitgeist: German language expression referring to "the spirit of the times"

Semantic Web: an evolving development of the World Wide Web in which the meaning (semantics) of information on the web is defined, making it possible for machines to process it

Flux of MEME: analysis of the web Zeitgeist through geo-localized Memes, updated and shared on social media mainly via mobile networks


background

yahoo researchWWW2011 - Who Says What to Whom on Twitter - Wu, Hofman, Mason, WattsWSDM2011 - Who Uses Web Search for What? And How? - Weber, JaimesCSCW2011 - Peaks and Persistence: Modeling the Shape of Microblog Conversations - Shamma, Kennedy, Churchill

othersWWW2010 - What is Twitter, a Social Network or a News Media? - Kwak, Lee, Park, MoonTech report 2009 (Princeton / Carnegie Mellon) - Topic Models - Blei, LaffertyTech report 2009 (Facebook / Maryland / Princeton) - Reading Tea Leaves: How Humans Interpret Topic Models - Chang, Boyd-Graber, Gerrish, Wang, Blei


algorithm steps

1. fetch data 2. create clusters 3. extract topics 4. analyze stats


implementation


step 1. fetch data!

using the free Spritzer access to Twitter streaming API (~1% of total tweets)defined set of location boxes (Italy, UK, France, Spain)reinforcing locations with geonames didn’t prove to be efficient (origin: from a galaxy far far away)enrich content through web scraping, also carrying meta & opengraph keywordsblacklist of noisy sources


step 2. create geo-clusters

create time slicesselect all the posts within a time slicechoose geo-granularity (radius of clusters)agglomerate posts with Hierarchical Agglomerative Clustering (HAC)


step 3. extract topics

a geo-cluster represents the whole bag of word used to define a document topic extraction is implemented with LDAα Dirichlet prior param. on the per-document topic distributions (frontend output: weight)β Dirichlet prior param on the per-topic word distributionθi is the topic distribution for document i,zij is the topic for the jth word in document i, andwij is the specific word.

user defined params: number of topics, number of words per topic, min followers


step 4. analyze data

define search context: topics or keywordsperform live search with TF-IDF indicatorsdisplay time-lapse of clusters’ analytics evolution (log-scale count and average size)quick and easy interface: toggle visibility of clusters



drag and zoom on specific location boxesselect time intervaldisplay aggregated stats of clusters (count and size) within location boxshow and export breakdown of posts’ languages



show stats and content of specific clusters

lat-lon of centroids, std. deviation, surface and radius

display weighted topics, TF-IDF of terms within topics, TF-IDF of meta keywordsshow / export list of postsshow related links



show query metrics and parametersdisplay overall TF-IDF for the selected query


demohttp://fom.londondroids.com/fom/


http://fom.londondroids.com/fom/

http://fom.londondroids.com/fom/

sorry guys, now the boring stuff...backend, front-end API, cron jobs


Backend

Streaming APIa batch process is constantly running and saving data on the dboptions: fetch by search query, expand terms with wikiminer, access all the stream, filter geotagged, filter location box, fetch related content

Clustering and Topic extractiondefine geo granularity time/size of geo clustersfollowers and retweetsnumber of topics / keywordslanguage mapping


API

search clusters containing specific topics / keywordsreturns lists of clusters ordered by topic weightall the data extraction API conforms to a RESTful model and returns JSON structured data


API

read list of geographic clustersusually called after a search topic has been raised


API

read semantic content of a geographic clustertopics group by score (alpha parameter in LDA) and word weighted with TF-IDF with respect to the whole cluster content


API

read meta / opengraph content of a geographic cluster


APIexport list of posts

exports all the posts contained in a cluster

example request: /cluster/export_posts/1026/csv

read post content

reads the content of a post

example request: /cluster/read_post/560951

read related link

read the content of a link related to a post (the id is usually fetched through the variable “links” returned by the function above)

example request: /cluster/read_link/16268

execute cluster stats within a location box

read list of clusters contained within a location box and creates stat charts (in form of google chart images)

example request: /cluster/dzstat/c_since=2011-05-07/c_until=2011-05-10/swLat=44.61/swLon=8.52/neLat=45.57/neLon=11.33

execute post stats within a location box

read list of posts contained within a location box and perform stats on languages

example request: /search/dzstat/p_since=2011-05-07/p_until=2011-05-10/p_timespan=daily/swLat=44.61/swLon=8.52/neLat=45.57/neLon=11.33

read query content

reads the list of geo-clusters associated to a specific query id (usually fetched by the function above)

example request: /cluster/read/2


Cron

keep everything runningrestart the streaming API now and then, so as to keep twitter happycreate the clusters at the end of the day


servers


final thoughts


improvements

optimize time slicing!emerging topics should be checked on hourly basis among the complete dataset

train models!a training set would be ideal to create models and optimize performances of the topic extraction algorithmmodels could relate to specific context in order to improve results (e.g. all the tweets from newspapers)

create language classifiersincrease the precision of language detection with naive bayes classifiers

think of scalabilityincreasing the amount of data makes it necessary to scale up to Map/Reduce architectures

increase flexibility (e.g. manage multimedia data, offer a rich contextualized API, ...)enhance analysis and visualization (e.g. reinforce topic correlation / n-grams)


other refs

algorithmsLDA - http://en.wikipedia.org/wiki/Latent_Dirichlet_allocationHAC - http://en.wikipedia.org/wiki/Cluster_analysis

librariestwitter 4 java - http://twitter4j.orgmachine learning - http://mallet.cs.umass.edu/jquery (core + ui) - http://jquery.org/data tables - http://datatables.net/chart api - http://code.google.com/apis/chart/

image courtesyhttp://yesyesno.com/nike-city-runs


http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

http://en.wikipedia.org/wiki/Cluster_analysis

http://en.wikipedia.org/wiki/Cluster_analysis

http://twitter4j.org/

http://twitter4j.org/

http://mallet.cs.umass.edu/

http://mallet.cs.umass.edu/

http://jquery.org/

http://jquery.org/

http://datatables.net/

http://datatables.net/

http://code.google.com/apis/chart/

http://code.google.com/apis/chart/

http://yesyesno.com/nike-city-runs

http://yesyesno.com/nike-city-runs

thanks!codebase source + wiki https://github.com/grudelsud/fomthomas alisi @grudelsudgiuseppe serra @giuseppeserramarco bertini @bertinimarco ?


https://github.com/grudelsud/fom

https://github.com/grudelsud/fom

Flux of MEME - final report

Technology

Transcript of Flux of MEME - final report