Using to Save Lives
Or, Using Digg to find interesting events.
Presented by: Luis Zaman, Amir Khakpour, and John Felix
Outline
Explanation Digg is a social web-media discovery tool
based on user submitted content. 1 or 2 submissions a minute Half-life of “interest” is about a day
Digg aggregates “interesting” content.
But how do we find interesting Events and know their Themes?
Motivation Collaborative nature of Social Media can scour
the WWW very thoroughly. But, this generates A LOT of data (you’ll see).
It would be cool to find emergencies, or critical situations based on this collaborative media.
Apple seems like a pretty good starting point.
Approach
Preprocessing Digg API
REST API http://services.digg.com/stories/topic/apple?count=10
XML response <?xml version="1.0" encoding="utf-8" ?><users
timestamp="1176998598" total="1" offset="0" count="1"> <user name="sbwms" icon="http://digg.com/img/user-large/user-default.png" registered="1135702996" profileviews="3104" /></users></xml>
Limitations 100 results per request 1 Hour of time series data Can’t go fast, or else.
Preprocessing
Time Series Each digg is the event (only 100 at a time) Rows
Each story’s digg count Columns
Every hour (2,207 of them from August 08 – November 08)
Clustering Rows
Each story that was digged at any point in the time series Columns
The words in the title and description of this story
Preprocessing - Challenges
SLOW Really Dirty Data Different Formats of Data REALLY SLOW
Introduction to Document Clustering
Challenges of clustering of text documents unlike structured data are: Volume Dimensionality Sparsity Complex semantics
In information retrieval and text mining, text data is represented in a common representation model, e.g. Vector Space Model (VSM) Huge sparse matrix, we just store non-zero values
Text
Text documents are converted to Am,n where for m documents and total number of n words (or phrases), each element xi,j represents the frequency of the jth term in the ith document.
Clustering Dataset
Number of stories (m) : 25470 Total number of unique words (n): 55557 Nonzero values: 469323 (0.03214%)
Clustering using Cluto Software Using Kmeans, bisecting Kmeans
Calculating Centroids and SSE A C++ program is run on “black”
Document Clustering by Optimizing Criterion Functions According to Zhao et .al, to have a good
clustering for documents we can use some Criterion Function and use optimization to find clusters: Internal Criterion Functions (I)
Maximizing the internal similarity function:
External Criterion Functions (E) Minimizing the external similarity function:
Hybrid Criterion Functions (H) Maximizing E
I
Experiments SSE for I (K-Means vs Bisecting K-Means)
Visualization What we used
jQuery Database query library for javascript
PHP/MySQL Scripting language and database backend
Google Visualization API Time Series Graph Zoomable
Timepedia Chronoscope Clickable
Conclusions Success?
Of course we think so
Future Work Save lives? Better clustering
Cleaner data More data
Make it scalable, and dynamic On-line and on the fly?
Top Related