Social Media Crawling & Mining Seminar

35
Lecture @ International Hellenic University Thessaloniki, 8 May 2014 Social Media Crawling and Mining Overview of Hands-on Workshop Symeon (Akis) Papadopoulos, Manos Schinas, Katerina Iliakopoulou, Yiannis Kompatsiaris Information Technologies Institute (ITI) Centre for Research & Technologies Hellas (CERTH)

description

Introductory seminar @ International Hellenic University (8 May 2014) Main topics covered: - data collection from social sources - indexing using mongoDB and Solr - mining (basic analytics & topic detection)

Transcript of Social Media Crawling & Mining Seminar

Page 1: Social Media Crawling & Mining Seminar

Lecture @ International Hellenic UniversityThessaloniki, 8 May 2014

Social Media Crawling and MiningOverview of Hands-on WorkshopSymeon (Akis) Papadopoulos, Manos Schinas, Katerina Iliakopoulou, Yiannis KompatsiarisInformation Technologies Institute (ITI)Centre for Research & Technologies Hellas (CERTH)

Page 2: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

workshop objectives

• get a glimpse into the problem of social media monitoring and mining

• get a feeling about nature of social data• understand the basic operations/tasks involved• look into underlying research problems• gain some practical experience with new database

technologies (mongo, Solr)• be motivated to explore further problems – become

experts

#2

Page 3: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI#3

Page 4: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

additional material

https://www.dropbox.com/s/ip0ah4u5r5mqi6i/dbs.zip

http://bit.ly/1nkyfQ7 (~230MB)

https://www.dropbox.com/s/d0d3586fxqtlx4u/ihu-material.zip

http://bit.ly/1uEfo4t (~180MB)

#4

Page 5: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI#5

problem setting

• input: streams of content from social sources• output: statistics/insights

operations:• data management (pre-processing, indexing)• data access/retrieval• mining:

– basic analytics– trend detection

Page 6: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

basic concepts I• Item:

– The unit of social content: Can refer to a tweet, a Facebook post, a Google+ post, an Instagram post, etc.

– It is typically associated with some attributes: ID, title, description, tags, contributor (here to be referred to as StreamUser), publication date, etc.

– There are some attributes that are social network-specific: location, retweets, likes, shares, etc.

• MediaItem:– For Items that point to multimedia content (image/video), we make use of

a special type of object, called MediaItem to store the associated URL of the image/video along with additional attributes (e.g. image size, video duration, etc.).

– Depending on the social network, there can be Items with no associated MediaItem (e.g. Facebook post), or an Item must always be associated with a MediaItem (e.g. YouTube video).

• Webpage:– Represents a webpage. It is linked to the Item that shared it and may

contain one or more MediaItems.

#6

Page 7: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

basic concepts I

#7

ItemItem

MediaItemMediaItem

WebpageWebpage

StreamUserStreamUsershares/creates

links/contains

contains

links

1

N

1

N

N

N

1

Page 8: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

basic concepts II

• crawling: – Typically this refers to a process that “explores” the Web

by starting from a seed set of webpages and following the included links (in a recursive way).

– In a social media context, we will use crawling in a relaxed manner to mean “collection (in a focused way) of content shared by social network users”.

– Collection from social media can be done in many ways. In this workshop, we will use the paradigm of the “stream manager”.

#8

Page 9: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

basic concepts III

• stream manager: – This is a process that continuously retrieves Items from a social

source based on a given configuration. The collected Items along with the embedded MediaItems and Webpages can be stored to the selected databases for further processing.

– For instance, in the case of Twitter the configuration may specify a set of keywords/users and/or locations to track (as supported by the Streaming API). The configuration options vary depending on the social network in question.

– Once a new Item is obtained, its further handling may include the following: a) storing to a DB, b) indexing of its text, c) extraction of contained URLs and/or MediaItems, d) further analysis (e.g. sentiment detection), etc.

– For this workshop we will restrict to Twitter data.

#9

Page 10: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

basic concepts IV

• indexing: – There are different types of indices that serve different

access requirements.– Here, we will deal with the following types of indices:

• Full-text index: This supports free text queries and will be based on the Solr framework.

• Numerical value index: This supports interval/threshold queries (e.g. retrieve all tweets with more than 100 retweets) and is applied on numerical (int/double) fields. The same index can be used for temporal filtering if the Unix timestamp is indexed.

• Text similarity index: This supports similarity-based queries using the whole text of an input Item, e.g. bring the most similar tweets. Our implementation relies on Locality Sensitive Hashing (LSH).

• Visual similarity index: This supports search by example using the visual content of an image (will not be used in this workshop).

#10

Page 11: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

basic concepts V

• mining:– Mining refers to the processing of multiple Items with the

goal of extracting some insights, higher-level conclusions about the data collection, and ultimately about the real world (e.g. understand which candidate is more popular).

– In this workshop, we will deal with two popular and relatively straightforward mining problems:

• basic analytics: this involves the computation of basic aggregate statistics and the extraction of most “important” objects in a given dataset, e.g. most important contributors, hashtags, Items.

• trend detection: this involves the detection of keywords or phrases that attract increasing interest during a specific interval. Those may refer to news stories, events, persons (e.g. celebrities), memes, etc. We can also often refer to those as topics.

#11

Page 12: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

overview of architecture

#12

StreamManagerStream

Manager

Mongo DAO

Mongo DAO

input.conf.xmlinput.conf.xml

crawling configuration

Solr Handler

Solr Handler

mongomongo

SolrSolr

Mongo DAO

Mongo DAO

Solr Handler

Solr Handler

basic analytics

basic analytics

trend detection

trend detection

crawling indexing mining

streams.conf.xmlstreams.conf.xml

DB & OSN credentials configuration

Page 13: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

useful techniques I: tokenization

• Split an input text into lexical units (=tokens)– useful for indexing (happens behind the scenes)– useful for trending keyword detection

• Most crude implementation:String[] tokens = message.split(“ “);

• Available standard implementations in Solr, e.g.:– Standard Tokenizer– Letter Tokenizer– N-Gram Tokenizer (N-Gram = sequence of N tokens)

• For Twitter, Twokenizer is popular.

#13

Page 14: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

useful techniques II: entity detection

• entity detection (or named entity detection) is the marking of particular tokens that refer to things like persons, organizations (e.g. companies), locations– useful for giving more importance to these tokens– useful for filtering noise (e.g. tweets that contain no

entities)– named entities make good query terms (e.g. for retrieving

content from external sources) • standard implementations are available

– perhaps the most popular is the Stanford NER library– others include Balie, GATE, OpenNLP

#14

Page 15: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

useful techniques III: LSH (a)

• Locality Sensitive Hashing (LSH) is a set of probabilistic methods for the hashing of high-dimensional data.

• The basic idea is that similar items according to some metric are hashed to the same value with high probability.

• There are available hash functions for several distance measures: Jaccard Coefficient (MinHash), L1 and L2 distance, Cosine Similarity (random projection).

• Random projection: For an input vector u of length d, and using K random d-dimensional vectors (with K<<d) we create a signature of length K, e.g. for K=4: hash(u) = {1, 0, 1, 1}

#15

Page 16: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

useful techniques III: LSH (b)

Applications•Approximate Nearest-Neighbour search

– Typical values for K are 4, 6, 8 16, 64, 256 unique signatures. Given this partitioning of the input collection, instead of searching the whole dataset for the nearest item, one may search only in the subset of items with the same signature.

– To increase recall, one could use L different hash functions and merge the results.

•Near Duplicate Detection– Items with the same or similar signature are considered

near duplicates. For this to be more precise, we need higher values of K, e.g. K=12 16,384 unique signatures.

#16

Page 17: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

useful techniques IV: querying mongo (a)

• MongoDB stores data in the form of documents, which are JSON-like field and value pairs.

#17

Page 18: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

useful techniques IV: querying mongo (b)

#18

Page 19: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

useful techniques V: querying Solr (a)

• For Solr, each Item is a document with a set of indexed fields, which can be queried in combination.

• Example: String query = “(title:Crimea) OR (description:Crimea)SolrQuery solrQuery = new SolrQuery(query);

• Number of results & sorting:solrQuery.setRows(100);solrQuery.addSortField(publicationTime, ORDER.desc);

#19

Page 20: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

useful techniques V: querying Solr (b)

Additional examples:•Restrict to a selected time period:

String query = "(title : Crimea) OR (description : Crimea) AND publicationTime: [minDateTime (long value) TO maxDateTime(long value)]“

•Retrieve results in the order of descending retweets:solrQuery.addSortField("retweetsCount", ORDER.desc);

•Retrieve only those Items with >100 retweets:solrQuery.addFilterQuery("retweetsCount: [100 TO *]");

#20

Page 21: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

useful techniques V: querying Solr (c)

#21

Page 22: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

topic detection: basics

• Trending topic: A news story or discussion that attracts a lot of activity during a given time interval.

• How is it represented?– Headline (“Bomb explosion in the Z Embassy of X.“)– Set of keywords (“bomb”, “explosion”, “embassy”, etc.)– N-grams (“bomb explosion”, “Z Embassy”, etc.)– Set of characteristic Items and possibly MediaItems

• There are different categories of topic detection methods: – document-pivot: Cluster incoming Items into groups referring to the

same topic and try to extract a topic representation by processing the group Items.

– feature-pivot: Try to extract frequently occurring or trending keywords or phrases that are supposed to correspond to topics.

– probabilistic-generative: Try to find a generative topic model that underlies the topic distribution on the set of documents (e.g. LDA).

#22

Page 23: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

topic detection: document-pivot

• Objective: Cluster together Items referring to the same topic.• Simple technique (incremental clustering):

– Compare each incoming Item with existing topics-clusters and if similarity is higher than a given threshold then assign it to the most similar, otherwise create a new topic.

– Two issues: a) threshold selection, b) what do you compare with: i) a representative Item per cluster, ii) an aggregate cluster representation (e.g. centroid)

• Other conventional clustering techniques could be tried out: k-means, DBSCAN, hierarchical agglomerative clustering, but most of them are not easy to apply incrementally.

• Another challenge stems from the short length of Tweets, which makes similarity between them high only when they are practically near-duplicate. This leads to the well-known problem of topic fragmentation.

#23

Page 24: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

topic detection: feature-pivot

• Objective: Try to find terms of phrases that appear very prominently in the dataset.

• Simple technique: – Select N most frequent keywords or hashtags.– Problem-1: If you consider all keywords, you’ll end up with too

many generic words or very broad topics. – Problem-2: A single keyword is often insufficient to describe a

topic accurately.• More advanced technique:

– Select those keywords and phrases that are used more frequently now (in the current time slot) compared to the previous, cf. BN-gram approach (Aiello et al., 2013).

– Need to compute n-gram frequencies per time slot (use Solr N-Gram Tokenizer and retrieve most frequent n-grams).

#24

Page 25: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

dataset: SNOW 2014 Data Challenge

• A set of ~1M tweets collected using a list of 5000 UK-focused “news hounds” and the keywords “Syria”, “terror”, “Ukraine”, and “bitcoin” for a period of 24 hours starting from Feb 25, 18:00.

• Average rate: ~720 tweets/minute• Number of unique twitter accounts: ~556K• Number of retweets: ~648K• Number of replies: ~135K• Ground truth topics:

http://figshare.com/articles/SNOW_2014_Data_Challenge/1003755

#25

Page 26: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

overview of hands-on workshop

• introduction (30-40 mins)– how to use the stream manager– how to index and query content in Solr and mongoDB– how to import an existing dataset in the system

• basic analytics (~45 mins)– how to compute and maintain statistics for most active-

influential users, top hashtags, top tweets– how to create activity timelines

• trend detection (~90 mins)– existing solutions: a) document-pivot, b) keywords-based– work on own implementation: bursty keyword detection

• future work (~10 mins)

#26

Page 27: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

our tutors

#27

Manos Schinas ([email protected])

Katerina Iliakopoulou ([email protected])

Page 28: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

Questions?

#28

Page 29: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

project idea I

• improve topic detection– read literature (cf. references at the end)– make use of external sources as input

• RSS feeds• Gazetteer of named entities (e.g. Wikipedia) to detect tokens that

refer to persons, locations, organizations– make use of machine learning

• train classifiers that can separate high-quality Items from noise• train classifiers that can separate trustworthy sources

(StreamUsers) from less reliable ones• how to create training set?

#29

Page 30: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

project idea II

• integrate additional sources– there are several wrappers available for collecting content

from multiple social networks– try to programmatically use the wrappers to collect social

content around events of interest (e.g. get lists of RSS feeds as input)

– present collected content in a Web UI– cf. “meteoroid on steroids” paper (References)

#30

Page 31: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

project idea III

• create an alerting system– monitor a set of keywords that relate to specific types of

events, e.g. fires, explosions, etc. [try also with Greek keywords]

– check whether collected Items indeed refer to these events or maybe are irrelevant

• use topic models (needs training)• use writing style rules (to check quality or writing)

– if number of Items is considerably larger than mean value, set alarm!

• automatically send email (also include sample Items)• create an RSS feed• automatically tweet about it, see for instance @WikiLiveMon

#31

Page 32: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

project idea IV

• create a geo-topic detector– monitor geotagged Items (around a given list of bounding

boxes)– find trending topics per location– monitor these locations for a longer time– find persistent topics per location– find unique topics per location (i.e. topics that do not

appear in other locations)– visualize the results on a web UI– See: http://trendsmap.com/

#32

Page 33: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

project idea V

• create a twitter account profiler– monitor a set of selected twitter accounts– analyze tweets from these accounts with respect to

keywords and shared URLs– categorize tweets by these accounts (e.g. economy,

politics, sports, etc.)– create topic profiles for each account (e.g. user X 10%

sports, 60% economy, 30% politics) – create a user profile search engine (e.g. “give me accounts

that are discussing more about sports“)– See: http://wefollow.com/

#33

Page 34: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI#34

Thank You!

[email protected]

Acknowledgements

Contact

https://github.com/socialsensor/

http://www.slideshare.net/sympapadopoulos/

@sympapadopoulos

Check out

Page 35: Social Media Crawling & Mining Seminar

IHU SocialSensor Seminar – May 2014 CERTH-ITI

references

• L. M. Aiello, G. Petkos, C. Martin, D. Corney, S. Papadopoulos, R. Skraba, A. Goker, I. Kompatsiaris, A. Jaimes. "Sensing trending topics in Twitter." Transactions on Multimedia 15(6), 2013: 1268-1282.

• SNOW 2014 Data Challenge Proceedings: http://ceur-ws.org/Vol-1150/

• T. Steiner. “A meteoroid on steroids: ranking media items stemming from multiple social networks.” In Proceedings of the 22nd international conference on World Wide Web companion (pp. 31-34). 2013 http://www2013.org/companion/p31.pdf