Social media crawling and mining [exercises]

19
Lecture @ International Hellenic University Thessaloniki, 8 May 2014 Social Media Crawling and Mining Overview of Hands-on Workshop Symeon (Akis) Papadopoulos, Manos Schinas, Katerina Iliakopoulou, Yiannis Kompatsiaris Information Technologies Institute (ITI) Centre for Research & Technologies Hellas (CERTH)

Transcript of Social media crawling and mining [exercises]

Lecture @ International Hellenic University

Thessaloniki, 8 May 2014

Social Media Crawling and MiningOverview of Hands-on Workshop

Symeon (Akis) Papadopoulos, Manos Schinas, Katerina Iliakopoulou,

Yiannis KompatsiarisInformation Technologies Institute (ITI)Centre for Research & Technologies Hellas (CERTH)

IHU SocialSensor Seminar – May 2014 CERTH-ITI#2

Stream Manager

Supports search by: 1. Keywords

2. Users

3. Locations

Supports storage to: 1. MongoDB

2. Solr

Supports retrieval 1. Twitter

from: 2. Facebook

3. Google+, etc.

input.conf.xmlinput.conf.xml

streams.conf.xmlstreams.conf.xml

IHU SocialSensor Seminar – May 2014 CERTH-ITI

Streams Manager

#3

How to run :

java –jar StreamsManager.jar stream.conf.xml input.conf.xml

IHU SocialSensor Seminar – May 2014 CERTH-ITI

Items, MediaItems and StreamUsers

#4

Item class

Basic fields: String idString title String[] tagslong publicationTimeString uidString referenceString referenceUserIdString[] mentions

MediaItem class

Basic fields: String idString title String[] tagslong publicationTimeString uidString reference

IHU SocialSensor Seminar – May 2014 CERTH-ITI

Items, MediaItems and StreamUsers

#5

StreamUser class

Basic fields: String idString usernameString urlint itemslong followerslong friends

Getters / Setters for each field

IHU SocialSensor Seminar – May 2014 CERTH-ITI

MongoDB – Import Data

#6

mongoimport –h localhost –d Snow14 –c Items –file ../../Items

mongoimport –h localhost –d Snow14 –c MediaItems –file ../../MediaItems

IHU SocialSensor Seminar – May 2014 CERTH-ITI

MongoDB – Direct Queries

#7

1. Find an Item by its id

db.Items.find({“id” : “Twitter#438612090748416”})

2. Find all Items posted before a certain date

db.Items.find({“publicationTime” : {$lt:1393408367000}})

3. Find a Media Item by its reference

db.MediaItems.find({“reference” : “Twitter#438612090748416”})

4. Find all Users with at least 1000 followers

db.StreamsUsers.find({“followers” : {$gt:1000}})

IHU SocialSensor Seminar – May 2014 CERTH-ITI

MongoDB – Query using DAO classes

#8

1. Create instance of ItemDAO to retrieve item

ItemDAO itemDAO = new ItemDAOImpl(“localhost”, “Snow14”, “Items”)

2. Create instance of MediaItemDAO to retrieve mediaItems

MediaItemDAO mediaItemDAO = new MediaItemDAOImpl(“localhost”, “Snow14”, “MediaItems”)

3. Create instance of StreamUserDAO to retrieve users

StreamUserDAO userDAO = new StreamUserDAOImpl(“localhost”, “Snow14”, “StreamUsers”)

IHU SocialSensor Seminar – May 2014 CERTH-ITI

MongoDB – Query using DAO classes

#9

1. Find an Item by its id

ItemDAO.getItem(“Twitter#438612090748416”)

2. Find a Media Item by its reference

List<String> items = new ArrayList<String>;items.add(“Twitter#438612090748416”);MediaItemDAO.getMediaItemsForItems(items,image,20);

3. Find 1000 latest ItemsItemDAO.getLatestItems(1000);

IHU SocialSensor Seminar – May 2014 CERTH-ITI

MongoDB – Generic queries & Iteration

#10

Use BasicDBObject class to represent JSON objects

e.g {“id” : “Twitter#1234567”} ->

BasicDBObject query = new BasicDBObject(“id” : “Twitter#1234567”)

List<Item> items = itemDAO.getItems(query);

To iterate:

ItemIterator it = itemDAO.getIterator(query);

Use methods hasNext() and next() to iterate over

the collection of Items.

IHU SocialSensor Seminar – May 2014 CERTH-ITI

Solr – Query using SocialSensor wrappers

#11

1. Create instance of SolrItemHandler to index and retrieve items

SolrItemHandler itemHandler = SolrItemHandler.getInstance(

“http://localhost:8080/solr/Items”)

2. Create instance of SolrMediaItemHandler to index and retrieve mediaItems

SolrMediaItemHandler itemHandler = SolrMediaItemHandler.getInstance(

“http://localhost:8080/solr/MediaItems”)

IHU SocialSensor Seminar – May 2014 CERTH-ITI

Solr – Use of UI and SocialSensor wrappers

#12

Assignment #1

Index all the items from MongoDB to Solr

Fill the method eu.socialsensor.ihu_workshop.indexItems

Assignment #2

Run the following queries to get relevant Items

Q1 : terror attack Q2 : Crimea Q3 : Bitcoin

IHU SocialSensor Seminar – May 2014 CERTH-ITI

Basic Social Media Analytics

#13

Assignment #1

1. Find the N most frequent hashtag in a list of Items1. Process one by one all items in the list2. Create a map of all detected hashtags and their number of

occurrences.3. Select the hashtag with the highest value.

2.Find the N most frequent terms in a list of Items using tokenization

3.Find the N most re-tweeted tweets in the dataset 1. Process one by one all items in collection2. Create a map of the item (item id) and its retweets3. Select the item with the highest value

IHU SocialSensor Seminar – May 2014 CERTH-ITI

Basic Social Media Analytics

#14

Assignment #1

4. Find N top users based on: a) Number of posted itemsb) Aggregated number of retweets

IHU SocialSensor Seminar – May 2014 CERTH-ITI

Basic Social Media Analytics

#15

Assignment #1

5. Create an activity timeline for the tweets in the dataset and for the set of original tweets

6. Create the timeline of the tweets that contain a hashtag (or keyword) of your choice

7. Try to visualize the timelines you have created in the previous steps.

IHU SocialSensor Seminar – May 2014 CERTH-ITI

Detection of Trending Topics and Events

#16

What is a trending topic?

Keywords, N-grams, Named Entities, Phrases, which are shared a lot in social media for a certain period of time.

Keywords, N-grams, Named Entities, Phrases, which are shared a lot in social media for a certain period of time.

IHU SocialSensor Seminar – May 2014 CERTH-ITI

Detection of Trending Topics and Events

#17

Assignment #2Feature pivot topic detection by using hashtag

1.Baseline method: Split the data into timeslots of the same length. Calculate the most frequent hashtags of each timeslot

2.Calculate the most trending hashtags by comparing the current frequency of a hashtag with the values of the previous timeslots.

IHU SocialSensor Seminar – May 2014 CERTH-ITI

Detection of Trending Topics and Events

#18

Assignment #2Document pivot event detection by clustering tweets

Cluster “similar” tweets to create groups of tweets that represent candidate events.

The similarity between two tweets could be a combination of similarity measures across different dimensions, e.g textual similarity, time and space proximity, etc.

IHU SocialSensor Seminar – May 2014 CERTH-ITI

Detection of Trending Topics and Events

#19

Assignment #2 Frequency pivot event detection by clustering tweets

1.Run document-pivot clustering provided by SocialSensor to create a set of candidate events.

2.For each produced topic find a list of representative hashtags.

3.Try to calculate a measure of “trendiness” of each event.