Network Analytics meets Text Mining for Social Media Analysis

Post on 26-Feb-2016

115 views 5 download

Tags:

description

Network Analytics meets Text Mining for Social Media Analysis. Dr. Rosaria Silipo. Social Media Data Water Water Everywhere , and not a drop to drink. Social Media Data Water Water Everywhere , and not a drop to drink. What companies do with it: Download and keep - PowerPoint PPT Presentation

Transcript of Network Analytics meets Text Mining for Social Media Analysis

Network Analytics meets

Text Mining for Social Media Analysis

Dr. Rosaria Silipo

Social Media DataWater Water Everywhere, and not a drop to drink

2

Social Media DataWater Water Everywhere, and not a drop to drink

What companies do with it:• Download and keep• Topic [Shift] Detection (email content routing,

detect market interest shift, clinical studies, query non structured DBs, ...)

• Sentiment Analysis (marketing, polls, elections, ...)

• Connection Analysis (influencers, risk analysis, ...)

• .... 3

Social Media DataWater Water Everywhere, and not a drop to drink

The Analysis Tools:• Web Crawlers• Visual Exploration• Topic Detection (Text Mining, NLP,

Ontologies)• Sentiment Score (Text Mining, NLP)• Influence Score (Network Analytics)• Find Groups (Predictive Analytics)

4

Case Study Example: Slashdot Data

5

Basic Numbers:

• 24532 users

• 491 threads with • 15 – 843 responses• 12 – 507 users

• 113505 posts

• 60 main topics

• Selected Topic: Politics

Post

Comments

Case Study Example: Slashdot• Very rich data sources about customers

!

• We want to establish:• How users feel about the discussed topic• Whether it matters how users feel• A more general abstraction of the results

6

Sentiment Analysis

Network Analytics

Clustering

Sentiment AnalysisRemove anonymous users, group by PostID Words Tagging

Positive words

Negative words

MPQACorpus

BoW

, Ent

ity F

ilter

, Wor

d Fr

eque

ncy,

A

ttitu

de C

alcu

latio

n by

Doc

umen

t

User Bins

Word cloud for selected users

Total Attitude by User

Slashdot – Text MiningMost Negative User pNutz

Slashdot – Text MiningMost Positive User dada21

Slashdot – Sentiment Analysis• 16016 positive users• 7107 negative users• Most positive user: dada21 (2838 positive/1725 negative

words)• Most negative user: pNutz (43 positive/109 negative

words)• Which Topics have positive users in common ?

– Government– People– Law/s– Money– Market– Parties

11

Network CreationUser

1

User2

User3

User6

User4

User5

Topic Graphs

12

14

Topic Graph: NASA

15

Topic Graph: Sci-Fi

Hubs & Authorities

16

• Hubs = Followers• Authorities = Leaders

Filtering anonymous users and creating network Centrality index to define hub weight

and authority weight

Users with hub and authority weights and

other features

Hubs & Authorities

17

dada21

Doc Ruby

Carl Bialik from the WSJ

pNutz

99BottlesOfBeerInMyF

Tube Steak

18

KNIME: Bringing it all together

Network Analysis

Text Analysis

Users with hub and authority weights and other features

Users bins: positive, negative, neutral

19

Carl Bialik from the WSJ

dada21

Doc Ruby

99BottlesOfBeerInMyFWeb

Hos

ting

Guy

pNutz

Tube Steak

Catbeller

What we have found ...- The positive

leaders- The neutral

leaders- The negative

leaders- The inactive users

20

What identifies each group?How do I identify a new user?How do I handle each user?

Why Clustering?

- No a priori knowledge (not even on a subset of users)

- Prediction and interpretation capabilities required

21

k-Means algorithm

Re-sampling the Training Set

23

k = 10

The k-Means Clusters

24

The k-Means Clusters

25

Superfans

Negativeusers

Neutralusers

Fans

Additional Discoveries• There are only very few real leaders!

Authority and hub scores identify active participants rather than leaders.

• Superfans can be found in cluster_3 • Negative and (sigh!) active users are collected in

cluster_1.• Neutral users are usually inactive (cluster_2,

cluster_7, and cluster_8)• Positive users with different degrees of activity are

scattered across the remaining clusters.26

The operational Workflow

27

Pre-processingCluster Extraction

Assignment of new data

Notes• MPQA Corpus: publicly available

Subjectivity Lexicon (http://www.cs.pitt.edu/mpqa/lexicons.html)

• User Characterization is Sum -> Mean• NLP: No sentence splitting, no negation

identification.• For a more refined syntax-based

sentiment analysis -> „External Tool“ node 28

External Tool NodeThe „External Tool“ node executes any

external program from command line1. Writes input data to an input file2. Calls Tool to run on input file and command

line options and to write results to output file

3. Reads output file and presents data at output port

29

Alternative Sentiment AnalysisFree non-interactive Command Line

running Tools for Sentiment Analysis not found

SentiStrength v2.2 (still interactive)

30External Tool and

Generic Web Service Client

Web Crawling Workflow

31

Community Web Crawler Node

XML Parsing Nodes

Next Steps- Integrate topic information- Integrate user demographic and

behavioural information- Discover [time series] patterns for

early detection of negative users and superfans

- Try other techniques, maybe even on manually segmented data, to discover new user segments 32

Where do I find more?Whitepaper: rosariasilipo@yahoo.com Complete Workflows + Data:

www.knime.com - text mining- network mining- combined analysis

(note the above 3 process huge data and require 16G memory)– clustering

Open Source Software: KNIME www.knime.com 33

Next Appointment

User Day US Boston (free)

October 22nd 2013 10:00 -17:00Microsoft New England R&D Center (NERD) One Memorial Drive, Suite 100, Cambridge

http://www.knime.com/user-day-boston-2013

34

Hands-on Session

1. Download KNIME from www.knime.com

35

Hands-on Session2. Install ExtensionsHelp -> Install New

SoftwareSelect:

• KNIME & Extensions

In KNIME Labs Extensions, select:

• KNIME Network Mining• KNIME Textprocessing

36

Hands-on Session3. Get workflows and Slashdot data

• Get workflows from USB stick (KNIMEBoston2013.zip)• Text Mining• Network Analytics• Text and Network Mining• Social Media Clustering

• Slashdot Raw Data is included in the downloaded workflows

• A smaller set of data is available, Slashdot Reduced Data, for lower memory requirements

• Both data sets are available from USB Stick 37

Hands-on Session3. Import Workflows

38

Hands-on SessionMemory Increase in knime.ini-startupplugins/org.eclipse.equinox.launcher_1.2.0.v20110502.jar--launcher.libraryplugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.1.100.v20110502-vmargs-Xmx2G-XX:MaxPermSize=256m-server-Dsun.java2d.d3d=false-Dosgi.classloader.lock=classname-XX:+UnlockDiagnosticVMOptions-XX:+UnsyncloadClass-Dknime.enable.fastload=true-Djava.library.path=C:\Users\rosy\Documents\R\win-library\2.15\rJava\jri\x64

39

Hands-on Session

5. Improve Workflows: Text Mining

40

Data Reading

Data Preprocessing

Reading Tag Corpus

Tagging Words

BoW

Scoring and Tag Cloud

Hands-on Session

6. Improve Workflows: Network Analytics

41

Data Reading andpreprocessing

Create Network Object Clean up Network

Visualize Network

zoomba

42

nahdude812

43