Connecting Users across Social Media Sites: A Behavioral-Modeling Approach

Post on 06-Jan-2016

21 views 0 download

description

Connecting Users across Social Media Sites: A Behavioral-Modeling Approach. Reza Zafarani and Huan Liu Data Mining and Machine Learning Laboratory (DMML) Arizona State University KDD 2013 – Chicago, Illinois. How hard can it be to identify an individual across sites? - PowerPoint PPT Presentation

Transcript of Connecting Users across Social Media Sites: A Behavioral-Modeling Approach

Connecting Users across Social Media Sites:A Behavioral-Modeling Approach

REZA ZAFARANI AND HUAN LIU

DATA MINING AND MACHINE LEARNING LABORATORY (DMML)

ARIZONA STATE UNIVERSITY

KDD 2013 – CHICAGO, ILLINOIS

How hard can it be to identify

an individual across sites?

Privacy Experts Claim Advertisers

Know a lot about People

Can they stop showing you the

same repetitive ads across sites?

More information about individuals

Many social media sites

Partial Information

Complementary Information

Better User Profiles

Facebook

Google+

Age

Location

Education

Huan Liu

N/A

Tempe,AZ

USC

N/A

USA

USC (1985-89)

Can we connect individualsacross sites?

Connectivity is not available

Consistency in Information Availability

Can we verify that the information provided across sites belong to the same individual?

MOdeling Behavior for Identifying Users across Sites

Human behavior generates Information redundancy

Information shared across sites

provides a behavioral fingerprint

MOBIUS

- Behavioral Modeling

- Minimum Information

Identification Function

Minimum information available on ALL sites:

Usernames

CandidateUsername (john.smith)

Prior Usernames ({jsmith, john.s})

Behavior 1

Behavior 2

Behavior n

Information RedundancyInformation Redundancy

Information Redundancy

Feature Set 1

Feature Set 2

Feature Set n

GeneratesCaptured

Via

Learning Framewor

kData

IdentificationFunction

Behaviors

Human Limitation

Time & Memory Limitation

Knowledge Limitation

Exogenous Factors

Typing Patterns

Language Patterns

Endogenous Factors

Personal Attributes &

Traits

Habits

Using Same Usernames

Username Length

Likelihood

Time and Memory Limitation

59% of individuals use the same

username

1 2 3 4 5 6 7 8 9 10 11 120 0 0 0 0 0 0

2

4

5

1

0

Limited Vocabulary

Limited Alphabet

Knowledge Limitation

Identifying individuals by their

vocabulary size

Alphabet Size is correlated to

language: शमं�त कु� मं�र -> Shamanth Kumar

Typing Patterns

QWERTY Keyboard Variants: AZERTY, QWERTZ

DVORAK Keyboard

Keyboard type impacts your usernames

QWER1234 AOEUISNTH

Modifying Previous

Usernames

Creating Similar

UsernamesUsername Observatio

n Likelihood

Habits - old habits die hardAdding Prefixes/Suffixes, Abbreviating, Swapping or Adding/Removing Characters

Nametag and Gateman

Usernames come from a language

model

Experiment Setup

Data:

200,000 instances (50% class balance)

414 Features

Previous Methods:

1) Zafarani and Liu, 2009

2) Perito et al., 2011

Baselines:

3) Exact Username Match

4) Substring Match

5) Patterns in Letters

Exac

t Use

rnam

e M

atch

Subs

trin

g M

atch

ing

Patte

rns in

Let

ters

Zafar

ani a

nd L

iu

Perito

et a

l.

Naï

ve B

ayes

0

20

40

60

80

100

7763.12

49.2566 77.59

91.38

MOBIUS Performance

Naï

ve B

ayes J4

8

Rando

m F

ores

t

L2-reg

L2-

Loss

SVM

L1-reg

L2-

Loss

SVM

L2-reg

Log

istic Reg

ress

ion

L1-reg

Log

istic Reg

ress

ion

89909192939495

91.3890.87

93.5993.793.7193.7793.8

Choice of Learning Algorithm

Diminishing Returns for Adding More Usernames

Discover applications of connecting users across sites

Information shared across sites acts as a behavioral fingerprint

Human Behavior Results in Information RedundancyIncorporating features indigenous to specific sitesA methodology for connecting individuals across sitesA behavioral modeling approachUses minimum information across

sitesAllows for integration of additional

behaviors when required

Conclusions + Future Work