Social Networks and Surveillance: Evaluating Suspicion by Association Ryan P. Layfield Dr. Bhavani...

Post on 17-Jan-2016

224 views 0 download

Tags:

Transcript of Social Networks and Surveillance: Evaluating Suspicion by Association Ryan P. Layfield Dr. Bhavani...

Social Networks and Surveillance: Evaluating Suspicion by Association

Ryan P. LayfieldDr. Bhavani Thuraisingham

Dr. Latifur KhanDr. Murat Kantarcioglu

The University of Texas at Dallas

{layfield, bxt043000, lkhan, muratk}@utdallas.edu

Overview

Introduction►Our Goal►System Design►Social Networks►Threat Detection►Correlation Analysis

The Experiment►Setup►Current Results►Issues►Future Work

Introduction

Automated message surveillance is essential to communication monitoring►Widespread use of electronic

communication

►Exponential data growth

►Impossible to sift through all ‘by hand’

Going beyond basic surveillance►Identifying groups rather than individuals

►Monitoring conversations rather than messages

Our Goal

Design new techniques and apply existing algorithms to…►Create a machine-understandable model

of existing social networks

►Identify abnormal conversations and behavior

►Monitor a given communications system in real-time

►Continuously learn and adapt to a dynamic environment

System Design

Three major components:►Social Network Modeler

►Initial Activity Detector

►Correlated Activity Investigator

Social Networks

Individuals engaged in suspicious or undesirable behavior rarely act alone

We can infer than those associated with a person positively identified as suspicious have a high probability of being either:►Accomplices (participants in suspicious

activity)►Witnesses (observers of suspicious activity)

Making these assumptions, we create a context of association between users of a communication network

Social Networks

Within our model:► Every node is a unique user► Every message creates or strengthens a link between

nodesOver time, the network changes

► Frequent communication leads to stronger links► Intermittent messaging implies weakening social ties

The strength of the link implies how strong an association between individuals is

From this data, we can theoretically identify► Hubs► Groups► Liaisons

Social Networks

Threat Detection

Every message sent is scrutinized in the interest of identifying suspicious communication►Keywords analysis►Prior context (i.e. previous message content)

When a detection algorithm yields a strong result, a token is created►The token is created at the origin and passed to the

recipient(s)►Existing tokens, if any, are cloned instead

The result is a web that potentially reflects the dissemination of suspicious information activity

Correlation Analysis

Future messages with similar suspicious topics are not always identifiable with the same ‘initial’ techniques►Quick replies ►Pronoun use►Assumption that recipient is aware of topic

If a token is present at the sender when a message is sent:►Message token is associated with and new

message are analyzed►If analysis yields a strong match, the token

is further cloned and passed to recipient

The Experiment

A rare set of words shared between two or more messages are candidates for keyword analysis, but they are not always easily sifted from ‘noise’

Noise within text-based messages comes in a variety of forms► Misspelled words► Unusual word choice► Incompatible variations of the same language (i.e. British

vs. American English)► Unexpected language

However, we do not want to eliminate potential keywords► Document names► Terminology specific to a subject► ‘Buzz’ words

The Experiment

We proposed an experiment that attempts to eliminate false positives due to noisy data while strengthening and expanding our correlation techniques

Setup

Tools► Running word ‘rank’ database

► Implementation of word set theory infrastructure

► JAMA Matrix LibrarySingular Value Decomposition

Our Approach► Apply SVD noise filtering based on 100 messages

► Analyze word frequency correlation between current message and prior suspicious messages

► Generate a score based on the results

Setup

Construct a matrix based on the last 100 messages

Ww

MMMW

mwcountc

i

t

jiji

...

),(

21

wor

ds

messages

More common

Less common

Setup

Decompose and rebuild

U VTA

Eliminate ‘weak’ singular values

SetupPulled from messages j and k

)(

),(),()(

i

kijii wrank

mwcountmwcountwscore

‘Raw’ total score for word wi

Pulled from ‘running’ word database

kji WWw

iwscore )(Counts only intersection of words Predefined fixed

threshold

Current Results

Method is not currently accurateLarge fluctuations

►Correlation easily swayed by plethora of common words

►Uncommon words not given enough weight

Current Results

Accuracy of Results over 900 Messages

3%12%

59%

26%

True Positives

False Positives

True Negatives

False Negatives

1000 messages evaluated, first 100 used to seed word ranks.

Issues

Word frequencies fluctuate wildly during beginning of experiment (0.0 – 10.0+)

Extreme cost for current construction methods and computation

Filtering context limited to recent global history

Affected by large bodies of text

Future Work

Tap potential of existing matrix for further analysis

Adaptive filtering feedback algorithmsSpeed improvements to accommodate

real-time streamsFlexible communication platform

monitoringAddition of pipe architecture for

modular threat detection and correlation