Csc410 presentation

14
Location Based Analysis on the Twitter Feed Student Matthew Ross Professor Mitsunori Ogihara

Transcript of Csc410 presentation

Location Based

Analysis on the

Twitter FeedStudent Matthew Ross

Professor Mitsunori Ogihara

Goals For the Project

● Write Software that would pull data from

Twitter based on location

● Gain Insights on language used in different

cities

● Become well versed in web technologies and

software development practices

Technology Used

Tasks

Step 1: Collect and Format Data from Twitter

API

Step 2: Structure Collected data

Step 3: Perform data analysis on structured

data

Step 1: Let me in!

Step 2: Crunching Numbers

Step by Step

● For each set of tweets

● Read each tweet into an ArrayList

o Find all HashTags

o Find all User Mentions

o Find all Uppercase Words

● Remove The Stop Words

o For every Tweet String

o Take Each word and insert it into an

ArrayList

o Store Each ArrayList in an ArrayList

ArrayList<ArrayList<String>>

o Iterate through ArrayList of StopWords

o If any of the individual word in the Data

Structure contain a Stop Word, replace

with STOP

Classes● Main-Find the important stuff

o Collected Data From Files

o Gathered Information Through

Regular Expressions

● City-Store The information

o Only Mutator(set) and Accessor(get)

Methods

o More Formatting Data

● Comparison-Do Stuff to the information

o Takes in two City Objects

o Get Data Proportions?

Stop Words?

In computing, stop words are words which are filtered out prior to, or after,

processing of natural language data (text).

take

taken

tell

tends

th

than

thank

thanks

thanx

that

that's

thats

the

their

theirs

them

themselves

then

thence

there

there's

thereafter

thereby

therefore

therein

thereupon

these

they

they'd

they'll

they're

they've

think

third

this

thorough

thoroughly

those

though

three

through

throughout

thru

thus

to

together

too

took

toward

towards

tried

tries

truly

try

trying

twice

two

Step 3: THE ALGORITHM

● The Idea was to compare the distribution of tags,uppercase words, and mentions

● So we could say "location X is more similar to location Y than to similar to

Z"● Let these proportions be P1 and P2, where both P1 and P2 range from 0

to 1. Then we compute the absolute difference d = |P_1 - P_2|● The distance between L1 and L2 in this context can be computed for

example, by selecting top X information pieces (again, tag/word/mention) from those that appear in the tweets at L1 (call this, S1, and S2 for L2)

● Then computing d with respect to the top pieces from L1's point of view and d from L2's point of view, and by taking the average between the two distance values.

1.

2.

3.

Tools and Technologies Learned

● Java Regular Expressions

● Ruby Programming

● Ruby Gems

● JQuery

● AJAX

● Restful Web Architecture

● Practical Object Oriented

Practices

Tasks for the future

● Scheduling Data collectiono There’s data and insights, but are they valid?

● Lots of Unused Data

● Custom HashTags

● Custom Cities