Hashtags on Twitter: Linguistic Aspects, Popularity...

116
Hashtags on Twitter: Linguistic Aspects, Popularity Prediction and Information Diffusion Pawan Goyal CSE, IITKGP July 24-28, 2014 Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 1 / 59

Transcript of Hashtags on Twitter: Linguistic Aspects, Popularity...

Page 1: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Hashtags on Twitter: Linguistic Aspects, PopularityPrediction and Information Diffusion

Pawan Goyal

CSE, IITKGP

July 24-28, 2014

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 1 / 59

Page 2: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

What are #Hashtags?

Come under the general category of memes; a short unit of text that spreadsfrom person to person within a culture.

syntaxAdding a hash symbol (#) before a string of

letters

numerical digits or

underscore sign (_)

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 2 / 59

Page 3: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Why are Hashtags useful?

Hashtags are used to classify messages, propagate ideas and to promotespecific topics and people.

Allow users to create communities of people interested in the same topic,making it easier to find and share information related to it.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 3 / 59

Page 4: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Why are Hashtags useful?

Hashtags are used to classify messages, propagate ideas and to promotespecific topics and people.

Allow users to create communities of people interested in the same topic,making it easier to find and share information related to it.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 3 / 59

Page 5: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Why are Hashtags useful?

Hashtags are used to classify messages, propagate ideas and to promotespecific topics and people.

Allow users to create communities of people interested in the same topic,making it easier to find and share information related to it.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 3 / 59

Page 6: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Social Phenomena

A new social event can lead to the simultaneous emergence of severaldifferent hashtags, each one generated by a different user.

They can either be accepted by other members of the network or not.

Some propagate and thrive, while others die eventually or immediatelyafter birth, being restricted to a few messages.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 4 / 59

Page 7: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Social Phenomena

A new social event can lead to the simultaneous emergence of severaldifferent hashtags, each one generated by a different user.

They can either be accepted by other members of the network or not.

Some propagate and thrive, while others die eventually or immediatelyafter birth, being restricted to a few messages.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 4 / 59

Page 8: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Social Phenomena

A new social event can lead to the simultaneous emergence of severaldifferent hashtags, each one generated by a different user.

They can either be accepted by other members of the network or not.

Some propagate and thrive, while others die eventually or immediatelyafter birth, being restricted to a few messages.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 4 / 59

Page 9: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Social Phenomena

A new social event can lead to the simultaneous emergence of severaldifferent hashtags, each one generated by a different user.

They can either be accepted by other members of the network or not.

Some propagate and thrive, while others die eventually or immediatelyafter birth, being restricted to a few messages.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 4 / 59

Page 10: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Novelty’s propogation process

Subgraphs from Twitter dataset showing two distinct moments in the processof spreading #musicmonday

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 5 / 59

Page 11: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

First Reference

Evandro Cunha, Gabriel Magno, Giovanni Comarela, Virgilio Almeida, MarcosAndré Goncalves, and Fabrício Benevenuto. 2011. Analyzing the dynamicevolution of hashtags on Twitter: a language-based approach. In Proceedingsof the Workshop on Languages in Social Media (LSM ’11). Association forComputational Linguistics, Stroudsburg, PA, USA, 58 - 65.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 6 / 59

Page 12: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Research Objectives

To understand the process of propagation of innovative hashtags in light oflinguistic theories.

Interesting QuestionsDoes the distribution of the hashtags in frequency rankings follow somepattern, as the words in the lexicon of a language?

Is the length of a hashtag a factor that influences to its success or failure?

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 7 / 59

Page 13: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Dataset Used

55 million users

2 billion follow links

8% users ignored because the profile was private

More than 1.7 billion tweets between July 2006 and August 2009 wereanalyzed

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 8 / 59

Page 14: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Dataset Used

55 million users

2 billion follow links

8% users ignored because the profile was private

More than 1.7 billion tweets between July 2006 and August 2009 wereanalyzed

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 8 / 59

Page 15: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

What aspect of the tweets would be important?

Must find interchangeable hashtags, i.e. different tags used for the samepurpose, to characterize messages on the same topic.

For example, #michaeljackson, #mj, #jackson refer to the same subject.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 9 / 59

Page 16: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

What aspect of the tweets would be important?

Must find interchangeable hashtags, i.e. different tags used for the samepurpose, to characterize messages on the same topic.

For example, #michaeljackson, #mj, #jackson refer to the same subject.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 9 / 59

Page 17: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Topics Selected

A minor base was built for each of the following topics:

Michael Jackson (singer’s death widely reported during that period)→ MJ

Swine Flu (H1N1 epidemic as major issue of 2009)→ SF

Music Monday (a very successful campaign in favor of posting tweetsrelated to music on Mondays)→ MM

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 10 / 59

Page 18: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Forming the bases

Filtering tweets that included

atleast one hashtagat least one of the following terms that was thought to be related to thetopics:

I MJ: ‘michael jackson’I SF: ‘swine flu’ or ‘#swineflu’I MM: ‘#musicmonday’

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 11 / 59

Page 19: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Forming the bases

Filtering tweets that included

atleast one hashtag

at least one of the following terms that was thought to be related to thetopics:

I MJ: ‘michael jackson’I SF: ‘swine flu’ or ‘#swineflu’I MM: ‘#musicmonday’

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 11 / 59

Page 20: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Forming the bases

Filtering tweets that included

atleast one hashtagat least one of the following terms that was thought to be related to thetopics:

I MJ: ‘michael jackson’I SF: ‘swine flu’ or ‘#swineflu’I MM: ‘#musicmonday’

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 11 / 59

Page 21: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Summary information

Number of tweets posted in that base

Number of users who posted tweets

Number of follow links among users of the base

Number of different hashtags used in the tweets of the base

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 12 / 59

Page 22: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Summary information

Number of tweets posted in that base

Number of users who posted tweets

Number of follow links among users of the base

Number of different hashtags used in the tweets of the base

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 12 / 59

Page 23: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Empirical Phenomenon

Rich-get-richer phenomenonAlso known as ‘preferential attachment process’: In some systems, thepopularity of the most common items tends to increase faster than thepopularity of the less comon ones.

Zipf’s LawFrequency of words in English or any other language follow a power law.

f ∝1r

log(f ) = log(k)− log(r)

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 13 / 59

Page 24: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Empirical Phenomenon

Rich-get-richer phenomenonAlso known as ‘preferential attachment process’: In some systems, thepopularity of the most common items tends to increase faster than thepopularity of the less comon ones.

Zipf’s LawFrequency of words in English or any other language follow a power law.

f ∝1r

log(f ) = log(k)− log(r)

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 13 / 59

Page 25: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Distribution of Hashtags

i-tweets hastags: hastags appearing in at most i tweets

j-tweet hashtags: hashtags that appear in at least j tweets

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 14 / 59

Page 26: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Distribution of Hashtags

i-tweets hastags: hastags appearing in at most i tweets

j-tweet hashtags: hashtags that appear in at least j tweets

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 14 / 59

Page 27: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Data from most used Hashtags

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 15 / 59

Page 28: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Verification of Zipfian Law

Volume of Tweets vs. its position in popularity ranking

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 16 / 59

Page 29: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Hashtag Length and Frequency

Zipf’s Other Laws: Word length and word frequencyWord frequency is inversely proportional to their length.

The length of the most popular hashtags was compared with the the lesspopular ones.

Main FindingsMost popular ones are simple, direct and short

Among those with little utilization, many are formed by long strings ofcharacters

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 17 / 59

Page 30: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Most Common Hashtags and most common 15-characterhashtags

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 18 / 59

Page 31: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Average Length

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 19 / 59

Page 32: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Avergae Number of Characters

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 20 / 59

Page 33: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Underscores in Hashtags

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 21 / 59

Page 34: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Underscores in Hashtags

Distribution of hashtags containing the sign “_”97% of _-hashtags are used in 10 or less tweets

#michael_jackson: position 248, only 128 tweets

#swine_flu: position 67, only 246 tweets

#music_monday: wasn’t even used

User behavior seems to indicate rejection of this sign

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 22 / 59

Page 35: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Second Reference

Oren Tsur and Ari Rappoport. 2012. What’s in a hashtag?: content basedprediction of the spread of ideas in microblogging communities. InProceedings of the fifth ACM international conference on Web search and datamining (WSDM ’12). ACM, New York, NY, USA, 643-652.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 23 / 59

Page 36: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Content-based Popularity Prediction

ObjectiveGiven an idea/meme m, and a time frame t, can we predict the acceptance ofm in the community (social network)?

Interesting QuestionsCan we accurately predict the acceptance of a meme based solely on thememe’s content?

Does the meme’s context improve the prediction?

Relation between graph topology and the content and how do theyintegrate for efficient propagation?

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 24 / 59

Page 37: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Content-based Popularity Prediction

ObjectiveGiven an idea/meme m, and a time frame t, can we predict the acceptance ofm in the community (social network)?

Interesting QuestionsCan we accurately predict the acceptance of a meme based solely on thememe’s content?

Does the meme’s context improve the prediction?

Relation between graph topology and the content and how do theyintegrate for efficient propagation?

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 24 / 59

Page 38: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Corpus Used

400 million tweets, tweeted between June-December, 2009.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 25 / 59

Page 39: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Filetering and Normalization

FilteringFiltered tweets containing non-Latin characters, to maintain a corpus ofEnglish tweets only

Hashtags that appear over 100 times

NormalizationThe same hashtag could have got different counts because of beingintroduced in a different week.

N(hti) = ∑j∈weeks

count(htij)

w1

wj

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 26 / 59

Page 40: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Filetering and Normalization

FilteringFiltered tweets containing non-Latin characters, to maintain a corpus ofEnglish tweets only

Hashtags that appear over 100 times

NormalizationThe same hashtag could have got different counts because of beingintroduced in a different week.

N(hti) = ∑j∈weeks

count(htij)

w1

wj

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 26 / 59

Page 41: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Fresh Hashtags

Hashtags that did not get popular before the corpus was collected.

DefinitionA hashtag is defined as fresh if it did not appear in the first week or if itsnormalized count in the first week is less than 10% of its normalized count inits peak.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 27 / 59

Page 42: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Fresh Hashtags

Hashtags that did not get popular before the corpus was collected.

DefinitionA hashtag is defined as fresh if it did not appear in the first week or if itsnormalized count in the first week is less than 10% of its normalized count inits peak.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 27 / 59

Page 43: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Fresh hashtags

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 28 / 59

Page 44: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Preprocessing

Identifying the distinct words in a hashtag#thankyousachin

Issues with hashtags: #savethenhs, #weluvjb

Matching hashtags against a lexicon of English words

Exploiting redundancy of hashtags that differ only orthographically

matches tuples like #freeiran, #FreeIran; performs segmentation as per thecapital letters

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 29 / 59

Page 45: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Preprocessing

Identifying the distinct words in a hashtag#thankyousachinIssues with hashtags: #savethenhs, #weluvjb

Matching hashtags against a lexicon of English words

Exploiting redundancy of hashtags that differ only orthographically

matches tuples like #freeiran, #FreeIran; performs segmentation as per thecapital letters

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 29 / 59

Page 46: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Preprocessing

Identifying the distinct words in a hashtag#thankyousachinIssues with hashtags: #savethenhs, #weluvjb

Matching hashtags against a lexicon of English words

Exploiting redundancy of hashtags that differ only orthographically

matches tuples like #freeiran, #FreeIran; performs segmentation as per thecapital letters

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 29 / 59

Page 47: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Prediction Model

The Target Function

f (ht) = n

ht is a vector space representation of a given hashtagn is the normalized count of its occurrences in a time frame.

Transformed target function: f ′(ht) = log(n).

Regression Model

Training set: (X,Y) = {xi,yi}, where for a hashtag hti:xi: feature vector representation of htiyi = log(ni), where ni is the normalized count of occurrences of hti

Y = b+wTX

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 30 / 59

Page 48: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Prediction Model

The Target Function

f (ht) = n

ht is a vector space representation of a given hashtagn is the normalized count of its occurrences in a time frame.Transformed target function: f ′(ht) = log(n).

Regression Model

Training set: (X,Y) = {xi,yi}, where for a hashtag hti:xi: feature vector representation of htiyi = log(ni), where ni is the normalized count of occurrences of hti

Y = b+wTX

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 30 / 59

Page 49: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Prediction Model

The Target Function

f (ht) = n

ht is a vector space representation of a given hashtagn is the normalized count of its occurrences in a time frame.Transformed target function: f ′(ht) = log(n).

Regression Model

Training set: (X,Y) = {xi,yi}, where for a hashtag hti:xi: feature vector representation of htiyi = log(ni), where ni is the normalized count of occurrences of hti

Y = b+wTX

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 30 / 59

Page 50: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Prediction Model

The Target Function

f (ht) = n

ht is a vector space representation of a given hashtagn is the normalized count of its occurrences in a time frame.Transformed target function: f ′(ht) = log(n).

Regression Model

Training set: (X,Y) = {xi,yi}, where for a hashtag hti:xi: feature vector representation of htiyi = log(ni), where ni is the normalized count of occurrences of hti

Y = b+wTX

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 30 / 59

Page 51: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

L1 Regularization with Stochastic Gradient Descent

Parameter update for Stochastic Gradient Descent (SGD)

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 31 / 59

Page 52: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

L1 Regularization with Stochastic Gradient Descent

Parameter update for Stochastic Gradient Descent (SGD)

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 31 / 59

Page 53: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Model Features

Hashtag content :- features that can be extracted from the hashtag itself.

Global tweet features:- features related to the content of the tweetscontaining the hashtag.

Graph topology features:- features related to graph topology and retweetstatistics.

Global temporal features:- features related to temporal pattern of the useof the hashtag.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 32 / 59

Page 54: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Hashtag Content Features

Character Length7 bins were used: 2, 3, 4, 5, 6-9, 10-14, >14 characters

Number of words55% of the hashtags were compounds of more than one word, e.g. #freeIran,#GoogleGoesGaga.Four bins: 1 word, 2-3 words, 4 words, >4 words

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 33 / 59

Page 55: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Hashtag Content Features

Character Length7 bins were used: 2, 3, 4, 5, 6-9, 10-14, >14 characters

Number of words55% of the hashtags were compounds of more than one word, e.g. #freeIran,#GoogleGoesGaga.Four bins: 1 word, 2-3 words, 4 words, >4 words

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 33 / 59

Page 56: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Hashtag Content Features

OrthographyHashtags can be written in capital letters, contain some capital and/or digits,e.g. #myheart4JB.‘right’ writing style may make it readable: savethenhs vs. saveTheNHSAttributes: no caps, some caps, all caps, contain digits

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 34 / 59

Page 57: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Hashtag Content Features

Lexical ItemsHashtag/its words are matched against five predefined lists:

a general lexicon containing all words from a large portion (612MB) ofEnglish Wikipedia

a list of proper names taken from the name list compiled at the UScensus of 1995

a list of celebrity names compiled from Forbes’ ‘The Celebrity 100’ lists of2008-2010.

a short list of holidays and days of the week

a list of all the world’s countries

Each of the five attributes is an attribute in the vector

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 35 / 59

Page 58: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Hashtag Content Features

Lexical ItemsHashtag/its words are matched against five predefined lists:

a general lexicon containing all words from a large portion (612MB) ofEnglish Wikipedia

a list of proper names taken from the name list compiled at the UScensus of 1995

a list of celebrity names compiled from Forbes’ ‘The Celebrity 100’ lists of2008-2010.

a short list of holidays and days of the week

a list of all the world’s countries

Each of the five attributes is an attribute in the vector

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 35 / 59

Page 59: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Hashtag Content Features

Lexical ItemsHashtag/its words are matched against five predefined lists:

a general lexicon containing all words from a large portion (612MB) ofEnglish Wikipedia

a list of proper names taken from the name list compiled at the UScensus of 1995

a list of celebrity names compiled from Forbes’ ‘The Celebrity 100’ lists of2008-2010.

a short list of holidays and days of the week

a list of all the world’s countries

Each of the five attributes is an attribute in the vector

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 35 / 59

Page 60: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Hashtag Content Features

Lexical ItemsHashtag/its words are matched against five predefined lists:

a general lexicon containing all words from a large portion (612MB) ofEnglish Wikipedia

a list of proper names taken from the name list compiled at the UScensus of 1995

a list of celebrity names compiled from Forbes’ ‘The Celebrity 100’ lists of2008-2010.

a short list of holidays and days of the week

a list of all the world’s countries

Each of the five attributes is an attribute in the vector

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 35 / 59

Page 61: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Hashtag Content Features

Lexical ItemsHashtag/its words are matched against five predefined lists:

a general lexicon containing all words from a large portion (612MB) ofEnglish Wikipedia

a list of proper names taken from the name list compiled at the UScensus of 1995

a list of celebrity names compiled from Forbes’ ‘The Celebrity 100’ lists of2008-2010.

a short list of holidays and days of the week

a list of all the world’s countries

Each of the five attributes is an attribute in the vector

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 35 / 59

Page 62: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Hashtag Content Features

Lexical ItemsHashtag/its words are matched against five predefined lists:

a general lexicon containing all words from a large portion (612MB) ofEnglish Wikipedia

a list of proper names taken from the name list compiled at the UScensus of 1995

a list of celebrity names compiled from Forbes’ ‘The Celebrity 100’ lists of2008-2010.

a short list of holidays and days of the week

a list of all the world’s countries

Each of the five attributes is an attribute in the vector

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 35 / 59

Page 63: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Hashtag Content Features

LocationLocation of a hashtag can give an indication of the way it is used:For instance, if located in the middle of the tweet, hashtag also serves as partof the sentence and not only as a meta tag.

Three locations: prefix, infix, suffixEx: “AP: Report: #Iran’s paramilitary launches cyber attack http://is.gd/HiCYJU#iranelections #freeiran”Last two hashtags are considered suffixes#Iran is considered Infix

CollocationWhether it collocates with other hashtags?Value 1 if more than 40% of the occurrences are collocated with otherhashtags.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 36 / 59

Page 64: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Hashtag Content Features

LocationLocation of a hashtag can give an indication of the way it is used:For instance, if located in the middle of the tweet, hashtag also serves as partof the sentence and not only as a meta tag.Three locations: prefix, infix, suffixEx: “AP: Report: #Iran’s paramilitary launches cyber attack http://is.gd/HiCYJU#iranelections #freeiran”

Last two hashtags are considered suffixes#Iran is considered Infix

CollocationWhether it collocates with other hashtags?Value 1 if more than 40% of the occurrences are collocated with otherhashtags.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 36 / 59

Page 65: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Hashtag Content Features

LocationLocation of a hashtag can give an indication of the way it is used:For instance, if located in the middle of the tweet, hashtag also serves as partof the sentence and not only as a meta tag.Three locations: prefix, infix, suffixEx: “AP: Report: #Iran’s paramilitary launches cyber attack http://is.gd/HiCYJU#iranelections #freeiran”Last two hashtags are considered suffixes

#Iran is considered Infix

CollocationWhether it collocates with other hashtags?Value 1 if more than 40% of the occurrences are collocated with otherhashtags.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 36 / 59

Page 66: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Hashtag Content Features

LocationLocation of a hashtag can give an indication of the way it is used:For instance, if located in the middle of the tweet, hashtag also serves as partof the sentence and not only as a meta tag.Three locations: prefix, infix, suffixEx: “AP: Report: #Iran’s paramilitary launches cyber attack http://is.gd/HiCYJU#iranelections #freeiran”Last two hashtags are considered suffixes#Iran is considered Infix

CollocationWhether it collocates with other hashtags?Value 1 if more than 40% of the occurrences are collocated with otherhashtags.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 36 / 59

Page 67: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Hashtag Content Features

LocationLocation of a hashtag can give an indication of the way it is used:For instance, if located in the middle of the tweet, hashtag also serves as partof the sentence and not only as a meta tag.Three locations: prefix, infix, suffixEx: “AP: Report: #Iran’s paramilitary launches cyber attack http://is.gd/HiCYJU#iranelections #freeiran”Last two hashtags are considered suffixes#Iran is considered Infix

CollocationWhether it collocates with other hashtags?Value 1 if more than 40% of the occurrences are collocated with otherhashtags.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 36 / 59

Page 68: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Hashtag Content Features

Cognitive DimensionSome words trigger specific emotions and encourage specific behavior andthis psychological dimension can influence its spread.

LIWC project assigns words to a number of emotional and cognitivedimensions.Ex: positive sentiment, negative sentiment, optimistic, self, anger etc.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 37 / 59

Page 69: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Hashtag Content Features

Cognitive DimensionSome words trigger specific emotions and encourage specific behavior andthis psychological dimension can influence its spread.LIWC project assigns words to a number of emotional and cognitivedimensions.

Ex: positive sentiment, negative sentiment, optimistic, self, anger etc.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 37 / 59

Page 70: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Hashtag Content Features

Cognitive DimensionSome words trigger specific emotions and encourage specific behavior andthis psychological dimension can influence its spread.LIWC project assigns words to a number of emotional and cognitivedimensions.Ex: positive sentiment, negative sentiment, optimistic, self, anger etc.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 37 / 59

Page 71: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Global Tweet Features

1000 most frequent words the hashtag co-occurred with were extracted.

This list is mapped to the 69 LIWC categories

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 38 / 59

Page 72: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Graph Topology Features

Average Number of followersAverage number of followers of users, who used the hashtag, is divided to 19bins on a sub logarithmic scale

Max Number of followersMax number of followers of users, who used the hashtag, is divided to 19 binson a logarithmic scale

Retweets ratioTendency of a hashtag to appear in retweeted messages.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 39 / 59

Page 73: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Graph Topology Features

Average Number of followersAverage number of followers of users, who used the hashtag, is divided to 19bins on a sub logarithmic scale

Max Number of followersMax number of followers of users, who used the hashtag, is divided to 19 binson a logarithmic scale

Retweets ratioTendency of a hashtag to appear in retweeted messages.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 39 / 59

Page 74: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Graph Topology Features

Average Number of followersAverage number of followers of users, who used the hashtag, is divided to 19bins on a sub logarithmic scale

Max Number of followersMax number of followers of users, who used the hashtag, is divided to 19 binson a logarithmic scale

Retweets ratioTendency of a hashtag to appear in retweeted messages.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 39 / 59

Page 75: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Graph Temporal Features

The normalized weekly count of each hashtag was sampled in four timestamps: wi, i ∈ {t, t+1, t+2, t+6}

t is the first week of occurrence and t+ j is the j-th week after the firstoccurrence.

Three lag values are otained dk∈1,2,3, where dk is the ratio of change fromthe previous time stamp. (stickiness and persistence)

17 bins on a logarithmic scale (-200% to 200% change)

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 40 / 59

Page 76: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Graph Temporal Features

The normalized weekly count of each hashtag was sampled in four timestamps: wi, i ∈ {t, t+1, t+2, t+6}t is the first week of occurrence and t+ j is the j-th week after the firstoccurrence.

Three lag values are otained dk∈1,2,3, where dk is the ratio of change fromthe previous time stamp. (stickiness and persistence)

17 bins on a logarithmic scale (-200% to 200% change)

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 40 / 59

Page 77: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Graph Temporal Features

The normalized weekly count of each hashtag was sampled in four timestamps: wi, i ∈ {t, t+1, t+2, t+6}t is the first week of occurrence and t+ j is the j-th week after the firstoccurrence.

Three lag values are otained dk∈1,2,3, where dk is the ratio of change fromthe previous time stamp. (stickiness and persistence)

17 bins on a logarithmic scale (-200% to 200% change)

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 40 / 59

Page 78: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Graph Temporal Features

The normalized weekly count of each hashtag was sampled in four timestamps: wi, i ∈ {t, t+1, t+2, t+6}t is the first week of occurrence and t+ j is the j-th week after the firstoccurrence.

Three lag values are otained dk∈1,2,3, where dk is the ratio of change fromthe previous time stamp. (stickiness and persistence)

17 bins on a logarithmic scale (-200% to 200% change)

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 40 / 59

Page 79: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Experimental Setup

Learning three aspects in the predictionwhat is the attribute combination that yields the best prediction?

what are the strongest attributes and how do they complement eachother?

how does the prediction accuracy change given different time frames?

Performance measurementExperiments were executed in a 10-fold cross validation manner.

Performance is measured by the mean square error (MSE).

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 41 / 59

Page 80: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Experimental Setup

Learning three aspects in the predictionwhat is the attribute combination that yields the best prediction?

what are the strongest attributes and how do they complement eachother?

how does the prediction accuracy change given different time frames?

Performance measurement

Experiments were executed in a 10-fold cross validation manner.

Performance is measured by the mean square error (MSE).

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 41 / 59

Page 81: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Experimental Setup

Learning three aspects in the predictionwhat is the attribute combination that yields the best prediction?

what are the strongest attributes and how do they complement eachother?

how does the prediction accuracy change given different time frames?

Performance measurementExperiments were executed in a 10-fold cross validation manner.

Performance is measured by the mean square error (MSE).

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 41 / 59

Page 82: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Experimental Setup

Learning three aspects in the predictionwhat is the attribute combination that yields the best prediction?

what are the strongest attributes and how do they complement eachother?

how does the prediction accuracy change given different time frames?

Performance measurementExperiments were executed in a 10-fold cross validation manner.

Performance is measured by the mean square error (MSE).

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 41 / 59

Page 83: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Results

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 42 / 59

Page 84: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Results

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 43 / 59

Page 85: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Results

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 44 / 59

Page 86: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Third Reference

Daniel M. Romero, Brendan Meeder, and Jon Kleinberg. 2011. Differences inthe mechanics of information diffusion across topics: idioms, politicalhashtags, and complex contagion on twitter. In Proceedings of the 20thinternational conference on World wide web (WWW ’11). ACM, New York, NY,USA, 695-704.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 45 / 59

Page 87: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

What is Information Diffusion?

Online Information DiffusionUnderstanding the tendency for people to engage in activities such asforwarding messages, linking to articles, joining groups, purchasing products,or becoming fans of pages after some number of their friends have.

Objectives of this researchWidespread belief that different kinds of information spread differentlyonline.

To study this issue on Twitter, analyzing the ways in which Hashtagsspread on a network defined by interactions among Twitter users.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 46 / 59

Page 88: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

What is Information Diffusion?

Online Information DiffusionUnderstanding the tendency for people to engage in activities such asforwarding messages, linking to articles, joining groups, purchasing products,or becoming fans of pages after some number of their friends have.

Objectives of this researchWidespread belief that different kinds of information spread differentlyonline.

To study this issue on Twitter, analyzing the ways in which Hashtagsspread on a network defined by interactions among Twitter users.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 46 / 59

Page 89: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

What is Information Diffusion?

Online Information DiffusionUnderstanding the tendency for people to engage in activities such asforwarding messages, linking to articles, joining groups, purchasing products,or becoming fans of pages after some number of their friends have.

Objectives of this researchWidespread belief that different kinds of information spread differentlyonline.

To study this issue on Twitter, analyzing the ways in which Hashtagsspread on a network defined by interactions among Twitter users.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 46 / 59

Page 90: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Twitter Data and Graph Construction

Twitter data crawled from August 2009 until January 2010.

Collected over 3 billion messages from more than 60 million users.

Graph construction via @-messages: X→ Y if X directed at least 3@-messages to Y .

Graph size: 8.5 million non-isolated nodes, 50 million links

Studies 500 most used hashtags

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 47 / 59

Page 91: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Twitter Data and Graph Construction

Twitter data crawled from August 2009 until January 2010.

Collected over 3 billion messages from more than 60 million users.

Graph construction via @-messages: X→ Y if X directed at least 3@-messages to Y .

Graph size: 8.5 million non-isolated nodes, 50 million links

Studies 500 most used hashtags

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 47 / 59

Page 92: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Twitter Data and Graph Construction

Twitter data crawled from August 2009 until January 2010.

Collected over 3 billion messages from more than 60 million users.

Graph construction via @-messages: X→ Y if X directed at least 3@-messages to Y .

Graph size: 8.5 million non-isolated nodes, 50 million links

Studies 500 most used hashtags

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 47 / 59

Page 93: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Twitter Data and Graph Construction

Twitter data crawled from August 2009 until January 2010.

Collected over 3 billion messages from more than 60 million users.

Graph construction via @-messages: X→ Y if X directed at least 3@-messages to Y .

Graph size: 8.5 million non-isolated nodes, 50 million links

Studies 500 most used hashtags

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 47 / 59

Page 94: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Twitter Data and Graph Construction

Twitter data crawled from August 2009 until January 2010.

Collected over 3 billion messages from more than 60 million users.

Graph construction via @-messages: X→ Y if X directed at least 3@-messages to Y .

Graph size: 8.5 million non-isolated nodes, 50 million links

Studies 500 most used hashtags

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 47 / 59

Page 95: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Hashtag Categories

Manually identified 8broad categories withatleast 20 HTs in each

Authors and 3volunteersindependentlyannotated eachhashtag.

Levels of agreementwas high

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 48 / 59

Page 96: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Exposure Curve: Defining p(k)

Neighbor Set of XFor a given user X, the set of other users to whom X has an edge.

When does X start mentioning a hashtag H?How do successive exposures to H affect the probability that X will beginmentioning it?

Look at all users X who have not mentioned H, but for whom k neighborshave

p(k): fraction of users who adopt the hashtag direct after their kth

exposure, given that they hadn’t yet adopted it.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 49 / 59

Page 97: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Exposure Curve: Defining p(k)

Neighbor Set of XFor a given user X, the set of other users to whom X has an edge.

When does X start mentioning a hashtag H?How do successive exposures to H affect the probability that X will beginmentioning it?

Look at all users X who have not mentioned H, but for whom k neighborshave

p(k): fraction of users who adopt the hashtag direct after their kth

exposure, given that they hadn’t yet adopted it.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 49 / 59

Page 98: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Exposure Curve: Defining p(k)

Neighbor Set of XFor a given user X, the set of other users to whom X has an edge.

When does X start mentioning a hashtag H?How do successive exposures to H affect the probability that X will beginmentioning it?

Look at all users X who have not mentioned H, but for whom k neighborshave

p(k): fraction of users who adopt the hashtag direct after their kth

exposure, given that they hadn’t yet adopted it.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 49 / 59

Page 99: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Average Exposure Curve for 500 most-mentioned hashtags

A ramp-up to the peak value, reached relatively early (k = 2,3,4)

Decline for larger values of k

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 50 / 59

Page 100: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Average Exposure Curve for 500 most-mentioned hashtags

A ramp-up to the peak value, reached relatively early (k = 2,3,4)

Decline for larger values of k

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 50 / 59

Page 101: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Persistence and Stickiness

StickinessThe maximum value of p(k)(probability of usage at the mosteffective exposure)

PersistenceA measure of the decay ofexposure curves.The ratio of the area under thecurve P and the area of therectangle of length max(P) andwidth max(D(P)).

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 51 / 59

Page 102: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Persistence and Stickiness

StickinessThe maximum value of p(k)(probability of usage at the mosteffective exposure)

PersistenceA measure of the decay ofexposure curves.

The ratio of the area under thecurve P and the area of therectangle of length max(P) andwidth max(D(P)).

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 51 / 59

Page 103: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Persistence and Stickiness

StickinessThe maximum value of p(k)(probability of usage at the mosteffective exposure)

PersistenceA measure of the decay ofexposure curves.The ratio of the area under thecurve P and the area of therectangle of length max(P) andwidth max(D(P)).

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 51 / 59

Page 104: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Approximating Exposure Curves via Stickiness and Persistence

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 52 / 59

Page 105: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Comparison of Hashtags based on Persistence

Idioms and Music have lower persistence than a random subset ofhashtags of the same size

Politics and Sports have higher persistence than a random subset

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 53 / 59

Page 106: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Comparison of Hashtags based on Stickiness

Technology and Movies have lower stickiness than a random subset

Music has higher stickiness than a random subset

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 54 / 59

Page 107: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Comparison of Hashtags based on Stickiness

Technology and Movies have lower stickiness than a random subset

Music has higher stickiness than a random subset

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 54 / 59

Page 108: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Persistence vs. Stickiness

Idioms and Politics: Same stickiness but opposite persistence

Music has high stickiness but low persistence

Stickiness does not explain the diffusion well by itself

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 55 / 59

Page 109: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Persistence vs. Stickiness

Idioms and Politics: Same stickiness but opposite persistence

Music has high stickiness but low persistence

Stickiness does not explain the diffusion well by itself

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 55 / 59

Page 110: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Persistence vs. Stickiness

Idioms and Politics: Same stickiness but opposite persistence

Music has high stickiness but low persistence

Stickiness does not explain the diffusion well by itself

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 55 / 59

Page 111: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Sample curves for #cantlivewithout (blue) and #hcr (red)

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 56 / 59

Page 112: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Comparison of Hashtag by Mention and User Counts

Political and Idioms are among the most mentioned, but Idioms are used bytwice the number of people that use Politics

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 57 / 59

Page 113: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Comparison of Hashtag by Mention and User Counts

Political and Idioms are among the most mentioned, but Idioms are used bytwice the number of people that use Politics

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 57 / 59

Page 114: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

The Structure of Initial Sets

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 58 / 59

Page 115: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Structure Comparison for Political Hashtags (G500)

The early adopters of a political hashtag message more with each other,create more triangles, and have a border of people with more links intothe early adopter set.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 59 / 59

Page 116: Hashtags on Twitter: Linguistic Aspects, Popularity ...cse.iitkgp.ac.in/~pawang/courses/SC14/lec3.pdfMore than 1.7 billion tweets between July 2006 and August 2009 were analyzed Pawan

Structure Comparison for Political Hashtags (G500)

The early adopters of a political hashtag message more with each other,create more triangles, and have a border of people with more links intothe early adopter set.

Pawan Goyal (IIT Kharagpur) Hashtags on Twitter July 24-28, 2014 59 / 59