Blogs (web logs) contain online stamped entries

31
Implicit Structure and Dynamics of BlogSpace Eytan Adar, Li Zhang, Lada Adamic, & Rajan Lukose HP Labs, Palo Alto, CA

description

Implicit Structure and Dynamics of BlogSpace Eytan Adar, Li Zhang, Lada Adamic, & Rajan Lukose HP Labs, Palo Alto, CA. list of read blogs. date and time stamps. URL that is being commented on. via link. Blogs (web logs) contain online stamped entries. - PowerPoint PPT Presentation

Transcript of Blogs (web logs) contain online stamped entries

Page 1: Blogs (web logs) contain online stamped entries

Implicit Structure and Dynamics of BlogSpaceEytan Adar, Li Zhang, Lada Adamic, & Rajan Lukose

HP Labs, Palo Alto, CA

Page 2: Blogs (web logs) contain online stamped entries

Blogs (weblogs) containonline stampedentries

date and time stamps

list of read blogs

URL thatis beingcommentedon

via link

Page 3: Blogs (web logs) contain online stamped entries

Blogs: structure and transmission

• Blog use:– Record real-world and virtual experiences– Note and discuss things “seen” on the net

• Blog structure: blog-to-blog linking

• Use + Structure– Great to track “memes” (catchy ideas)

• Patterns of information flow– How does the popularity of a topic evolve over time?– Who is getting information from whom?

• Ranking algorithms that take advantage of transmission patterns

Page 4: Blogs (web logs) contain online stamped entries

Related Work

Link prediction in social networks:Butts, C. Network Inference, Error, and Information (In)Accuracy:

A Bayesian Approach, Social Networks, 25(2):103-140.Dombroski, M., P. Fischbeck, and K. Carley, An Empirically-Based Model for Network

Estimation and Prediction, NAACSOS conference proceeding, Pittsburgh, PA, 2003.O’Madadhain J., Smyth P., Adamic L., Learning Predictive Models for Link Formation,

Sunbelt 2005 (hope you were there!)Getoor, L., N. Friedman, D. Koller, and B. Taskar, Learning Probabilistic Models of Link

Structure, Journal of Machine Learning Research, vol. 3(2002), pp. 690-707.Adamic L., Adar E., Friends and neighbors on the Web, Social Networks, 2003.Kleinberg, J., and .D. Liben-Nowell, The Link Prediction Problem for Social Networks’, in

Proceedings of CIKM ’03 (New Orleans, LA, November 2003), ACM Press.

Blog ranking:Technorati, BlogPulse, Daypop…

Blog epidemic tracking:Blogdex at MIT media lab, Cameron Marlow, Sunbelt 2003BlogPulse

Page 5: Blogs (web logs) contain online stamped entries

Intelliseek’s BlogPulse

Service for tracking trends in the blogosphere:popular URLs, phrases, people

Page 6: Blogs (web logs) contain online stamped entries

BlogPulse Data analyzed

37,153 blogs

Differential daily crawls (to find new posts) for May 2003Full page crawl for May 18, 2003 to capture blogrolls

175,712 URLs occurring on > 2 blogs

Page 7: Blogs (web logs) contain online stamped entries

Pop

ula

rity

Time

Slashdot Effect

BoingBoing Effect

Tracking popularity over time

Blogdex, BlogPulse, etc. track the most popular links/phrases of the day

Page 8: Blogs (web logs) contain online stamped entries

Election MapCartograms

Michael Gastner, Cosma Shalizi, and Mark Newman

University of Michigan

http://www-personal.umich.edu/~mejn/election/

Page 9: Blogs (web logs) contain online stamped entries
Page 10: Blogs (web logs) contain online stamped entries

Pop

ula

rity

Time

Tracking popularity over time

0 5 10 15 20 250

5

10

15

20

25

30

35

40

45

50

day

blo

gs

U-M: Election CartographsWIRED: Orrin Hatch: Software Pirate?

total mentions:100

total mentions:92

Page 11: Blogs (web logs) contain online stamped entries

Clustering information popularity profiles

May 2003

Total # of mentions substantial(40)

URL mentioned for the first time in May

Page 12: Blogs (web logs) contain online stamped entries

K-means clustering

259 URLs in the sample satisfy criteria

Take normalized cumulative profiles

all mentions

day

K-means minimizes the sum of the differences within each cluster

4 clusters captured most of the differences

Page 13: Blogs (web logs) contain online stamped entries

Different kinds of information have differentpopularity profiles

Products, etc.

Major-news site (editorial content) – back of the paper

5 10 15 5 10 155 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5 10 15

Slashdotpostings

Front-pagenews

1 2 3 4

Page 14: Blogs (web logs) contain online stamped entries

Cluster Profile # urls examples1 Sharp peak on day 1 followed by fast decay 38 Slashdot postings

2 Day 1 peak followed by decay 46 Front page news

3 Day 2 peak followed by gradual decay 51 Editorial content,

Sun java release

4 Sustained interest 124 iPod, iTunes, quizzila

Popularity profiles

2 4 6 8 10 12 14 16 18 20 220

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

cluster 4cluster 3cluster 2cluster 1

Page 15: Blogs (web logs) contain online stamped entries

Micro example: Giant Microbes

Page 16: Blogs (web logs) contain online stamped entries

Microscale Dynamics

• What do we need track specific info ‘epidemics’?– Timings– Underlying network

b1b1

Time of infectiont0 t1

b2b2

b3b3

Page 17: Blogs (web logs) contain online stamped entries

Microscale Dynamics• Challenges

– Root may be unknown– Multiple possible paths– Uncrawled space, alternate media (email, voice)– No links

b1b1

Time of infectiont0 t1

b2b2

b3b3

??

bnbn

Page 18: Blogs (web logs) contain online stamped entries

Microscale Dynamics who is getting info from whom

• Via Links (< 2 % of links, 50% within sample)unambiguous

• Multiple explicit links: which link is more likely

• No explicit links (70%) which implicit path is more likely

Page 19: Blogs (web logs) contain online stamped entries

Link Inference

Use machine learning algorithms:

A) Support Vector Machine (SVM)B) Logistic Regression

What we can use

Full text

Blogs in common

Links in common

History of infection

BoingBoing

WIRED

Page 20: Blogs (web logs) contain online stamped entries

Percentage of blog pairs sharing at least one link

link type same day A after B A before B

A B 17.4% 24.5% 24.5%

A B 10.9% 22.9% 17.0%

A,B unlinked

0.6% 1.5% 1.3%

Page 21: Blogs (web logs) contain online stamped entries

0 0.1 0.2 0.3 0.4 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

similarity in links to non-blog URLs

fra

ctio

n o

f p

air

s

0 0.1 0.2 0.3 0.4 0.50

0.2

0.4

0.6

0.8

1

similarity in links to other blogs

fra

ctio

n o

f p

air

s bidirectionalunidirectionalnot linked

Similarity in links between reciprocated, unreciprocated, and non-linked blog pairs

Page 22: Blogs (web logs) contain online stamped entries

Blog A

Blog B

+

Tinfection(Blog B) > Tinfection(Blog A)

Blog A

Blog B

-

Positive Example

Negative Example

Infected Uninfected

Training on positive and negative examples of ‘infection’

Page 23: Blogs (web logs) contain online stamped entries

Prediction results

Link Inference:SVM 91% accuracyregression 92% accuracy (blog-blog links most predictive)

Infection inference:SVM 71.5% accuracy:using blog and non-blog link similarity+ timing features(AbeforeB)/nA, (BbeforeA)/nA, (A same day B)/nA,, …

Regression:75% accuracy using only timing features

Page 24: Blogs (web logs) contain online stamped entries

time inferred

actual

uncrawled blogor media source

Sources of error

Coarseness and sparseness of timing data (1 day resolution)

Mirror URLS (actually helps)

Incomplete crawls

B

A

C

Page 25: Blogs (web logs) contain online stamped entries

Visualizationby Eytan Adar

• GUESS tool (build your own, see demo @ 5:30!)

– Using GraphViz (by AT&T) layouts

• Simple algorithm– If single, explicit link exists, draw it (add node if needed)

– Otherwise use ML algorithm• Pick the most likely explicit link• Pick the most likely possible link

• Tool lets you zoom around space, control threshold, link types, etc.

http://www-idl.hpl.hp.com/blogstuff

Page 26: Blogs (web logs) contain online stamped entries

Giant Microbes epidemic visualization

via link explicit link inferred link blog

Page 27: Blogs (web logs) contain online stamped entries

iRankFind early sources of good informationusing inferred information paths or timing

b1b1

b2b2

b3b3 b4b4 b5b5 bnbn…

True source

Popular site

Page 28: Blogs (web logs) contain online stamped entries

iRank Algorithm

• Draw a weighted edge for all pairs of blogs that cite the same URL• higher weight for mentions closer together• run PageRank• control for ‘spam’

Time of infectiont0 t1

Page 29: Blogs (web logs) contain online stamped entries

Do Bloggers Kill Kittens?

02:00 AM Friday Mar. 05, 2004 PST Wired publishes:

"Warning: Blogs Can Be Infectious.”

7:25 AM Friday Mar. 05, 2004 PST Slashdot posts:

"Bloggers' Plagiarism Scientifically Proven"

9:55 AM Friday Mar. 05, 2004 PST Metafilter announces

"A good amount of bloggers are outright thieves."

Page 30: Blogs (web logs) contain online stamped entries

For more info

Information Dynamics Lab @ HPhttp://www.hpl.hp.com/research/idl

Blog Epidemic Analyzerhttp://www-idl.hpl.hp.com/blogstuff

Eytan, Li, Lada & Rajanhttp://www.hpl.hp.com/research/idl/people/eytan/http://www.hpl.hp.com/personal/Li_Zhang/http://www.hpl.hp.com/personal/Lada_Adamichttp://www.hpl.hp.com/research/idl/people/lukose/

Page 31: Blogs (web logs) contain online stamped entries

CNN: Wal-Mart banishes bawdy mags