Blogosphere by FrancoSH

55
Blogosphere: Research Issues, Tools and Applications EDA June, 2010 Franco Sánchez Huertas (UCSP) 21/06/2010 UCSP -FASH 1

description

A Complete Analysis aboy the Blogosphere and the interaction with the web 2.0

Transcript of Blogosphere by FrancoSH

Page 1: Blogosphere by FrancoSH

xxxx Blogosphere: Research Issues, Tools

and Applications

EDA – June, 2010

Franco Sánchez Huertas

(UCSP)

21/06/2010 UCSP -FASH 1

Page 2: Blogosphere by FrancoSH

Overview

• Background: Web 2.0 and Social Networks

• Blogosphere: Definition, Types, and Comparison

• Blogosphere Research Issues

• Tools and APIs

• Data Collection

• Searching the Influentials: The Top Bloggers

• Conclusions

UCSP -FASH 21/06/2010 2

Page 3: Blogosphere by FrancoSH

Web 2.0 and Social Networks

21/06/2010 UCSP -FASH 3

Page 4: Blogosphere by FrancoSH

Caracteristics of Web 2.0

• Rich Internet Applications

• User generated contents

• User enriched contents

• User developed widgets

• Collaborative environment: Participatory Web, Citizen journalism

• Thus, it leverages the power of the Long Tail with user generated data as the driving force

• More of a paradigm shift than a technology shift

21/06/2010 UCSP -FASH 4

Page 5: Blogosphere by FrancoSH

Technology Overview of Web 2.0

• Cascading Style Sheets to aid in the separation of presentation and content

• Folksonomies (collaborative tagging, social classification, social indexing, and social tagging)

• REST and/or XML- and/or JSON-based APIs • Rich Internet application techniques, often Ajax and/or Flex, Flash-

based • Semantically valid XHTML and HTML markup • Syndication, aggregation and notification of data in RSS or Atom

feeds • mashups, merging content from different sources, client- and

server-side • Weblog-publishing tools • wiki or forum software to support user-generated content

21/06/2010 UCSP -FASH 5

Page 6: Blogosphere by FrancoSH

Some Web 2.0 Services

• Blogs – Blogspot – Wordpress – Lamula (Perú)

• Wikis – Wikipedia – Wikiversity

• Social Networking Sites – Facebook – Twitter – MySpace – Orkut

• Digital media sharing websites – Youtube – Flickr – Vimeo – Twitpic

• Social Tagging – Del.icio.us

21/06/2010 UCSP -FASH 6

Page 7: Blogosphere by FrancoSH

Social Networks

• A social structure made of nodes (individuals or organizations) that are related to each other by various interdependencies like friendship, kinship, like, ...

• Graphical representation

– Nodes = members – Edges = relationships

21/06/2010 UCSP -FASH 7

Page 8: Blogosphere by FrancoSH

Social Networks

21/06/2010 UCSP -FASH 8

Page 9: Blogosphere by FrancoSH

Social Networks

• A social structure made of nodes (individuals or organizations) that are related to each other by various interdependencies like friendship, kinship, like, ...

• Graphical representation – Nodes = members – Edges = relationships

• Various realizations – Social bookmarking (Del.icio.us) – Friendship networks (facebook, myspace) – Blogosphere – Media Sharing (Flickr, Youtube) – Folksonomies

21/06/2010 UCSP -FASH 9

Page 10: Blogosphere by FrancoSH

BLOGOSPHERE

Definitions, Types, and Comparison

21/06/2010 UCSP -FASH 10

Page 11: Blogosphere by FrancoSH

Blogging Phenomenon

• It’s growing fast as a new means for online communications and interactions

• A blogger could gain instant fame via his blogs

• A blogger may make a good living with her blogs

• Abundant, lucrative business opportunities

• A new political arena

21/06/2010 UCSP -FASH 11

Page 12: Blogosphere by FrancoSH

Blog Structure

21/06/2010 UCSP -FASH

Blog Site

12

Page 13: Blogosphere by FrancoSH

Blog Structure

21/06/2010 UCSP -FASH

Blog Post 13

Page 14: Blogosphere by FrancoSH

Blog Structure

21/06/2010 UCSP -FASH

Blogger

14

Page 15: Blogosphere by FrancoSH

Types of Blogs

• Individual vs. community – Single authored (Individual blog sites) – Multi authored (Community blog sites)

• Regulated vs. anonymous

21/06/2010 UCSP -FASH

Individual Blog Sites Community Blog Sites

Owned and maintained by individual users. Owned and maintained by a group of like-minded

users.

More like personal accounts, journals or diaries. More like discussion forums and discussion

boards.

No or almost negligible group interaction. High degree of group discussion and

collaboration.

No or almost negligible collective wisdom. Enormous collective wisdom and open source

intelligence.

15

Page 16: Blogosphere by FrancoSH

Blogosphere

21/06/2010 UCSP -FASH

• Complex Social Networks

• Vertices (Nodes): Bloggers/ Blog posts/Blog sites

• Edges: Relationships/Links

• In-Degree: Number of inlinks

• Out-Degree: Number of outlinks

16

Page 17: Blogosphere by FrancoSH

Friendship Networks vs. Blogosphere

Friendship Networks Blogosphere

Explicit Links/Edges Implicit Links/Edges

Undirected Graph Directed Graph

Network Centrality Measures Blog Statistics

Quantifying Spread of Influence Quantifying Influential Members

Nodes are members/actors Nodes can be bloggers/blogs or blog sites

Strictly defined graph structure Loosely defined graph structure

“Being in touch” or “Making Friends” Sharing ideas and opinions

Person-to-person Person-to-group

Friendship Oriented Community Oriented

Member’s Reputation/Trust based on network

connections and/or location in the network

Member’s Reputation/Trust based on the response

to other member’s knowledge solicitations

21/06/2010 UCSP -FASH 17

Page 18: Blogosphere by FrancoSH

Friendship Networks vs. Blogosphere

Social Friendship Networks

Blogosphere

Social Networks

Orkut, Facebook, LinkedIn, Classmates.com, etc.

LiveJournal, MySpace, etc.

TUAW, Blogger, Windows Live Spaces, etc.

21/06/2010 UCSP -FASH 18

Page 19: Blogosphere by FrancoSH

BLOGOSPHERE RESEARCH ISSUES

21/06/2010 UCSP -FASH 19

Page 20: Blogosphere by FrancoSH

Understanding Blogosphere

21/06/2010 UCSP -FASH

• Blogosphere

• Blog sites

• Bloggers

• Blog posts

• Reverse chronologically ordered entries

• Blogroll

• Permalinks

• Everyone can publish, but few are heard

• Many interesting questions to address

– How to build traffic

– How to find niche online

– How to increase influence

– How to … • Fertile research domain

20

Page 21: Blogosphere by FrancoSH

Understanding Blogosphere

• Understand structures and properties of Blogosphere

• Gain insights into the relationships between bloggers, readers, blog posts, comments, different blog sites in Blogosphere

• Models help generate artificial data, tune the parameters to simulate special scenarios, and compare various studies and different algorithms

• Study peculiarities in Blogosphere and infer latent patterns and structures that could explain certain phenomena like influence, diffusion, splogs, community discovery.

21/06/2010 UCSP -FASH 21

Page 22: Blogosphere by FrancoSH

Modeling Web and Blogosphere

• Some key differences between Web and Blogosphere – Models developed for Web assume dense graph structure due to a large

number of interconnecting hyperlinks within webpages. This assumption does not hold true. Blogosphere is shown to have a very sparse hyperlink structure [Kritikopoulos et al. 2006].

– The level of interaction in terms of comments and replies to blog posts makes Blogosphere different from Web

– The highly dynamic and “short-lived” nature of the blog posts could not be simulated by the web models. Web models do not consider dynamicity in the web pages

– Web models assume webpages accumulate links over time. However, this is not true with Blogosphere

– “Categories” and “tags” gives blogs flexibility that conventional websites typically don’t have

– Descriptive filenames used in permalinks of blogs as compared to webpage filenames

21/06/2010 UCSP -FASH 22

Page 23: Blogosphere by FrancoSH

Modeling Blogosphere

• Preferential attachment – Probability of a new edge to a node to be added depends on its degree

– “The rich get richer”

– Power law distribution or scale free distribution )deg():( iji vvveP

21/06/2010 UCSP -FASH 23

Page 24: Blogosphere by FrancoSH

Modeling Blogosphere

• Preferential attachment – Probability of a new edge to a node to be added depends on its degree

– “The rich get richer”

– Power law distribution or scale free distribution )deg():( iji vvveP

21/06/2010 UCSP -FASH 24

Page 25: Blogosphere by FrancoSH

Modeling Blogosphere

• Preferential attachment

– Probability of a new edge to a node to be added depends on its degree

– “The rich get richer”

– Power law distribution or scale free distribution

• Hybrid model

– Mixture of both preferential attachment model and random model

– Give a lucky poor guy some chance to get rich

– To solve irreducibility (strong connectedness with few isolated subgraphs) random walk on a graph model proposes a random jump with a fixed probability

• Leskovec et al. 2007 studied temporal patterns

– How often people create blog posts

– Busrtiness and popularity

– How these posts are linked and what is the link density

– Developed a SIS based model

• Kumar et al. 2003 use blogrolls on the blog posts to construct a network of blog posts assuming that blogrolls contain similar blog posts

VvvveP iji /)deg():(

)1(/)deg():( VvvveP iji

21/06/2010 UCSP -FASH 25

Page 26: Blogosphere by FrancoSH

Blog Clustering

21/06/2010 UCSP -FASH 26

Page 27: Blogosphere by FrancoSH

Blog Clustering

• Dynamic and automatic organization of the content

• Convenient accessibility

• Optimizing search engines by reducing search space

– Search only the relevant cluster

• Focused crawling

• Summarization

• Topic identification

• Reduce information overload

– 175,000 blog posts per day, i.e., 2 blog posts per second – Dec 2006

• Extraction and analysis of the trends

21/06/2010 UCSP -FASH 27

Page 28: Blogosphere by FrancoSH

Blog Clustering

• Brooks and Montanez 2006, used tf-idf and

picked top 3 keywords for blog posts – Clustered blogs based on these keywords

– Reported improved clustering as compared to that using tags

• Li et al. 2007 assigned different weights to title, body, and comments of blog posts – Need to address high dimensionality and sparsity due to their

keyword-based approach

• Agarwal et al. 2008 proposed a collective-wisdom based approach – Generate a category relation graph based on user assignments

– Compute similarity matrix from this graph

k jk

ji

jin

ntf

,

,

,

jij

idtd

Didf

:log

ijiji idftftfidf ,,

21/06/2010 UCSP -FASH 28

Page 29: Blogosphere by FrancoSH

Blog Mining

• Interactions between producers and consumers improved with blogs

• Consumers not only speak their mind but also broadcast their opinions

• Blogs are invaluable information sources

– consumers’ beliefs and opinions,

– initial reaction to a launch,

– understand consumer language,

– track trends and buzzwords, and

– fine-tune information needs

• Blog conversations leave behind the trails of links, useful for understanding how information flows and how opinions are shaped and influenced

• Tracking blogs also help in gaining deeper insights

21/06/2010 UCSP -FASH 29

Page 30: Blogosphere by FrancoSH

Blog Influence

• Two types of influence – Influential blog sites and site networks [Gill 2004, Gruhl et al 2004, Java et al

2006]

– Influential bloggers in a community [Agarwal et al. 2008]

• Blogosphere vs. Friendship Networks – Implicit vs. Explicit links

– Blog statistics vs. Centrality measures

– “influencing” vs. “could influence”

– Loosely vs. Strictly defined graph structures

• Blog vs. Webpage Ranking – Blog sites too sparse for webpage ranking algorithms to work [Kritikopoulos et

al 2006]

– Webpage acquires authority over time, blog posts’ influence diminishes

– Greedy approach works better than PageRank, HITS to maximize influence flow [Kempe et al 2003, Richardson & Domingos 2002]

21/06/2010 UCSP -FASH 30

Page 31: Blogosphere by FrancoSH

Issue of Trust

• Open standards and low barriers to publishing have created overwhelming amount of collective wisdom

• Yet more difficult for readers to discern whom to trust in some cases

• Similar to WWW – Authoritative webpages e.g., HITS [Kleinberg et al. 1998], PageRank

[Page et al. 1999]

• Blogosphere allow mass to create and edit content compromising the sanctity of the original content

• Some work exists for social friendship network domain, not many researchers have explored Blogosphere

• Huge potential for trust study in Blogosphere domain

21/06/2010 UCSP -FASH 31

Page 32: Blogosphere by FrancoSH

Trust

• Kale et al. 2007 transformed the problem of trust in blogosphere to the one in social friendship networks – Studied propagation of trust among different blog sites

– Mined sentiments from a window of words around hyperlinks

– Identified positive, negative, or neutral sentiments towards the linked blog site

– Constructed a network of blog sites using hyperlinks

– Used Gruhl et al. 2004 trust propagation algorithm

– Some concerns

• These blog sites have to be linked for trust propagation

• Trust is computed between blog sites based on how much one blog agrees or disagrees with the other

Mi+1 = Mi * Ci – Perform till convergence

M = Belief Matrix; Ci = Atomic Propagation

Ci = M + MT*M + MT + M*MT

21/06/2010 UCSP -FASH 32

Page 33: Blogosphere by FrancoSH

Community Extraction

• Blogosphere doesn’t have an explicit notion of communities

• Different from blog clustering

• Researchers identify communities based on

– Links: network of hyperlinks allows identification of virtual communities

• Several studies on finding community of webpages like Kleinberg 1998 and Kumar et al. 1999

• While Kleinberg used authority and hubs idea to explore communities of webpages, Kumar et al. extended the idea of hubs and authorities and included co-citations as a way to extract all communities on the web and used graph theoretic algorithms to identify all instances of graph structures that reflect community characteristics.

– Content: blogs with similar content or inspired by the same event form a virtual community

• Kumar et al. 2003, Efimova and Hendrick 2005, Blanchard 2004

21/06/2010 UCSP -FASH 33

Page 34: Blogosphere by FrancoSH

Community Extraction

• Chin and Chignell 2006 proposed a model for finding communities taking the blogging behavior of bloggers into account

– They aligned behavioral approaches through blog reader survey in studying blog community.

• Blanchard and Marcus 2004 studied a multiple sport newsgroup “Virtual Settlement” and analyzed the possibility of emerging virtual communities

– Newsgroups and discussion forums are similar in terms of interaction patterns to Blogosphere

– More person-to-group interaction rather than person-to-person interaction

21/06/2010 UCSP -FASH 34

Page 35: Blogosphere by FrancoSH

Spam blog (Splogs) Filtering

• One of the major rising concerns on Blogosphere

• Spammers make most of their money by getting viewers to click on ads that run adjacent to their nonsensical text

• Open standards and low barriers to publishing escalates the problem and challenges while solving

• Besides degrading search quality, affects the network resources

21/06/2010 UCSP -FASH 35

Page 36: Blogosphere by FrancoSH

Spam blog (Splogs) Filtering

• One of the major rising concerns on Blogosphere

• Open standards and low barriers to publishing escalates the problem and challenges while solving

• Besides degrading search quality, affects the network resources

• Initial researches applied web spam link detection approaches

– Ntoulas et al. 2006, distinguish between normal web pages and spam webpages based on the statistical properties like

• number of words, average length of words, anchor text, title keyword frequency, tokenized URL

– Gyongyi et al. 2004, Gyongyi et al. 2006 use PageRank to compute the spam score of a webpage

• Kolari et al. 2006, consider each blog post as a static webpage and use both content and hyperlinks to classify a blog post as spam using a SVM based classifier

21/06/2010 UCSP -FASH 36

Page 37: Blogosphere by FrancoSH

Tools and API’s

Working in the Blogosphere…

21/06/2010 UCSP -FASH 37

Page 38: Blogosphere by FrancoSH

Analysis and Visualization Tools

• Tools – Data Analysis & Visualization tools

– Statistics like centrality measures

• NetLogo (http://ccl.northwestern.edu/netlogo/) – Multi-agent programming language and modeling environment

designed in Logo

– Modelers can give instructions to hundreds or thousands of concurrently operating autonomous agents.

– Exploring the connection between the individuals (micro-level) and the patterns that emerge from the interaction of many individuals (macro-level).

21/06/2010 UCSP -FASH 38

Page 39: Blogosphere by FrancoSH

Analysis and Visualization Tools

• UCINet (http://www.analytictech.com/) – Package for the analysis of social network data including centrality

measures, subgroup identification, role analysis, elementary graph theory, and permutation-based statistical analysis

– Has strong matrix analysis routines, such as matrix algebra and multivariate statistics

• Pajek (http://vlado.fmf.uni-lj.si/pub/networks/pajek/) – Slovenian for spider

– Analyzing and visualizing large networks like social networks

• Network package in R (http://cran.r-project.org/src/contrib/Descriptions/network.htm) – The network class can represent a range of relational data types, and

support arbitrary vertex/edge/graph attributes

– This is used to create and/or modify the network objects and is used for social network analysis (SNA)

21/06/2010 UCSP -FASH 39

Page 40: Blogosphere by FrancoSH

Analysis and Visualization Tools

• InFlow (http://www.orgnet.com/inflow3.html)

– Integrated product for network analysis and visualization

– Used in the SNA domain

• NetMiner (http://www.netminer.com/)

– Tool for exploratory network data analysis and visualization

– NetMiner allows to explore network data visually and interactively, and helps in detecting underlying patterns and structures of the network

21/06/2010 UCSP -FASH 40

Page 41: Blogosphere by FrancoSH

APIs

• APIs

– Data collection (blog posts, inlinks, tags, etc.)

– Technorati

– Digg

– del.icio.us

– Facebook

– StumbleUpon

21/06/2010 UCSP -FASH 41

Page 42: Blogosphere by FrancoSH

Technorati API

• bloginfo query

API url: http://api.technorati.com/bloginfo?key=[apikey]&url=[blog url]

Sample response:

<result>

<url>[URL]</url>

<weblog>

<name>[blog name]</name>

<url>[blog URL]</url>

<rssurl>[blog RSS URL]</rssurl>

<atomurl>[blog Atom URL]</atomurl>

<inboundblogs>[inbound blogs]</inboundblogs>

<inboundlinks>[inbound links]</inboundlinks>

<lastupdate>[date blog last updated]</lastupdate>

<rank>[blog ranking]</rank>

<lang></lang>

<foafurl>[blog foaf URL]</foafurl>

</weblog>

</result>

21/06/2010 UCSP -FASH 42

Page 43: Blogosphere by FrancoSH

Technorati API

• BlogPostTags query

API url: http://api.technorati.com/blogposttags?key=[apikey]&url=[blog url]

Sample response:

<document>

<result>

<querycount>[limit parameter]</querycount>

</result>

<item>

<tag>[tag name];/tag>

<posts>[tag count]</posts>

</item>

</document>

21/06/2010 UCSP -FASH 43

Page 44: Blogosphere by FrancoSH

del.icio.us API

https://api.del.icio.us/v1/tags/get

Returns a list of tags and number of times used

Sample response

<tags>

<tag count="1" tag="activedesktop" />

<tag count="1" tag="business" />

<tag count="3" tag="radio" />

<tag count="5" tag="xml" />

<tag count="1" tag="xp" />

<tag count="1" tag="xpi" />

</tags>

21/06/2010 UCSP -FASH 44

Page 45: Blogosphere by FrancoSH

Data Collection

21/06/2010 UCSP -FASH

Using the Blogosphere…

45

Page 46: Blogosphere by FrancoSH

Available Datasets

• TREC (http://ir.dcs.gla.ac.uk/test_collections/blog06info.html)

– A crawl of Feeds, and associated Permalink and homepage documents (from late 2005 and early 2006)

– 100,649 feeds were polled once a week for 11 weeks

– Total Number of Feeds collected:753,681

– Average feeds collected every day:10,615

– Uncompressed Size:38.6GB Compressed Size:8.0GB

– Reasonably sized spam component for added realism

– Fee: £400 ~ $794.36

21/06/2010 UCSP -FASH 46

Page 47: Blogosphere by FrancoSH

Available Datasets

• Mobile Network (http://kdl.cs.umass.edu/data/msn/msn-info.html)

– 27 objects

– over 180,000 links

– 1 object attribute

– 2 link attributes

• Other ways – Crawl blogs

– Blogcatalog

– Statistics available from technorati API

– Tagging available from del.icio.us API

21/06/2010 UCSP -FASH 47

Page 48: Blogosphere by FrancoSH

Data Crawler

• BlogTrackers – User interface to crawl blog sites

• Scratch crawling (from blog archives)

• Incremental crawling (from RSS feeds)

– Stores the blog posts in Microsoft SQL server

– Collects

– Track blog posts like generate tag clouds for user specified time window

Blog post title Blog post tags

Blog post content Blog post permalink

Outlinks Blogger name

Inlinks Blog post date and time

21/06/2010 UCSP -FASH 48

Page 49: Blogosphere by FrancoSH

Collectable Statistics from Blogs

• Inbound links – Blogs, blog post, webpage

• Outbound links – Blogs, blog post, webpage

• Comments

• Blog server logs

• Subscribers

• Time to read/length

• Links to post and incoming traffic from them

• Links from post and outgoing traffic to them

• Topic frequency score

• Blogroll links

• Tagged urls (del.icio.us, furl)

21/06/2010 UCSP -FASH 49

Page 50: Blogosphere by FrancoSH

Searching The Influentials : The Top Bloggers

• Active bloggers

– Easy to define

– Often listed at a blog site

– Are they necessarily influential

• How to define an influential blogger?

– Influential bloggers have influential posts

– Subjective

– Collectable statistics

– How to use these statistics

21/06/2010 UCSP -FASH 50

Page 51: Blogosphere by FrancoSH

Intuitive Properties

• Social Gestures (statistics) – Recognition: Citations (incoming links)

– An influential blog post is recognized by many. The more influential the referring posts are, the more influential the referred post becomes.

– Activity Generation: Volume of discussion (comments) – Amount of discussion initiated by a blog post can be measured by the

comments it receives. Large number of comments indicates that the blog post affects many such that they care to write comments, hence influential.

– Novelty: Referring to (outgoing links) – Novel ideas exert more influence. Large number of outlinks suggests that

the blog post refers to several other blog posts, hence less novel.

– Eloquence: “goodness” of a blog post (length) – An influential is often eloquent. Given the informal nature of

Blogosphere, there is no incentive for a blogger to write a lengthy piece that bores the readers. Hence, a long post often suggests some necessity of doing so.

• Influence Score = f(Social Gestures)

21/06/2010 UCSP -FASH 51

Page 52: Blogosphere by FrancoSH

Understanding the Influentials

• Are influential bloggers simply active bloggers?

• If not, in what ways are they different?

– Can the model differentiate them?

• Are there different types of influential bloggers?

• What other parameters can we include to evolve the model?

• Are there temporal patterns of the influential bloggers?

21/06/2010 UCSP -FASH 52

Page 53: Blogosphere by FrancoSH

Active & Influential Bloggers

• Active and Influential Bloggers

• Inactive but Influential Bloggers

• Active but Non-influential Bloggers

• They don’t consider “Inactive and Non-influential Bloggers”, because they seldom submit blog posts. Moreover, they do not influence others.

21/06/2010 UCSP -FASH 53

Page 54: Blogosphere by FrancoSH

Conclusions…

Blogosphere is one of the fastest growing, social networking media. The virtual communities in the blogosphere are not constrained by physical proximity and allow anytime, anywhere, and instant communications.

In this paper the autors discuss current research issues in Blogosphere including modeling, blog clustering, blog mining, community discovery and factorization, influence and propagation, trust and reputation, and filtering spam blogs.

21/06/2010 UCSP -FASH 54

Page 55: Blogosphere by FrancoSH

Questions

21/06/2010 UCSP -FASH 55