Blogosphere

31
Blogosphere: Research Issues, Tools, and Applications Nitin Agarwal and Huan Liu Sunil Bandla INF384H – Fall 2011

description

blogosphere, Information retrieval,

Transcript of Blogosphere

Page 1: Blogosphere

Blogosphere: Research Issues, Tools, and Applications

Nitin Agarwal and Huan Liu

Sunil BandlaINF384H – Fall 2011

Page 2: Blogosphere

Agenda Introduction Research issues Tools and Methods Case Study Blogosphere and Social Networks

Page 3: Blogosphere

Web 2.0 It is the reason behind surge of interest in

online communities Former consumers are now producers Collaborative environment User-generated content Collective wisdom Web 2.0 services:

Blogs, wikis, social networking sites, social tagging Wordpress, Wikipedia, Facebook, Youtube, Twitter,

Yelp

Page 4: Blogosphere

Social Networks “A social network is a social structure made

up of individuals connected by one or more types of interdependency, such as friendship, common interest…” – Wikipedia

Web 2.0 is enabling virtual social networks Size and connectedness varies across

networks Examples:

Friendship networks ( Facebook, Myspace ) Media sharing ( Flickr, Youtube )

Page 5: Blogosphere

Source: The New York Times

“The site, chock full of advertising, is a moneymaking machine – so much so that Ms. Armstrong and her husband have both quit their regular jobs.“The reason? The advertisers are eager to influence her 850,000 readers.

Arnold Kim, founder and senior editor of MacRumors.com.

“The site places MacRumors No. 2 on a list of the ‘25 most valuable blogs,’ …” What is the potential value? “Two of the other tech-oriented blogs on its list, …, were sold earlier this year, reportedly for sums in excess of $25 million.”

Slide Credit: Liu & Nitin

Page 6: Blogosphere

Blogosphere Blog sites Bloggers Blog posts Blogroll Permalinks Low barrier to publication Readers can comment instantly which gives

blogger a feeling of satisfaction Individual vs community blogs

Page 7: Blogosphere

Blogosphere Complex social networks Bloggers/blog posts/blog sites become nodes Relationships are represented by edges

between nodes Inlinks & Outlinks

Page 8: Blogosphere

Agenda Introduction Research issues Tools and Methods Case Study Blogosphere and Social Networks

Page 9: Blogosphere

Modeling the Blogosphere

Web Blogosphere

Web models assume dense graph structure

Blogosphere has a very sparse hyperlink structure

Not much interaction Interaction in the form of comments and replies

Static web pages Dynamic blog posts

Conventional web pages do not have tags

Blog posts have tags and categories

Helps in generating an artificial dataset to compare algorithms

Study patterns that could explain community discovery, spam blogs, influence, etc.

Key differences between Web and Blogosphere

Page 10: Blogosphere

Modeling the Blogosphere Web models:

Random graph Preferential attachment graph models Hybrid graph models

Blogosphere models: To study temporal patterns of blogosphere like

how often people create blog posts, how they are linked

Blogrolls to create a network of connected posts

Page 11: Blogosphere

Blog Clustering Automatic organization of the content Helps readers focus on interesting categories Keyword based:

Brooks and Montanez 2006, pick top 3 keywords to cluster blog posts

Li et al. 2007, assign different weights to title, body and comments of blog posts

Collective wisdom based: Agarwal et al. 2008 use category relation graph to

merge categories and cluster blogs

Page 12: Blogosphere

Blog Mining Valuable resources to track:

Consumers’ beliefs and opinions Initial reaction to a launch Trends and buzzwords

Blog conversations provide insights into how information flows and how opinions are shaped and influenced

Pulse uses a Naïve Bayes classifier trained on annotated sentences to classify unlabeled data

Attardi and Simi 2006, use opinionated words acquired from WordNet to improve blog retrieval

Page 13: Blogosphere

Community Discovery Content analysis and text analysis of the blog

posts to identify communities Kleinberg et al, cluster all the expert

communities together as authorities using an authority based approach

Kumar et al. extend it to include co-citations to extract all communities on the web

Some researchers studied community extraction using newsgroups and discussion boards

Page 14: Blogosphere

Influence in Blogs Influential bloggers:

Are potential market-movers Sway opinions in political campaigns Troubleshoot the problems of peer consumers Useful for “word-of-mouth” advertising of products

Finding influential blog sites is different from identifying influential bloggers

Agarwal et al, studied the influence of a blogger by modeling the blog site as a graph

Page 15: Blogosphere

Trust and Reputation Overwhelming amount of collective wisdom Difficult for reader to decide whom to trust Assess the reputation of influential members in the

community Not much work that deals with trust in Blogosphere Kale et al. 2007 mined sentiments about the cited

blog post using a window of words around the links They compute trust in a network of blog sites

Use comments on the blog post to judge a blogger’s trust

Page 16: Blogosphere

Filtering Spam blogs Splogs == Spam blogs Degrade search quality and waste network resources Initial researchers used web spam detection

techniques Kolari et al. 2006, use content and hyperlinks to train

a SVM based classifier to classify a blog post as spam Content on blog sites is dynamic so content based

spam filters are ineffective Lin et al. propose a self similarity based splog

detection algorithm based on patterns in posting times of splogs, content similarity and similar links in splogs

Page 17: Blogosphere

Agenda Introduction Research issues Tools and Methods Case Study Blogosphere and Social Networks

Page 18: Blogosphere

Tools and APIs Tools to simulate social networks to study

their properties Multi-agent simulation tools Analysis of social networks Visualization of social networks APIs:

Facebook StumbleUpon Del.icio.us

Page 19: Blogosphere

Methodologies Centrality measures Content analysis Link analysis Decision theoretic approaches Agent-based modeling

Page 20: Blogosphere

Datasets Nielsen Buzzmetrics dataset

About 14M blog posts from 3M blog sites Annotated with 1.7M blog-blog links Up to a half of the blog outlinks are missing Only 51% of the total blog posts are in English

Enron Email dataset Emails from about 150 users at Enron 0.5M messages Social networks between users were studied based on link

construction Email senders and recipients are used to construct links

Page 21: Blogosphere

Experiments and Performance Metrics Concepts like influence, trust, etc. in

Blogosphere are socio-psychological and subjective

Evaluating them is non-trivial Hard to compare different approaches since

there is no ground truth! Search engines’ ranking as the baseline for

most of the existing works Web 2.0 application i.e., Digg, was used to

evaluate the influence in blogosphere

Page 22: Blogosphere

Agenda Introduction Research issues Tools and Methods Case Study Blogosphere and Social Networks

Page 23: Blogosphere

Finding influential bloggers “A blogger can be influential if s/he has more

than one influential blog post” Properties that represent influential blog posts:

Recognition – An influential blog post is recognized by many

Activity Generation – Number of comments received and amount of discussion initiated

Novelty – Number of outlinks Eloquence – Length of a post

Data Collection The Unofficial Apple Weblog Crawled 10,000 posts

Page 24: Blogosphere

Results Top 5 bloggers according to TUAW and

proposed model Some bloggers are both active and influential Some of them are active but not influential Some influential bloggers are not active Inactive and non-influential bloggers

Page 25: Blogosphere

Verification Challenges:

No testing and training data Absence of ground truth

Use another Web2.0 site Digg to provide a reference point

A more liked post will have higher score on Digg

Digg returns top 100 voted posts Intersection of Digg 100 and top 20 from their

model

Page 26: Blogosphere

Verification Importance of each parameter Inlinks > comments > outlinks > blog post

length in decreasing order of importance to influence estimation

Page 27: Blogosphere

Agenda Introduction Research issues Tools and Methods Case Study Blogosphere and Social Networks

Page 28: Blogosphere

Blogosphere and Social NetworksBlogosphere Social Networks

Influential nodes have “been influencing”

Influential nodes “could influence”

To share ideas or opinions To stay in touch or make friends

Reputation is based on previous responses

Reputation is based on the number of connections

Person-to-group interaction Person-to-person interaction

Community experience Friendship experience

Loosely defined graph Strictly defined graph

Nodes could be bloggers, blog posts, blog sites

Nodes are members

Implicit links Predefined links

Directed graph Undirected graph

Page 29: Blogosphere

Conclusion Virtual communities and low barrier to

publication are helping the growth of blogosphere

A lot is yet to be done in terms of research specific to blogosphere

Need accurate ground truth data Experiments and evaluation plan should be

devised to have objective analysis of different algorithms

Page 30: Blogosphere

Thank you!

Page 31: Blogosphere

References http://

www.sigkdd.org/explorations/issues/10-1-2008-07/V10N1-Blogosphere.pdf

http://videolectures.net/kdd08_liu_briat/