Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies...
-
Upload
pamela-parsons -
Category
Documents
-
view
215 -
download
0
Transcript of Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies...
![Page 1: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/1.jpg)
Modeling Text and Links: Overview
William W. Cohen
Machine Learning Dept. and Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
Joint work with: Amr Ahmed, Andrew Arnold, Ramnath Balasubramanyan, Frank Lin, Matt Hurst (MSFT), Ramesh Nallapati, Noah Smith, Eric Xing, Tae
Yano
![Page 2: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/2.jpg)
Outline
• Tools for analysis of text– Probabilistic models for text,
communities, and time• Mixture models and LDA models for text• LDA extensions to model hyperlink structure
![Page 3: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/3.jpg)
Introduction to Topic Models
![Page 4: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/4.jpg)
Introduction to Topic Models
• Probabilistic Latent Semantic Analysis (PLSA)
d
z
w
M
• Select document d ~ Mult()
• For each position n = 1,, Nd
• generate zn ~ Mult( z | d)
• generate wn ~ Mult( z | zn)
d
N
Topic distributionMixture model:
• each document is generated by a single (unknown) multinomial distribution of words, the corpus is “mixed” by
PLSA model:
• each word is generated by a single unknown multinomial distribution of words, each document is mixed by d
Doc-specific
![Page 5: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/5.jpg)
Introduction to Topic Models
• PLSA topics (TDT-1 corpus)
![Page 6: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/6.jpg)
Introduction to Topic Models
JMLR, 2003
![Page 7: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/7.jpg)
Introduction to Topic Models
• Latent Dirichlet Allocation
z
w
M
N
a• For each document d = 1,,M
• Generate d ~ Dir(¢ | )
• For each position n = 1,, Nd
• generate zn ~ Mult( z | d)
• generate wn ~ Mult( w | zn)
![Page 8: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/8.jpg)
Introduction to Topic Models
• Perplexity comparison of various models
Unigram
Mixture model
PLSA
LDALower is better
![Page 9: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/9.jpg)
Introduction to Topic Models• Prediction accuracy for classification using learning
with topic-models as features
Higher is better
![Page 10: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/10.jpg)
Outline
• Tools for analysis of text– Probabilistic models for text,
communities, and time• Mixture models and LDA models for text• LDA extensions to model hyperlink
structure• LDA extensions to model time
– Alternative framework based on graph analysis to model time & community• Preliminary results & tradeoffs
• Discussion of results & challenges
![Page 11: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/11.jpg)
Hyperlink modeling using PLSA
![Page 12: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/12.jpg)
Hyperlink modeling using PLSA
[Cohn and Hoffman, NIPS, 2001]
d
z
w
M
d
N
z
c
g
• Select document d ~ Mult()
• For each position n = 1,, Nd
• generate zn ~ Mult( ¢ | d)
• generate wn ~ Mult( ¢ | zn)
• For each citation j = 1,, Ld
• generate zj ~ Mult( ¢ | d)
• generate cj ~ Mult( ¢ | zj)L
![Page 13: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/13.jpg)
Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]
d
z
w
M
d
N
z
c
g
L
PLSA likelihood:
New likelihood:
Learning using EM
![Page 14: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/14.jpg)
Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]
Heuristic:
0 · · 1 determines the relative importance of content and hyperlinks
(1-)
![Page 15: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/15.jpg)
Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]
• Experiments: Text Classification• Datasets:
– Web KB• 6000 CS dept web pages with hyperlinks• 6 Classes: faculty, course, student, staff, etc.
– Cora• 2000 Machine learning abstracts with citations• 7 classes: sub-areas of machine learning
• Methodology:– Learn the model on complete data and obtain d for each
document– Test documents classified into the label of the nearest
neighbor in training set– Distance measured as cosine similarity in the space– Measure the performance as a function of
![Page 16: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/16.jpg)
Hyperlink modeling using PLSA[Cohn and Hoffman, NIPS, 2001]
• Classification performance
Hyperlink content Hyperlink content
![Page 17: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/17.jpg)
Hyperlink modeling using LDA
![Page 18: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/18.jpg)
Hyperlink modeling using LinkLDA[Erosheva, Fienberg, Lafferty, PNAS, 2004]
z
w
M
N
a
• For each document d = 1,,M
• Generate d ~ Dir(¢ | )
• For each position n = 1,, Nd
• generate zn ~ Mult( ¢ | d)
• generate wn ~ Mult( ¢ | zn)
• For each citation j = 1,, Ld
• generate zj ~ Mult( . | d)
• generate cj ~ Mult( . | zj)
z
c
g
L
Learning using variational EM
![Page 19: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/19.jpg)
Hyperlink modeling using LDA[Erosheva, Fienberg, Lafferty, PNAS, 2004]
![Page 20: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/20.jpg)
Newswire Text
• Goals of analysis:– Extract
information about events from text
– “Understanding” text requires understanding “typical reader”• conventions for
communicating with him/her
• Prior knowledge, background, …
• Goals of analysis:– Very diverse– Evaluation is
difficult• And requires
revisiting often as goals evolve
– Often “understanding” social text requires understanding a community
Social Media Text
Science as a testbed for social text: an open community which we understand
![Page 21: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/21.jpg)
Author-Topic Model for Scientific Literature
![Page 22: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/22.jpg)
Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
z
w
f
M
N
b
• For each author a = 1,,A
• Generate a ~ Dir(¢ | )
• For each topic k = 1,,K
• Generate fk ~ Dir( ¢ | )
• For each document d = 1,,M
• For each position n = 1,, Nd
• Generate author x ~ Unif(¢ | ad)
• generate zn ~ Mult( ¢ | a)
• generate wn ~ Mult( ¢ | fzn)
x
a
A
P
a
![Page 23: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/23.jpg)
Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
Learning: Gibbs sampling
z
w
f
M
N
b
x
a
A
P
![Page 24: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/24.jpg)
Author-Topic Model for Scientific Literature [Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
• Topic-Author visualization
![Page 25: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/25.jpg)
Stopped here
![Page 26: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/26.jpg)
Author-Topic Model for Scientific Literature[Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004]
• Application 1: Author similarity
![Page 27: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/27.jpg)
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
![Page 28: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/28.jpg)
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]
Gibbs sampling
![Page 29: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/29.jpg)
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]• Datasets
– Enron email data• 23,488 messages between 147 users
– McCallum’s personal email• 23,488(?) messages with 128 authors
![Page 30: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/30.jpg)
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]• Topic Visualization: Enron set
![Page 31: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/31.jpg)
Author-Topic-Recipient model for email data [McCallum, Corrada-Emmanuel,Wang, ICJAI’05]• Topic Visualization: McCallum’s data
![Page 32: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/32.jpg)
Modeling Citation Influences
![Page 33: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/33.jpg)
Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]
• Copycat model of citation influence• LDA model for cited
papers
• Extended LDA model for citing papers
• For each word, depending on coin flip c, you might chose to copy a word from a cited paper instead of generating the word
![Page 34: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/34.jpg)
Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]
• Citation influence model
![Page 35: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/35.jpg)
Modeling Citation Influences[Dietz, Bickel, Scheffer, ICML 2007]
• Citation influence graph for LDA paper
![Page 36: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/36.jpg)
Models of hypertext for blogs [ICWSM 2008]
Ramesh Nallapati
me
Amr Ahmed Eric Xing
![Page 37: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/37.jpg)
LinkLDA model for citing documentsVariant of PLSA model for cited documentsTopics are shared between citing, citedLinks depend on topics in two documents
Link-PLSA-LDA
![Page 38: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/38.jpg)
Experiments
• 8.4M blog postings in Nielsen/Buzzmetrics corpus– Collected over three weeks summer 2005
• Selected all postings with >=2 inlinks or >=2 outlinks– 2248 citing (2+ outlinks), 1777 cited documents (2+
inlinks)– Only 68 in both sets, which are duplicated
• Fit model using variational EM
![Page 39: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/39.jpg)
Topics in blogs
Model can answer questions like: which blogs are most likely to be cited when discussing topic z?
![Page 40: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/40.jpg)
Topics in blogs
Model can be evaluated by predicting which links an author will include in a an article
Lower is better
Link-PLDA-LDA
Link-LDA
![Page 41: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/41.jpg)
Another model: Pairwise Link-LDA
z
w
N
a
z
w
N
z z
c
• LDA for both cited and citing documents
• Generate an indicator for every pair of docs• Vs. generating
pairs of docs• Link depends on
the mixing components (’s)• stochastic block
model
![Page 42: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/42.jpg)
Pairwise Link-LDA supports new inferences…
…but doesn’t perform better on link prediction
![Page 43: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/43.jpg)
Outline
• Tools for analysis of text– Probabilistic models for text,
communities, and time• Mixture models and LDA models for text• LDA extensions to model hyperlink
structure– Observation: these models can be used for
many purposes…• LDA extensions to model time
– Alternative framework based on graph analysis to model time & community
• Discussion of results & challenges
![Page 44: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/44.jpg)
![Page 45: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/45.jpg)
Authors are using a number of clever tricks for inference….
![Page 46: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/46.jpg)
![Page 47: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/47.jpg)
![Page 48: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/48.jpg)
![Page 49: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/49.jpg)
Predicting Response to Political Blog Posts with Topic Models [NAACL ’09]
Tae Yano Noah Smith
![Page 50: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/50.jpg)
50
Political blogs and and comments
Comment style is casual, creative,less carefully edited
Posts are often coupled with comment sections
![Page 51: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/51.jpg)
Political blogs and comments
• Most of the text associated with large “A-list” community blogs is comments– 5-20x as many words in comments as in text for
the 5 sites considered in Yano et al.• A large part of socially-created commentary
in the blogosphere is comments.– Not blog blog hyperlinks
• Comments do not just echo the post
![Page 52: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/52.jpg)
Modeling political blogs
Our political blog model:
CommentLDA
D = # of documents; N = # of words in post; M = # of words in comments
z, z` = topicw = word (in post)w`= word (in comments)u = user
![Page 53: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/53.jpg)
Modeling political blogs
Our proposed political blog model:
CommentLDALHS is vanilla LDA
D = # of documents; N = # of words in post; M = # of words in comments
![Page 54: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/54.jpg)
Modeling political blogs
Our proposed political blog model:
CommentLDA
RHS to capture the generation of reaction separately from the post body
Two separate sets of word distributions
D = # of documents; N = # of words in post; M = # of words in comments
Two chambers share the same topic-mixture
![Page 55: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/55.jpg)
Modeling political blogs
Our proposed political blog model:
CommentLDA
User IDs of the commenters as a part of comment text
generate the wordsin the comment section
D = # of documents; N = # of words in post; M = # of words in comments
![Page 56: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/56.jpg)
Modeling political blogs
Another model we tried:
CommentLDA
This is a model agnostic to the words in the comment section!
D = # of documents; N = # of words in post; M = # of words in comments
Took out the words from the comment section!
The model is structurally equivalent to the LinkLDA from (Erosheva et al., 2004)
![Page 57: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/57.jpg)
57
Topic discovery - Matthew Yglesias (MY) site
![Page 58: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/58.jpg)
58
Topic discovery - Matthew Yglesias (MY) site
![Page 59: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/59.jpg)
59
Topic discovery - Matthew Yglesias (MY) site
![Page 60: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/60.jpg)
60
• LinkLDA and CommentLDA consistently outperform baseline models
• Neither consistently outperforms the other.
Comment prediction
20.54 %
16.92 % 32.06
%
Comment LDA (R)
Link LDA (R) Link LDA (C)
user prediction: Precision at top 10From left to right: Link LDA(-v, -r,-c) Cmnt LDA (-v, -r, -c), Baseline (Freq, NB)
(CB)
(MY)
(RS)
![Page 61: Modeling Text and Links: Overview William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.](https://reader030.fdocuments.net/reader030/viewer/2022032612/56649ea25503460f94ba5806/html5/thumbnails/61.jpg)
• Goals of analysis:– Very diverse– Evaluation is
difficult• And requires
revisiting often as goals evolve
– Often “understanding” social text requires understanding a community
Social Media Text
• Probabilistic models – can model many aspects
of social text• Community (links,
comments)• Time
• Evaluation:– introspective, qualitative
on communities we understand
• Scientific communities– quantitative on predictive
tasks• Link prediction, user
prediction, …– Against “gold-standard
visualization” (sagas)
Comments