Journalistic Corpus Similarity over Time
description
Transcript of Journalistic Corpus Similarity over Time
OutlineIntroductionMethodologyExperiments
Final Remarks
Journalistic Corpus Similarity over Time
American Association for Corpus Linguistics ConferenceProvo, Utah - March 12th - 15th 2008
Cristina Mota
Instituto Superior Tecnico & L2F INESC-ID (Portugal)&
New York University (USA)
(Advisors: Ralph Grishman & Nuno Mamede)
This research was funded by Fundacao para a Ciencia e a Tecnologia (doctoral scholarship SFRH/BD/3237/2000)
Cristina Mota Journalistic Corpus Similarity over Time
OutlineIntroductionMethodologyExperiments
Final Remarks
Outline
1 Introduction
2 Methodology
3 Experiments
4 Final Remarks
Cristina Mota Journalistic Corpus Similarity over Time
OutlineIntroductionMethodologyExperiments
Final Remarks
MotivationApproach
1 IntroductionMotivationApproach
2 Methodology
3 Experiments
4 Final Remarks
Cristina Mota Journalistic Corpus Similarity over Time
OutlineIntroductionMethodologyExperiments
Final Remarks
MotivationApproach
Motivation
Mary is studying in the United States at Brigham Young University� NE Tagger �
MaryPER is studying in the United StatesLOC at Brigham
Young UniversityORG
Cristina Mota Journalistic Corpus Similarity over Time
OutlineIntroductionMethodologyExperiments
Final Remarks
MotivationApproach
Motivation
o o o o o o
oo
o o
o
o
o
o
o
o
05
1015
2025
Time frame (semester)N
ame
occu
rren
ces
/ 100
K w
ords
x
x
x
x x x x x x x x x x x x xO O
OO
O O
O
O O O
O
O
O
O
O O
xx
x
x
x x
x
x x x x x x x x x
oxOx
UECEEUnião EuropeiaComunidade Europeia
91a 92a 93a 94a 95a 96a 97a 98a
Do texts vary over time in a way that affects NE recognition?
Should NE taggers be also conceived time-aware?
Cristina Mota Journalistic Corpus Similarity over Time
OutlineIntroductionMethodologyExperiments
Final Remarks
MotivationApproach
Approach
Corpus Analysis
Measure corpus similarity based on
Words
Upper case words
Lower case words
Compute name and context overlaps
By type
By token
NER Performance Analysis
Assess performance by trainingand testing with differentconfigurations (train,test)
Time issue
More data or better data?
Labeled vs. unlabeled
Cristina Mota Journalistic Corpus Similarity over Time
OutlineIntroductionMethodologyExperiments
Final Remarks
Corpus Similarity Algorithm (Kilgarriff, 2001)Application
1 Introduction
2 MethodologyCorpus Similarity Algorithm (Kilgarriff, 2001)Application
3 Experiments
4 Final Remarks
Cristina Mota Journalistic Corpus Similarity over Time
OutlineIntroductionMethodologyExperiments
Final Remarks
Corpus Similarity Algorithm (Kilgarriff, 2001)Application
Corpus Similarity Algorithm (Kilgarriff, 2001)
Similarity(A,B):
Split corpus A and B into k slices each
Repeat m times:
Randomly allocate k2 slices to Ai and k
2 to Bi
Construct word frequency lists for Ai and Bi
Compute CBDF between A and B for the n most frequentwords of the joint corpus (Ai+Bi )[CBDF = χ
2 by degrees of freedom]
Output mean and standard deviation of CBDF of allexperiments
Repeat using corpus A only: Similarity(A,A) → Homogeneity(A)Repeat using corpus B only: Similarity(B,B) → Homogeneity(B)
Cristina Mota Journalistic Corpus Similarity over Time
OutlineIntroductionMethodologyExperiments
Final Remarks
Corpus Similarity Algorithm (Kilgarriff, 2001)Application
Application
Corpus A
DAA′1
DAA′2
.
.
.
DAA′n
DAA′
Homogeneity(A)
12 Corpus A + 1
2 Corpus B
DAB′1
DAB′2
.
.
.
DAB′n
DAB
Similarity(A, B)
Corpus B
DBB′1
DBB′2
.
.
.
DBB′n
DBB′
Homogeneity(B)
Lower values of D ⇒ higher homogeneity/similarity
Cristina Mota Journalistic Corpus Similarity over Time
OutlineIntroductionMethodologyExperiments
Final Remarks
Experimental SettingWithin-topic homogeneityWithin-topic similarity over timeCross-topic similarity over timeWithin-topic similarity using lower and upper case words
1 Introduction
2 Methodology
3 ExperimentsExperimental SettingWithin-topic homogeneityWithin-topic similarity over timeCross-topic similarity over timeWithin-topic similarity using lower and upper case words
4 Final Remarks
Cristina Mota Journalistic Corpus Similarity over Time
OutlineIntroductionMethodologyExperiments
Final Remarks
Experimental SettingWithin-topic homogeneityWithin-topic similarity over timeCross-topic similarity over timeWithin-topic similarity using lower and upper case words
Experimental Setting
91a 92a 93a 94a 95a 96a 97a 98a
Time frame (semester)
Num
ber
of w
ords
0e+
002e
+06
4e+
066e
+06
8e+
061e
+07
CultureSportsEconomyPoliticsSociety
CETEMPublico (Santos & Rocha, 2001) is aPortuguese public journalistic corpus
Size: 180 million words
Time span: 8 years
Organization: randomly shuffled extracts[1 extract ≅ 2 paragraphs]
Classification: 10 topics and 16 timeframes (year + semester)
Mark up: paragraphs, sentences,enumeration lists and authors
Cristina Mota Journalistic Corpus Similarity over Time
OutlineIntroductionMethodologyExperiments
Final Remarks
Experimental SettingWithin-topic homogeneityWithin-topic similarity over timeCross-topic similarity over timeWithin-topic similarity using lower and upper case words
Experimental Setting
Experiment parameters:
Topics: culture, sports, economy, politics and society
Time unit: semester
Text unit: sentence
Size: 10 slices x 37600 words per topic and time frame (80% lowercase and 8% upper case)
N most frequent words: 2000 words (80% lower case and 8% uppercase)
Cristina Mota Journalistic Corpus Similarity over Time
OutlineIntroductionMethodologyExperiments
Final Remarks
Experimental SettingWithin-topic homogeneityWithin-topic similarity over timeCross-topic similarity over timeWithin-topic similarity using lower and upper case words
Within-topic homogeneity
Cul Spo Eco Pol Soc
1.16
1.18
1.20
1.22
1.24
1.26
1.28
Topics
Het
erog
enei
ty (
= m
ean
CB
DF
dis
tanc
e)
Very low values of homogeneity foreach semester within each topic
Politics has the most homogeneoussemesters
Culture has the least homogeneoussemesters
Economy values spread widely
What happens if we begin comparing similarity between semesters?
Cristina Mota Journalistic Corpus Similarity over Time
OutlineIntroductionMethodologyExperiments
Final Remarks
Experimental SettingWithin-topic homogeneityWithin-topic similarity over timeCross-topic similarity over timeWithin-topic similarity using lower and upper case words
Within-topic similarity over time
Cul Spo Eco Pol Soc
12
34
56
7
Topics
Dis
sim
ilarit
y (=
mea
n C
BD
F d
ista
nce)
0 5 10 15
12
34
5
Politics (within−topic over time)
Time gap (semester)D
issi
mila
rity
(= m
ean
CB
DF
dis
tanc
e)
Within-topic similarity decreases over time
The similarity decay is more significant for some topics
Cristina Mota Journalistic Corpus Similarity over Time
OutlineIntroductionMethodologyExperiments
Final Remarks
Experimental SettingWithin-topic homogeneityWithin-topic similarity over timeCross-topic similarity over timeWithin-topic similarity using lower and upper case words
Cross-topic similarity over time
Cul Spo Eco Pol Soc
510
1520
Topics
Dis
sim
ilarit
y (=
mea
n C
BD
F d
ista
nce)
0 5 10 15
510
1520
Politics (within− and cross−topic over time)
Time gap (semester)D
issi
mila
rity
(= m
ean
CB
DF
dis
tanc
e)
Cross-topic similarity doesn’t show a decreasing pattern
Within-topic similarity over time tends to cross-topic similarity
Cristina Mota Journalistic Corpus Similarity over Time
OutlineIntroductionMethodologyExperiments
Final Remarks
Experimental SettingWithin-topic homogeneityWithin-topic similarity over timeCross-topic similarity over timeWithin-topic similarity using lower and upper case words
Within-topic similarity using lower and upper case words
words lower upper
24
68
1012
Frequency list word types
Dis
sim
ilarit
y (=
mea
n C
BD
F d
ista
nce)
0 5 10 15
24
68
1012
Politics (within−topic by word type)
Time gap (semester)D
issi
mila
rity
(= m
ean
CB
DF
dis
tanc
e)
Similarity based on upper case words decreases moresignificantly
Cristina Mota Journalistic Corpus Similarity over Time
OutlineIntroductionMethodologyExperiments
Final Remarks
Experimental SettingWithin-topic homogeneityWithin-topic similarity over timeCross-topic similarity over timeWithin-topic similarity using lower and upper case words
Within-topic similarity using lower and upper case words
Cul Spo Eco Pol Soc
12
34
5
Dissimilarity based on lower case words
Topics
Dis
sim
ilarit
y (=
mea
n C
BD
F d
ista
nce)
Cul Spo Eco Pol Soc
24
68
1012
14
Dissimilarity based on capitalized words
TopicsD
issi
mila
rity
(= m
ean
CB
DF
dis
tanc
e)
Similarity profiles depend on types of words→ Sports and Politics have similar profiles with lower case words→ Politics varies much more than Sports based on upper case words
Cristina Mota Journalistic Corpus Similarity over Time
OutlineIntroductionMethodologyExperiments
Final Remarks
1 Introduction
2 Methodology
3 Experiments
4 Final Remarks
Cristina Mota Journalistic Corpus Similarity over Time
OutlineIntroductionMethodologyExperiments
Final Remarks
Final Remarks
The within-topic similarity between texts decreases as weincrease the time gap between them
This tendency doesn’t seem to flatten in a short period of 8years
The within-topic similarity decrease is more significant forsome topics
In some cases the within-topic similarity values get close tocross-topic similarity values
The similarity based on capitalized words decreases in a morepronounced way
Cristina Mota Journalistic Corpus Similarity over Time
OutlineIntroductionMethodologyExperiments
Final Remarks
Final Remarks
Other related issues we are currently investigating aiming at betternamed entity recognition
Does the decrease in similarity affect the perfomance ofnamed entity recognition?
How do other more structured lexical units (e.g., namedentities and surrounding context) vary over time?
Cristina Mota Journalistic Corpus Similarity over Time