Journalistic Corpus Similarity over Time

OutlineIntroductionMethodologyExperiments

Final Remarks

Journalistic Corpus Similarity over Time

American Association for Corpus Linguistics ConferenceProvo, Utah - March 12th - 15th 2008

Cristina Mota

Instituto Superior Tecnico & L2F INESC-ID (Portugal)&

New York University (USA)

(Advisors: Ralph Grishman & Nuno Mamede)

This research was funded by Fundacao para a Ciencia e a Tecnologia (doctoral scholarship SFRH/BD/3237/2000)

Cristina Mota Journalistic Corpus Similarity over Time


Final Remarks

Outline

1 Introduction

2 Methodology

3 Experiments

4 Final Remarks



Final Remarks

MotivationApproach

1 IntroductionMotivationApproach

2 Methodology

3 Experiments

4 Final Remarks



Final Remarks

MotivationApproach

Motivation

Mary is studying in the United States at Brigham Young University� NE Tagger �

MaryPER is studying in the United StatesLOC at Brigham

Young UniversityORG



Final Remarks

MotivationApproach

Motivation

o o o o o o

oo

o o

o

o

o

o

o

o

05

1015

2025

Time frame (semester)N

ame

occu

rren

ces

/ 100

K w

ords

x

x

x

x x x x x x x x x x x x xO O

OO

O O

O

O O O

O

O

O

O

O O

xx

x

x

x x

x

x x x x x x x x x

oxOx

UECEEUnião EuropeiaComunidade Europeia

91a 92a 93a 94a 95a 96a 97a 98a

Do texts vary over time in a way that affects NE recognition?

Should NE taggers be also conceived time-aware?



Final Remarks

MotivationApproach

Approach

Corpus Analysis

Measure corpus similarity based on

Words

Upper case words

Lower case words

Compute name and context overlaps

By type

By token

NER Performance Analysis

Assess performance by trainingand testing with differentconfigurations (train,test)

Time issue

More data or better data?

Labeled vs. unlabeled



Final Remarks

Corpus Similarity Algorithm (Kilgarriff, 2001)Application

1 Introduction

2 MethodologyCorpus Similarity Algorithm (Kilgarriff, 2001)Application

3 Experiments

4 Final Remarks



Final Remarks


Corpus Similarity Algorithm (Kilgarriff, 2001)

Similarity(A,B):

Split corpus A and B into k slices each

Repeat m times:

Randomly allocate k2 slices to Ai and k

2 to Bi

Construct word frequency lists for Ai and Bi

Compute CBDF between A and B for the n most frequentwords of the joint corpus (Ai+Bi )[CBDF = χ

2 by degrees of freedom]

Output mean and standard deviation of CBDF of allexperiments

Repeat using corpus A only: Similarity(A,A) → Homogeneity(A)Repeat using corpus B only: Similarity(B,B) → Homogeneity(B)



Final Remarks


Application

Corpus A

DAA′1

DAA′2

.

.

.

DAA′n

DAA′

Homogeneity(A)

12 Corpus A + 1

2 Corpus B

DAB′1

DAB′2

.

.

.

DAB′n

DAB

Similarity(A, B)

Corpus B

DBB′1

DBB′2

.

.

.

DBB′n

DBB′

Homogeneity(B)

Lower values of D ⇒ higher homogeneity/similarity



Final Remarks

Experimental SettingWithin-topic homogeneityWithin-topic similarity over timeCross-topic similarity over timeWithin-topic similarity using lower and upper case words

1 Introduction

2 Methodology

3 ExperimentsExperimental SettingWithin-topic homogeneityWithin-topic similarity over timeCross-topic similarity over timeWithin-topic similarity using lower and upper case words

4 Final Remarks



Final Remarks


Experimental Setting

91a 92a 93a 94a 95a 96a 97a 98a

Time frame (semester)

Num

ber

of w

ords

0e+

002e

+06

4e+

066e

+06

8e+

061e

+07

CultureSportsEconomyPoliticsSociety

CETEMPublico (Santos & Rocha, 2001) is aPortuguese public journalistic corpus

Size: 180 million words

Time span: 8 years

Organization: randomly shuffled extracts[1 extract ≅ 2 paragraphs]

Classification: 10 topics and 16 timeframes (year + semester)

Mark up: paragraphs, sentences,enumeration lists and authors



Final Remarks


Experimental Setting

Experiment parameters:

Topics: culture, sports, economy, politics and society

Time unit: semester

Text unit: sentence

Size: 10 slices x 37600 words per topic and time frame (80% lowercase and 8% upper case)

N most frequent words: 2000 words (80% lower case and 8% uppercase)



Final Remarks


Within-topic homogeneity

Cul Spo Eco Pol Soc

1.16

1.18

1.20

1.22

1.24

1.26

1.28

Topics

Het

erog

enei

ty (

= m

ean

CB

DF

dis

tanc

e)

Very low values of homogeneity foreach semester within each topic

Politics has the most homogeneoussemesters

Culture has the least homogeneoussemesters

Economy values spread widely

What happens if we begin comparing similarity between semesters?



Final Remarks


Within-topic similarity over time

Cul Spo Eco Pol Soc

12

34

56

7

Topics

Dis

sim

ilarit

y (=

mea

n C

BD

F d

ista

nce)

0 5 10 15

12

34

5

Politics (within−topic over time)

Time gap (semester)D

issi

mila

rity

(= m

ean

CB

DF

dis

tanc

e)

Within-topic similarity decreases over time

The similarity decay is more significant for some topics



Final Remarks


Cross-topic similarity over time

Cul Spo Eco Pol Soc

510

1520

Topics

Dis

sim

ilarit

y (=

mea

n C

BD

F d

ista

nce)

0 5 10 15

510

1520

Politics (within− and cross−topic over time)


issi

mila

rity

(= m

ean

CB

DF

dis

tanc

e)

Cross-topic similarity doesn’t show a decreasing pattern

Within-topic similarity over time tends to cross-topic similarity



Final Remarks


Within-topic similarity using lower and upper case words

words lower upper

24

68

1012

Frequency list word types

Dis

sim

ilarit

y (=

mea

n C

BD

F d

ista

nce)

0 5 10 15

24

68

1012

Politics (within−topic by word type)


issi

mila

rity

(= m

ean

CB

DF

dis

tanc

e)

Similarity based on upper case words decreases moresignificantly



Final Remarks


Within-topic similarity using lower and upper case words

Cul Spo Eco Pol Soc

12

34

5

Dissimilarity based on lower case words

Topics

Dis

sim

ilarit

y (=

mea

n C

BD

F d

ista

nce)

Cul Spo Eco Pol Soc

24

68

1012

14

Dissimilarity based on capitalized words

TopicsD

issi

mila

rity

(= m

ean

CB

DF

dis

tanc

e)

Similarity profiles depend on types of words→ Sports and Politics have similar profiles with lower case words→ Politics varies much more than Sports based on upper case words



Final Remarks

1 Introduction

2 Methodology

3 Experiments

4 Final Remarks



Final Remarks

Final Remarks

The within-topic similarity between texts decreases as weincrease the time gap between them

This tendency doesn’t seem to flatten in a short period of 8years

The within-topic similarity decrease is more significant forsome topics

In some cases the within-topic similarity values get close tocross-topic similarity values

The similarity based on capitalized words decreases in a morepronounced way



Final Remarks

Final Remarks

Other related issues we are currently investigating aiming at betternamed entity recognition

Does the decrease in similarity affect the perfomance ofnamed entity recognition?

How do other more structured lexical units (e.g., namedentities and surrounding context) vary over time?


Journalistic Corpus Similarity over Time

Documents

Transcript of Journalistic Corpus Similarity over Time