Learning Within-Sentence Semantic Coherence Elena Eneva Rose Hoberman Lucian Lita Carnegie Mellon...
-
Upload
jack-banks -
Category
Documents
-
view
219 -
download
0
Transcript of Learning Within-Sentence Semantic Coherence Elena Eneva Rose Hoberman Lucian Lita Carnegie Mellon...
Learning Within-Sentence Semantic Coherence
Elena Eneva
Rose Hoberman
Lucian LitaCarnegie Mellon University
Semantic (in)Coherence
Trigram: content words unrelated Effect on speech recognition:
– Actual Utterance: “THE BIRD FLU HAS AFFECTED CHICKENS FOR YEARS BUT ONLY RECENTLY BEGAN MAKING HUMANS SICK”
– Top Hypothesis: “THE BIRD FLU HAS AFFECTED SECONDS FOR YEARS BUT ONLY RECENTLY BEGAN MAKING HUMAN SAID”
Our goal: model semantic coherence
A Whole Sentence Exponential Model [Rosenfeld 1997]
P0(s) is an arbitrary initial model (typically N-gram)
fi(s)’s are arbitrary computable properties of s (aka features)
Z is a universal normalizing constant
)exp()(1
)Pr( )(0 sii
i
fsPZ
s
(
def
A Methodology for Feature InductionGiven corpus T of training sentences:
1. Train best-possible baseline model, P0(s)
2. Use P0(s) to generate corpus T0 of “pseudo sentences”
3. Pose a challenge: find (computable) differences that allow discrimination between T and T0
4. Encode the differences as features fi(s)
5. Train a new model:
)exp()(1
)( )(01 sii
i
fsPZ
sP
Discrimination Task:
1. - - - feel - - sacrifice - - sense - - - - - - - - -meant - - - - - - - - trust - - - - truth
2. - - kind - free trade agreements - - - living - - ziplock bag - - - - - - university japan's daiwa bank stocks step –
Are these content words generated from atrigram or a natural sentence?
Building on Prior Work
Define “content words” (all but top 50) Goal: model distribution of content
words in sentence Simplify: model pairwise co-
occurrences (“content word pairs”) Collect contingency tables; calculate
measure of association for them
Q Correlation Measure
Q values range from –1 to +1
21122211
21122211
cccc
cc-cc
Q
W1 yes
W1 no
W2 yes c11 c21
W2 no c12 c22
Derived fromCo-occurrenceContingencyTable
Density Estimates
We hypothesized:– Trigram sentences: wordpair correlation
completely determined by distance– Natural sentences: wordpair correlation
independent of distance kernel density estimation
– distribution of Q values in each corpus– at varying distances
Q Distributions
Q Value
Den
sity
---- Trigram Generated Broadcast News
Distance = 1 Distance = 3
Likelihood Ratio Feature
ji ijij
ijij
TrigramdQ
BNewsdQL
, wordpairs ),|Pr(
),|Pr(
she is a country singer searching for fame and fortune in nashville
Q(country,nashville) = 0.76 Distance = 8Pr (Q=0.76|d=8,BNews) = 0.32 Pr(Q=0.76|d=8,Trigram) = 0.11 Likelihood ratio = 0.32/0.11 = 2.9
Simpler Features
Q Value based– Mean, median, min, max of Q values for content
word pairs in the sentence (Cai et al 2000)– Percentage of Q values above a threshold– High/low correlations across large/small distances
Other– Word and phrase repetition– Percentage of stop words– Longest sequence of consecutive stop/content
words
Datasets
LM and contingency tables (Q values) derived from 103 million words of BN
From remainder of BN corpus and sentences sampled from trigram LM:– Q value distributions estimated from ~100,000
sentences– Decision tree trained and test on ~60,000 sentences
Disregarded sentences with < 7 words – “Mike Stevens says it’s not real”– “We’ve been hearing about it”
Experiments
Learners: – C5.0 decision tree– Boosting decision stumps with
Adaboost.MH Methodology:
– 5-fold cross validation on ~60,000 sentences
– Boosting for 300 rounds
Results
Feature Set Classification
Accuracy
Q mean, median, min, max (Previous Work)
73.39 ± 0.36
Likelihood Ratio 77.76 ± 0.49
All but Likelihood Ratio 80.37 ± 0.42
All Features 80.37 ± 0.46
Likelihood Ratio + non-Q
Shannon-Style Experiment
50 sentences – ½ “real” and ½ trigram-generated– Stopwords replaced by dashes
30 participants– Average accuracy of 73.77% ± 6– Best individual accuracy 84%
Our classifier:– Accuracy of 78.9% ± 0.42
Summary
Introduced a set of statistical features which capture aspects of semantic coherence
Trained a decision tree to classify with accuracy of 80%
Next step: incorporate features into exponential LM
Future Work
Combat data sparsity– Confidence intervals– Different correlation statistic– Stemming or clustering vocabulary
Evaluate derived features– Incorporate into an exponential language model– Evaluate the model on a practical application
Agreement among Participants
Expected Perplexity Reduction
Semantic coherence feature– 78% of broadcast news sentences– 18% of trigram-generated sentences
Kullback-Leibler divergence: .814 Average perplexity reduction per word
= .0419 (2^.814/21) per sentence? Features modify probability of entire sentence Effect of feature on per-word probability is
small
Likelihood Value
Den
sity
---- Trigram Generated
Broadcast News
Distribution of Likelihood Ratio
Discrimination Task
Natural Sentence:– but it doesn't feel like a sacrifice in a sense that you're
really saying this is you know i'm meant to do things the right way and you trust it and tell the truth
Trigram-Generated:– they just kind of free trade agreements which have been
living in a ziplock bag that you say that i see university japan's daiwa bank stocks step though
Q Value
Den
sity
---- Trigram Generated Broadcast News
Q Values at Distance 1
Q Value
Den
sity
---- Trigram Generated Broadcast News
Q Values at Distance 3
Outline
The problem of semantic (in)coherence Incorporating this into the whole-
sentence exponential LM Finding better features for this model
using machine learning Semantic coherence features Experiments and results