Applied Natural Language...
Transcript of Applied Natural Language...
![Page 1: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/1.jpg)
Applied Natural Language Processing
Info 256Lecture 7: Testing (Feb 12, 2019)
David Bamman, UC Berkeley
![Page 2: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/2.jpg)
Significance in NLP
• You develop a new method for text classification; is it better than what comes before?
• You’re developing a new model; should you include feature X? (when there is a cost to including it)
• You're developing a new model; does feature X reliably predict outcome Y?
![Page 3: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/3.jpg)
Evaluation
• A critical part of development new algorithms and methods and demonstrating that they work
![Page 4: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/4.jpg)
Classification
𝓧 = set of all documents 𝒴 = {english, mandarin, greek, …}
A mapping h from input data x (drawn from instance space 𝓧) to a label (or labels) y from some enumerable output space 𝒴
x = a single document y = ancient greek
![Page 5: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/5.jpg)
train dev test
𝓧instance space
![Page 6: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/6.jpg)
Experiment designtraining development testing
size 80% 10% 10%
purpose training models model selectionevaluation;
never look at it until the very
end
![Page 7: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/7.jpg)
Metrics• Evaluations presuppose that you have some metric to evaluate
the fitness of a model.
• Text classification: accuracy, precision, recall, F1
• Phrase-structure parsing: PARSEVAL (bracketing overlap)
• Dependency parsing: Labeled/unlabeled attachment score
• Machine translation: BLEU, METEOR
• Summarization: ROUGE
• Language model: perplexity
![Page 8: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/8.jpg)
Metrics• Downstream tasks that use NLP to predict the
natural world also have metrics:
• Predicting presidential approval rates from tweets.
• Predicting the type of job applicants from a job description.
• Conversational agent
![Page 9: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/9.jpg)
Binary classification
puppy fried chicken
puppy 6 3
fried chicken 2 5
Truth
Syst
em B
Accuracy = 11/16 = 68.75%
https://twitter.com/teenybiscuit/status/705232709220769792/photo/1
![Page 10: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/10.jpg)
Dem Repub Indep
Dem 100 2 15
Repub 0 104 30
Indep 30 40 70
Predicted (ŷ)
True
(y)
Multiclass confusion matrix
![Page 11: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/11.jpg)
Accuracy
Dem Repub Indep
Dem 100 2 15
Repub 0 104 30
Indep 30 40 70
Predicted (ŷ)Tr
ue (y
)
1
N
N�
i=1
I[yi = yi]
I[x]
�1 if x is true
0 otherwise
![Page 12: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/12.jpg)
Precision
Precision: proportion of predicted class that are actually that class.
Dem Repub Indep
Dem 100 2 15
Repub 0 104 30
Indep 30 40 70
Predicted (ŷ)Tr
ue (y
)
Precision(Dem) =
∑Ni=1 I(yi = yi = Dem)
∑Ni=1 I( yi = Dem)
![Page 13: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/13.jpg)
Recall
Recall: proportion of true class that are predicted to be that class.
Dem Repub Indep
Dem 100 2 15
Repub 0 104 30
Indep 30 40 70
Predicted (ŷ)Tr
ue (y
)
Recall(Dem) =
∑Ni=1 I(yi = yi = Dem)
∑Ni=1 I(yi = Dem)
![Page 14: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/14.jpg)
F score
F =2 � precision � recall
precision + recall
![Page 15: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/15.jpg)
Ablation test
• To test how important individual features are (or components of a model), conduct an ablation test
• Train the full model with all features included, conduct evaluation.
• Remove feature, train reduced model, conduct evaluation.
![Page 16: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/16.jpg)
Ablation test
Gimpel et al. 2011, “Part-of-Speech Tagging for Twitter”
![Page 17: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/17.jpg)
Significance
• If we observe difference in performance, what’s the cause? Is it because one system is better than another, or is it a function of randomness in the data? If we had tested it on other data, would we get the same result?
Your work 58%
Current state of the art 50%
![Page 18: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/18.jpg)
Hypotheses
hypothesis
The average income in two sub-populations is different
Web design A leads to higher CTR than web design B
Self-reported location on Twitter is predictive of political preference
Your system X is better than state-of-the-art system Y
![Page 19: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/19.jpg)
Null hypothesis• A claim, assumed to be true, that we’d like to test
(because we think it’s wrong)
hypothesis H0
The average income in two sub-populations is different The incomes are the same
Web design A leads to higher CTR than web design B The CTR are the same
Self-reported location on Twitter is predictive of political preference
Location has no relationship with political preference
Your system X is better than state-of-the-art system Y
There is no difference in the two systems.
![Page 20: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/20.jpg)
Hypothesis testing
• If the null hypothesis were true, how likely is it that you’d see the data you see?
![Page 21: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/21.jpg)
Hypothesis testing
• Hypothesis testing measures our confidence in what we can say about a null from a sample.
![Page 22: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/22.jpg)
Hypothesis testing• Current state of the art = 50%; your model = 58%.
Both evaluated on the same test set of 1000 data points.
• Null hypothesis = there is no difference, so we would expect your model to get 500 of the 1000 data points right.
• If we make parametric assumptions, we can model this with a Binomial distribution (number of successes in n trials)
![Page 23: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/23.jpg)
Example
Binomial probability distribution for number of correct predictions in n=1000 with p = 0.5
0.000
0.005
0.010
0.015
0.020
0.025
400 450 500 550 600# Dem
![Page 24: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/24.jpg)
0.000
0.005
0.010
0.015
0.020
0.025
400 450 500 550 600# Dem
ExampleAt what point is a sample statistic unusual enough to reject
the null hypothesis?
510
580
![Page 25: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/25.jpg)
Example
• The form we assume for the null hypothesis lets us quantify that level of surprise.
• We can do this for many parametric forms that allows us to measure P(X ≤ x) for some sample of size n; for large n, we can often make a normal approximation.
![Page 26: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/26.jpg)
Z score
For Normal distributions, transform into standard normal (mean = 0, standard deviation =1 )
Z =X � μσ/
�n
![Page 27: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/27.jpg)
0.0
0.1
0.2
0.3
0.4
-6 -3 0 3 6z
density
580 correct = z score 5.06
510 correct = z score 0.63
Z score
![Page 28: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/28.jpg)
Tests
• We will define “unusual” to equal the most extreme areas in the tails
![Page 29: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/29.jpg)
least likely 10%
0.0
0.1
0.2
0.3
0.4
-4 -2 0 2 4z
density
> z=1.65< z=-1.65
![Page 30: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/30.jpg)
least likely 5%
0.0
0.1
0.2
0.3
0.4
-4 -2 0 2 4z
density
> z=1.96< z=-1.96
![Page 31: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/31.jpg)
least likely 1%
0.0
0.1
0.2
0.3
0.4
-4 -2 0 2 4z
density
> z=2.58< z=-2.58
![Page 32: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/32.jpg)
0.0
0.1
0.2
0.3
0.4
-6 -3 0 3 6z
density
580 correct = z score 5.06
510 correct = z score 0.63
Tests
![Page 33: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/33.jpg)
• Decide on the level of significance α. {0.05, 0.01}
• Testing is evaluating whether the sample statistic falls in the rejection region defined by α
Tests
![Page 34: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/34.jpg)
Tails• Two-tailed tests measured whether the observed statistic is different (in either direction)
• One-tailed tests measure difference in a specific direction
• All differ in where the rejection region is located; α = 0.05 for all.
two-tailed test
lower-tailed test upper-tailed test
0.0
0.1
0.2
0.3
0.4
-4 -2 0 2 4z
density
0.0
0.1
0.2
0.3
0.4
-4 -2 0 2 4z
density
0.0
0.1
0.2
0.3
0.4
-4 -2 0 2 4z
density
![Page 35: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/35.jpg)
p values
• Two-tailed test p-value(z) = 2 � P(Z � �|z|)
• Lower-tailed test p-value(z) = P(Z � z)
• Upper-tailed test p-value(z) = 1 � P(Z � z)
A p value is the probability of observing a statistic at least as extreme as the one we did if the null hypothesis were true.
![Page 36: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/36.jpg)
Errors
keep null reject null
keep null Type I errorα
reject null Type II errorβ Power
Test results
Truth
![Page 37: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/37.jpg)
Errors
• Type I error: we reject the null hypothesis but we shouldn’t have.
• Type II error: we don’t reject the null, but we should have.
![Page 38: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/38.jpg)
1 “jobs” is predictive of presidential approval rating
2 “job” is predictive of presidential approval rating
3 “war” is predictive of presidential approval rating
4 “car” is predictive of presidential approval rating
5 “the” is predictive of presidential approval rating
6 “star” is predictive of presidential approval rating
7 “book” is predictive of presidential approval rating
8 “still” is predictive of presidential approval rating
9 “glass” is predictive of presidential approval rating
… …
1,000 “bottle” is predictive of presidential approval rating
![Page 39: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/39.jpg)
Errors
• For any significance level α and n hypothesis tests, we can expect α⨉n type I errors.
• α=0.01, n=1000 = 10 “significant” results simply by chance
![Page 40: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/40.jpg)
Multiple hypothesis corrections
• Bonferroni correction: for family-wise significance level α0 with n hypothesis tests:
• [Very strict; controls the probability of at least one type I error.]
• False discovery rate
α � α0n
![Page 41: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/41.jpg)
Confidence intervals
• Even in the absence of specific test, we want to quantify our uncertainty about any metric.
• Confidence intervals specify a range that is likely to contain the (unobserved) population value from a measurement in a sample.
![Page 42: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/42.jpg)
Confidence intervalsBinomial confidence intervals (again using Normal approximation):
• p = rate of success (e.g., for binary classification, the accuracy). • n = the sample size (e.g., number of data points in test set). • zα = the critical value at significance level α.
• 95% confidence interval: α = 0.05; zα = 1.96 • 99% confidence interval: α = 0.01; zα = 2.58
p ± zαp(1 − p)
n
![Page 43: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/43.jpg)
Issues
• Evaluation performance may not hold across domains (e.g., WSJ →literary texts)
• Covariates may explain performance (MT/parsing, sentences up to length n)
• Multiple metrics may offer competing results
Søgaard et al. 2014
![Page 44: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/44.jpg)
English POS
5062.5
7587.5100
WSJ Shakespeare
81.997.0
German POS
5062.5
7587.5100
Modern Early Modern
69.6
97.0English POS
5062.5
7587.5100
WSJ Middle English
56.2
97.3
Italian POS
5062.5
7587.5100
News Dante
75.0
97.0English POS
5062.5
7587.5100
WSJ Twitter
73.7
97.3English NER
40557085
100
CoNLL Twitter
41.0
89.0
Phrase structure parsing
5060708090
WSJ GENIA
79.389.5
Dependency parsing
5059.7569.5
79.2589
WSJ Patent
79.688.2
Dependency parsing
5059.2568.5
77.7587
WSJ Magazines
77.186.9
![Page 45: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/45.jpg)
Takeaways• At a minimum, always evaluate a method on the
domain you’re using it on
• When comparing the performance of models, quantify your uncertainty with significant tests/confidence bounds
• Use ablation tests to identify the impact that a feature class has on performance.
![Page 46: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/46.jpg)
Ethics
Why does a discussion about ethics need to be a part of NLP?
![Page 47: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/47.jpg)
Conversational Agents
![Page 48: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/48.jpg)
Question Answering
http://searchengineland.com/according-google-barack-obama-king-united-states-209733
![Page 49: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/49.jpg)
Language Modeling
![Page 50: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/50.jpg)
Vector semantics
![Page 51: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/51.jpg)
• The decisions we make about our methods — training data, algorithm, evaluation — are often tied up with its use and impact in the world.
![Page 52: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/52.jpg)
I saw the man with the telescope
nsubjdobj
detdet
pobjprep
prep
Scope
• NLP often operates on text divorced from the context in which it is uttered.
• It’s now being used more and more to reason about human behavior.
![Page 53: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/53.jpg)
Privacy
![Page 54: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/54.jpg)
![Page 55: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/55.jpg)
![Page 56: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/56.jpg)
Interventions
![Page 57: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/57.jpg)
![Page 58: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/58.jpg)
Exclusion
• Focus on data from one domain/demographic
• State-of-the-art models perform worse for young (Hovy and Søgaard 2015) and minorities (Blodgett et al. 2016)
![Page 59: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/59.jpg)
Language identification Dependency parsing
Blodgett et al. (2016), "Demographic Dialectal Variation in Social Media: A Case Study of African-American English" (EMNLP)
Exclusion
![Page 60: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/60.jpg)
Overgeneralization
• Managing and communicating the uncertainty of our predictions
• Is a false answer worse than no answer?
![Page 61: Applied Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/anlp19_slides/7_hypotheses.pdfApplied Natural Language Processing ... or is it a function of randomness in the](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ec9de4abb8ca67fb44659cf/html5/thumbnails/61.jpg)
Dual Use
• Authorship attribution (author of Federalist Papers vs. author of ransom note vs. author of political dissent)
• Fake review detection vs. fake review generation
• Censorship evasion vs. enabling more robust censorship