Patrick Juola Duquesne University [email protected] Authorship Attribution and Stylometry...

16
Patrick Juola Duquesne University www.jgaap.com [email protected] Authorship Attribution and Stylometry (lecture 5)

Transcript of Patrick Juola Duquesne University [email protected] Authorship Attribution and Stylometry...

Page 1: Patrick Juola Duquesne University  juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)

Patrick Juola

Duquesne University

www.jgaap.com

[email protected]

Authorship Attribution and Stylometry(lecture 5)

Authorship Attribution and Stylometry(lecture 5)

Page 2: Patrick Juola Duquesne University  juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)

Some HousekeepingSome Housekeeping

• I’m having trouble with n/w connectivity to Duquesne• Watch www.mathcs.duq.edu/~juola• Watch www.jgaap.com• Will be posting new developments as they

occur• (Will also post NG corpus as requested.)

Page 3: Patrick Juola Duquesne University  juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)

ESSLLI materialESSLLI material

• The Personae corpus is freely available

• BUT the one we’ve developed is not• If you’re willing to have your essays and

information published, contact me• [email protected]

• I will collate and publish via the web

Page 4: Patrick Juola Duquesne University  juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)

JGAAP materialJGAAP material

• JGAAP is freeware; use and enjoy

• New developments to JGAAP are always welcome, subject to licensure (i.e. GPL).

• Wiki at www.jgaap.com is open for• Feature requests• Bug reports• Comments• New developers

Page 5: Patrick Juola Duquesne University  juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)

Interest in a volume?Interest in a volume?

• Depending upon public interest,... i.e. you, should we pursue the idea of an edited collection of JGAAP-related papers?• There are a lot of publishers at this summer

school• Contact me if you’re interested

Page 6: Patrick Juola Duquesne University  juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)

So, now what?So, now what?

• JGAAP seems to work, but needs more development

• More corpora (and more specialist corpora) are needed

• But if you have an authorship problem to solve NOW…

Page 7: Patrick Juola Duquesne University  juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)

Top/bottom methodsTop/bottom methods

• Sorry, still having n/w troubles 8-(

• Best canonicizers : unify case, normalize whitespace• Strip punctuation hinders

• Best events : word bigrams• Worst : word lengths

• Best analysis : KL-distance, cosine distance• Worst : LZW

Page 8: Patrick Juola Duquesne University  juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)

But....But....

• (Show spreadsheet, stupid!)

Page 9: Patrick Juola Duquesne University  juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)

Testing transferrenceTesting transferrence

• 8 AAAC problems are “English”

• 5 are “foreign” (French [x2], Dutch, Latin, Serbian/Slavonic)

• Does English score reflect “foreign” score?• If so, have evidence that best practices in

English are also best practices in novel language.

• N.b. evidence is not proof!

Page 10: Patrick Juola Duquesne University  juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)

2008/9 AAAC data2008/9 AAAC data

• 281 different analyses, generally better than AAAC submisssions.

• Correlation: r = 0.6680 (cf. 0.594)

• Significance: p < 0.0001 (cf. 0.05)

• Coefficient of determination (r2)• 45% of variation explained by algorithm

performance alone (rather than other factors)

Page 11: Patrick Juola Duquesne University  juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)

TranferrenceTranferrence

• Best practices transfer – a best practice in one environment is likely to be a “good” practice in another• Turn it around : Do we really expect something

terrible in English to magically improve in Polish?

• Caveat : No predictions about “absolute” error rates

• Caveat(2) : Assumes lg. agnosticism

Page 12: Patrick Juola Duquesne University  juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)

Some other findingsSome other findings

• OCR errors do not materially impact accuracy (Noecker, et al.)

• Asymmetry is a significant factor in distance-based attribution methods (Ryan and Juola)

• Algorithm performance dominates language or data size effects (Juola)

Page 13: Patrick Juola Duquesne University  juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)

Other findings (2)Other findings (2)

• Cosine distance on large numbers of words outperforms higher-overhead methods on fewer words (Noecker & Juola)

• Characters trump words for Chinese at current word seg technology (Zhao & Juola)

• Mosteller-Wallace’s function words are overtuned (in preparation)

Page 14: Patrick Juola Duquesne University  juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)

Best practices for nowBest practices for now

• “Mixture of experts” improves accuracy

• Run multiple analyses, mixing event types (character and word n-grams)

• Cosine distance and KL-distance work well on large event sets

• SVM works well on small event set

• Current leader : KL-distance (max) on word bigrams

Page 15: Patrick Juola Duquesne University  juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)

• AAAC corpus too small to distinguish among 20,000 methods (testing continuing, though)

• Add more methods to JGAAP, hopefully solicited from community

• Continue to develop/publish “best practices”

Future extensionsFuture extensions

Page 16: Patrick Juola Duquesne University  juola@mathcs.duq.edu Authorship Attribution and Stylometry (lecture 5)

• Merci

• Arigato

• Спасибо

• Danke

• Gracias

• Teşekkür ederim

• Dank U

Tak!