Patrick Juola Duquesne University [email protected] Authorship Attribution and Stylometry...
-
Upload
aron-bennett -
Category
Documents
-
view
213 -
download
1
Transcript of Patrick Juola Duquesne University [email protected] Authorship Attribution and Stylometry...
Patrick Juola
Duquesne University
www.jgaap.com
Authorship Attribution and Stylometry(lecture 5)
Authorship Attribution and Stylometry(lecture 5)
Some HousekeepingSome Housekeeping
• I’m having trouble with n/w connectivity to Duquesne• Watch www.mathcs.duq.edu/~juola• Watch www.jgaap.com• Will be posting new developments as they
occur• (Will also post NG corpus as requested.)
ESSLLI materialESSLLI material
• The Personae corpus is freely available
• BUT the one we’ve developed is not• If you’re willing to have your essays and
information published, contact me• [email protected]
• I will collate and publish via the web
JGAAP materialJGAAP material
• JGAAP is freeware; use and enjoy
• New developments to JGAAP are always welcome, subject to licensure (i.e. GPL).
• Wiki at www.jgaap.com is open for• Feature requests• Bug reports• Comments• New developers
Interest in a volume?Interest in a volume?
• Depending upon public interest,... i.e. you, should we pursue the idea of an edited collection of JGAAP-related papers?• There are a lot of publishers at this summer
school• Contact me if you’re interested
So, now what?So, now what?
• JGAAP seems to work, but needs more development
• More corpora (and more specialist corpora) are needed
• But if you have an authorship problem to solve NOW…
Top/bottom methodsTop/bottom methods
• Sorry, still having n/w troubles 8-(
• Best canonicizers : unify case, normalize whitespace• Strip punctuation hinders
• Best events : word bigrams• Worst : word lengths
• Best analysis : KL-distance, cosine distance• Worst : LZW
But....But....
• (Show spreadsheet, stupid!)
Testing transferrenceTesting transferrence
• 8 AAAC problems are “English”
• 5 are “foreign” (French [x2], Dutch, Latin, Serbian/Slavonic)
• Does English score reflect “foreign” score?• If so, have evidence that best practices in
English are also best practices in novel language.
• N.b. evidence is not proof!
2008/9 AAAC data2008/9 AAAC data
• 281 different analyses, generally better than AAAC submisssions.
• Correlation: r = 0.6680 (cf. 0.594)
• Significance: p < 0.0001 (cf. 0.05)
• Coefficient of determination (r2)• 45% of variation explained by algorithm
performance alone (rather than other factors)
TranferrenceTranferrence
• Best practices transfer – a best practice in one environment is likely to be a “good” practice in another• Turn it around : Do we really expect something
terrible in English to magically improve in Polish?
• Caveat : No predictions about “absolute” error rates
• Caveat(2) : Assumes lg. agnosticism
Some other findingsSome other findings
• OCR errors do not materially impact accuracy (Noecker, et al.)
• Asymmetry is a significant factor in distance-based attribution methods (Ryan and Juola)
• Algorithm performance dominates language or data size effects (Juola)
Other findings (2)Other findings (2)
• Cosine distance on large numbers of words outperforms higher-overhead methods on fewer words (Noecker & Juola)
• Characters trump words for Chinese at current word seg technology (Zhao & Juola)
• Mosteller-Wallace’s function words are overtuned (in preparation)
Best practices for nowBest practices for now
• “Mixture of experts” improves accuracy
• Run multiple analyses, mixing event types (character and word n-grams)
• Cosine distance and KL-distance work well on large event sets
• SVM works well on small event set
• Current leader : KL-distance (max) on word bigrams
• AAAC corpus too small to distinguish among 20,000 methods (testing continuing, though)
• Add more methods to JGAAP, hopefully solicited from community
• Continue to develop/publish “best practices”
Future extensionsFuture extensions
• Merci
• Arigato
• Спасибо
• Danke
• Gracias
• Teşekkür ederim
• Dank U
Tak!