Trausan-Matu: Natural Language Processing and Topic Modelling
Transcript of Trausan-Matu: Natural Language Processing and Topic Modelling
Natural Language Processing andTopic Modelling
Ştefan Trăușan-Matu
University Politehnica of BucharestRomanian Academy Research Institute for Artificial Intelligence
COST Action IS1310 - Reassembling the Republic of Letters23rd March St’. Anne’s College University of Oxford
Natural Language Processing (NLP)
• Input:
– Text in digital format (strings of characters)
• document
• corpus
• question
• transcription of a monologue or of a conversation
• instant messenger log
• discussion forum, social network
• corpus of interlinked documents (e.g. letters)
• dialog
2Republic of Letters, University of Oxford23 March 2015
Natural Language Processing• Output:
– Text(s) in digital format
• translation – e.g. Google translate
• document(s) summary - summarizers
• answer – question answering
• clusters of documents
– Automatically generated annotations
– List of topics in the text
– Links among topics
– Similar documents
– Links among documents – intertextuality
– Threads of discussion, COLLABORATION
– Other data
• collocations
• structures (syntactical, discourse, rhetorical, etc.)
• opinions In text
• participation & collaboration degrees in conversations
• …3
NLP approaches
• Grammar-based
• Statistical (corpus-based, machinelearning)
– unsupervized (clustering, LSA, LDA)
– Supervized
annotated corpus
learned model
automated annotation
4Republic of Letters, University of Oxford23 March 2015
Text annotation• Space
• Time
• Named Entities
• Links
• Syntactic
• Semantic
• Pragmatic
• Discourse
• Rhetoric
• …5Republic of Letters, University of Oxford23 March 2015
Text Annotators
6Republic of Letters, University of Oxford23 March 2015
Topic Modeling No generally accepted definition for a “topic” in NLPDocument clustersAbstractions based on document clustersLabels;Centroids, etc
(Word, Probability) pairs
Bayesian statistical modelsTopics – distributions over wordsDocuments – distributions over topicsGenerative modelTopic IntertwiningConceptually similar to the ideas of Mikhail BakhtinTopics and voices
7Republic of Letters, University of Oxford23 March 2015
Topic Modeling (2)• LSA/pLSA/hLDA/CTM
– Each newer version corrects some flaws of theearlier ones
• LDA
– Readily available
• Mallet
• Easily reproducible experiments
8Republic of Letters, University of Oxford23 March 2015
The LSA idea
• Reducing the dimensionality of the vector space,similarly to the least squares method
• The effect is the creation of semantic spacescontaining semantically related words
• Bag-of-words approach
• http://lsa.colorado.edu
9Republic of Letters, University of Oxford23 March 2015
LSA - Vector space model
Singular value decomposition (SVD)
n=min(t,d)Tdxnnxntxntxd DSTA
10Republic of Letters, University of Oxford23 March 2015
09.016.061.073.025.05dim
58.058.000.000.058.04dim
41.015.037.059.057.03dim
65.035.051.033.030.02dim
26.070.048.013.044.01dim
cos truckcarmoonastronautmonaut
T T
39.000.000.000.000.0
00.000.100.000.000.0
00.000.028.100.000.0
00.000.000.059.100.0
00.000.000.000.016.2
S
22.041.019.063.029.053.05dim
58.058.000.058.000.000.04dim
33.012.020.045.075.028.03dim
41.022.063.019.053.029.02dim
12.033.045.020.028.075.01dim
654321 dddddd
DT
101000
011001
000011
000010
000101cos
654321
truck
car
moon
astronaut
monaut
dddddd
ATerms-documents array
(ex. from Manning and Schutze, 1999)
Reduced A
• By SVD on maps the n-dimension space ona k-dimension one, with n >>k
• Common values for k are 100 and 150.
2|||| AA
Tdxx DSB 222
65.035.000.130.084.046.02dim
26.071.097.004.060.062.11dim
654321 dddddd
B
LSA based text processing
The most significant 20 wordsfrom Plato
[Plato|TheApology,Justin|TheSecondApology-(0.6475);Plato|TheRepublic.7,Irenaeus|AgainstHeresies.6-(0.6095)]
The similarity of Plato’s workswith the works of other writers
11Republic of Letters, University of Oxford23 March 2015
Latent Dirichlet Allocation
12
http://www.columbia.edu/~ih2240/dataviz/G4063-week5/images/text/LDA.png
Republic of Letters, University of Oxford23 March 2015
Bakhtin’s Polyphonic Intertextuality
Voice I
Voice IIVoice III
Voice IVoice IIVoice III
In dialog
Text 1 Text 2 Text 3
Text 4
Text 1Text 2Text 3
In dialog in text 4
13Republic of Letters, University of Oxford23 March 2015
Polyphony Appears in music (e.g. J.S.Bach) and in novels (Bakhtin)
The Polyphonic Model (Trausan-Matu, 2005, 2010)
Analysis method (Trausan-Matu, Dascalu and Rebedea, 2010)
Computer support tools for the polyphonic analysis ofconversations and networks of documents The “Polyphony” system (Trausan-Matu and all, 2007)
ASAP (Dascalu, Chioasca and Trausan-Matu, 2008)
PolyCAFe (Trausan-Matu, Rebedea and Dascalu, 2011; Rebedea, Dascalu,Trausan-Matu and all, 2010)
Collaboration regions detection (Banica, Trausan-Matu and Rebedea,2011)
Detection of the Important moments (Chiru and Trausan-Matu, 2012)
Intertextuality detection (Ghiban and Trausan-Matu, 2012)
ReaderBench (Dascălu, Trăușan-Matu and Dessus, 2013)14Republic of Letters, University of Oxford23 March 2015
Intertextuality analysis
• Mikhail Bakhtin’s dialogistical and polyphonicmodel Intertextuality (Kristeva)
• Analyze how concepts are echoed from onetext to another (intertextual networks)
• To indicate membership to a philosophicaltrend or influences among authors
15Republic of Letters, University of Oxford23 March 2015
Bakhtin’s Polyphonic Intertextuality
Theme 2 and Theme 3 mayhave the same words butonly different concepts
Section 1 and 6 are dialogical or polyphonical.They may present a higher force ofexpresivity.
16Republic of Letters, University of Oxford23 March 2015
PolyCAFe(Trăușan-Matu, Dascălu and Rebedea)
• Polyphony-based Collaboration Analysis andFeedback generation
• Developed in the “Language Technologies forLifelong Learning” EU FP7 project(http://www.ltfll-project.org/)
• Analyses chat (instant messenger) logs withmore than two participants using thepolyphonic model (Trăușan-Matu)
17Republic of Letters, University of Oxford23 March 2015
From: Trăuşan-Matu , A Polyphonic Model for Interethnic Discourse, 2013
18Republic of Letters, University of Oxford23 March 2015
PolyCAFe
From: Trăuşan-Matu , A Polyphonic Model for Interethnic Discourse, 2013
19Republic of Letters, University of Oxford23 March 2015
ReaderBench(Dascalu, Trăușan-Matu and Dessus)
• Based on
– LSA, LDA
– Polyphonic model
– WordNet
– Social Network Analysis
23 March 2015 Republic of Letters, University of Oxford 20
(http://wordnet.princeton.edu)
NLP Text pre-processing in PolyCAFe and ReaderBench
23 March 2015 Republic of Letters, University of Oxford
ReaderBench Document view
23 March 2015 Republic of Letters, University of Oxford 22
ReaderBench Concept View
23 March 2015 Republic of Letters, University of Oxford 23
Concept View
24
ReaderBench Corpus Similarity
23 March 2015 Republic of Letters, University of Oxford 25
ReaderBench Document Centrality
26
Thank you!
Questions?
http://www.racai.ro/trausan
27Republic of Letters, University of Oxford23 March 2015