Trausan-Matu: Natural Language Processing and Topic Modelling

27
Natural Language Processing and Topic Modelling Ştefan Trăușan-Matu University Politehnica of Bucharest Romanian Academy Research Institute for Artificial Intelligence COST Action IS1310 - Reassembling the Republic of Letters 23rd March St’. Anne’s College University of Oxford

Transcript of Trausan-Matu: Natural Language Processing and Topic Modelling

Page 1: Trausan-Matu: Natural Language Processing and Topic Modelling

Natural Language Processing andTopic Modelling

Ştefan Trăușan-Matu

University Politehnica of BucharestRomanian Academy Research Institute for Artificial Intelligence

COST Action IS1310 - Reassembling the Republic of Letters23rd March St’. Anne’s College University of Oxford

Page 2: Trausan-Matu: Natural Language Processing and Topic Modelling

Natural Language Processing (NLP)

• Input:

– Text in digital format (strings of characters)

• document

• corpus

• question

• transcription of a monologue or of a conversation

• instant messenger log

• discussion forum, social network

• corpus of interlinked documents (e.g. letters)

• dialog

2Republic of Letters, University of Oxford23 March 2015

Page 3: Trausan-Matu: Natural Language Processing and Topic Modelling

Natural Language Processing• Output:

– Text(s) in digital format

• translation – e.g. Google translate

• document(s) summary - summarizers

• answer – question answering

• clusters of documents

– Automatically generated annotations

– List of topics in the text

– Links among topics

– Similar documents

– Links among documents – intertextuality

– Threads of discussion, COLLABORATION

– Other data

• collocations

• structures (syntactical, discourse, rhetorical, etc.)

• opinions In text

• participation & collaboration degrees in conversations

• …3

Page 4: Trausan-Matu: Natural Language Processing and Topic Modelling

NLP approaches

• Grammar-based

• Statistical (corpus-based, machinelearning)

– unsupervized (clustering, LSA, LDA)

– Supervized

annotated corpus

learned model

automated annotation

4Republic of Letters, University of Oxford23 March 2015

Page 5: Trausan-Matu: Natural Language Processing and Topic Modelling

Text annotation• Space

• Time

• Named Entities

• Links

• Syntactic

• Semantic

• Pragmatic

• Discourse

• Rhetoric

• …5Republic of Letters, University of Oxford23 March 2015

Page 6: Trausan-Matu: Natural Language Processing and Topic Modelling

Text Annotators

6Republic of Letters, University of Oxford23 March 2015

Page 7: Trausan-Matu: Natural Language Processing and Topic Modelling

Topic Modeling No generally accepted definition for a “topic” in NLPDocument clustersAbstractions based on document clustersLabels;Centroids, etc

(Word, Probability) pairs

Bayesian statistical modelsTopics – distributions over wordsDocuments – distributions over topicsGenerative modelTopic IntertwiningConceptually similar to the ideas of Mikhail BakhtinTopics and voices

7Republic of Letters, University of Oxford23 March 2015

Page 8: Trausan-Matu: Natural Language Processing and Topic Modelling

Topic Modeling (2)• LSA/pLSA/hLDA/CTM

– Each newer version corrects some flaws of theearlier ones

• LDA

– Readily available

• Mallet

• Easily reproducible experiments

8Republic of Letters, University of Oxford23 March 2015

Page 9: Trausan-Matu: Natural Language Processing and Topic Modelling

The LSA idea

• Reducing the dimensionality of the vector space,similarly to the least squares method

• The effect is the creation of semantic spacescontaining semantically related words

• Bag-of-words approach

• http://lsa.colorado.edu

9Republic of Letters, University of Oxford23 March 2015

LSA - Vector space model

Page 10: Trausan-Matu: Natural Language Processing and Topic Modelling

Singular value decomposition (SVD)

n=min(t,d)Tdxnnxntxntxd DSTA

10Republic of Letters, University of Oxford23 March 2015

09.016.061.073.025.05dim

58.058.000.000.058.04dim

41.015.037.059.057.03dim

65.035.051.033.030.02dim

26.070.048.013.044.01dim

cos truckcarmoonastronautmonaut

T T

39.000.000.000.000.0

00.000.100.000.000.0

00.000.028.100.000.0

00.000.000.059.100.0

00.000.000.000.016.2

S

22.041.019.063.029.053.05dim

58.058.000.058.000.000.04dim

33.012.020.045.075.028.03dim

41.022.063.019.053.029.02dim

12.033.045.020.028.075.01dim

654321 dddddd

DT

101000

011001

000011

000010

000101cos

654321

truck

car

moon

astronaut

monaut

dddddd

ATerms-documents array

(ex. from Manning and Schutze, 1999)

Reduced A

• By SVD on maps the n-dimension space ona k-dimension one, with n >>k

• Common values for k are 100 and 150.

2|||| AA

Tdxx DSB 222

65.035.000.130.084.046.02dim

26.071.097.004.060.062.11dim

654321 dddddd

B

Page 11: Trausan-Matu: Natural Language Processing and Topic Modelling

LSA based text processing

The most significant 20 wordsfrom Plato

[Plato|TheApology,Justin|TheSecondApology-(0.6475);Plato|TheRepublic.7,Irenaeus|AgainstHeresies.6-(0.6095)]

The similarity of Plato’s workswith the works of other writers

11Republic of Letters, University of Oxford23 March 2015

Page 12: Trausan-Matu: Natural Language Processing and Topic Modelling

Latent Dirichlet Allocation

12

http://www.columbia.edu/~ih2240/dataviz/G4063-week5/images/text/LDA.png

Republic of Letters, University of Oxford23 March 2015

Page 13: Trausan-Matu: Natural Language Processing and Topic Modelling

Bakhtin’s Polyphonic Intertextuality

Voice I

Voice IIVoice III

Voice IVoice IIVoice III

In dialog

Text 1 Text 2 Text 3

Text 4

Text 1Text 2Text 3

In dialog in text 4

13Republic of Letters, University of Oxford23 March 2015

Page 14: Trausan-Matu: Natural Language Processing and Topic Modelling

Polyphony Appears in music (e.g. J.S.Bach) and in novels (Bakhtin)

The Polyphonic Model (Trausan-Matu, 2005, 2010)

Analysis method (Trausan-Matu, Dascalu and Rebedea, 2010)

Computer support tools for the polyphonic analysis ofconversations and networks of documents The “Polyphony” system (Trausan-Matu and all, 2007)

ASAP (Dascalu, Chioasca and Trausan-Matu, 2008)

PolyCAFe (Trausan-Matu, Rebedea and Dascalu, 2011; Rebedea, Dascalu,Trausan-Matu and all, 2010)

Collaboration regions detection (Banica, Trausan-Matu and Rebedea,2011)

Detection of the Important moments (Chiru and Trausan-Matu, 2012)

Intertextuality detection (Ghiban and Trausan-Matu, 2012)

ReaderBench (Dascălu, Trăușan-Matu and Dessus, 2013)14Republic of Letters, University of Oxford23 March 2015

Page 15: Trausan-Matu: Natural Language Processing and Topic Modelling

Intertextuality analysis

• Mikhail Bakhtin’s dialogistical and polyphonicmodel Intertextuality (Kristeva)

• Analyze how concepts are echoed from onetext to another (intertextual networks)

• To indicate membership to a philosophicaltrend or influences among authors

15Republic of Letters, University of Oxford23 March 2015

Page 16: Trausan-Matu: Natural Language Processing and Topic Modelling

Bakhtin’s Polyphonic Intertextuality

Theme 2 and Theme 3 mayhave the same words butonly different concepts

Section 1 and 6 are dialogical or polyphonical.They may present a higher force ofexpresivity.

16Republic of Letters, University of Oxford23 March 2015

Page 17: Trausan-Matu: Natural Language Processing and Topic Modelling

PolyCAFe(Trăușan-Matu, Dascălu and Rebedea)

• Polyphony-based Collaboration Analysis andFeedback generation

• Developed in the “Language Technologies forLifelong Learning” EU FP7 project(http://www.ltfll-project.org/)

• Analyses chat (instant messenger) logs withmore than two participants using thepolyphonic model (Trăușan-Matu)

17Republic of Letters, University of Oxford23 March 2015

Page 18: Trausan-Matu: Natural Language Processing and Topic Modelling

From: Trăuşan-Matu , A Polyphonic Model for Interethnic Discourse, 2013

18Republic of Letters, University of Oxford23 March 2015

Page 19: Trausan-Matu: Natural Language Processing and Topic Modelling

PolyCAFe

From: Trăuşan-Matu , A Polyphonic Model for Interethnic Discourse, 2013

19Republic of Letters, University of Oxford23 March 2015

Page 20: Trausan-Matu: Natural Language Processing and Topic Modelling

ReaderBench(Dascalu, Trăușan-Matu and Dessus)

• Based on

– LSA, LDA

– Polyphonic model

– WordNet

– Social Network Analysis

23 March 2015 Republic of Letters, University of Oxford 20

(http://wordnet.princeton.edu)

Page 21: Trausan-Matu: Natural Language Processing and Topic Modelling

NLP Text pre-processing in PolyCAFe and ReaderBench

23 March 2015 Republic of Letters, University of Oxford

Page 22: Trausan-Matu: Natural Language Processing and Topic Modelling

ReaderBench Document view

23 March 2015 Republic of Letters, University of Oxford 22

Page 23: Trausan-Matu: Natural Language Processing and Topic Modelling

ReaderBench Concept View

23 March 2015 Republic of Letters, University of Oxford 23

Page 24: Trausan-Matu: Natural Language Processing and Topic Modelling

Concept View

24

Page 25: Trausan-Matu: Natural Language Processing and Topic Modelling

ReaderBench Corpus Similarity

23 March 2015 Republic of Letters, University of Oxford 25

Page 26: Trausan-Matu: Natural Language Processing and Topic Modelling

ReaderBench Document Centrality

26

Page 27: Trausan-Matu: Natural Language Processing and Topic Modelling

Thank you!

Questions?

[email protected]

[email protected]

http://www.racai.ro/trausan

27Republic of Letters, University of Oxford23 March 2015