Natural Language Processing in R (rNLP)

64
Natural Language Processing in R (rNLP) Fridolin Wild, The Open University, UK Tutorial to the Doctoral School at the Institute of Business Informatics of the Goethe University Frankfurt

description

The introductory slides of a workshop given to the doctoral school at the Institute of Business Informatics of the Goethe University Frankfurt. The tutorials are available on http://crunch.kmi.open.ac.uk/w/index.php/Tutorials

Transcript of Natural Language Processing in R (rNLP)

Page 1: Natural Language Processing in R (rNLP)

Natural Language Processing in R (rNLP)Fridolin Wild, The Open University, UK

Tutorial to the Doctoral School at the Institute of Business Informatics

of the Goethe University Frankfurt

Page 2: Natural Language Processing in R (rNLP)

Structure of this tutorial

• An introduction to R and cRunch• Language basics in R• Basic I/O in R• Social Network Analysis• Latent Semantic Analysis• Twitter• Sentiment• (Advanced I/O in R: MySQL, SparQL)

Page 3: Natural Language Processing in R (rNLP)

Introduction

Page 4: Natural Language Processing in R (rNLP)

cRunch

• is an infrastructure• for computationally-intense learning

analytics• supporting researchers • in investigating big data • generated in the co-construction of

knowledge

… and beyond …

Page 5: Natural Language Processing in R (rNLP)

Architecture

(Thiele & Lehner, 2011)

Page 6: Natural Language Processing in R (rNLP)

Architecture

(Thiele & Lehner, 2011)

Living Reports

data shop

cron jobs

R webservices

Page 7: Natural Language Processing in R (rNLP)

Reports

Page 8: Natural Language Processing in R (rNLP)

Living reports

• reports with embedded scripts and data

• knitr and Sweave• render to html, PDF, …• visualisations:– ggplot2, trellis, graphix– jpg, png, eps, pdf

png(file=”n.png”, plot(network(m)))

• Fill-in-the-blanks:Drop out quote went down to <<echo=FALSE>>= doquote[“OU”,”2011”]@

\documentclass[a4paper]{article} \title{Sweave Example 1}\author{Friedrich Leisch} \begin{document} \maketitle In this example we embed parts of the examples from the\texttt{kruskal.test} help page into a \LaTeX{} document: <<>>=data(airquality)library(ctest)kruskal.test(Ozone ~ Month, data = airquality)@which shows that the location parameter of the Ozone distribution varies significantly from month to month. Finally weinclude a boxplot of the data: \begin{center}<<fig=TRUE,echo=FALSE>>=boxplot(Ozone ~ Month, data = airquality)@\end{center} \end{document}

Page 9: Natural Language Processing in R (rNLP)

Example PDF report

Page 10: Natural Language Processing in R (rNLP)

Example html5 reportExample Report=============

This is an example of embedded scripts and data.```{r}a = "hello world”print(a)```

And here is an example of how to embed a chart.```{r fig.width=7, fig.height=6}plot( 5:20 )```

Page 11: Natural Language Processing in R (rNLP)

Shiny Widgets (1)

• Widgets: use-case sized encapsulations of mini apps

• HTML5• Two files:

ui.R, server.R• Still missing:

manifest files(info.plist, config.xml)

Page 12: Natural Language Processing in R (rNLP)

Shiny Widgets (2)

From http://www.rstudio.com/shiny/

Page 13: Natural Language Processing in R (rNLP)

Web Servicesharmonization & data warehousing

Page 14: Natural Language Processing in R (rNLP)

Example R web service

print “hello world”

Page 15: Natural Language Processing in R (rNLP)

More complex R web service

setContentType("image/png")

a = c(1,3,5,12,13,15)image_file = tempfile()

png(file=image_file)plot(a, main = "The magic image", ylab = "", xlab = "", col = c("darkred", "darkblue", "darkgreen"))dev.off()

sendBin(readBin(image_file,'raw',n=file.info(image_file)$size))unlink(image_file)

Page 16: Natural Language Processing in R (rNLP)

R web services

• Uses the apache mod_R.so

• See http://Rapache.net • Common server

functions:– GET and POST variables– setContentType– sendBin–…

Page 17: Natural Language Processing in R (rNLP)

A word on memory mgmt.• Advanced memory management

(see p.70 of Dietl diploma thesis):– Use package big memory

(for shared memory across threads) – Use package Rserve (for shared

read-only access across threads)– Swap out memory objects with

save() and load()– The latter is typically sufficient

(hard disks are fast!)• data management abstraction

layer for mod_R.so: configure handler in http.conf: specify directory match and load specific data management routines at start up: REvalOnStartup "source(’/dbal.R');"

Page 18: Natural Language Processing in R (rNLP)

Harvestingdata acquisition

Page 19: Natural Language Processing in R (rNLP)

Job scheduling

• crontab entries for R webservices• e.g. harvest feeds • e.g. store in local DB

Page 20: Natural Language Processing in R (rNLP)

data shopsharing

Page 21: Natural Language Processing in R (rNLP)

Data shop and the community

• You have a ‘public/’ folder :)– ‘public/data’: save() any .rda file

and it will be indexed within the hour

– ‘public/services’: use this to execute your scripts; indexed within the hour

– ‘public/gallery’: use this to store your public visualisations

– code sharing: Any .R script in your ‘public/’ folder is source readable by the web

Page 22: Natural Language Processing in R (rNLP)

Not coveredThe useful pointer

Page 23: Natural Language Processing in R (rNLP)

More NLP packages

install.packages("NaturalLanguageProcessing”)

library("NaturalLanguageProcessing")

Page 24: Natural Language Processing in R (rNLP)

studioexploratoryprogramming

Page 25: Natural Language Processing in R (rNLP)

studio

Page 26: Natural Language Processing in R (rNLP)

Social Network AnalysisFridolin Wild, The Open University, UK

Page 27: Natural Language Processing in R (rNLP)

The Idea

Page 28: Natural Language Processing in R (rNLP)

The basic concept

• Precursors date back to 1920s, math to Euler’s ‘Seven Bridges of Koenigsberg’

Page 29: Natural Language Processing in R (rNLP)

The basic concept

• Precursors date back to 1920s, math to Euler’s ‘Seven Bridges of Koenigsberg’

Page 30: Natural Language Processing in R (rNLP)

The basic concept

• Precursors date back to 1920s, math to Euler’s ‘Seven Bridges of Koenigsberg’

• Social Networks are:• Actors (people, groups, media, tags,

…)• Ties (interactions, relationships, …)• Actors and ties form graph• Graph has measurable structural

properties• Betweenness, • Degree of Centrality, • Density, • Cohesion• Structural Patterns

Page 31: Natural Language Processing in R (rNLP)

Forum Messages  message_id forum_id parent_id author

130 2853483 2853445 \N 2043

131 1440740 785876 \N 1669

132 2515257 2515256 \N 5814

133 4704949 4699874 \N 5810

134 2597170 2558273 \N 2054

135 2316951 2230821 \N 5095

136 3407573 3407568 \N 36

137 2277393 2277387 \N 359

138 3394136 3382201 \N 1050

139 4603931 4167338 \N 453

140 6234819 6189254 6231352 5400

141 806699 785877 804668 2177

142 4430290 3371246 3380313 48

143 3395686 3391024 3391129 35

144 6270213 6024351 6265378 5780

145 2496015 2491522 2491536 2774

146 4707562 4699873 4707502 5810

147 2574199 2440094 2443801 5801

148 4501993 4424215 4491650 5232

  message_id forum_id parent_id author

60 734569 31117 \N 2491

221 762702 31117   1

317 762717 31117 762702 1927

1528 819660 31117 793408 1197

1950 840406 31117 839998 1348

1047 841810 31117 767386 1879

2239 862709 31117 \N 1982

2420 869839 31117 862709 2038

2694 884824 31117 \N 5439

2503 896399 31117 862709 1982

2846 901691 31117 895022 992

3321 951376 31117 \N 5174

3384 952895 31117 951376 1597

1186 955595 31117 767386 5724

3604 958065 31117 \N 716

2551 960734 31117 862709 1939

4072 975816 31117 \N 584

2574 986038 31117 862709 2043

2590 987842 31117 862709 1982

Page 32: Natural Language Processing in R (rNLP)

Incidence Matrix

• msg_id = incident, authors appear in incidents

Page 33: Natural Language Processing in R (rNLP)

Derive Adjacency Matrix

= t(im) %*% im

Page 34: Natural Language Processing in R (rNLP)

Visualization: Sociogramme

Page 35: Natural Language Processing in R (rNLP)

Degree

Page 36: Natural Language Processing in R (rNLP)

Betweenness

Page 37: Natural Language Processing in R (rNLP)

Network Density• Total edges = 29• Possible edges =

18 * (18-1)/2 = 153

• Density = 0.19

Page 38: Natural Language Processing in R (rNLP)

kmeans Cluster (k=3)

Page 39: Natural Language Processing in R (rNLP)

Analysis

• Mix• Match• Optimise

Page 40: Natural Language Processing in R (rNLP)

Tutorials

• Starter: sna-simple.Rmd• Real: sna-blog.Rmd• Advanced: sna-forum.Rmd

Page 41: Natural Language Processing in R (rNLP)

Latent Semantic AnalysisFridolin Wild, The Open University, UK

Page 42: Natural Language Processing in R (rNLP)

Latent Semantic Analysis

• “Humans learn word meanings and how to combine them into passage meaning through experience with ~paragraph unitized verbal environments.”

• “They don’t remember all the separate words of a passage; they remember its overall gist or meaning.”

• “LSA learns by ‘reading’ ~paragraph unitized texts that represent the environment.”

• “It doesn’t remember all the separate words of a text it; it remembers its overall gist or meaning.”

(Landauer, 2007)

Page 43: Natural Language Processing in R (rNLP)

Word choice is over-rated

• Educated adult understands ~100,000 word forms

• An average sentence contains 20 tokens. • Thus 100,00020 possible combinations of

words in a sentence• maximum of log2 100,00020

= 332 bits in word choice alone.• 20! = 2.4 x 1018 possible orders of 20 words

= maximum of 61 bits from order of the words.

• 332/(61+ 332) = 84% word choice(Landauer, 2007)

Page 44: Natural Language Processing in R (rNLP)

LSA (2)

• Assumption: texts have a semantic structure

• However, this structure is obscured by word usage (noise, synonymy, polysemy, …)

• Proposed LSA Solution: –map doc-term matrix – using conceptual indices – derived statistically (truncated SVD) – and make similarity comparisons using

angles

Page 45: Natural Language Processing in R (rNLP)

Input (e.g., documents)

{ M } = 

Deerwester, Dumais, Furnas, Landauer, and Harshman (1990): Indexing by Latent Semantic Analysis, In: Journal of the American Society for Information Science, 41(6):391-407

Only the red terms appear in more than one document, so strip the rest.

term = feature

vocabulary = ordered set of features

TEXTMATRIX

Page 46: Natural Language Processing in R (rNLP)

Singular Value Decomposition

=

Page 47: Natural Language Processing in R (rNLP)

Truncated SVDlatent-semantic space

Page 48: Natural Language Processing in R (rNLP)

Reconstructed, Reduced Matrix

m4: Graph minors: A survey

Page 49: Natural Language Processing in R (rNLP)

Similarity in a Latent-Semantic Space

Query

Target 1

Target 2Angle 2

Angle 1

Y dimensio

n

X dimension

Page 50: Natural Language Processing in R (rNLP)

doc2doc - similarities

Unreduced = pure vector space model- Based on M = TSD’- Pearson Correlation  over document vectors

reduced- based on M2 = TS2D’- Pearson Correlation    over document vectors

Page 51: Natural Language Processing in R (rNLP)

Ex Post Updating: Folding-In

• SVD factor stability

– SVD calculates factors over a given text base– Different texts – different factors– Challenge: avoid unwanted factor changes 

(e.g., bad essays)– Solution: folding-in of essays instead of recalculating

• SVD is computationally expensive

Page 52: Natural Language Processing in R (rNLP)

Folding-In in Detail

1 kkT

i STvd1

Tikki dSTm

2

vT

Tk Sk Dk

Mk

(Berry et al., 1995)

(1) convertOriginalVector to„Dk“-format

(2) convert„Dk“-formatvector to„Mk“-format

Page 53: Natural Language Processing in R (rNLP)

LSA Process & Driving Parameters

4 x 12 x 7 x 2 x 3 = 2016 Combinations

Page 54: Natural Language Processing in R (rNLP)

Pre-Processing

• Stemming–  Porter Stemmer (snowball.tartarus.org)–  ‚move‘, ‚moving‘, ‚moves‘ => ‚move‘–  in German even more important (more flections)

• Stop Word Elimination–  373 Stop Words in German

• Stemming plus Stop Word Elimination• Unprocessed (‘raw’) Terms

Page 55: Natural Language Processing in R (rNLP)

Term Weighting Schemes

• Global Weights (GW)–  None (‚raw‘ tf)–  Normalisation

–  Inverse Document  Frequency (IDF)

–  1 + Entropy

.

12

1

jij

i

tfnorm

1)(

log2

idocfreqnumdocsidf i

1

loglog

1j

ijiji numdocs

ppentplusone

1

jij

ijij

tf

tfp, where

weightij = lw(tfij) gw(tf∙ ij)

Local Weights (LW) None (‘raw’ tf) Binary Term Frequency Logarithmized Term Frequency 

(log)

Page 56: Natural Language Processing in R (rNLP)

SVD-Dimensionality

• Many different proposals (see package)• 80% variance is a good estimator

Page 57: Natural Language Processing in R (rNLP)

Proximity Measures

• Pearson Correlation• Cosine Correlation• Spearman‘s Rho

pics: http://davidmlane.com/hyperstat/A62891.html

Page 58: Natural Language Processing in R (rNLP)

Pair-wise dis/similarity

Convergence expected: ‘eu’, ‘österreich’ Divergence expected: ‘jahr’, ‘wien’

Page 59: Natural Language Processing in R (rNLP)

The Package

• Available via CRAN, e.g.:http://cran.r-project.org/web/packages/lsa/index.html

• Higher-level Abstraction to Ease Use– Core methods:

textmatrix() / query()lsa()fold_in()as.textmatrix()

– Support methods for term weighting, dimensionality calculation, correlation measurement, …

Page 60: Natural Language Processing in R (rNLP)

Core Workflow

• tm = textmatrix(‘dir/‘)

• tm = lw_logtf(tm) * gw_idf(tm)

• space = lsa(tm, dims=dimcalc_share())

• tm3 = fold_in(tm, space)

• as.textmatrix(tm)

Page 61: Natural Language Processing in R (rNLP)

Pre-Processing Chain

Page 62: Natural Language Processing in R (rNLP)

Tutorials

• Starter: lsa-indexing.Rmd• Real: lsa-essayscoring.Rmd• Advanced: lsa-sparse.Rmd

Page 63: Natural Language Processing in R (rNLP)

Additional tutorialsFridolin Wild, The Open University, UK

Page 64: Natural Language Processing in R (rNLP)

Tutorials

• Advanced I/O: twitter.Rmd• Advanced I/O: sparql.Rmd• Advanced NLP: twitter-

sentiment.Rmd• Evaluation: interrater-

agreement.Rmd