Natural Language Processing in R (rNLP)

Post on 18-Nov-2014

5.738 views 1 download

description

The introductory slides of a workshop given to the doctoral school at the Institute of Business Informatics of the Goethe University Frankfurt. The tutorials are available on http://crunch.kmi.open.ac.uk/w/index.php/Tutorials

Transcript of Natural Language Processing in R (rNLP)

Natural Language Processing in R (rNLP)Fridolin Wild, The Open University, UK

Tutorial to the Doctoral School at the Institute of Business Informatics

of the Goethe University Frankfurt

Structure of this tutorial

• An introduction to R and cRunch• Language basics in R• Basic I/O in R• Social Network Analysis• Latent Semantic Analysis• Twitter• Sentiment• (Advanced I/O in R: MySQL, SparQL)

Introduction

cRunch

• is an infrastructure• for computationally-intense learning

analytics• supporting researchers • in investigating big data • generated in the co-construction of

knowledge

… and beyond …

Architecture

(Thiele & Lehner, 2011)

Architecture

(Thiele & Lehner, 2011)

Living Reports

data shop

cron jobs

R webservices

Reports

Living reports

• reports with embedded scripts and data

• knitr and Sweave• render to html, PDF, …• visualisations:– ggplot2, trellis, graphix– jpg, png, eps, pdf

png(file=”n.png”, plot(network(m)))

• Fill-in-the-blanks:Drop out quote went down to <<echo=FALSE>>= doquote[“OU”,”2011”]@

\documentclass[a4paper]{article} \title{Sweave Example 1}\author{Friedrich Leisch} \begin{document} \maketitle In this example we embed parts of the examples from the\texttt{kruskal.test} help page into a \LaTeX{} document: <<>>=data(airquality)library(ctest)kruskal.test(Ozone ~ Month, data = airquality)@which shows that the location parameter of the Ozone distribution varies significantly from month to month. Finally weinclude a boxplot of the data: \begin{center}<<fig=TRUE,echo=FALSE>>=boxplot(Ozone ~ Month, data = airquality)@\end{center} \end{document}

Example PDF report

Example html5 reportExample Report=============

This is an example of embedded scripts and data.```{r}a = "hello world”print(a)```

And here is an example of how to embed a chart.```{r fig.width=7, fig.height=6}plot( 5:20 )```

Shiny Widgets (1)

• Widgets: use-case sized encapsulations of mini apps

• HTML5• Two files:

ui.R, server.R• Still missing:

manifest files(info.plist, config.xml)

Shiny Widgets (2)

From http://www.rstudio.com/shiny/

Web Servicesharmonization & data warehousing

Example R web service

print “hello world”

More complex R web service

setContentType("image/png")

a = c(1,3,5,12,13,15)image_file = tempfile()

png(file=image_file)plot(a, main = "The magic image", ylab = "", xlab = "", col = c("darkred", "darkblue", "darkgreen"))dev.off()

sendBin(readBin(image_file,'raw',n=file.info(image_file)$size))unlink(image_file)

R web services

• Uses the apache mod_R.so

• See http://Rapache.net • Common server

functions:– GET and POST variables– setContentType– sendBin–…

A word on memory mgmt.• Advanced memory management

(see p.70 of Dietl diploma thesis):– Use package big memory

(for shared memory across threads) – Use package Rserve (for shared

read-only access across threads)– Swap out memory objects with

save() and load()– The latter is typically sufficient

(hard disks are fast!)• data management abstraction

layer for mod_R.so: configure handler in http.conf: specify directory match and load specific data management routines at start up: REvalOnStartup "source(’/dbal.R');"

Harvestingdata acquisition

Job scheduling

• crontab entries for R webservices• e.g. harvest feeds • e.g. store in local DB

data shopsharing

Data shop and the community

• You have a ‘public/’ folder :)– ‘public/data’: save() any .rda file

and it will be indexed within the hour

– ‘public/services’: use this to execute your scripts; indexed within the hour

– ‘public/gallery’: use this to store your public visualisations

– code sharing: Any .R script in your ‘public/’ folder is source readable by the web

Not coveredThe useful pointer

More NLP packages

install.packages("NaturalLanguageProcessing”)

library("NaturalLanguageProcessing")

studioexploratoryprogramming

studio

Social Network AnalysisFridolin Wild, The Open University, UK

The Idea

The basic concept

• Precursors date back to 1920s, math to Euler’s ‘Seven Bridges of Koenigsberg’

The basic concept

• Precursors date back to 1920s, math to Euler’s ‘Seven Bridges of Koenigsberg’

The basic concept

• Precursors date back to 1920s, math to Euler’s ‘Seven Bridges of Koenigsberg’

• Social Networks are:• Actors (people, groups, media, tags,

…)• Ties (interactions, relationships, …)• Actors and ties form graph• Graph has measurable structural

properties• Betweenness, • Degree of Centrality, • Density, • Cohesion• Structural Patterns

Forum Messages  message_id forum_id parent_id author

130 2853483 2853445 \N 2043

131 1440740 785876 \N 1669

132 2515257 2515256 \N 5814

133 4704949 4699874 \N 5810

134 2597170 2558273 \N 2054

135 2316951 2230821 \N 5095

136 3407573 3407568 \N 36

137 2277393 2277387 \N 359

138 3394136 3382201 \N 1050

139 4603931 4167338 \N 453

140 6234819 6189254 6231352 5400

141 806699 785877 804668 2177

142 4430290 3371246 3380313 48

143 3395686 3391024 3391129 35

144 6270213 6024351 6265378 5780

145 2496015 2491522 2491536 2774

146 4707562 4699873 4707502 5810

147 2574199 2440094 2443801 5801

148 4501993 4424215 4491650 5232

  message_id forum_id parent_id author

60 734569 31117 \N 2491

221 762702 31117   1

317 762717 31117 762702 1927

1528 819660 31117 793408 1197

1950 840406 31117 839998 1348

1047 841810 31117 767386 1879

2239 862709 31117 \N 1982

2420 869839 31117 862709 2038

2694 884824 31117 \N 5439

2503 896399 31117 862709 1982

2846 901691 31117 895022 992

3321 951376 31117 \N 5174

3384 952895 31117 951376 1597

1186 955595 31117 767386 5724

3604 958065 31117 \N 716

2551 960734 31117 862709 1939

4072 975816 31117 \N 584

2574 986038 31117 862709 2043

2590 987842 31117 862709 1982

Incidence Matrix

• msg_id = incident, authors appear in incidents

Derive Adjacency Matrix

= t(im) %*% im

Visualization: Sociogramme

Degree

Betweenness

Network Density• Total edges = 29• Possible edges =

18 * (18-1)/2 = 153

• Density = 0.19

kmeans Cluster (k=3)

Analysis

• Mix• Match• Optimise

Tutorials

• Starter: sna-simple.Rmd• Real: sna-blog.Rmd• Advanced: sna-forum.Rmd

Latent Semantic AnalysisFridolin Wild, The Open University, UK

Latent Semantic Analysis

• “Humans learn word meanings and how to combine them into passage meaning through experience with ~paragraph unitized verbal environments.”

• “They don’t remember all the separate words of a passage; they remember its overall gist or meaning.”

• “LSA learns by ‘reading’ ~paragraph unitized texts that represent the environment.”

• “It doesn’t remember all the separate words of a text it; it remembers its overall gist or meaning.”

(Landauer, 2007)

Word choice is over-rated

• Educated adult understands ~100,000 word forms

• An average sentence contains 20 tokens. • Thus 100,00020 possible combinations of

words in a sentence• maximum of log2 100,00020

= 332 bits in word choice alone.• 20! = 2.4 x 1018 possible orders of 20 words

= maximum of 61 bits from order of the words.

• 332/(61+ 332) = 84% word choice(Landauer, 2007)

LSA (2)

• Assumption: texts have a semantic structure

• However, this structure is obscured by word usage (noise, synonymy, polysemy, …)

• Proposed LSA Solution: –map doc-term matrix – using conceptual indices – derived statistically (truncated SVD) – and make similarity comparisons using

angles

Input (e.g., documents)

{ M } = 

Deerwester, Dumais, Furnas, Landauer, and Harshman (1990): Indexing by Latent Semantic Analysis, In: Journal of the American Society for Information Science, 41(6):391-407

Only the red terms appear in more than one document, so strip the rest.

term = feature

vocabulary = ordered set of features

TEXTMATRIX

Singular Value Decomposition

=

Truncated SVDlatent-semantic space

Reconstructed, Reduced Matrix

m4: Graph minors: A survey

Similarity in a Latent-Semantic Space

Query

Target 1

Target 2Angle 2

Angle 1

Y dimensio

n

X dimension

doc2doc - similarities

Unreduced = pure vector space model- Based on M = TSD’- Pearson Correlation  over document vectors

reduced- based on M2 = TS2D’- Pearson Correlation    over document vectors

Ex Post Updating: Folding-In

• SVD factor stability

– SVD calculates factors over a given text base– Different texts – different factors– Challenge: avoid unwanted factor changes 

(e.g., bad essays)– Solution: folding-in of essays instead of recalculating

• SVD is computationally expensive

Folding-In in Detail

1 kkT

i STvd1

Tikki dSTm

2

vT

Tk Sk Dk

Mk

(Berry et al., 1995)

(1) convertOriginalVector to„Dk“-format

(2) convert„Dk“-formatvector to„Mk“-format

LSA Process & Driving Parameters

4 x 12 x 7 x 2 x 3 = 2016 Combinations

Pre-Processing

• Stemming–  Porter Stemmer (snowball.tartarus.org)–  ‚move‘, ‚moving‘, ‚moves‘ => ‚move‘–  in German even more important (more flections)

• Stop Word Elimination–  373 Stop Words in German

• Stemming plus Stop Word Elimination• Unprocessed (‘raw’) Terms

Term Weighting Schemes

• Global Weights (GW)–  None (‚raw‘ tf)–  Normalisation

–  Inverse Document  Frequency (IDF)

–  1 + Entropy

.

12

1

jij

i

tfnorm

1)(

log2

idocfreqnumdocsidf i

1

loglog

1j

ijiji numdocs

ppentplusone

1

jij

ijij

tf

tfp, where

weightij = lw(tfij) gw(tf∙ ij)

Local Weights (LW) None (‘raw’ tf) Binary Term Frequency Logarithmized Term Frequency 

(log)

SVD-Dimensionality

• Many different proposals (see package)• 80% variance is a good estimator

Proximity Measures

• Pearson Correlation• Cosine Correlation• Spearman‘s Rho

pics: http://davidmlane.com/hyperstat/A62891.html

Pair-wise dis/similarity

Convergence expected: ‘eu’, ‘österreich’ Divergence expected: ‘jahr’, ‘wien’

The Package

• Available via CRAN, e.g.:http://cran.r-project.org/web/packages/lsa/index.html

• Higher-level Abstraction to Ease Use– Core methods:

textmatrix() / query()lsa()fold_in()as.textmatrix()

– Support methods for term weighting, dimensionality calculation, correlation measurement, …

Core Workflow

• tm = textmatrix(‘dir/‘)

• tm = lw_logtf(tm) * gw_idf(tm)

• space = lsa(tm, dims=dimcalc_share())

• tm3 = fold_in(tm, space)

• as.textmatrix(tm)

Pre-Processing Chain

Tutorials

• Starter: lsa-indexing.Rmd• Real: lsa-essayscoring.Rmd• Advanced: lsa-sparse.Rmd

Additional tutorialsFridolin Wild, The Open University, UK

Tutorials

• Advanced I/O: twitter.Rmd• Advanced I/O: sparql.Rmd• Advanced NLP: twitter-

sentiment.Rmd• Evaluation: interrater-

agreement.Rmd