Natural Language Processing in R (rNLP)
-
Upload
fridolinwild -
Category
Technology
-
view
5.738 -
download
1
description
Transcript of Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Fridolin Wild, The Open University, UK
Tutorial to the Doctoral School at the Institute of Business Informatics
of the Goethe University Frankfurt
Structure of this tutorial
• An introduction to R and cRunch• Language basics in R• Basic I/O in R• Social Network Analysis• Latent Semantic Analysis• Twitter• Sentiment• (Advanced I/O in R: MySQL, SparQL)
Introduction
cRunch
• is an infrastructure• for computationally-intense learning
analytics• supporting researchers • in investigating big data • generated in the co-construction of
knowledge
… and beyond …
Architecture
(Thiele & Lehner, 2011)
Architecture
(Thiele & Lehner, 2011)
Living Reports
data shop
cron jobs
R webservices
Reports
Living reports
• reports with embedded scripts and data
• knitr and Sweave• render to html, PDF, …• visualisations:– ggplot2, trellis, graphix– jpg, png, eps, pdf
png(file=”n.png”, plot(network(m)))
• Fill-in-the-blanks:Drop out quote went down to <<echo=FALSE>>= doquote[“OU”,”2011”]@
\documentclass[a4paper]{article} \title{Sweave Example 1}\author{Friedrich Leisch} \begin{document} \maketitle In this example we embed parts of the examples from the\texttt{kruskal.test} help page into a \LaTeX{} document: <<>>=data(airquality)library(ctest)kruskal.test(Ozone ~ Month, data = airquality)@which shows that the location parameter of the Ozone distribution varies significantly from month to month. Finally weinclude a boxplot of the data: \begin{center}<<fig=TRUE,echo=FALSE>>=boxplot(Ozone ~ Month, data = airquality)@\end{center} \end{document}
Example PDF report
Example html5 reportExample Report=============
This is an example of embedded scripts and data.```{r}a = "hello world”print(a)```
And here is an example of how to embed a chart.```{r fig.width=7, fig.height=6}plot( 5:20 )```
Shiny Widgets (1)
• Widgets: use-case sized encapsulations of mini apps
• HTML5• Two files:
ui.R, server.R• Still missing:
manifest files(info.plist, config.xml)
Shiny Widgets (2)
From http://www.rstudio.com/shiny/
Web Servicesharmonization & data warehousing
Example R web service
print “hello world”
More complex R web service
setContentType("image/png")
a = c(1,3,5,12,13,15)image_file = tempfile()
png(file=image_file)plot(a, main = "The magic image", ylab = "", xlab = "", col = c("darkred", "darkblue", "darkgreen"))dev.off()
sendBin(readBin(image_file,'raw',n=file.info(image_file)$size))unlink(image_file)
R web services
• Uses the apache mod_R.so
• See http://Rapache.net • Common server
functions:– GET and POST variables– setContentType– sendBin–…
A word on memory mgmt.• Advanced memory management
(see p.70 of Dietl diploma thesis):– Use package big memory
(for shared memory across threads) – Use package Rserve (for shared
read-only access across threads)– Swap out memory objects with
save() and load()– The latter is typically sufficient
(hard disks are fast!)• data management abstraction
layer for mod_R.so: configure handler in http.conf: specify directory match and load specific data management routines at start up: REvalOnStartup "source(’/dbal.R');"
Harvestingdata acquisition
Job scheduling
• crontab entries for R webservices• e.g. harvest feeds • e.g. store in local DB
data shopsharing
Data shop and the community
• You have a ‘public/’ folder :)– ‘public/data’: save() any .rda file
and it will be indexed within the hour
– ‘public/services’: use this to execute your scripts; indexed within the hour
– ‘public/gallery’: use this to store your public visualisations
– code sharing: Any .R script in your ‘public/’ folder is source readable by the web
Not coveredThe useful pointer
More NLP packages
install.packages("NaturalLanguageProcessing”)
library("NaturalLanguageProcessing")
studioexploratoryprogramming
studio
Social Network AnalysisFridolin Wild, The Open University, UK
The Idea
The basic concept
• Precursors date back to 1920s, math to Euler’s ‘Seven Bridges of Koenigsberg’
The basic concept
• Precursors date back to 1920s, math to Euler’s ‘Seven Bridges of Koenigsberg’
The basic concept
• Precursors date back to 1920s, math to Euler’s ‘Seven Bridges of Koenigsberg’
• Social Networks are:• Actors (people, groups, media, tags,
…)• Ties (interactions, relationships, …)• Actors and ties form graph• Graph has measurable structural
properties• Betweenness, • Degree of Centrality, • Density, • Cohesion• Structural Patterns
Forum Messages message_id forum_id parent_id author
130 2853483 2853445 \N 2043
131 1440740 785876 \N 1669
132 2515257 2515256 \N 5814
133 4704949 4699874 \N 5810
134 2597170 2558273 \N 2054
135 2316951 2230821 \N 5095
136 3407573 3407568 \N 36
137 2277393 2277387 \N 359
138 3394136 3382201 \N 1050
139 4603931 4167338 \N 453
140 6234819 6189254 6231352 5400
141 806699 785877 804668 2177
142 4430290 3371246 3380313 48
143 3395686 3391024 3391129 35
144 6270213 6024351 6265378 5780
145 2496015 2491522 2491536 2774
146 4707562 4699873 4707502 5810
147 2574199 2440094 2443801 5801
148 4501993 4424215 4491650 5232
message_id forum_id parent_id author
60 734569 31117 \N 2491
221 762702 31117 1
317 762717 31117 762702 1927
1528 819660 31117 793408 1197
1950 840406 31117 839998 1348
1047 841810 31117 767386 1879
2239 862709 31117 \N 1982
2420 869839 31117 862709 2038
2694 884824 31117 \N 5439
2503 896399 31117 862709 1982
2846 901691 31117 895022 992
3321 951376 31117 \N 5174
3384 952895 31117 951376 1597
1186 955595 31117 767386 5724
3604 958065 31117 \N 716
2551 960734 31117 862709 1939
4072 975816 31117 \N 584
2574 986038 31117 862709 2043
2590 987842 31117 862709 1982
Incidence Matrix
• msg_id = incident, authors appear in incidents
Derive Adjacency Matrix
= t(im) %*% im
Visualization: Sociogramme
Degree
Betweenness
Network Density• Total edges = 29• Possible edges =
18 * (18-1)/2 = 153
• Density = 0.19
kmeans Cluster (k=3)
Analysis
• Mix• Match• Optimise
Tutorials
• Starter: sna-simple.Rmd• Real: sna-blog.Rmd• Advanced: sna-forum.Rmd
Latent Semantic AnalysisFridolin Wild, The Open University, UK
Latent Semantic Analysis
• “Humans learn word meanings and how to combine them into passage meaning through experience with ~paragraph unitized verbal environments.”
• “They don’t remember all the separate words of a passage; they remember its overall gist or meaning.”
• “LSA learns by ‘reading’ ~paragraph unitized texts that represent the environment.”
• “It doesn’t remember all the separate words of a text it; it remembers its overall gist or meaning.”
(Landauer, 2007)
Word choice is over-rated
• Educated adult understands ~100,000 word forms
• An average sentence contains 20 tokens. • Thus 100,00020 possible combinations of
words in a sentence• maximum of log2 100,00020
= 332 bits in word choice alone.• 20! = 2.4 x 1018 possible orders of 20 words
= maximum of 61 bits from order of the words.
• 332/(61+ 332) = 84% word choice(Landauer, 2007)
LSA (2)
• Assumption: texts have a semantic structure
• However, this structure is obscured by word usage (noise, synonymy, polysemy, …)
• Proposed LSA Solution: –map doc-term matrix – using conceptual indices – derived statistically (truncated SVD) – and make similarity comparisons using
angles
Input (e.g., documents)
{ M } =
Deerwester, Dumais, Furnas, Landauer, and Harshman (1990): Indexing by Latent Semantic Analysis, In: Journal of the American Society for Information Science, 41(6):391-407
Only the red terms appear in more than one document, so strip the rest.
term = feature
vocabulary = ordered set of features
TEXTMATRIX
Singular Value Decomposition
=
Truncated SVDlatent-semantic space
Reconstructed, Reduced Matrix
m4: Graph minors: A survey
Similarity in a Latent-Semantic Space
Query
Target 1
Target 2Angle 2
Angle 1
Y dimensio
n
X dimension
doc2doc - similarities
Unreduced = pure vector space model- Based on M = TSD’- Pearson Correlation over document vectors
reduced- based on M2 = TS2D’- Pearson Correlation over document vectors
Ex Post Updating: Folding-In
• SVD factor stability
– SVD calculates factors over a given text base– Different texts – different factors– Challenge: avoid unwanted factor changes
(e.g., bad essays)– Solution: folding-in of essays instead of recalculating
• SVD is computationally expensive
Folding-In in Detail
1 kkT
i STvd1
Tikki dSTm
2
vT
Tk Sk Dk
Mk
(Berry et al., 1995)
(1) convertOriginalVector to„Dk“-format
(2) convert„Dk“-formatvector to„Mk“-format
LSA Process & Driving Parameters
4 x 12 x 7 x 2 x 3 = 2016 Combinations
Pre-Processing
• Stemming– Porter Stemmer (snowball.tartarus.org)– ‚move‘, ‚moving‘, ‚moves‘ => ‚move‘– in German even more important (more flections)
• Stop Word Elimination– 373 Stop Words in German
• Stemming plus Stop Word Elimination• Unprocessed (‘raw’) Terms
Term Weighting Schemes
• Global Weights (GW)– None (‚raw‘ tf)– Normalisation
– Inverse Document Frequency (IDF)
– 1 + Entropy
.
12
1
jij
i
tfnorm
1)(
log2
idocfreqnumdocsidf i
1
loglog
1j
ijiji numdocs
ppentplusone
1
jij
ijij
tf
tfp, where
weightij = lw(tfij) gw(tf∙ ij)
Local Weights (LW) None (‘raw’ tf) Binary Term Frequency Logarithmized Term Frequency
(log)
SVD-Dimensionality
• Many different proposals (see package)• 80% variance is a good estimator
Proximity Measures
• Pearson Correlation• Cosine Correlation• Spearman‘s Rho
pics: http://davidmlane.com/hyperstat/A62891.html
Pair-wise dis/similarity
Convergence expected: ‘eu’, ‘österreich’ Divergence expected: ‘jahr’, ‘wien’
The Package
• Available via CRAN, e.g.:http://cran.r-project.org/web/packages/lsa/index.html
• Higher-level Abstraction to Ease Use– Core methods:
textmatrix() / query()lsa()fold_in()as.textmatrix()
– Support methods for term weighting, dimensionality calculation, correlation measurement, …
Core Workflow
• tm = textmatrix(‘dir/‘)
• tm = lw_logtf(tm) * gw_idf(tm)
• space = lsa(tm, dims=dimcalc_share())
• tm3 = fold_in(tm, space)
• as.textmatrix(tm)
Pre-Processing Chain
Tutorials
• Starter: lsa-indexing.Rmd• Real: lsa-essayscoring.Rmd• Advanced: lsa-sparse.Rmd
Additional tutorialsFridolin Wild, The Open University, UK
Tutorials
• Advanced I/O: twitter.Rmd• Advanced I/O: sparql.Rmd• Advanced NLP: twitter-
sentiment.Rmd• Evaluation: interrater-
agreement.Rmd