Text Mining Infrastructure in R

49
Text Mining Infrastructure in R Presented By Ashraf Uddin (http ://ashrafsau.blogspot.in / ) South Asian University, New Delhi, India. 29 January 2014

description

This presentation includes basic text processing techniques using R packages

Transcript of Text Mining Infrastructure in R

Page 1: Text Mining Infrastructure in R

Text Mining Infrastructure in R

Presented ByAshraf Uddin

(http://ashrafsau.blogspot.in/)South Asian University, New Delhi, India.

29 January 2014

Page 2: Text Mining Infrastructure in R

What is R?

A free software environment for statistical computing and graphics.

open source package based developed by Bell Labs

Many statistical functions are already built in

Contributed packages expand the functionality to cutting edge research

Implementation languages C, Fortran

Page 3: Text Mining Infrastructure in R

What is R?

R is the result of a collaborative effort with contributions from all over the

world

R was initially written by Robert Gentleman and Ross Ihaka—also known as

"R & R" of the Statistics Department of the University of Auckland

R was inspired by the S environment

R can be extended (easily) via packages.

More about R

Page 4: Text Mining Infrastructure in R

What R does and does not

ois not a database, but connects to DBMSsolanguage interpreter can be very slow, but allows to call own C/C++ code ono professional / commercial support

Page 5: Text Mining Infrastructure in R

Data Types in R

numeric (integer, double, complex) character logical Data frame factor

Page 6: Text Mining Infrastructure in R

Contributed Packages Currently, the CRAN package repository features 5034 available packages

Page 7: Text Mining Infrastructure in R

Growing users of R

Page 8: Text Mining Infrastructure in R

Text Mining: Basics

Text is Unstructured collections of words

Documents are basic units consisting of a sequence of tokens or terms

Terms are words or roots of words, semantic units or phrases which are the atoms of indexing

Repositories (databases) and corpora are collections of documents.

Corpus conceptual entity similar to a database for holding and managing text documents

Text mining involves computations to gain interesting information

Page 9: Text Mining Infrastructure in R

Text Mining: Practical Applications

Spam filtering Business Intelligence, Marketing applications : predictive analytics Sentiment analysis Text IR, indexing Creating suggestion and recommendations (like amazon) Monitoring public opinions (for example in blogs or review sites) Customer service, email support Automatic labeling of documents in business libraries Fraud detection by investigating notification of claims Fighting cyberbullying or cybercrime in IM and IRC chatAnd many more

Page 10: Text Mining Infrastructure in R

A List Text Mining Tools

Page 11: Text Mining Infrastructure in R

Text Mining Packages in R

Corpora gsubfn kernlab KoNLP

koRpus `lda lsa maxent

movMF openNLP qdap RcmdrPlugin.temis

RKEA RTextTools Rweka skmeans

Snowball SnowballC tau textcat

Textir tm tm.plugin.dc tm.plugin.factiva

tm.plugin.mail topicmodels wordcloud

Wordnet zipfR

Page 12: Text Mining Infrastructure in R

Text Mining Packages in Rplyr: Tools for splitting, applying and combining dataclass: Various functions for classificationtm: A framework for text mining applicationscorpora: Statistics and data sets for corpus frequency datasnowball: stemmersRweka: interface to Weka, a collection of ML algorithms for data mining taskswordnet: interface to WordNet using the Jawbone Java API to WordNetwordcloud: to make cloud of wordtextir: A suite of tools for text and sentiment miningtau: Text Analysis Utilitiestopicmodels: an interface to the C code for Latent Dirichlet Allocation (LDA) models and Correlated Topics Models (CTM)zipfR: Statistical models for word frequency distributions

Page 13: Text Mining Infrastructure in R

Conceptual process in Text Mining

organize and structure the texts (into repository) convenient representation (preprocessing) Transform texts into structured formats (e.g. TDM)

Page 14: Text Mining Infrastructure in R

The framework

different file formats and in different locations standardized interfaces to access the document (sources)

Metadata valuable insights into the document structure must be able to alleviate metadata usage

to efficiently work with the documents must provide tools and algorithm to perform common task (transformation) To extract patterns of interest (filtering)

Page 15: Text Mining Infrastructure in R

Text document collections: CorpusConstructor:Corpus(object = ..., readerControl = list(reader = object@DefaultReader, language = "en_US", load = FALSE))

Example:>txt <- system.file("texts", "txt", package = "tm")>(ovid <- Corpus(DirSource(txt), readerControl = list(reader = readPlain, language = "la", load = TRUE))) A corpus with 5 text documents

Page 16: Text Mining Infrastructure in R

Corpus: Meta Data>meta(ovid[[1]])Available meta data pairs are: Author : DateTimeStamp: 2013-11-19 18:54:04 Description : Heading : ID : ovid_1.txt Language : la Origin :

>ID(ovid[[1]]) [1] "ovid_1.txt“

Page 17: Text Mining Infrastructure in R

Corpus: Document’s text>ovid[[1]] Si quis in hoc artem populo non novit amandi, hoc legat et lecto carmine doctus amet. arte citae veloque rates remoque moventur, arte leves currus: arte regendus amor. curribus Automedon lentisque erat aptus habenis, Tiphys in Haemonia puppe magister erat: me Venus artificem tenero praefecit Amori; Tiphys et Automedon dicar Amoris ego. ille quidem ferus est et qui mihi saepe repugnet: sed puer est, aetas mollis et apta regi. Phillyrides puerum cithara perfecit Achillem, atque animos placida contudit arte feros. qui totiens socios, totiens exterruit hostes, creditur annosum pertimuisse senem.

Page 18: Text Mining Infrastructure in R

Corpus: Meta Data>c(ovid[1:2], ovid[3:4]) A corpus with 4 text documents

>length(ovid) 5

>summary(ovid) A corpus with 5 text documents The metadata consists of 2 tag-value pairs and a data frame Available tags are: create_date creator

Available variables in the data frame are: MetaID

Page 19: Text Mining Infrastructure in R

Corpus: Meta Data

>CMetaData(ovid) $create_date [1] "2013-11-19 18:54:04 GMT" $creator [1] "“

>DMetaData(ovid) MetaID 1 0 2 0 3 0 4 0 5 0

Page 20: Text Mining Infrastructure in R

Corpus: Transformations and Filters>getTransformations() [1] "as.PlainTextDocument" "removeNumbers" "removePunctuation" "removeWords" [5] "stemDocument" "stripWhitespace“

>tm_map(ovid, FUN = tolower)A corpus with 5 text documents

>getFilters() [1] "searchFullText" "sFilter" "tm_intersect"

>tm_filter(ovid, FUN = searchFullText, "Venus", doclevel = TRUE) A corpus with 1 text document

Page 21: Text Mining Infrastructure in R

Text Preprocessing: import>txt <- system.file("texts", "crude", package = "tm") >(acq <- Corpus(DirSource(txt), readerControl = list(reader = readPlain, language = "la", load = TRUE))) A corpus with 50 text documents

>txt <- system.file("texts", "crude", package = "tm") >(crude <- Corpus(DirSource(txt), readerControl = list(reader = readPlain, language = "la", load = TRUE))) A corpus with 20 text documents

resulting in 50 articles of topic acq and 20 articles of topic crude

Page 22: Text Mining Infrastructure in R

Preprocessing: stemming Morphological variants of a word (morphemes). Similar terms derived from

a common stem: engineer, engineered, engineeringuse, user, users, used, using

Stemming in Information Retrieval. Grouping words with a common stem together.

For example, a search on reads, also finds read, reading, and readable

Stemming consists of removing suffixes and conflating the resulting morphemes. Occasionally, prefixes are also removed.

Page 23: Text Mining Infrastructure in R

Preprocessing: stemming Reduce terms to their “roots” automate(s), automatic, automation all reduced to automat.

for example compressed and compression are both accepted as equivalent to compress.

for exampl compress andcompress ar both acceptas equival to compress

Page 24: Text Mining Infrastructure in R

Preprocessing: stemming

Typical rules in Stemming:

sses ssies iational atetional tion

Weight of word sensitive rules (m>1) EMENT →replacement → replaccement → cement

Page 25: Text Mining Infrastructure in R

Preprocessing: stemming

help recall for some queries but harm precision on others Fine distinctions may be lost through stemming.

Page 26: Text Mining Infrastructure in R

Preprocessing: stemming>acq[[10]] Gulf Applied Technologies Inc said it sold its subsidiaries engaged in pipeline and terminal operations for 12.2 mln dlrs. The company said the sale is subject to certain post closing adjustments, which it did not explain. Reuter

>stemDocument(acq[[10]]) Gulf Appli Technolog Inc said it sold it subsidiari engag in pipelin and terminal oper for 12.2 mln dlrs. The compani said the sale is subject to certain post clos adjustments, which it did not explain. Reuter

>tm_map(acq, stemDocument) A corpus with 50 text documents

Page 27: Text Mining Infrastructure in R

Preprocessing: Whitespace elimination & lower case conversion

>stripWhitespace(acq[[10]]) Gulf Applied Technologies Inc said it sold its subsidiaries engaged in pipeline and terminal operations for 12.2 mln dlrs. The company said the sale is subject to certain post closing adjustments, which it did not explain. Reuter

>tolower(acq[[10]]) gulf applied technologies inc said it sold its subsidiaries engaged in pipeline and terminal operations for 12.2 mln dlrs. the company said the sale is subject to certain post closing adjustments, which it did not explain. reuter

Page 28: Text Mining Infrastructure in R

Preprocessing: Stopword removal

Very common words, such as of, and, the, are rarely of use in information retrieval.A long stop list saves space in indexes, speeds processing, and eliminates many false hits.However, common words are sometimes significant in information retrieval, which is an argument for a short stop list.

(Consider the query, "To be or not to be?")

Page 29: Text Mining Infrastructure in R

Preprocessing: Stopword removal

Include the most common words in the English language (perhaps 50 to 250 words).

Do not include words that might be important for retrieval (Among the 200 most frequently occurring words in general literature in English are time, war, home, life, water, and world).

In addition, include words that are very common in context (e.g., computer, information, system in a set of computing documents).

Page 30: Text Mining Infrastructure in R

Preprocessing: Stopword removal

about above accordingacross actually adj after afterwardsagain against all almost alone along already

also although always among amongst an another any anyhow anyone anything anywhere are aren't around at be became because become becomes becoming been before beforehand begin beginning behind being below beside besides between beyond billion both but by can can't cannot caption co could couldn't did didn't do does doesn't don't down during each eg eight eighty either else elsewhere end ending enough etc even ever every everyone everything

Page 31: Text Mining Infrastructure in R

Preprocessing: Stopword removal

How many words should be in the stop list?• Long list lowers recall

Which words should be in list?• Some common words may have retrieval importance:

-- war, home, life, water, world

• In certain domains, some words are very common:-- computer, program, source, machine, language

Page 32: Text Mining Infrastructure in R

Preprocessing: Stopword removal

>mystopwords <- c("and", "for", "in", "is", "it", "not", "the", "to")

>removeWords(acq[[10]], mystopwords) Gulf Applied Technologies Inc said sold its subsidiaries engaged pipeline terminal operations 12.2 mln dlrs. The company said sale subject certain post closing adjustments, which did explain. Reuter

>tm_map(acq, removeWords, mystopwords) A corpus with 50 text documents

Page 33: Text Mining Infrastructure in R

Preprocessing: Synonyms

> library("wordnet")

synonyms("company")[1] "caller" "companionship" "company" "fellowship"[5] "party" "ship’s company" "society" "troupe“

replaceWords(acq[[10]], synonyms(dict, "company"), by = "company")

Tm_map(acq, replaceWords, synonyms(dict, "company"), by = "company")

Page 34: Text Mining Infrastructure in R

Preprocessing: Part of speech tagging>library("NLP","openNLP")s <- as.String(acq[[10]])## Need sentence and word token annotations.sent_token_annotator <- Maxent_Sent_Token_Annotator()word_token_annotator <- Maxent_Word_Token_Annotator()a2 <- annotate(s, list(sent_token_annotator, word_token_annotator))pos_tag_annotator <- Maxent_POS_Tag_Annotator()#pos_tag_annotatora3 <- annotate(s, pos_tag_annotator, a2)a3w <- subset(a3, type == "word")tags <- sapply(a3w$features, "[[", "POS")sprintf("%s/%s", s[a3w], tags)

Page 35: Text Mining Infrastructure in R

Preprocessing: Part of speech tagging"Gulf/NNP" "Applied/NNP" "Technologies/NNP" "Inc/NNP" "said/VBD" "it/PRP" "sold/VBD" "its/PRP$" "subsidiaries/NNS" "engaged/VBN" "in/IN" "pipeline/NN" "and/CC" "terminal/NN" "operations/NNS" "for/IN" "12.2/CD" "mln/NN" "dlrs/NNS" "./." "The/DT" "company/NN" "said/VBD" "the/DT" "sale/NN" "is/VBZ" "subject/JJ" "to/TO" "certain/JJ" "post/NN" "closing/NN" "adjustments/NNS" ",/," "which/WDT" "it/PRP" "did/VBD" "not/RB" "explain/VB" "./." "Reuter/NNP“more

Page 36: Text Mining Infrastructure in R

Preprocessing

R Demo

Page 37: Text Mining Infrastructure in R

Classification using KNN

K-Nearest Neighbor algorithm: Most basic instance-based method Data are represented in a vector space Supervised learning

, V is the finite set {v1,......,vn}

the k-NN returns the most common value among the k training examples nearest to xq.

Page 38: Text Mining Infrastructure in R

KNN Feature space

Page 39: Text Mining Infrastructure in R

KNN Training algorithm

For each training example <x,f(x)> add the example to the listClassification algorithm

Given a query instance xq to be classifiedLet x1,..,xk k instances which are nearest to xq

Where (a,b)=1 if a=b, else (a,b)= 0 (Kronecker function)

Page 40: Text Mining Infrastructure in R

Classification using KNN : Example

Two classes: Red and Blue

Green is Unknown

With K=3, classification is RedWith k=4, classification is Blue

Page 41: Text Mining Infrastructure in R

How to determine the good value for k?

Determined experimentally Start with k=1 and use a test set to validate the error rate of the classifier Repeat with k=k+2 Choose the value of k for which the error rate is minimum

Note: k should be odd number to avoid ties

Page 42: Text Mining Infrastructure in R

KNN for speech classificationDatasets: Size: 40 instancesBarak Obama 20 speeches Mitt Romney 20 speeches

Training datasets: 70% (28)Test datasets: 30% (12)

Accuracy: on average more than 90%

Page 43: Text Mining Infrastructure in R

Speech Classification Implementation in R#initialize the R environmentlibs<-c("tm","plyr","class")lapply(libs,require,character.only=TRUE)

#Set parameters / source directorydir.names<-c("obama","romney")path<-"E:/Ashraf/speeches"

#clean text / preprocessingcleanCorpus<-function(corpus){ corpus.tmp<-tm_map(corpus,removePunctuation) corpus.tmp<-tm_map(corpus.tmp,stripWhitespace) corpus.tmp<-tm_map(corpus.tmp,tolower) corpus.tmp<-tm_map(corpus.tmp,removeWords,stopwords("english")) return (corpus.tmp)}

Page 44: Text Mining Infrastructure in R

Speech Classification Implementation in R#build term document matrixgenerateTDM<-function(dir.name,dir.path){ s.dir<-sprintf("%s/%s",dir.path,dir.name) s.cor<-Corpus(DirSource(directory=s.dir,encoding="ANSI")) s.cor.cl<-cleanCorpus(s.cor) s.tdm<-TermDocumentMatrix(s.cor.cl) s.tdm<-removeSparseTerms(s.tdm,0.7) result<-list(name=dir.name,tdm=s.tdm)}

tdm<-lapply(dir.names,generateTDM,dir.path=path)

Page 45: Text Mining Infrastructure in R

Speech Classification Implementation in R#attach candidate name to each row of TDMbindCandidateToTDM<-function(tdm){ s.mat<-t(data.matrix(tdm[["tdm"]])) s.df<-as.data.frame(s.mat,StringAsFactors=FALSE) s.df<-cbind(s.df,rep(tdm[["name"]],nrow(s.df))) colnames(s.df)[ncol(s.df)]<-"targetcandidate" return (s.df)}

candTDM<-lapply(tdm,bindCandidateToTDM)

Page 46: Text Mining Infrastructure in R

Speech Classification Implementation in R#stack the TDMs together (for both Obama and Romnie)tdm.stack<-do.call(rbind.fill,candTDM)tdm.stack[is.na(tdm.stack)]<-0

#hold-out / splitting training and test data setstrain.idx<-sample(nrow(tdm.stack),ceiling(nrow(tdm.stack)*0.7))test.idx<-(1:nrow(tdm.stack))[-train.idx])

Page 47: Text Mining Infrastructure in R

Speech Classification Implementation in R#model KNNtdm.cand<-tdm.stack[,"targetcandidate"]tdm.stack.nl<-tdm.stack[,!colnames(tdm.stack)%in%"targetcandidate"]

knn.pred<-knn(tdm.stack.nl[train.idx,],tdm.stack.nl[test.idx,],tdm.cand[train.idx])

#accuracy of the predictionconf.mat<-table('Predictions'=knn.pred,Actual=tdm.cand[test.idx])(accuracy<-(sum(diag(conf.mat))/length(test.idx))*100)

#show resultshow(conf.mat)show(accuracy)

Page 48: Text Mining Infrastructure in R

Speech Classification Implementation in R

Show R Demo

Page 49: Text Mining Infrastructure in R

References

1. Text Mining Infrastructure in R, Ingo Feinerer, Kurt Hornik, David Meyer, Vol. 25, Issue 5, Mar 2008, Journal of Statistical Software.

2. http://mittromneycentral.com/speeches/3. http://obamaspeeches.com/4. http://cran.r-project.org/