A journey into Text Analytics

31
A journey into Text Analytics John McConnell Analytical People ASC Winchester 7th September 2013 © analytical-people 2013

description

A journey into Text Analytics. John McConnell Analytical People ASC Winchester 7th September 2013. Contents. Background & Objectives Our current view on Text Analytics Value Process An example application Conclusions. Background. Text Analytics and Text Mining are largely synonymous - PowerPoint PPT Presentation

Transcript of A journey into Text Analytics

Page 1: A journey into Text Analytics

A journey into Text Analytics

John McConnellAnalytical People

ASC Winchester7th September 2013

© analytical-people 2013

Page 2: A journey into Text Analytics

Contents

• Background & Objectives• Our current view on Text Analytics

– Value– Process

• An example application• Conclusions

2

Page 3: A journey into Text Analytics

Background

• Text Analytics and Text Mining are largely synonymous• Interest and execution of Text Analytics is growing

– Social Media sources are largely responsible for this– And that often means “Big Data”

• This should lead to further improvements in technology and methodology which will benefit survey practitioners

3

Page 4: A journey into Text Analytics

Objectives

• We’ve been involved in more Text Analytics work in the last 2 years than in all previous years

• Our objective in this presentation is to share some of our experience and thoughts around some of the technology we have used

4

Page 5: A journey into Text Analytics

The Value Propositions

1. Reduce cost (and time)

2. Generating actionable insights– Improve public and commercial processes

5

*http://wp.eaagle.com/

Page 6: A journey into Text Analytics

Using Text Analytics to find Text Analytics software

6http://www.isvworld.com/

Page 7: A journey into Text Analytics

3 Software tools

7

R•Open Source Statistical Platform•Command driven

Rapid Miner•Open Source Data Mining Workbench•GUI•Built on R and Weka

SPSS Text Analytics for Surveys•Commercial Text Analytics•GUI

Page 8: A journey into Text Analytics

Unstructur

ed data

Structured data

The Process – Highest Level

Page 9: A journey into Text Analytics

1. Extract

2. Refine

3. Analyse

Process – Level 2

Page 10: A journey into Text Analytics

How can we tell if we are using the right tool(s)?

10

Extract

• How good is the first extraction?• How long to get to an acceptable extraction?

Refine• How easy is to refine?• How easy is to capture refinements to re-use them in

future?

Analyse•What tools exist to support the Text Analytics process?•What tools exist to use the Structured Text in other analyses?

How well do the tools/methods deliver on the value propositions?

Page 11: A journey into Text Analytics

Algorithms and Dictionaries

11

1. Extract

Algorithms

• e.g. Natural Language Processing (NLP)

Dictionaries

• Variously called Lexicons, Resources, Libraries, etc.• Are usually contextual e.g. Customer Satisfaction

Page 12: A journey into Text Analytics

Example Data

• The American Physical Society (APS)• Student Survey Comments from 2009 (Base=1304)• Q4.2 Comments about the best features of and what could be

added or improved to the special programses for Student Members*

12*http://www.aps.org/about/governance/committees/commemb/upload/2009-student-comments.pdf

Page 13: A journey into Text Analytics

The first extraction with R

13

library("tm", lib.loc="C:/Users/jmcconnell/Documents/R/win-library/3.0")

APS2009df = read.csv("C:/AP/ASC/APS/APS2009Verbatims.csv", header = TRUE)

text_corpus <- Corpus(VectorSource(APS2009df), readerControl = list(language = "en"))

summary(text_corpus) #check what went intext_corpus <- tm_map(text_corpus, removeNumbers)text_corpus <- tm_map(text_corpus, removePunctuation)text_corpus <- tm_map(text_corpus , stripWhitespace)text_corpus <- tm_map(text_corpus, tolower)

We apply a basic set of text handling methods (simple NLP) e.g. removePunctuationWe also apply a small dictionary of known “Stopwords” (not shown)

Page 14: A journey into Text Analytics

R Extraction Results – Top 20 Terms

14

Page 15: A journey into Text Analytics

The first extraction with Rapid Miner

15

We visually construct a similar set of steps

Page 16: A journey into Text Analytics

Rapid Miner Extraction Results – Top 20 Terms

16

Page 17: A journey into Text Analytics

Improving and creating new data

17

2. Refine

Improve the extraction

• Correct mistakes• Add omissions

Map the extraction to structured data

• Group and combine meaningful terms that will become data for further analysis

In second and subsequent waves (where applicable) Refine should be a shorter step where we look for new concepts

Page 18: A journey into Text Analytics

Rapid Miner - Refine

18

We add one process step to fix up some of the issues in the first extraction Filter Tokens sets a lower limit for the length of an extracted term/attribute

Page 19: A journey into Text Analytics

Rapid Miner results after first refinement

19

Page 20: A journey into Text Analytics

The first extraction with SPSS

20

SPSS Uses a Wizard to specify the extraction steps

Page 21: A journey into Text Analytics

SPSS Extraction Results – Top 20 Terms

21Synonyms are used from the dictionaries

SPSS Is counting respondents not occurrences

Page 22: A journey into Text Analytics

Synonyms for “Excellent”

22

10 stars, 10/10, 100 % correct, 100% accurate, 100% correct, 100% grade a, 5 star, 5 stars, 5-star, ^ best $, ^ great $, a must, a nice plus, a plus, a+, a++, aagood, above and beyond, above excellence, absolute life saver, absolute word class, acceptional, admirable, all was well, allright, alright, always a please, amazing, among the best, among the very best, appreciable, appreciative, award winning, awesome, awesopme, awsome, beenfantastic, best asset, best of all, best possible, beyond expectation, beyond expectations, big asset, big beast, big hit, big hits, big kudos to, big plus, blow all others away, blows all others away, blows the doors off, brilliant $, can not be beat, can't be beat, can't beat, cannot be beat, capable, capible, class service, compliment, compliment one another well, congrats, congratulations, copious, cutting edge, cutting-edge, dandy, delight, deluxe, deserves a raise, deserves credit, does that well, doing her best, doing his best, doing their best, done very well, dynamite, exccellent, excelent, excellant, excellence, excellet, excelllent, excepional, exceptional, exceptionl, execellent, exelant, exelent, exellant, exellecent, exellent, exlt, expectional, exquis, exquise, exquises, exquisite, exquisitely $, extraordinary, extrodinary, fabulous, fairly well, fanatstic, fantabulous, fantasic, fantastic, fantatic, finest, first class, first-class, first-rate, five stars, formidable, frantastic, given me the most, godsend, goes over well, goodd, gooood, graet, grat, grea, greaat, great pleasure, greate, greatest, greeeeeeeaaattttt, gret, greta, hats down, hats off, head and shoulders better, heavenly, high hats off, ideal, impecable, impeccable, impress, impresses me most, impressive, in an orderly fashion, incomparable, incredibe, incredible, increible, indisputable, ingenious, inpecable, invaluable, is still the best, it was a pleasure, knock socks off, knock spots off, kudos, kudos to, laudable, lifesaver, made an impression, made the difference, magnificent, marvellous, marvelous, my compliments to, nicest, number 1, number one, oustanding, out of the woods, out of the world, out of this world, outperform, outperforming, outsanding, outstanding, peachy, perfect, perfection, perfectly done, phenomenal, phenominal, pleasure of working with, prettier, pretty good, quintessential, reach a ten, real good, real nice, remarkable, right direction, rock $, rocked my world, second to none, sensational, smashing, spectacular, spendid, splendid, stand head & shoulders above, stand head and shoulders above, standing head & shoulders above, standing head and shoulders above, stands head & shoulders above, stands head and shoulders above, stood head & shoulders above, stood head and shoulders above, strong positive, superb, supurb, surpassed my expectations, surreal, sweetheart, ten stars, terric, terrific, terrifig, the best, the best one so far, the best thing, the highlight of, the only one that works, thebest, think highly, think very highly, to die for, top notch, top quality, top ranked, top-flight, top-notch, top-of-the-line, top-ranked, top-ranking, topflight, topnotch, topranked, topranking, tremedous, tremendous, tried and proven, trmendous, turn out good, two thumbs up, unbeatable, unmatched, unmnatched, unparalleled, unquestionable, unquestionnable, unsurpassed, up 2 standard, up 2 standards, up 2 usual standards, up to standard, up to standards, up to usual standards, up to your usual standards, up-beat, upbeat, utmost, v-good, well done, went above and beyond my expectations, woderful, womderful, wondeful, wonderful, wonderfull, wonedeful, wonederful, would be the smartest, wounderful

Page 23: A journey into Text Analytics

Adding Wordnet to our R (/RapidMiner) analysis

23

library("wordnet")setDict ("C:/Wordnet/WordNet-3.0/dict")synonyms("excellent", "ADJECTIVE")

[1] "excellent" "fantabulous" "first-class" "splendid"

Page 24: A journey into Text Analytics

Analytics to aid refinement

24

Page 25: A journey into Text Analytics

Job … Fair

25

Students are asking for more “stuff” at the job fair

Page 26: A journey into Text Analytics

R Extraction Results – Top 20 Terms

26

Page 27: A journey into Text Analytics

Onward to analysis

27

Job Fair - Would like more

Accommodation

Support Services

Teaching quality

0% 5% 10% 15% 20% 25% 30% 35% 40% 45%

10%

20%

30%

40%

Key Drivers of Recommendation*

*This is an anonymised example

Page 28: A journey into Text Analytics

Onward to analysis

28

R•In R we are in a statistical platform already•Text Analytics outputs are part of the data in the current “Workspace”•For Research style charts and tables we may need to export data

Rapid Miner•In RM we are in a Data Mining platform already•Text Analytics is part of the current process flow

SPSS Text Analytics for Surveys•Data needs to be exported elsewhere for Analysis•To SPSS .sav, Excel or Data Collection

3. Analyse

Page 29: A journey into Text Analytics

A High Level Comparison

29

Attribute R Rapid Miner SPSS TAfS

Help & Support Lot of User Generated Content

Lots of User Generated ContentPaid support option

Paid support

Usability Low level coding control

Visual programming Visual UI

Scalability R in itself isn’t too scalable but many scalable implementations exist e.g. Revolution, Hadoop

Radoop We experienced Issues with data sets around 100,000 cases*

Extensibility Various options Various options None

Automation Can be run in batch Can run in batch None

Overall Great for the coder. Those familiar with R

The power of R with a GUI

The most graphical and tuned for Generic survey types e.g. Opinions

*IBM/SPSS have a Text Analytics option for Data Mining which may be more scalable – we haven’t tested yet

Page 30: A journey into Text Analytics

Our current conclusions• Dictionaries help in the initial extraction

– But it is almost inevitable you will want to extend them to get to the specificity of the study. If the study domain is very specific you can build your own dictionaries in all 3 tools. A lot of social media monitoring starts with libraries of regular expressions built from the ground up.

• Open Source tools like R and Rapid Miner will continue to improve with “packages” added by the R community

• There is no “silver bullet”. The Refine step will typically require a lot of manual input– Especially in the initial “build” phase– More is required on larger surveys

• But the ROI – in time and/or cost - should be clear– And the results more robust and reliable

30

Page 31: A journey into Text Analytics

A journey into Text AnalyticsThank-you & Questions

John McConnellAnalytical People

[email protected]