A Text Analytics Marketscape (from Strata NY 2014)

22
Extending "Variety" of Data to "Variety" of Users Tina Groves Big Data and Analytics, IBM

Transcript of A Text Analytics Marketscape (from Strata NY 2014)

Extending "Variety" of Data to "Variety" of

Users

Tina Groves

Big Data and Analytics, IBM

© 2014 IBM Corporation2 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

Tina Groves

IBM Big Data & Analytics Product Strategy team

Product manager, 15+ years

Focus on new product introduction and

innovation areas

Results tied to 1,000s of customers; 1,000,000s

of users and 100s of millions in revenue

Personal

hockey mom, skier, closet Scrabble nerd and

oftentimes analytics geek

© 2014 IBM Corporation3 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

What Makes Text Analytics Challenging?

Company IBM

Annual

Revenue

99,751

Annual

Revenue

Units

Billion

Number of

Employees

432,212

Tone Conservative

Easier: One source; derive attributes Harder: Many sources; infer perception & behaviour

© 2014 IBM Corporation4 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

What Makes Text Analytics Even MORE CHALLENGING?

Culture, Slang, Sarcasm

• Same word, different

meanings

e.g., “Sick”

• Same meaning,

different words

e.g., “daks”, “trousers”,

“pants”

Infrastructure

Increasing

• Volume

• Sources

• Users

• Analytic complexity

“Juan” = “John” or “Jean”

“The lazy brown dog

lazed in the sunshine” =

Multiple languages

怠惰な茶色の犬は太陽

の下でlazed

Setting aside the obvious linguistic challenges…

© 2014 IBM Corporation5 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

The Dilemma with Text Analytics: Skills vs. Need Disconnect

Business

AnalystApplication

Developer

Data

Scientist

Domain

Knowledge

Advanced

Analytics & NLP

Programming

© 2014 IBM Corporation6 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

NLP Engines & tools: Developer and Data Scientist Tools

• NLP market is about 10-15 yrs old

• Highly fragmented, no clear leader

• Many open source or free alternatives

tm

Text

Mining

Free / Open Source

NLP Pure

Plays

Sources:

• A Review of Text Analytics Suppliers, Butler

Analytics, 2014-01

• Text Analytics 2014: User Perspectives on

Solutions and Providers, Seth Grimes, 2014-06

• Who's Who in Text Analytics, Gartner, 2012-09

© 2014 IBM Corporation7 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

“100 Best Jobs” Copyrighted 2014. U.S. News &

World Report. 112878:914JM.

Occupation Software

Developer

Rating 8.4

Upward

Mobility

good

(Average)

Stress Level Fair (Average)

Flexibility Fair (Average)

Let’s look at an example

© 2014 IBM Corporation8 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

Rstudio using stringr & hmisc libraries against BigInsights

© 2014 IBM Corporation9 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

# Text Analytics with Open Source R# Strata Conference - October 2014, NYC################################## Clean-up environmentrm(list=ls())

# Load required open source R packageslibrary(stringr)library(Hmisc)

# Loop through all files in a directoryfiles <-list.files(path="/home/biadmin/Desktop/Best-jobs", pattern=".txt", all.files=T, full.names=T)

outDF <-data.frame(InputDocument=character(),

JobTitle=character(), OverallScore=numeric(), Stress=character(), UpWardMobility=character(), Flexibility=character())

for (file in files) {

# Read in the text file of interestf <- readLines(file)

# Text to extract: Occupationcline1 <- f[1]

val1 <- as.character(str_extract(cline1,"[a-zA-Z]+\\s*[a-zA-Z]*"))val1 <- ifelse(is.null(val1) == TRUE, NA, val1)

# Text to extract: Rating (Overall Score)cline2 <- grep("Overall Score", f, value=TRUE)

val2 <- as.numeric(str_extract(cline2,"[0-9]+.[0-9]+"))val2 <- ifelse(is.null(val2) == TRUE, NA, val2)

# Text to extract: Stress Levelcline3 <- grep("Stress Level",f, value=TRUE)val3 <- as.character(substring(cline3, 14))val3 <- first.word(val3)

val3 <- ifelse(is.null(val3) == TRUE, NA, val3)# Text to extract: Upward Mobility cline4 <- grep("Upward Mobility",fvalue=TRUE)

val4 <- as.character(substring(cline4, 17))val4 <- first.word(val4)val4 <- ifelse(is.null(val4) == TRUE, NA, val4)

# Text to extract: Flexibilitycline5 <- grep("Flexibility",f

val5 <- as.character(substring(cline5, 13))val5 <- first.word(val5)val5 <- ifelse(is.null(val5) == TRUE, NA, val5)

fileName <- basename(file)

newRow <-data.frame(InputDocumentJobTitle=val1, OverallScoreStress=val3, UpWardMobility

© 2014 IBM Corporation10 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

# Text to extract: Occupationcline1 <- f[1]

val1 <- as.character(str_extract(cline1,"[a-zA-Z]+\\s*[a-zA-Z]*"))val1 <- ifelse(is.null(val1) == TRUE, NA, val1)

# Text to extract: Stress Levelcline3 <- grep("Stress Level",f, value=TRUE)val3 <- as.character(substring(cline3, 14))val3 <- first.word(val3)val3 <- ifelse(is.null(val3) == TRUE, NA, val3)

© 2014 IBM Corporation11 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

Finished Results

Time: a few hours

Considerations

Programming

Text parsing

Multiple files

Missing values resulting in

missing rows

Infrastructure

Dataset size

Single machine vs cluster

© 2014 IBM Corporation12 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

NLP Tool: BigInsights Big Text ExampleNote: In Beta

© 2014 IBM Corporation13 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

But… how to reach the Business Analyst?

Business

AnalystApplication

Developer

Data

Scientist

Domain

Knowledge

Programming

Advanced

Analytics & NLP

© 2014 IBM Corporation14 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

A. Within

Enterprise

Offerings

NLP Engines

B. Niche

Tool

Reaching the Business Analyst with Tools

• Key drivers: ease of use and time-to-results

• Differences from NLP tools

• GUI-driven

• Built-in algorithms

• Multi-language support

• Related technologies Search or Information

Discovery Machine Learning

© 2014 IBM Corporation15 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

“100 Best Jobs” Copyrighted 2014. U.S.

News & World Report. 112878:914JM.

The Difference

1,000

2,000

3,000

4,000

5,000

6,000

7,000

High School orless

2 years postsecondary or

less

Bachelor'sdegree or

higher

Thousands

Projected Job Growth 2020

© 2014 IBM Corporation16 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

Business Analyst: IBM Social Media Analytics example

© 2014 IBM Corporation17 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

Tools & Engines incorporate Domain Knowledge

Marketplace

Platforms

Point

Solutions

Where’s the Growth?

• Key influence: SaaS

• NLP Engines Solution Platforms

• LOB Tools Point Solutions

• data integration services

Areas trending

• Marketing, fraud, healthcare

NLP Engines

LOB Tools

© 2014 IBM Corporation18 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

Business Analyst: IBM Social Media Analytics example

© 2014 IBM Corporation20 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

Conclusion

A variety of tools is

needed to reach a

variety of users

1 With a highly

fragmented market,

look for integration.

2

This market is changing.

Don’t be afraid to re-assess.3

© 2014 IBM Corporation21 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

© 2014 IBM Corporation22 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

ibmhadoop.challengepost.com

Stop by

IBM Booth

#321 to

learn more!

© 2014 IBM Corporation23 #strataconf #hadoopworldIbm.com/hadoop @tinagroves

Legal Disclaimer

• © IBM Corporation 2014. All Rights Reserved.• The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained in this publication, it is

provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are subject to change by IBM without notice. IBM shall not

be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any

warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.

• References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in this

presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. Nothing

contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.

• If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete:

Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon

many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can

be given that an individual user will achieve results similar to those stated here.

• If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete:

All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance

characteristics may vary by customer.

• Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBM Lotus® Sametime® Unyte™).

Subsequent references can drop “IBM” but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server). Please refer to http://www.ibm.com/legal/copytrade.shtml for

guidance on which trademarks require the ® or ™ symbol. Do not use abbreviations for IBM product names in your presentation. All product names must be used as adjectives rather than nouns. Please list all of

the trademarks that you use in your presentation as follows; delete any not included in your presentation. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2,

PartnerWorld and Lotusphere are trademarks of International Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other

countries, or both.

• If you reference Adobe® in the text, please mark the first use and include the following; otherwise delete:

Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries.

• If you reference Java™ in the text, please mark the first use and include the following; otherwise delete:

Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.

• If you reference Microsoft® and/or Windows® in the text, please mark the first use and include the following, as applicable; otherwise delete:

Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both.

• If you reference Intel® and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete:

Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

• If you reference UNIX® in the text, please mark the first use and include the following; otherwise delete:

UNIX is a registered trademark of The Open Group in the United States and other countries.

• If you reference Linux® in your presentation, please mark the first use and include the following; otherwise delete:

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.

• If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations, Zeta Bank, Acme) please update

and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustration purposes only.