Text Analytics Software Choosing the Right Fit Tom Reamy Chief Knowledge Architect KAPS Group Text...

Text Analytics SoftwareChoosing the Right Fit

Tom ReamyChief Knowledge Architect

KAPS Group

http://www.kapsgroup.com

Text Analytics World

October 20 New York

http://www.kapsgroup.com/

2

Agenda

Introduction – Text Analytics Basics Evaluation Process & Methodology

– Two Stages – Initial Filters & POC Proof of Concept

– Methodology – Results

Text Analytics and “Text Analytics” Conclusions

3

KAPS Group: General

Knowledge Architecture Professional Services Virtual Company: Network of consultants – 8-10 Partners – SAS, SAP, FAST, Smart Logic, Concept Searching, etc. Consulting, Strategy, Knowledge architecture audit Services:

– Taxonomy/Text Analytics development, consulting, customization– Evaluation of Enterprise Search, Text Analytics– Text Analytics Assessment, Fast Start– Technology Consulting – Search, CMS, Portals, etc.– Knowledge Management: Collaboration, Expertise, e-learning– Applied Theory – Faceted taxonomies, complexity theory, natural

categories

4

Introduction to Text AnalyticsText Analytics Features Noun Phrase Extraction

– Catalogs with variants, rule based dynamic– Multiple types, custom classes – entities, concepts, events– Feeds facets

Summarization– Customizable rules, map to different content

Fact Extraction– Relationships of entities – people-organizations-activities– Ontologies – triples, RDF, etc.

Sentiment Analysis– Rules – Objects and phrases

5

Introduction to Text AnalyticsText Analytics Features Auto-categorization

– Training sets – Bayesian, Vector space– Terms – literal strings, stemming, dictionary of related terms– Rules – simple – position in text (Title, body, url)– Semantic Network – Predefined relationships, sets of rules– Boolean– Full search syntax – AND, OR, NOT– Advanced – DIST(#), ORDDIST#, PARAGRAPH, SENTENCE

This is the most difficult to develop Build on a Taxonomy Combine with Extraction

– If any of list of entities and other words

Case Study – Categorization & Sentiment

6

Case Study – Categorization & Sentiment

7

Evaluation Process & MethodologyOverview Start with Self Knowledge

– Think Big, Start Small, Scale Fast Eliminate the unfit

– Filter One- Ask Experts - reputation, research – Gartner, etc.• Market strength of vendor, platforms, etc.• Feature scorecard – minimum, must have, filter to top 3

– Filter Two – Technology Filter – match to your overall scope and capabilities – Filter not a focus

– Filter Three – In-Depth Demo – 3-6 vendors Deep POC (2) – advanced, integration, semantics Focus on working relationship with vendor.

9

Design of the Text Analytics Selection Team Traditional Candidates – IT&, Business, Library IT - Experience with software purchases, needs assess, budget

– Search/Categorization is unlike other software, deeper look

Business -understand business, focus on business value They can get executive sponsorship, support, and budget

– But don’t understand information behavior, semantic focus Library, KM - Understand information structure Experts in search experience and categorization

– But don’t understand business or technology

10

Design of the Text Analytics Selection Team

Interdisciplinary Team, headed by Information Professionals Relative Contributions

– IT – Set necessary conditions, support tests– Business – provide input into requirements, support project– Library – provide input into requirements, add understanding

of search semantics and functionality Much more likely to make a good decision Create the foundation for implementation

11

Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge

Strategic and Business Context Info Problems – what, how severe Strategic Questions – why, what value from the text analytics,

how are you going to use it– Platform or Applications?

Formal Process - KA audit – content, users, technology, business and information behaviors, applications - Or informal for smaller organization,

Text Analytics Strategy/Model – forms, technology, people– Existing taxonomic resources, software

Need this foundation to evaluate and to develop

12

13

Varieties of Taxonomy/ Text Analytics Software

Taxonomy Management– Synaptica, SchemaLogic

Full Platform– SAS, SAP, Smart Logic, Linguamatics, Concept Searching, Expert

System, IBM, GATE Embedded – Search or Content Management

– FAST, Autonomy, Endeca, Exalead, etc.– Nstein, Interwoven, Documentum, etc.

Specialty / Ontology (other semantic)– Sentiment Analysis – Lexalytics, Clarabridge, Lots of players– Ontology – extraction, plus ontology

Vendors of Taxonomy/ Text Analytics Software

– Attensity– Business Objects –

Inxight– Clarabridge– ClearForest– Concept Searching– Data Harmony / Access

Innovations– Expert Systems– GATE (Open Source)– IBM Infosphere

– Lexalytics– Multi-Tes– Nstein– SAS– SchemaLogic– Smart Logic– Synaptica

14

15

Initial Evaluation – Factors Traditional Software Evaluation - Deeper Basic & Advanced Capabilities Lack of Essential Feature

– No Sentiment Analysis, Limited language support Customization vs. OOB

– Strongest OOB – highest customization cost Company experience, multiple products vs. platform Ease of integration – API’s, Java

– Internal and External Applications– Technical Issues, Development Environment

Total Cost of Ownership and support, initial price POC Candidates – 1-4

16

Initial Evaluation – Factors Case Studies Amdocs

– Customer Support Notes – short, badly written, millions of documents– Total Cost, multiple languages, Integration with their application– Distributed expertise – Platform – resell full range of services, Sentiment Analysis– Twenty to Four to POC (Two) to SAS

GAO– Library of 200 page PDF formal documents, plus public web site– People – library staff – 3-4 taxonomists – centralized expertise– Enterprise search, general public– Twenty to POC with SAS

Phase II - Proof Of Concept - POC

Measurable Quality of results is the essential factor 4 weeks POC – bake off / or short pilot Real life scenarios, categorization with your content 2 rounds of development, test, refine / Not OOB Need SME’s as test evaluators – also to do an initial categorization of

content Majority of time is on auto-categorization Need to balance uniformity of results with vendor unique capabilities –

have to determine at POC time Taxonomy Developers – expert consultants plus internal taxonomists

17

18

POC Design: Evaluation Criteria & Issues

Basic Test Design – categorize test set– Score – by file name, human testers

Categorization & Sentiment – Accuracy 80-90%– Effort Level per accuracy level

Quantify development time – main elements Comparison of two vendors – how score?

– Combination of scores and report Quality of content & initial human categorization

– Normalize among different test evaluators Quality of taxonomists – experience with text analytics software and/or

experience with content and information needs and behaviors Quality of taxonomy – structure, overlapping categories

Text Analytics POC OutcomesEvaluation Factors

Variety & Limits of Content – Twitter to large formal libraries

Quality of Categorization– Scores – Recall, Precision (harder)– Operators – NOT, DIST, START,

Development Environment & Methodology– Toolkit or Integrated Product– Effort Level and Usability

Importance of relevancy – can be used for precision, applications Combination of workbench, statistical modeling Measures – scores, reports, discussions

19

POC and Early Development: Risks and Issues

CTO Problem –This is not a regular software process Semantics is messy not just complex

– 30% accuracy isn’t 30% done – could be 90% Variability of human categorization Categorization is iterative, not “the program works”

– Need realistic budget and flexible project plan Anyone can do categorization

– Librarians often overdo, SME’s often get lost (keywords) Meta-language issues – understanding the results

– Need to educate IT and business in their language

20

Text Analytics and “Text Analytics” – Text Mining

TA is pre-processing for text mining TA adds huge dimensions of unstructured text

– Now 85-90% of all content, Social Media TA can improve the quality of text

– Categorization, Disambiguated metadata extraction Unstructured text into data - What are the possibilities?

– New Kinds of Taxonomies – emotion, small smart modular – Information Overload – search, facets, auto-tagging, etc.– Behavior Prediction – individual actions (cancel or not?)– Customer & Business Intelligence – new relationships– Crowd sourcing – technical support – Expertise Analysis – documents, authors, communities

21

Conclusion

Start with self-knowledge – what will you use it for?– Current Environment – technology, information

Basic Features are only filters, not scores Integration – need an integrated team (IT, Business, KA)

– For evaluation and development POC – your content, real world scenarios – not scores Foundation for development, experience with software

– Development is better, faster, cheaper Categorization is essential, time consuming Text Analytics opens up new worlds of applications

22

Questions?

Tom [email protected]

KAPS Group

Knowledge Architecture Professional Services

http://www.kapsgroup.com

Text Analytics Software Choosing the Right Fit Tom Reamy Chief Knowledge Architect KAPS Group Text...

Documents

Transcript of Text Analytics Software Choosing the Right Fit Tom Reamy Chief Knowledge Architect KAPS Group Text...