Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava...

39
Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Transcript of Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava...

Page 1: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Text Analytics in Action:Using Text Analytics as a Toolset

TBC 4:15 p.m. - 5:00 p.m.

Marjorie HlavaSemantic enrichment / Semantic

Fingerprinting

Page 2: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Abstract

• Big data inferences are increasingly used to mine huge heaps of data.

• The applications are endless. • However, those inferences do not work well when many

lines go to a single bubble. The lines and relationships must be drawn between concepts, not simply between words.

• Using the text analytics is a powerful tool, but it is a means to an end, not the end itself.

• The important work is in the interpretation of the data. • This session outlines a highly accurate and efficient approach

and provides a case study of the application.

Page 3: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Outline of the talk

• Using text analytics in term extraction– 3 examples– Pattern recognition– String tagging– Taxonomy control

• Achieving Synonymy• Now what do I do with it?

Page 4: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Term clouds

• Good place to start• Show concept landscape• Basis =

– Levenshtein distances– N-grams

• Redundant concepts, separately shown• No disambiguation• Not direct XML tagging

Page 5: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Sample article

Page 6: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Normal text extraction

Page 7: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Near conceptual synonyms

Page 8: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Nonsensical suggestions

Page 9: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Small Taxonomy

Near synonym, conceptual duplicate

Page 10: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Refined presentation

Page 11: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Dependent concepts

Page 12: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Ontological dependencies

Page 13: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Achieving Synonymy

• Find like concepts• Merge the terms• Choose a preferred form• Build term record

– Hierarchy– Equivalence– Associative

Page 14: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Overview, Upload 7K documents, search for text string, add a tag, “Columbia”

Page 15: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

“Colombian” – no stemming

Same document – different terms

Page 16: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Colombiana – record overlap

Page 17: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

“FARC” – No Synonymy

Page 18: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

“People’s Armed Forces of Colombia”, i.e., FARC, lacks synonymy, some doc overlap

Page 19: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Tag suite, no hierarchy, no equivalence, no combining

tags for synonymy

Page 20: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Disambiguation

Bridge Structure

Bridge Dentistry

Bridge Game

Bridge Concept

Page 21: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Now what do I do with it?

• Tag documents– Consistently– Even depth of treatment– Full breadth of conceptual area

• Insert concepts in full text or as linked data• Implement in search• Use for internal statistics and analysis• Track industry trends• Create semantic fingerprints

Page 22: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.
Page 23: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

The AIP Thesaurus

Hierarchy TermRecord

Page 24: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

The AIP Thesaurus: Rulebase

This article is about (among other things)degenerate stars.

The text string “degenerate stars” occurs zerotimes in the text of the article.

But since the rulebase is tuned to understandthat when certain other words appear nearthe text “star”or “stars” it was correctly indexed.

Page 25: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

The AIP Thesaurus: Rulebase

If the word “star” or “stars” appears inthe same sentence as “degenerate” or“compact” MAI applies the term “Degenerate stars” instead ofjust using “Stars”

Page 26: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

The AIP Thesaurus: Applications

Page 27: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Listing of the AIP Thesaurus terms in JATS. Includes the term, keyword-ID, weight, code.

Page 28: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Inline tagged terms (denoted by the highlighting). The keyword ID (kwd1.4) corresponds with the name in the previous screenshot.

Page 29: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

HTML Header

Copyright © 2013 Access Innovations, Inc.

Page 30: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

7. Content Recommender

More Articles on the same topic

Selected Article Search “thin film sputtering”

Grants available

Upcoming conferences on this topic

Authors working in this space

Page 31: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Taxonomy Driven Search Presentation

Navigate the full taxonomy “tree”

BROWSE

Auto-completion using the taxonomy

Guide the user

Page 32: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Copyright © 2005 - Access Innovations, Inc.

Taxonomyview

ThesaurusTerm Record

view

Page 33: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Suggested taxonomy descriptors

Page 34: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

34

Visualization Strategies

MatrixVisualization

Software

Page 35: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Pattern AnalysisDomain Associations

Page 36: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Pattern AnalysisGap Analyses

Page 37: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Summary

• Taxonomy tool box• Text extraction / mining for terms• Gather synonyms• Disambiguate terms• Look for gaps and over coverage• Map all conceptual groupings

– Hierarchical, Associative, Equivalence• Apply to content• Leverage knowledge of the collection

Page 38: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

Thank you

Marjorie M.K. Hlava, PresidentAccess Innovations

[email protected]

The Semantic Enrichment CompanySMART CONTENT

Page 39: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

About Access InnovationsAccess Innovations are experts in content creation, enrichment, and conversion services. We provide services to semantically enrich and tag raw text into highly structured data. We deliver clean, well-formed, metadata-enriched content so our clients can reuse, repurpose, store, and find their knowledge assets. We go beyond the standards to build taxonomies and other data control structures as a solid foundation for your information. Our services and software allow organizations to use and present their information to both internal and external constituents by leveraging search, presentation, and e-commerce. We change search to found!

Quick Facts• Founded in 1978• Headquartered in Albuquerque, NM• Privately held• Delivered more than 2000 engagements