Tim Estes - Information Systems in an Entity Centric World

13
Digital Reasoning™ Copyright © 2010 All Rights Reserved Information Systems in an Entity-Centric World

description

Tim Estes, CEO of Digital Reasoning, talks about the use of Hadoop and other scalable technologies along with Digital Reasoning's analytics for automated understanding of cloud-scale text challenges. This presentation was delivered at Hadoop World in New York in Oct 2010

Transcript of Tim Estes - Information Systems in an Entity Centric World

Page 1: Tim Estes - Information Systems in an Entity Centric World

Digital Reasoning™ ❘ Copyright © 2010 ❘ All Rights Reserved

Information Systems in an Entity-Centric World

Page 2: Tim Estes - Information Systems in an Entity Centric World

What Are We Covering?

2Digital Reasoning™ ❘ Copyright © 2010 ❘ All Rights Reserved

‣ The New Entity-Oriented Mission in an Era of Information Overload

‣ Demonstration

‣ Entity-Oriented Analytics

‣ Processing

‣ Schema/Semantics

‣ The Takeaways

Page 3: Tim Estes - Information Systems in an Entity Centric World

Mission Evolution from Forces to Entities

3

Armies

Groups

Individuals

Multiple Cyber Presences for Individuals

CurrentTime 200019901980

Digital Reasoning™ ❘ Copyright © 2010 ❘ All Rights Reserved

Page 4: Tim Estes - Information Systems in an Entity Centric World

4Information Explosion:

“A Wealth of Information creates a Poverty of Attention”

‣ Most of the time, what we don’t know poses the greatest risk. Missing critical information can

have tragic consequences.

• President Obama, when referencing the Christmas Day bomb attempt, recognized that the

intelligence community failed to “connect the dots”

‣ Hiring more analysts to manually review is an expensive and insufficient approach

• Costs not reasonable, lack of coverage intolerable for intelligence, defense and law enforcement

agencies

‣ Most automated solutions do not understand human language sufficiently to be useful

• variations in grammar, spelling and context obscure meaning to software

‣ Today's search solutions link you to matching keywords and don't deliver understanding of

hidden relationships buried in unstructured text

Global information created shown in Exabytes (Source: IDC)

80% of all data is unstructured

• 1.73 billion internet users as of September 2009

• 247 billion emails on average sent every day in 2009

• 126 million blogs on the internet

•400+ million Facebook users generate more than 5 billion pieces

of content every week

• 50 million Twitter messages were sent every day in January 2010

(~17-18 Billion tweets so far…)

Tenfold growth in data in five years

Digital Reasoning™ ❘ Copyright © 2010 ❘ All Rights Reserved

Page 5: Tim Estes - Information Systems in an Entity Centric World

The Last 10 Years Have Been About

Search

The Next 10 Years Are Going To Be About

Summarization

5Digital Reasoning™ ❘ Copyright © 2010 ❘ All Rights Reserved

Page 6: Tim Estes - Information Systems in an Entity Centric World

Community Response

6

Time

Inn

ov

ati

on

Keyword Search

Improved Tools

Enriched Data

Understand Data

1992 2000 2006 Current‣Documents retrieved by

matching keywords

‣ Limited as matching

keywords doesn’t provide

understanding

‣Faceted search and

related filters make

discovery easier

‣Actual “meaning” of data

remains hidden

‣Extracts known entities

adding metadata to allow

filtering

‣Predetermined taxonomies

of limited value for

unknowns

‣Entity Resolution

‣Horizontal Scalability

‣Statistical Filtering

‣No preconceptions

‣Context discovered

automatically

Digital Reasoning™ ❘ Copyright © 2010 ❘ All Rights Reserved

Page 7: Tim Estes - Information Systems in an Entity Centric World

Transform Enterprise Knowledge: From Docs to Things

7Digital Reasoning™ ❘ Copyright © 2010 ❘ All Rights Reserved

Page 8: Tim Estes - Information Systems in an Entity Centric World

Demonstration

8Digital Reasoning™ ❘ Copyright © 2010 ❘ All Rights Reserved

Page 9: Tim Estes - Information Systems in an Entity Centric World

What Are We Using?

9Digital Reasoning™ ❘ Copyright © 2010 ❘ All Rights Reserved

‣ Hadoop (Cloudera Distribution of Hadoop V3 – CDH3) for horizontally scalable

ingestion and analytics processing

‣ CDH3 has beneficial tooling over vanilla Hadoop

‣ Its nice to have supported/assured capabilities by really smart guys

‣ Cassandra for persistence of entity data

‣ No Single Point of Failure – true peer data architecture

‣ Leveraging its very fast write performance

‣ Pushing and extending its distributed query capabilities

‣ Multi-point ingest and eventual consistency are useful for downstream multi-

datacenter/multi-cloud deployment scenarios

‣ Complex and novel analytics algorithms working on Hadoop

‣ NLP and extraction technology is fresh and state of the art

‣ Associative Network algorithms are unique and patented

‣ Concept resolution from structured and unstructured data using unsupervised

learning approach

‣ Can also incorporate human guidance (augmented intelligence) but current

capabilities give us a great way to control the fire hose of information before an

analyst has to deal with it

Page 10: Tim Estes - Information Systems in an Entity Centric World

SynthesysTM

in Entity Analytics

Digital Reasoning™ ❘ Copyright © 2010 ❘ All Rights Reserved 10

Page 11: Tim Estes - Information Systems in an Entity Centric World

SynthesysTM

in Entity Analytics

Digital Reasoning™ ❘ Copyright © 2010 ❘ All Rights Reserved 11

Page 12: Tim Estes - Information Systems in an Entity Centric World

Entity-Centric Clouds - Takeaways

12Digital Reasoning™ ❘ Copyright © 2010 ❘ All Rights Reserved

‣ Analysts (and people) want an entity-centric system vs. a document centric one (but

it has to work).

‣ Dealing with entities vs. documents or records makes the IT complexity jump 100X

‣ But it gives our analysts a 10X boost in productivity if done right (maybe 100X

when we get around to writing Agents on the Entity-Centric Cloud)

‣ That’s where the cloud comes in

‣ We as a community have the processing and storage capacity to handle this now

at the 100m-10B record level

‣ We should take a “conservative” approach to analysis where analytics is a

continuous enrichment process and our current models are not dramatically

privileged over potential future models

‣ Nearly all web-scale search / analytics systems use the following recipe:

‣ Distributed, columnar type data store

‣ Horizontally scalable analytics framework (like Hadoop)

‣ Future systems are all distributed and learning all the time. Our investments today

can be ready for this future with the right architectural and technology considerations.

Page 13: Tim Estes - Information Systems in an Entity Centric World

13Digital Reasoning™ ❘ Copyright © 2010 ❘ All Rights Reserved

Tim Estes| CEO | Digital Reasoning Systems | 730 Cool Springs Blvd, Suite 110, Franklin, TN 37067

office: 615.370.1860 | fax: 615.370.1865 |

email: [email protected]

website: http://www.digitalreasoning.com

twitter: http://twitter.com/spooksandgeeks