Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC

Big Data Challenges

in the DoD and IC

Wes Caldwell Chief Architect

Intelligent Software Solutions

Topics

• Introduction to ISS

• The growth of data

• Our customer’s data environment

• The need for effective big-data management

• Search as the cornerstone of a big-data strategy

About ISS

• Headquartered in Colorado Springs • Other offices located in Washington DC, Hampton VA,

Tampa FL, and Rome NY

• Innovative Solutions from “Space to Mud and Everything Between” • Sole prime on multiple Air Force Research Labs

programs IDIQ • Currently Executing More Than 100 Software

Development Projects • Over 800 employees • Strength in Solutions Development and

Deployment

• Consistently Recognized as a Leader • Recognized as a Deloitte Fast 50 Colorado

company and a Deloitte Fast 500 company over eight consecutive years

• Three-time Inc. Magazine 500 winner • 2009 Defense Company of the Year

ISS Solution Space/Value Proposition

• Reusable and license-free to US Federal Government (GOTS)

• Committed to providing best ROI to our customers by integrating leading open-source solutions into our products and services

• Scalable from a single desktop solution to large distributed networks with thousands of users

• Customizable to each organization’s unique analytical and information technology infrastructure

• Operationally proven, secure and accredited for all major classified networks

ISS Business Strategy

Government

Off The Shelf

(GOTS)

Commercial

Off The Shelf

(COTS)

Subject

Matter Experts

(SMEs)

• Low Barrier to Entry: No license fees to US Government Agencies

• Fast: Proven baseline provides immediate capability

• Turnkey: Highly customizable solutions can be implemented quickly with no development

• Solutions Oriented: Subject Matter Experts support implementation in each domain

• Low Cost: Cost of Adding Features is shared across large customer base; all customers benefit

Blending the best elements of each industry model to provide low risk, nonproprietary, high payoff solutions—fast! 6

The growth of data

• Most electronic information is not relational,

but unstructured (textual, binary) or semi-

structured (spreadsheet, RSS feed, etc.)

– In 2007, the estimated information content of all

human knowledge was 295 exabytes(295 million

terabytes)

– Data production will be 44 times greater in 2020

than in 2009

• Approx 35 zetabytes total (35 billion terabytes)

• A majority of the data produced in the future will

be unstructured

– A tremendous amount of information and

knowledge is dormant within unstructured data

Our customer’s data environment

• Literally thousands of data sources/feeds

from a variety of strategic, national, and

tactical sources

– Media (documents, images, etc.)

– Human interactions

– Geospatial

– Open Source (News feeds, RSS)

– Imagery/Video

– Many more…

How our analysts feel

The need for effective “big-data” management

• Analysts are looking to extract knowledge from the massive heterogeneous

data sets, providing “actionable intelligence”

• Tactical environments absolutely demand effective management of data

– Time to live on the relevance of data collected can be very short

– Communications pipes aren’t as optimal as large CONUS-based data

centers, so reduction of data based on tactical conditions (i.e. AOR,

Problem Domain, etc.) is critical

• Search and Analytics are key enablers to allow an analyst to reliably search

through large amounts of information, and to focus their efforts around a

subset of that information to perform deeper analysis

Search IS the cornerstone of an effective big-data strategy

Structured Content

Semi-Structured Content

Un-Structured Content

Content Cache (Haystacks)

Content Acquisition

Tenets • Connector architecture • Data normalization • Data staging • Data Compartmenting

(Multiple Haystacks)

Tenets • Optimized Index of Content

for Search and Discovery of Big Data

• Analyst Topics that “Shrink the Haystack” Search Features (Facets, Auto-Complete, Tagging, Comments, etc.)

• Semantic (Synonym) Search based on pluggable taxonomies

Search/Discovery

Content Index

NLP Pipeline

Semantic Enrichment

Categorization

Named Entity

Recognition

Clustering

Gazetteers

Tenets • “Domain Spaces” that

support pluggable entity recognition and categorization

• Continuous feedback loop that improves the system over time with analyst input

• Lexicon-based analytics that allows for targeted categorization across corpus of data

Tenets • Data Reduction into

focused “Data Perspectives”

• Data perspectives stored in optimized formats (e.g. Graph, Time Series, Geo, etc.) for the questions being asked

• Leveraging industry-standard parallel processing frameworks for scalable analytics

Data Perspectives

Data

How can Search help you?

Have a great conference!!!

Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC

Technology

Transcript of Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC