Customer summit - big data (final)

54
BIG DATA Defined: Data Stack 3.0 Persistent Systems June 2012 1 24 July 2012

description

Presentation from the Persistent Customer Summit about Big Data

Transcript of Customer summit - big data (final)

Page 1: Customer summit  - big data (final)

BIG DATA Defined:

Data Stack 3.0

Persistent Systems

June 2012

1 24 July 2012

Page 2: Customer summit  - big data (final)

The Data Revolution is Happening Now

The growing need for large-volume, multi-

structured “Big Data” analytics,

as well as … “Fast Data”, have positioned the

industry at the cusp of the most radical

revolution in database architectures in 20

years.

We believe that the economics of data will

increasingly drive competitive advantage.

Source: Credit Suisse Research, Sept 2011

24 July 2012 2

Page 3: Customer summit  - big data (final)

Enterprise Value is Shifting to Data

3

Mainframe

Operating

Systems

ERP

Apps

Data

2013 2006

Database

1995 1985 1975 24 July 2012

Page 4: Customer summit  - big data (final)

Organizational leaders want analytics to exploit their growing data and computational power to get smart, and get innovative, in ways they never could before. Source - MIT Sloan Management Review- The New Intelligent Enterprise Big Data, Analytics

and the Path From Insights to Value By Steve LaValle, Eric Lesser,

Rebecca Shockley, Michael S. Hopkins and Nina Kruschwitz

December 21, 2010

What Data Can Do For You

24 July 2012 4

Page 5: Customer summit  - big data (final)

Source: New York Times, September 2, 2009. Tesco, British Grocer, Uses Weather to Predict Sales By Julia Werdigier

http://www.nytimes.com/2009/09/02/business/global/02weather.html

Britain often conjures images of unpredictable weather, with downpours sometimes followed

by sunshine within the same hour — several times a day.

Such randomness has prompted Tesco, the country’s largest grocery chain, to create…its own

software that calculates how shopping patterns change “for every degree of temperature and

every hour of sunshine.”

Determining Shopping Patterns

British Grocer, Tesco Uses Big Data

by Applying Weather Results to Predict

Demand and Increase Sales

24 July 2012 5

Page 6: Customer summit  - big data (final)

GlaxoSmithKline is aiming to build direct relationships with 1 million consumers in a year using

social media as a base for research and multichannel marketing. Targeted offers and

promotions will drive people to particular brand websites where external data is integrated

with information already held by the marketing teams.

Source: Big data: Embracing the elephant in the room By Steve Hemsley

http://www.marketingweek.co.uk/big-data-embracing-the-elephant-in-the-room/3030939.article

Tracking Customers in Social Media

Glaxo Smith Kline Uses Big Data

to Efficiently Target Customers

24 July 2012 6

Page 7: Customer summit  - big data (final)

What does India Think?

Persistent enables Aamir Khan Productions and Star Plus use

Big Data to know how people react to some of the most

excruciating social issues.

http://www.satyamevjayate.in/

24 July 2012 7

Satyamev Jayate - Aamir Khan’s pioneering, interactive socio-cultural TV show - has caught the

interest of the entire nation. It has already generated ~7.5M responses in 4 weeks over SMS,

Facebook, Twitter, Phone Calls and Discussion Forums by its viewers across the world over. This

data is being analyzed and delivered in real-time to allow the producers to understand the

pulse of the viewers, to gauge the appreciation for the show and most importantly to spread

the message. Harnessing the truth from all this data is a key component of the show’s success.

Page 8: Customer summit  - big data (final)

24 July 2012 8

Page 9: Customer summit  - big data (final)

WE ALREADY HAVE DATABASES.

WHY DO WE NEED TO DO ANYTHING

DIFFERENT?

9 24 July 2012

Page 10: Customer summit  - big data (final)

● Transaction processing capabilities ideally suited for transaction-oriented operational stores.

● Data types – numbers, text, etc.

● SQL as the Query language

● De-facto standard as the operational store for ERP and mission critical systems.

● Interface through application programs and query tools

Relational Database Systems for

Operational Store

10 24 July 2012

Page 11: Customer summit  - big data (final)

● Operational data stores store on-line transactions – Many writes, some reads.

● Large fact table, multiple dimension tables

● Schema has a specific pattern – star schema

● Joins are also very standard and create cubes

● Queries focus on aggregates.

● Users access data through tools such as Cognos, Business Objects, Hyperion etc.

Enterprise Data Warehouse for Decision

Support

11 24 July 2012

Page 12: Customer summit  - big data (final)

Data Stack 2.0: Enterprise Data Warehouse Systems

Standard Enterprise Data Architecture

Data Warehouse Engine

Optimized Loader Extraction Cleansing

(ETL)

Analyze Query

Metadata Repository

Relational Databases

Legacy Data

Purchased Data

ERP Systems

Relational Databases

Application Logic

Presentation Layer

Data Stack 1.0:

Operational Data Systems

12 24 July 2012

Page 13: Customer summit  - big data (final)

One in two business executives believe that they do not have sufficient information across their organization to do their job

Source: IBM Institute for Business Value

Despite the two data stacks ..

13 24 July 2012

Page 14: Customer summit  - big data (final)

Data has Variety

24 July 2012 14

Less than 40% of

the Enterprise

Data is stored in

Data Stack 1.0 or

Data Stack 2.0.

Page 15: Customer summit  - big data (final)

Beyond the Operational Systems, data

required for decision making is scattered

within and beyond the enterprise

ERP Systems

CRM Systems

Enterprise

Data Warehouse

Structured

Data Sources

Email Systems Collaboration

/Wiki Sites

Document Repositories

Project artifacts

Employee Surveys

Customer Call

Center Records

Unstructured

Data Sources

Organizational

Workflow

Sensor

Data

Cloud

Data Sources

CRM Systems

Expense

Management System Vendor

Collaboration Systems

Supply Chain

Systems

Location and

Presence Data

Public

Data Sources

Weather forecasts

Demographic

Data

Maps

Economic Data

Social

Networking Data

Twitter

Feeds

15 24 July 2012

Page 16: Customer summit  - big data (final)

5 Exabytes of information was

created between the dawn of

civilization through 2003, but that

much information is now created

every 2 days, and the pace is

increasing

Eric Schmidt

at the Techonomy Conference,

August 4, 2010 (1 exabyte = 1018 bytes )

Data Volumes are Growing

24 July 2012 16

Page 17: Customer summit  - big data (final)

The Continued Explosion of Data in the

Enterprise and Beyond

80% of new information growth is

unstructured content –

90% of that is currently unmanaged

1990 2000 2010 2020 Source: IDC, The Digital Universe Decade – Are You Ready?, May 2010

2009

800,000 petabytes

2020

35 zettabytes

44x as much

Data and Content

Over Coming Decade

17 24 July 2012

Page 18: Customer summit  - big data (final)

What comes first -- Structure or data?

18

Schema/

Structure Data

24 July 2012

Structure First is Constraining

Page 19: Customer summit  - big data (final)

Time to create a new data stack for unstructured data. Data Stack 3.0.

19 24 July 2012

Page 20: Customer summit  - big data (final)

The Path to Data Stack 3.0:

Must support Variety, Volume and Velocity

24 July 2012 20

Data Stack 3.0

Dynamic Data Platform

Uncovering Key Insights

Schema less Approach

PBs of Data

End User Direct Access

Structured + Semi Structured

Data Stack 2.0

Enterprise Data Warehouse

Support for Decision Making

Un-normalized Dimensional Model

TBs of Data

End User Access Through Reports

Structured

Data Stack 1.0

Relational Database Systems

Recording Business Events

Highly Normalized Data

GBs of Data

End User Access through Ent Apps

Structured

Page 21: Customer summit  - big data (final)

Can Data Stack 3.0 Address Real Problems?

Large Data

Volume at Low

Price

Diverse Data

beyond

Structured Data

Queries that

Are Difficult to

Answer

Answer Queries

that No One

Dare Ask

24 July 2012 21

Page 22: Customer summit  - big data (final)

Time-out!

Internet companies

have already

addressed the same

problems.

22 24 July 2012

Page 23: Customer summit  - big data (final)

● Twitter has 140 million active users and more than 400 million tweets per day.

● Facebook has over 900 million active users and an average of 3.2 billion Likes and Comments are generated by Facebook users per day.

● 3.1 billion email accounts in 2011, expected to rise to over 4 billion by 2015.

● There were 2.3 billion internet users (2,279,709,629) worldwide in the first quarter of 2012, according to Internet World Stats data updated 31st March 2012.

Internet Companies have to deal with large

volumes of unstructured real-time data.

23 24 July 2012

Page 24: Customer summit  - big data (final)

● Hosted service

● Large cluster (1000s of nodes) of low-cost

commodity servers.

● Very large amounts of data -- Indexing

billions of documents, video, images etc..

● Batch updates.

● Fault tolerance.

● Hundreds of Million users,

● Billions of queries every day.

Their data loads and pricing requirements

do not fit traditional relational systems

24 24 July 2012

Page 25: Customer summit  - big data (final)

● It is the platform that distinguishes them from everyone else.

● They required: – high reliability across data centers

– scalability to thousands of network nodes

– huge read/write bandwidth requirements

– support for large blocks of data which are gigabytes in size.

– efficient distribution of operations across nodes to reduce bottlenecks

Relational databases were not suitable and would have been cost prohibitive.

They built their own systems

25 24 July 2012

Page 26: Customer summit  - big data (final)

Companies have

created business

models to support

and enhance this

software.

Internet Companies have open-sourced the

source code they created for their own use.

26 24 July 2012

Page 27: Customer summit  - big data (final)

Open Source Rules !

27

Hadoop

Infrastructure

24 July 2012

Page 28: Customer summit  - big data (final)

What about support !

28 24 July 2012

Page 29: Customer summit  - big data (final)

Allows for analysis of massive volumes of information • Structured and Unstructured • External and Internal

Thousands of users, millions of files, terabytes of data needs to be handled

Commoditized hardware can be used to reduce costs

Big Data can and should integrate with existing enterprise information architecture

Only Big Data makes it possible!

Enterprises Always had Data.

Now there is a way to handle it!

24 July 2012 29

Page 30: Customer summit  - big data (final)

PERSISTENT SYSTEMS AND BIG DATA

24 July 2012 30

Page 31: Customer summit  - big data (final)

Persistent Systems has an experienced team of Big Data Experts that has created the technology building blocks to help you implement a Big Data Solution

that offers a direct path to unlock the value in your data.

Page 32: Customer summit  - big data (final)

Big Data Expertise at Persistent ● 10+ projects executed with Leading ISVs and Enterprise Customers

● Dedicated group to MapReduce, Hadoop and Big Data Ecosystem

(formed 3 years ago)

● Engaged with the Big Data Ecosystem, including leading ISVs and

experts

• Preferred Big Data Services Partner of IBM and Microsoft

24 July 2012

Page 33: Customer summit  - big data (final)

Big Data Leadership and Contributions

● Code Contributions to Big Data Open Source Projects, including:

– Hadoop, Hive, and SciDB

● Dedicated Hadoop cluster in Persistent

● Created PeBAL – Persistent Big Data Analytics Library

● Created Visual Programming Environment for Hadoop

● Created Data Connectors for Moving Data

● Pre-built Solutions to Accelerate Big Data Projects

24 July 2012 33

Page 34: Customer summit  - big data (final)

Persistent’s Big Data Offerings 1. Setting up and Maintaining Big Data Platform

2. Data Analytics on Big Data Platform

3. Building Applications on Big Data

Foundational Infrastructure and Platform (Built Upon Selected 3rd Party Big Data Platforms and Technologies;

Cluster of Commodity Hardware)

Persistent Platform Enhancement IP

(PeBAL Analytics Library, Data Connectors)

Persistent Pre-built Horizontal Solutions

(Email, Text, IT Analytics, … )

Persistent Pre-built

Industry Solution: Retail

Technology Assets

Vis

ual P

rog

ram

min

g

Too

ls

Persistent Pre-built

Industry Solution: Banking

Persistent Pre-built

Industry Solution: Telco

Big Data Custom

Services

Extension of

Your Team

Discovery Workshop

Training for Your Team

Team Formation Process

Cluster Sizing/Config

People Assets

Methodology

24 July 2012 34

Page 35: Customer summit  - big data (final)

Commercial/ Open Source Product

Persistent IP External Data source

Email Server

Co

nn

ector Fram

ewo

rk

IBM Tivoli

BBCA

Web Proxy

Social Me

dia Connector

Twitter, Facebook

Email Server

Web Proxy

DW

NoSQL

RDBMS

Data Warehouse

PIG/Jqal Text Analytics/ GATE/SystemT

Hive

Persistent Analytics Library (PEBAL)

Graph Fn Set Fn …. ….. ….. Text Analytics Fn

Solutions

MapReduce and HDFS Cluster Monitoring

Admin App

Wo

rkflow

Integratio

n

Co

nn

ector Fram

ewo

rk

BI Tools Reports & Alerts

Persistent Next Generation Data Architecture

24 July 2012 35

Page 36: Customer summit  - big data (final)

Persistent Big Data Analytics Library

WHY PEBAL • Lots of common problems – not all of them are solved in Map Reduce

• PigLatin, Hive, JAQL are languages and not libraries – something is

needed to run on top that is not tied to SQL like interaces

BENEFITS OF A READY MADE SOLUTION • Proven – well written and tested

• Reuse across multiple applications

• Quicker implementation of map reduce applications

• High performance

FEATURES • Organized as JAQL functions, PeBAL implements several graph, set, text

extraction, indexing and correlation algorithms.

• PeBAL functions are schema agnostic.

• All PeBAL functions are tried and tested against well defined use cases.

24 July 2012 36

Page 37: Customer summit  - big data (final)

24 July 2012 37

Graph

Set

Text

Analytics

Inverted

Lists

Web

Analytics

Statistics

Page 38: Customer summit  - big data (final)

Visual Programming Environment

ADOPTION BARRIERS • Steep Learning Curve

• Difficult to Code

• Ad-hoc reporting can’t always be done by writing programs

• Limited tooling available

VISUAL PROGRAMMING ENVIRONMENT • Use Standard ETL tool as the UI environment for generating PIG scripts

BENEFITS • ETL Tools are widely used in Enterprises

• Can leverage large pool of skilled people who are experts in ETL and BI

tools

• UI helps in iterative and rapid data analysis

• More people will start using it

24 July 2012 38

Page 39: Customer summit  - big data (final)

Visual Programming Environment for

Hadoop

HDFS/ Hive HDFS

Persistent IP

Data Flow UI

PIG Convertor

HDFS

PIG UDF Library

Big Data Platform

ETL Tool

Metadata

Data Data

Data Sources

PIG code

24 July 2012 39

Page 40: Customer summit  - big data (final)

Persistent Connector Framework

OUT OF THE BOX • Database, Data Warehouse

• Microsoft Exchange

• Web proxy

• IBM Tivoli

• BBCA

• Generic Push connector for *any* content

FEATURES • Bi-directional connector (as applicable)

• Supports Push/Pull mechanism

• Stores data on HDFS in an optimized format

• Supports masking of data

WHY CONNECTOR FRAMEWORK • Pluggable Architecture

20+ Years

24 July 2012 40

Page 41: Customer summit  - big data (final)

Persistent Data Connectors

24 July 2012 41

Page 42: Customer summit  - big data (final)

Persistent’s Breadth of Big Data Capabilities

Horizontal and Vertical Pre-built Solutions

Big Data Platform (PeBAL) analytics libraries and Connectors

IT Management

Big Data Application Programming

Distributed File Systems

Cluster Layer

Tooling

• RDBMS/DWH to import/export data

• Text Analytics libraries

• Data Visualization using Web2.0 and reporting tools - Cognos, Microstrategy

• Ecosystem tools like - Nutch, Katta, Lucene

• Job configuration, management and monitoring with BIgInsight’s job

scheduler (MetaTracker)

• Job failure and recovery management

• Deep JAQL expertise - JAQL Programming, Extending JAQL using UDFs,

Integration of third party tools/libraries, Performance tuning, ETL using JAQL

• Expertise in MR programming - PIG, Hive, Java MR

• Deep expertise in analytics - Text Analytics - IBM’s text extraction solution (AQL + SystemT)

• Statistical Analytics - R, SPSS, BigInsights Integration with R

• HDFS

• IBM GPFS

• Platform Setup on multi-

node clusters, monitoring, VM based

setup

• Product Deployment Persistent IP for Big Data Solutions

Big Data Platform Components 24 July 2012 42

Page 43: Customer summit  - big data (final)

Persistent Roadmap to Big Data

1. Learn

2. Initiate

3. Scale 4. Measure

5. Manage

Discover and

Define Use Cases

Improve Knowledge Base

and Shared Big Data Platform

Upgrade to Production

if Successful

Validate with

a POC

Measure Effectiveness

and Business Value

24 July 2012 43

Page 44: Customer summit  - big data (final)

Build a social

graph of all

customers

Overlay sales

data on the

graph

Identify

influential

customers

using network

analysis

Target these

customers for

promotions.

Customer Analytics

24 July 2012 44

Identifying your most

influential customers ?

Targeting influential customers is best way to

improve campaign ROI!

70 million customers

> 1billion transactions

over twenty years

Few thousand

Influential customers

Page 45: Customer summit  - big data (final)

Overview of Email Analytics

● Key Business Needs – Ensure compliance with respect to a variety of business and IT communications and

information sharing guidelines. – Provide an ongoing analysis of customer sentiment through email communications.

● Use Cases – Quickly identify if there has been an information breach or if the information is being shared in

ways that is not in compliance with organizational guidelines.

– Identify if a particular customer is not being appropriately managed.

● Benefits – Ability to proactively manage email analytics and communications across the organization in a

cost-effective way.

– Reduce the response time to manage a breach and proactively address issues that emerge through ongoing analysis of email.

24 July 2012 45

Page 46: Customer summit  - big data (final)

Using Email to Analyze Customer

Sentiment

24 July 2012 46

Sense the mood of your customers through their emails

Carry out detailed analysis on customer team interactions and response times

Page 47: Customer summit  - big data (final)

Analyzing Prescription Data

24 July 2012 47

1.5 million patients are

harmed by medication

errors every year

Identifying erroneous prescriptions can save lives!

Source: Center for Medication Safety & Clinical Improvement

Page 48: Customer summit  - big data (final)

Overview of IT Analytics

● Key Business Needs – Troubleshooting issues in the world of advanced and cloud based systems is highly complex, requiring

analysis of data from various systems.

– Information may be in different formats, locations, granularity, data stores.

– System outages have a negative impact on short-term revenue, as well as long-term credibility and reliability.

– The ability to quickly identify if a particular system is unstable and take corrective action is imperative.

● Use Cases – Identify security threats and isolate the corresponding external factors quickly.

– Identify if an email server is unstable, determine the priority and take preventative action before a complete failure occurs.

● Benefits – Reduced maintenance cost

– Higher reliablity and SLA compliance

24 July 2012 48

Page 49: Customer summit  - big data (final)

Consumer Insight from Social Media

24 July 2012 49

Find out what the customers are talking about your organization or product in the social media

Page 50: Customer summit  - big data (final)

1. Structured Analysis Responses to Pledge, multiple choice questions

2. Unstructured Analysis Responses to following questions • Share your story

• Ask a question to Aamir • Send a message of hope • Share your solution

Content Filtering Rating Tagging System (CFRTS) L0, L1, L2 phased analytics 3. Impact Analysis

Crawling general internet for measuring the before & after scenario on a particular topic

Web/TV Viewer

Response to Pledge multiple choice questions Web, emails, IVR/Calls Individual blogs Social widgets Videos …

IVR

SM

S W

eb, S

ocia

l Me

dia

(Str

uctu

red

) So

cial

Me

dia

(uns

truc

ture

d)

Insights for Satyamev Jayate – Variety of

sources

Page 51: Customer summit  - big data (final)

Rigorous Weekly

Operation Cycle

producing instant

analytics Killer combo of Human+Software to

analyze the data efficiently Topic opens on

Sunday

Live Analytics

report is sent

during the show

Data capture

from SMS, phone

calls, social

media, website,

System runs L0

Analysis, L1, L2

Analysts continue

JSONs are

created for the

external and

internal

dashboards

Featured content

is delivered thrice

a day all through

out the week.

Episode Tags are

refined and

messages are re-

ingested for

another pass

Page 52: Customer summit  - big data (final)

24 July 2012 52

Page 53: Customer summit  - big data (final)

Thank you

Anand Deshpande ([email protected])

http://in.linkedin.com/in/ananddeshpande

Persistent Systems Limited

www.persistentsys.com

53 24 July 2012

Page 54: Customer summit  - big data (final)

Next Generation Sequencing

24 July 2012 54

Sequencing machines are getting affordable

Running cost of sequencing is going down

NGS machines generate TBs of data per week.

Need to analyze this data in time

Analysis results are critical for human life, personalized medicines