Big Data - Umesh Bellur

22
Not Only Big Data Prof. Umesh Bellur Department of Computer Science The Indian Institute of Technology (IIT) Bombay India But FAST

Transcript of Big Data - Umesh Bellur

Page 1: Big Data - Umesh Bellur

Not Only Big Data

Prof. Umesh Bellur Department of Computer Science

The Indian Institute of Technology (IIT) Bombay India

But FAST

Page 2: Big Data - Umesh Bellur

What’s Big Data? No single definition; here is one from Wikipedia:

• “…difficult to process using on-hand database

management tools or traditional data processing applications. “

• This is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.”

2

Page 3: Big Data - Umesh Bellur

The Vs of Big Data

3

Page 4: Big Data - Umesh Bellur

12+ TBs

of tweet data every day

25+ TBs of log data

every day

? TB

s o

f d

ata

ever

y d

ay

2+ billion

people on the Web

by end 2011

30 billion RFID

tags today (1.3B in 2005)

4.6 billion

camera phones

world wide

100s of millions

of GPS enabled

devices sold annually

76 million smart meters

in 2009… 200M by 2014

Volume

Page 5: Big Data - Umesh Bellur

Variety - A Single perspective of the Digital Universe

Customer

Social Media

Gaming

Entertain

Banking Finance

Our

Known History

Purchase

Page 6: Big Data - Umesh Bellur

Velocity (Speed)

• Data is being generated fast and need to be processed fast

• Online Data Analytics

• Late decisions missing opportunities

• Examples – E-Promotions: Based on your current location, your purchase history,

what you like send promotions right now for store next to you

– Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction

6

Page 7: Big Data - Umesh Bellur

Motivational Use Cases

Customer

Influence Behavior

Product Recommendations that are Relevant

& Compelling

Friend Invitations to join a

Game or Activity that expands

business

Preventing Fraud as it is Occurring

& preventing more proactively

Learning why Customers Switch to competitors

and their offers; in time to Counter

Improving the Marketing

Effectiveness of a Promotion while it

is still in Play

Page 8: Big Data - Umesh Bellur

“Fast” in Smart Grids

An electricity network that can intelligently integrate the actions of all users connected to it (generators, consumers and those that do both) in order to efficiently deliver sustainable, economic and secure electricity supplies

Page 9: Big Data - Umesh Bellur

No longer just an experiment!

Estimated investments of ~ 60-75 Billion Euro by 2020

Page 10: Big Data - Umesh Bellur

Hinges on

• Real time decision making to route energy from producers to consumers

• Based on fine-grained energy demand predictions.

• Millions of events a second have to be processed “on the fly” – A Billion events per day (10000 smart plugs, per

second readings)

Page 11: Big Data - Umesh Bellur

Another Motivational Angle for

“Fast”

Performance of disks:

1987 2004 Increase

CPU Performance 1 MIPS 2,000,000 MIPS 2,000,000 x

Memory Size 16 Kbytes 32 Gbytes 2,000,000 x

Memory Performance 100 usec 2 nsec 50,000 x

Disc Drive Capacity 20 Mbytes 300 Gbytes 15,000 x

Disc Drive Performance 60 msec 5.3 msec 11 x

Source: Seagate Technology Paper: ” Economies of Capacity and Speed:

Choosing the most cost-effective disc drive size and RPM to meet IT requirements” Memory I/O is much faster

than disk I/O!

11

Page 12: Big Data - Umesh Bellur

Processing Fast Data

• Streams of data that must be processed in one pass in real time: – No random access allowed. – Continuous – Massive – Unbounded – May be dense or sparse – Event arrive faster than can be “mined” – Uncertainty – missing values

Lack of a real time response may be either life threatening or result in large revenue losses

Page 13: Big Data - Umesh Bellur

Challenges

• Time/Space constrained – Not enough memory – Can’t afford storing/revisiting the data

• Single pass computation

– External memory algorithms for handling data sets larger than main memory cannot be used.

• Do not support continuous queries • Too slow real-time response

• Noise – Missing data is a common feature – Outliers – Aged (Stale) data

Page 14: Big Data - Umesh Bellur

So…..

• No time to stop and smell the roses

• Only one chance to look at the data…

Page 15: Big Data - Umesh Bellur

Harnessing Big Data – the Evolution

• OLTP: Online Transaction Processing (DBMSs)

• OLAP: Online Analytical Processing (Data Warehousing)

• RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)

15

Page 16: Big Data - Umesh Bellur

DBMS vs. DSMS

Query Processing Continuous Query (CQ) Result

Query Processing

Main Memory Data Stream(s) Data Stream(s)

Disk

Main Memory

SQL Query Result

16

Transient Continuous queries Bounded memory Real time requirements

Persistent relations (relatively static,

stored)

Random access

“Unbounded” disk store

Only current state matters

No real-time services

Page 17: Big Data - Umesh Bellur

Synopsis • Random sampling • Histograms • Wavelets

Aging • Sliding Window

Techniques

Stream Processing

• Temporal and spatial operators

• Distributed Complex event processing

Approximations • Deterministic

bounds • Probabilistic

bounds

Technical Aspects of DSMS

Page 18: Big Data - Umesh Bellur

Maturity Model

Monitoring

Insights

Process Optimization

Data Monetization

Metamorphosis

Page 19: Big Data - Umesh Bellur

(Role of) Standards in Big Data Adoption

• OGC Standards – SOS – Sensor Observation Service

• IEEE Big Data Initiative (BDI) – Metadata standards for Big data management – Verticals – Healthcare, energy etc.

• ISO/IEC CD 20546 – Big Data Vocabulary

• NIST Public working group on Big Data • ITU-T Technology Watch report on Big Data • …

Page 20: Big Data - Umesh Bellur

Summary

• Fast data processing is fundamentally different from Big data processing

– DSMS Vs Hadoop/Data Warehousing etc.

• More and more applications having real time needs.

• While there are some solutions, wide open space for research and technological innovation.

– Role of standards cannot be emphasized enough

Page 21: Big Data - Umesh Bellur

Questions?

[email protected]

Page 22: Big Data - Umesh Bellur

NIST Reference Architecture for Big Data