Big Data - Umesh Bellur
-
Upload
sts-forum-2016 -
Category
Presentations & Public Speaking
-
view
45 -
download
2
Transcript of Big Data - Umesh Bellur
Not Only Big Data
Prof. Umesh Bellur Department of Computer Science
The Indian Institute of Technology (IIT) Bombay India
But FAST
What’s Big Data? No single definition; here is one from Wikipedia:
• “…difficult to process using on-hand database
management tools or traditional data processing applications. “
• This is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.”
2
The Vs of Big Data
3
12+ TBs
of tweet data every day
25+ TBs of log data
every day
? TB
s o
f d
ata
ever
y d
ay
2+ billion
people on the Web
by end 2011
30 billion RFID
tags today (1.3B in 2005)
4.6 billion
camera phones
world wide
100s of millions
of GPS enabled
devices sold annually
76 million smart meters
in 2009… 200M by 2014
Volume
Variety - A Single perspective of the Digital Universe
Customer
Social Media
Gaming
Entertain
Banking Finance
Our
Known History
Purchase
Velocity (Speed)
• Data is being generated fast and need to be processed fast
• Online Data Analytics
• Late decisions missing opportunities
• Examples – E-Promotions: Based on your current location, your purchase history,
what you like send promotions right now for store next to you
– Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction
6
Motivational Use Cases
Customer
Influence Behavior
Product Recommendations that are Relevant
& Compelling
Friend Invitations to join a
Game or Activity that expands
business
Preventing Fraud as it is Occurring
& preventing more proactively
Learning why Customers Switch to competitors
and their offers; in time to Counter
Improving the Marketing
Effectiveness of a Promotion while it
is still in Play
“Fast” in Smart Grids
An electricity network that can intelligently integrate the actions of all users connected to it (generators, consumers and those that do both) in order to efficiently deliver sustainable, economic and secure electricity supplies
No longer just an experiment!
Estimated investments of ~ 60-75 Billion Euro by 2020
Hinges on
• Real time decision making to route energy from producers to consumers
• Based on fine-grained energy demand predictions.
• Millions of events a second have to be processed “on the fly” – A Billion events per day (10000 smart plugs, per
second readings)
Another Motivational Angle for
“Fast”
Performance of disks:
1987 2004 Increase
CPU Performance 1 MIPS 2,000,000 MIPS 2,000,000 x
Memory Size 16 Kbytes 32 Gbytes 2,000,000 x
Memory Performance 100 usec 2 nsec 50,000 x
Disc Drive Capacity 20 Mbytes 300 Gbytes 15,000 x
Disc Drive Performance 60 msec 5.3 msec 11 x
Source: Seagate Technology Paper: ” Economies of Capacity and Speed:
Choosing the most cost-effective disc drive size and RPM to meet IT requirements” Memory I/O is much faster
than disk I/O!
11
Processing Fast Data
• Streams of data that must be processed in one pass in real time: – No random access allowed. – Continuous – Massive – Unbounded – May be dense or sparse – Event arrive faster than can be “mined” – Uncertainty – missing values
Lack of a real time response may be either life threatening or result in large revenue losses
Challenges
• Time/Space constrained – Not enough memory – Can’t afford storing/revisiting the data
• Single pass computation
– External memory algorithms for handling data sets larger than main memory cannot be used.
• Do not support continuous queries • Too slow real-time response
• Noise – Missing data is a common feature – Outliers – Aged (Stale) data
So…..
• No time to stop and smell the roses
• Only one chance to look at the data…
Harnessing Big Data – the Evolution
• OLTP: Online Transaction Processing (DBMSs)
• OLAP: Online Analytical Processing (Data Warehousing)
• RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
15
DBMS vs. DSMS
Query Processing Continuous Query (CQ) Result
Query Processing
Main Memory Data Stream(s) Data Stream(s)
Disk
Main Memory
SQL Query Result
16
Transient Continuous queries Bounded memory Real time requirements
Persistent relations (relatively static,
stored)
Random access
“Unbounded” disk store
Only current state matters
No real-time services
Synopsis • Random sampling • Histograms • Wavelets
Aging • Sliding Window
Techniques
Stream Processing
• Temporal and spatial operators
• Distributed Complex event processing
Approximations • Deterministic
bounds • Probabilistic
bounds
Technical Aspects of DSMS
Maturity Model
Monitoring
Insights
Process Optimization
Data Monetization
Metamorphosis
(Role of) Standards in Big Data Adoption
• OGC Standards – SOS – Sensor Observation Service
• IEEE Big Data Initiative (BDI) – Metadata standards for Big data management – Verticals – Healthcare, energy etc.
• ISO/IEC CD 20546 – Big Data Vocabulary
• NIST Public working group on Big Data • ITU-T Technology Watch report on Big Data • …
Summary
• Fast data processing is fundamentally different from Big data processing
– DSMS Vs Hadoop/Data Warehousing etc.
• More and more applications having real time needs.
• While there are some solutions, wide open space for research and technological innovation.
– Role of standards cannot be emphasized enough
Questions?
NIST Reference Architecture for Big Data