Big Data vs Data Warehousing
-
Upload
thomas-kejser -
Category
Technology
-
view
2.122 -
download
4
description
Transcript of Big Data vs Data Warehousing
Thomas Kejser
http://blog.kejser.org
@thomaskejser
Bigdata vs. Data Warehousing
Synergy or Conflict?
Thomas Kejserhttp://blog.kejser.org@thomaskejser
• Formerly: Lead SQLCAT EMEA• Now: CTO FusionIo EMEA
• 15 year database experience• Performance Tuner
Who is this Guy?
Billi
on H
uman
s
Year2000 2050 2100 2150 2200 22505
6
7
8
9
10
Source: United Nations Projections
Human Consciousness Doesn’t Scale
Text Messages in a Table
CREATE TABLE AllTexts (
Sender BIGINT 8B
, Receiver BIGINT 8B
, SenderLocation BIGINT 8B
, ReceiverLocation BIGINT 8B
, Time DATETIME 8B , SMS VARCHAR(140) 140B
)= 180Bytes
How much do we text?
• World Average• 6.1 Trillion Text Messages / year• About 80% cell phone coverage• 7 billion people• 3 messages/day/person
• But: • Teenagers: 50 messages/day
Source: Pew Internet Research 2010 & ITU
How much will we EVER text?
• 9B people acting like teenagers (in 2050)• 50 texts/day
• That’s 450 billion texts/day• 164 Trillion texts/year (20x today)• 180 bytes each• Assume x3 compression
• Approximation: 10 Petabytes/year in 2050
LOGCapacity GB
Year
Can it be done?
Moore’s Hard Drives
How Large is this/year?
Hard Disk (4TB) : 2.5”
About 1500 Wine Bottles
Wine Bottle (75cl): 4.0”
• Calculating:• 2U Storage=24 Disks
(includes compute)• 4TB per Disk• 100TB in 2U (a bit
less)• 10PB = 200U storage
• About six racks
In the Data Center
Warehouses Serve us Well..
• Good Management Interfaces
• Standard SQL• with a few extensions
• Appliances• Support system• Homogenous HW
• In chunks
… And it is Becoming a Commodity
vs.
PDW vs. Hive – Scan/seek
SELECT count(*) FROM lineitem
Query 1 Query 20
200400600800
100012001400
HivePDW
Secs.
SELECT max(l_quantity) FROM lineitem WHERE l_orderkey > 1000 and l_orderkey < 100000 GROUP BY l_linestatus
Query 1 Query 2
Hive PDW-U PDW-P0
5001000150020002500300035004000
Series1
Secs.
PDW vs. Hive - Joins
PDW-U: • orders partitioned on c_custkey • lineitem partitioned on l_partkey
PDW-P: • orders partitioned on o_orderkey• lineitem partitioned on
l_orderkey
SELECT max(l_orderkey) FROM ordersJOIN lineitem ON l_orderkey = o_orderkey
• Thread startup times• Co-location awareness• Files vs. optimized DB memory
structures• Column stores and other DB tech
Generic is good…
… but when there is structure, make use of it!
What does Big Data need to Catch up?
• What is BigdataVery Unstructured Data
How many Pictures of Cats?
• Flickr Today: • 300MB/month • 2GB/year• 51M users (too small?)
• Estimate: 102 PB / year
• 10 x text messages
Source: WikiPedia
How big is this in wine bottles?
We have learned how to store it!
• Distributed File System
• Open Source• No more SAN
• The Failure Unit is the Server
What is HDFS?
Fully unstructured data is boring
…Unless you get money for storing it
Acquiring Personal Information
Your Semi-structured Data, the Old Fashioned Way
The Social Angle
Who do you talk to and how often?
The Reasons
Why do you own a cell phone?
Your Semi-structured Data, For Free
- at The PubSaturday, 1:39am
Big Value
Extraction of
of meaning and insight
from semi-structured data
Extracting Meaning from Humans
Method Examples
Turn semi-structure to structure Image recognition, network proximity and super nodes, social media
Needle in a haystack Extract outliers, Fraud
Herd behaviors Clustering, Pattern Recognition, “Customers who bought this also bought”
Text classification and search Text indexes, syntactic counting, pagerank
Text to structure Semantic analysis, loose structure into structure
Find New Customers
“Michael, who is respected among his peers, often talks about his new, coolgadgets”
Michael
Thomas
Tommy
Cross Sell
“Families who own an Aston Martin will often buy a Mini Cooper too”
Free Information
Need: Lots of CPU Cores!
Need: Data Centers!
Provisioning has to be REALLY fast
• Get good at • Statistics (again)• Distributed Algorithms• Tuning
• Understand Physical Constraints
• Acquire deep domain knowledge
Things to Learn for the Future
Something is Changing
Today Tomorrow
YouCAPEX Hardware OPEX Hardware
The Mother of All Stovepipes
Data you are afraidto lose
Big Data / Staging(No Model)
Delivery(Model)
Data You actually need
Synergy
Create Structure for me
Here is a tableWarehouse
Applying Social Media to Structure
Data Warehouse
• There is a model• Seek Co-location• Respond in seconds• Calculate first, query after• Expensive HW• Optimise for target HW• Homogenous HW• Pay vendor, expect
optimised
Big Data
• Don’t bother modeling!• Optional Co-Location• Respond in minutes• Calculate while querying• Cheap HW• Good enough on all HW• Heterogeneous HW• Free license, optimise
yourself
Summary
Q A&