Distributed Data Analytics Thorsten Papenbrock Storage and … · 2017-11-08 · 0 Tree; frequently...
Transcript of Distributed Data Analytics Thorsten Papenbrock Storage and … · 2017-11-08 · 0 Tree; frequently...
Distributed Data Analytics
Storage and Retrieval Thorsten Papenbrock
G-3.1.09, Campus III
Hasso Plattner Institut
Introduction
Layering Data Models
Slide 2
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
1. Conceptual layer
Data structures, objects, modules, …
Application code
2. Logical layer
Relational tables, JSON, XML, graphs, …
Database management system (DBMS) or storage engine
3. Representation layer
Bytes in memory, on disk, on network, …
Database management system (DBMS) or storage engine
4. Physical layer
Electrical currents, pulses of light, magnetic fields, …
Operating system and hardware drivers
our focus now
Overview
Storage and Retrieval
Slide 3
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Index Structures
Hash Indexes
SSTables and LSM-Trees
B-Trees
Further Indexes
Online Analytical Processing (OLAP)
Data Warehousing
Star- and Snowflake Schemata
Column-Oriented Storage
Data Cubes and Materialized Views
Overview
Storage and Retrieval
Slide 4
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Index Structures
Hash Indexes
SSTables and LSM-Trees
B-Trees
Further Indexes
Online Analytical Processing (OLAP)
Data Warehousing
Star- and Snowflake Schemata
Column-Oriented Storage
Data Cubes and Materialized Views
Index Structures
A Tiny Database
Slide 5
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Basic database tasks: (a) write given data, (b) read specific data
A tiny key-value store in two Bash functions:
It works:
$ db_set 1234 '{"name":"Berlin","type":"city"}'
$ db_get 1234
'{"name":"Berlin","type":"city"}'
#!/bin/bash
db_set () {
echo "$1,$2" >> database
}
db_get () {
grep "^$1," database | sed –e "s/^$1,//" | tail –n 1
}
Concatenate the first two parameters by “,” and write/append them to the
file named “database”
Find all lines starting with first parameter, remove first parameter from lines, and select the last line
Why?
Index Structures
A Tiny Database
Slide 6
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Assume the following input-sequence:
$ db_set 1234 '{"name":"Berlin","type":"city"}'
$ db_set 42 '{"name":"Germany","type":"country"}'
$ db_set 42 '{"name":"Germany","type":"country","capital":"Berlin"}'
The according “database”-file:
$ cat database
{"name":"Berlin","type":"city"}
{"name":"Germany","type":"country"}
{"name":"Germany","type":"country","capital":"Berlin"}
“database” is a Log file:
Append only, no removal of old values
only last entry for each key is valid
Fast writes ( O(1) ) but slow reads ( O(n) with n records in the log )
To speed-up reads: Indexes!
Index Structures
Indexes
Slide 7
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Index
An additional data structure that locates data by some
search criterion, i.e., key (usually one or more attributes, or
value identifier)
Basically a key-value store, where values can be actual data
or pointers to relational records, documents, graph
nodes/edges, …
Improves data retrieval operations
Costs additional writes to index structure and storage space
Use indexes carefully (not too many)!
Different index implementations (data structures) have
different strengths
Choose the right index for your queries!
Overview
Storage and Retrieval
Slide 8
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Index Structures
Hash Indexes
SSTables and LSM-Trees
B-Trees
Further Indexes
Online Analytical Processing (OLAP)
Data Warehousing
Star- and Snowflake Schemata
Column-Oriented Storage
Data Cubes and Materialized Views
Hash Indexes
Hash Indexes
Slide 9
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Definition
A hash index is a hash map (dictionary) that maps keys to the positions
(in memory, on disk, on the network, …) of their values/records
The hash map uses a hash function to calculate mapping of keys and
positions and is usually kept in memory
Uses
key-value stores, other indexes, data distribution (load balancing,
sharding, …)
Strength
Point queries: one index look up directly delivers a value’s position
Weaknesses
Range queries require to look up each key individually
Hash map must fit into main memory; hash maps on disk perform poorly
Hash Indexes
Example
Slide 10
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
1 2 3 4 , { " n a m e " : " B e r l i n " , " t y p e " : "
c i t y " } \n 4 3 , { " n a m e " : " G e r m a n y " , " t
y p e " : " c o u n t r y " } \n
key byte offset
1234 0
42 37
0
30
60
10
40
70
20
50
80
37
In-memory hash map
Log-structured file on disk (each box is one byte)
hash function
Hash Indexes
Hash Indexes
Slide 11
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Controlling file growth
Indexed data, i.e., log is usually insert only (for good write performance)
Frequent updates make files unnecessarily large
Example: a store that maps products to stock-counts
Each purchase increments a stock-count new record!
Each sale decrements a stock-count new record!
But: collection of products is almost constant
Solution: consolidate/compact the log frequently freeing up disk space
How to do this on a running system?
Segmentation!
Hash Indexes
Hash Indexes
Slide 12
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Segmentation
Break log into segments of fixed size
Whenever a segment reaches max size:
1. Close the segment and redirect writes to a new segment
2. Compact the old segment:
1. Read the old segment backwards
2. If a key is read for the first time:
Write the entry (key + value) into a compacted segment
3. Merge the compacted segment with older compacted segment(s):
1. Read older compacted segment
2. If a key is not present in the newer compacted segment:
Write the entry (key + value) into the compacted segment
4. Delete the old segment and old compacted segment
Werwrth Wrthrth Wrthwrthwrth Wrthwrht Wrthwrh Wrhwthwht wrth
Rujrtj Wethwrth Reerzj Wrtwrjerjzezj Erzje Erzjezjetzj Etzetjz etzjejzezjejzetzj
Wrjezjejz Ejzez Etzje etjzetzjejzeWRE RTHWERTHE WRTHE ETZJETZJE ETZJ
EZJEZJ EJZETZJEJZ ZWEZ WETJHEJZU Q4WZ6ZR SR SRZR
ERZJWRZJE SFDGS SRTJHSW
RZJETZJE FGJRU RZIRZIK,R ARGA Q5ZW4E57JW6 SD
ATEHSWRZJ SRT SRTJHSRTJHSRT SSRZJ RZJERJZ WRJREJZREJ RJSER REJZWRJZWR REZJEJERZJ
RWZJWRZ RZJ RZJJREZ REZJWRZJW RZJ WZJ ERZJDGH
Q5THW SRTW WTRHWR 57J472 ETZJE 245Z36J357
E5J7E7JE EJ EZJEJE57JE WRJWRJ WJZWEJE5JE ETZJ WZJWWE6ZJEJ
Werwrth Wrthrth SATSFG Wrthwrht SRSSSRZNRZR Wrhwthwht wrthSRNS
SRT Etzjejzezjejzetzj SRTHS SRSRZJSWJZ ZTJEDJZ DTZJETZJETZJ
WrETZJ SRTHejz EjzeETZJETZJz Etzje etjzeWRE RTHERWRTHE ETZJETZJE ETZJ
EZJEZJ EJZETZJEJZ ZWEZ WETJHEJZU Q4TEZJWETZJZ6ZRTZJ SR SRETZJZR
ERZJWRZJE SFDGRTHS SERZJ EJZEZTJE ETZJETZJ ETZJETJZEJZETZ ETZJET ETZJRTJHSW
RZJETZJE FGJRU RZIRZIETZJK,R ARETZGA Q5JW6 SD ETZJETJ
ATEHSWRZJ SRTETZJETZJ TJHSRTJHSRT SSRZJETZJ RZJZ WRJREJ RJSER REJZWRJZWR REZJEJERZJ
RWZJWRZ RZJ RZJJREZ REZJWRZJW RZJTEZJETZJ WZJETZJ ERZJDGHETZJ ETZJEJ ETZJETZJ
Q5THW SRTW WTRHETZJWR 57J472ETJZ ETZJEETZ 245Z36J35 ETZJET7
E5J7E7JETZJE EJETZJ EZJEJE57JE WRJWRJ WJZWE ETZJETZJE5JE ETZJ WZJWWE6ZJEJ
Compacted
Current
ATEHS SRT SR
Writes
Reads
Hash Indexes
Hash Indexes
Slide 13
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Segmentation
Compact:
Merge:
Overview
Storage and Retrieval
Slide 14
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Index Structures
Hash Indexes
SSTables and LSM-Trees
B-Trees
Further Indexes
Online Analytical Processing (OLAP)
Data Warehousing
Star- and Snowflake Schemata
Column-Oriented Storage
Data Cubes and Materialized Views
SSTables and LSM-Trees
SSTables
Slide 15
Thorsten Papenbrock
Definition
Sorted String Tables (SSTables) are segment files that contain key-value
pairs sorted by their key; each key only appears once
Sortation and distinctness of keys disallows simple appends to the log
Merge
Given two (or more) SSTables, their merge is calculated similarly to the
merge-sort algorithm:
1. Read all SSTables simultaneously:
1. Copy the smallest key with its value to the merged SSTable
and read next key-value pair
2. If keys are similar:
Copy only the key-value pair from the most recent SSTable to the merged SSTable
1 2 3 4 ?
2 4 5 7 1 3 6 7
Storage and Retrieval
Distributed Data Analytics
SSTables and LSM-Trees
SSTables
Slide 16
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Index
Two options:
a) A dense index, i.e., each key-value pair in the SSTable is indexed
Allows direct look ups
b) A sparse index, i.e., first key-value pair in each block of an SSTable
is indexed
If a query key is not directly indexed:
find the next larger key in the index (binary search)
find the block of the next larger key (look up)
search for the query key in the block (binary search or
linear scan, if log entries have variable size)
Blocks can be compressed to reduce memory consumption and
I/O bandwidth
SSTables and LSM-Trees
SSTables
Slide 17
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Merge
Sparse index
SSTables and LSM-Trees
LSM-Trees
Slide 18
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Definition
Log-Structured Merge-Trees (LSM Trees) are multilayered search trees for
key-value log-data that use different data structures, each of which
optimized for its underlying storage medium
Example
First layer (C0 Tree): index structure that …
1) efficiently takes new key-value pairs in any order
2) outputs all contained key-value pairs in sorted order
B-trees, red-black trees, AVL trees, …
Second layer (C1 Tree): index structure that …
1) is able to merge with sorted key-value pair lists
2) effectively compacts/compresses contained key-value pairs
SSTables
Patrick O'Neil et. al. “The log-structured merge-tree (LSM-tree)”, Acta Information, volume 33, number 4, pages 351-385, 1996
memtable
SSTables and LSM-Trees
LSM-Trees
Slide 19
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Intuition
Sorted trees are fast in-memory indexes but they outgrow main memory
SSTables are indexable and compact but don’t support random inserts
Insert: add new key-value pairs to C0 Tree; frequently merge trees
down the hierarchy (C0 C1 C2 …) to free memory
Read: search for the key in every layer or chronological if only the
most recent value is needed
Data stores with
LSM-Tree implementations
LevelDB, RocksDB,
Cassandra, Hbase,
Lucene (in Elastic-
search and Solr), …
SSTables and LSM-Trees
LSM-Trees
Slide 20
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Optimizations Querying non-existend values is expensive (check all layers)
Catch most of these queries with a Bloomfilter
Size-tiered compaction
Merge newer, smaller SSTables successively into older, larger SSTables
Overview
Storage and Retrieval
Slide 21
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Index Structures
Hash Indexes
SSTables and LSM-Trees
B-Trees
Further Indexes
Online Analytical Processing (OLAP)
Data Warehousing
Star- and Snowflake Schemata
Column-Oriented Storage
Data Cubes and Materialized Views
B-Tree
Definition and Structure
Slide 22
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Definition
A self-balancing, tree-based data structure, that stores values sorted by key and
allows read/write access in logarithmic time
A generalization of a binary search tree as nodes can have more than two
children
Structure
Blocks:
Nodes in the tree that contain key-value pairs and pointers to other blocks
Correspond to physical, fixed sized disk blocks/pages that are addressed
and read as single units
Pointers:
Edges in the tree that connect blocks in tree structure
Correspond to physical block/page addresses
R. Bayer and E. McCreight. “Organization and maintenance of large ordered indices”, In Proceedings of SIGFIDET (now SIGMOD) Workshop, pages 107-141, 1970
B-Tree
Constraints
Slide 23
Thorsten Papenbrock
Constraints
Balanced:
Same distance from root-node to all leaf-nodes
Depth of the tree is in O(log(n)) (= key-look-up complexity)
If tree grows unbalanced, re-balancing required
Block-Content:
A block contains n keys and n+1 pointers in alternating order
Pointers left to a key point to blocks containing smaller keys
Pointers right to a key point to blocks containing larger/equal keys
All values in the tree are sorted by their keys!
Block-Size:
Typically 4096 Byte per block; 4 Byte per key; 8 Byte per pointer
4n + 8(n+1) ≤ 4096 => n = 340
B-Tree
Constraints
Slide 24
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Constraints
Root Node:
Points to underlying nodes (and values)
At least 2 pointers used
Inner Node:
Points to underlying nodes (and values)
At least (𝑛 + 1)/2 pointers used
Leaf Node:
Points to right neighbor leaf and values
At least 1 neighbor pointer (if present) and (𝑛 + 1)/2 value pointers used
Uses
Any data store: most widely used index structure for DBMSs
Sorted, dense, and sparse indexes
B-Tree vs. B+-Tree
B-Trees store keys and values in both internal and leaf nodes;
B+-Trees store values only in leaf nodes.
The following examples show B+-Trees
B-Tree
Example
Slide 25
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Example
Block Key
Pointer free space Root
Inners
Leafs
Values
B-Tree vs. SSTable .
Always rewrite entire blocks vs. always append and merge frequently
B-Tree
More Examples
Slide 26
Storage and Retrieval
Thorsten Papenbrock
Distributed Data Analytics
Neighbor pointer
Look up scenario
Grow scenario
With data
B-Tree
Insert and Delete
No
Find leaf node of the key
Delete key and value-pointer
Underflow?
Redistribute
with neighbors?
Adjust the keys of
all parent nodes
Redistribute
Done No
Yes
N leaf?
Move all keys of the right neighbor
into the left neighbor node and
delete the (empty) right node
Move split-key
into parent node of
left neighbor node
Yes No Merge
Yes
No
Find corresponding leaf node for key
Insert key and value-pointer
Overflow?
N root?
Create new
root node
(with only one
Single key)
Split node N into nodes N1, N2
and distribute keys evenly
N leaf?
Insert new key-pointer
pair into parent node
Copy smallest
key in N2
upwards
Move largest
key in N1
upwards
Done
Yes
No
Yes No
Yes
Split
Split and Merge operations guarantee that the B-Tree is always balanced and the blocks are filled correctly.
Overview
Storage and Retrieval
Slide 28
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Index Structures
Hash Indexes
SSTables and LSM-Trees
B-Trees
Further Indexes
Online Analytical Processing (OLAP)
Data Warehousing
Star- and Snowflake Schemata
Column-Oriented Storage
Data Cubes and Materialized Views
Further Indexes
Alternative Index Types
Slide 29
Thorsten Papenbrock
Clustered Index
Stores indexed data or parts of it within the index (instead of pointers to data)
Example: An index on attribute delivery_status allows to count pending
deliveries without data access.
Improves the performance of certain queries
Reduces write performance and requires additional storage
Redundant values (in data and index) complicate data consistency
Multi-Column Index
a) Concatenated index: merge keys into one key
b) Multi-dimensional index: split multi-dim. key domain into multi-dim. shapes
Example: An index on two-dim. geo locations (longitude, latitude)
to answer intersection, containment, and nearest neighbor queries.
Most common implementation: R-Trees
Further Indexes
R-Tree
Slide 30
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
R-Tree
A variation of a B-Tree that uses a hierarchy of rectangles as keys
Also: balanced and block-sized nodes
Indexed points …
are clustered into leaf nodes
might occur in multiple clusters
Insertion:
into appropriate clusters
split cluster if too large
find smallest cluster extension
via heuristic if no cluster fits directly
Further Indexes
Alternative Index Types
Slide 31
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Fuzzy Index
Index on terms/keys that allows for value misspellings, synonyms, variations, …
Idea: sparse, sorted index (e.g. SSTable or B-Tree) with similarity look up
Example: An index on attribute firstname where names might be misspelled.
Look up most similar key
Scan the (sorted!) neighborhood of that key’s value for similar values
John Sno
John Snow
Jon Snow
john snow
jonny snow
…
…
Overview
Storage and Retrieval
Slide 32
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Index Structures
Hash Indexes
SSTables and LSM-Trees
B-Trees
Further Indexes
Online Analytical Processing (OLAP)
Data Warehousing
Star- and Snowflake Schemata
Column-Oriented Storage
Data Cubes and Materialized Views
Onlien Analytical Processing (OLAP)
What is Analytics?
Slide 33
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
“Calculate the number of sales per year and state.”
SELECT SUM(S.sales)
FROM Sales S, Times T, Locations L
WHERE S.timeid = T.timeid AND S.timeid = L.timeid
GROUP BY T.year, L.state
Aggregate Queries
Outlier Detection
Clustering
Trend Prediction
And many further questions
to data sets!
Overview
Storage and Retrieval
Slide 34
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Index Structures
Hash Indexes
SSTables and LSM-Trees
B-Trees
Further Indexes
Online Analytical Processing (OLAP)
Data Warehousing
Star- and Snowflake Schemata
Column-Oriented Storage
Data Cubes and Materialized Views
Data Warehousing
Data Warehouse
Slide 35
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Definition
A central repository of integrated, potentially pre-aggregated historical data
from one ore more distinct sources (usually operational/OLTP systems)
Ecosystem
Customer
Employee
Supplier DeliveryDB
InventoryDB
SalesDB
Staging Area
Operational Systems
Data Warehouse
raw data
meta data
summary data data integration, cleaning, pre-aggregation, …
ETL
ETL
ETL
ETL ETL
Data Marts
Purchasing
Sales Analyst
Reporting
data preparation, filtering, …
ETL-processes (extract, transform, load)
Data Warehousing
Assessment
Slide 36
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Advantages
Compute intense analytical queries do not interfere with operational business
Data is static and does not change while queries are run
Analytics-friendly data layouts, e.g., star-schemata, data cubes, or
materialized views as well as indexes for analytical query patters are possible
Data can be sorted by certain keys that are often queried as range or group
(e.g. timestamps, version-IDs, tags, or country codes), whereas operational
data is usually not sorted for better insert performance
Data can be compressed more aggressively, which can improve read
performance
Disadvantages
Data is not up-to-date (one ETL-cycle ≈ one day old)
Data Warehousing
Solutions
Slide 37
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Commercial
Microsoft SQL Server
SAP HANA
Teradata
Vertica
ParAccel
…
Open-Source
Apache Hive
Spark SQL
Cloudera Impala
Facebook Presto
Apache Tajo
Apache Drill
…
Most of these are Hadoop-based and use some kind of
MapReduce paradigm.
Overview
Storage and Retrieval
Slide 38
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Index Structures
Hash Indexes
SSTables and LSM-Trees
B-Trees
Further Indexes
Online Analytical Processing (OLAP)
Data Warehousing
Star- and Snowflake Schemata
Column-Oriented Storage
Data Cubes and Materialized Views
Star- and Snowflake Schemata
Definitions
Slide 39
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Star Schema
An acyclic graph of relational tables with depth one, where the root is
called fact table and the leafs dimension tables
Fact tables: contain events (transactions, measurements, snapshots,…)
Usually very large tables
Examples: sales, page views, clicks, shippings, sensor readings, …
Dimension tables: contain entity data and descriptive information
Usually small tables with fix domain
Examples: products, employees, customers, dates, locations, …
Answer for each event: who, what, where, when, how, or why
Snowflake Schema
Same as star schema, but with arbitrary depth, i.e., dimension tables
might have further dimension tables
Star- and Snowflake Schemata
Examples
Slide 40
Storage and Retrieval
Thorsten Papenbrock
Star Schema Snowflake Schema
Distributed Data Analytics
Star- and Snowflake Schemata
Benefits
Slide 41
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Star- and Snowflake Schemata
Improved query performance for (most) aggregation queries:
Better join-performance than normalized data
Better scan performance than one big table
May contain pre-aggregated data
Simpler queries:
Clear join logic and limited number of necessary joins
Overview
Storage and Retrieval
Slide 42
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Index Structures
Hash Indexes
SSTables and LSM-Trees
B-Trees
Further Indexes
Online Analytical Processing (OLAP)
Data Warehousing
Star- and Snowflake Schemata
Column-Oriented Storage
Data Cubes and Materialized Views
Column-Oriented Storage
Motivation and Idea
Slide 43
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Observation
Data warehouse tables are often very wide, but analytical queries access only
very few columns (usually 4 to 5 and rarely “SELECT *”)
Most data models (relational, key-value, column family, document) store data
record-wise and must read, parse and filter all data for analytical queries
Solution
Store data attribute-wise instead of record-wise:
One file per attribute
Values ordered by record index in each file
For each query: scan only required attribute files
Column-Oriented Storage
Example
date_key product_sk store_sk promotion_sk customer_sk quantity net_price discount_price
140102 69 4 NULL NULL 1 13.99 13.99
140102 69 5 19 NULL 3 14.99 9.99
140102 69 5 NULL 191 1 14.99 14.99
140102 74 3 23 202 5 0.99 0.89
140103 31 2 NULL NULL 1 2.49 2.49
140103 31 3 NULL NULL 3 14.99 9.99
140103 31 3 21 123 1 49.99 39.99
140103 31 8 NULL 233 1 0.99 0.99
fact_sales table
file 1 file 2 file 3 file 4 file 5 file 6 file 7 file 8
Column-Oriented Storage
Query
product_sk file
69 69 69 69 74 31 31 31 31 29 30 30 31 31 31 68 69 69
Query
SELECT product_sk, SUM(quantity) AS product_sales
FROM fact_sales
WHERE product_sk IN (30, 68, 69)
GROUP BY product_sk;
1. Scan product_sk file for values 30, 68, and 69 and
remember the row of each value
2. Read the quantities at the retrieved rows in the
quantity file and sum accordingly
Slide 45
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock We can do even better!
Compression
Column-Oriented Storage
Bitmap Encoding
product_sk file
69 69 69 69 74 31 31 31 31 29 30 30 31 31 31 68 69 69
Bitmap encoding
Slide 46
Thorsten Papenbrock
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 29
0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 30
0 0 0 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 31
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 68
1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 69
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 74
Smaller? … Here, yes:
18 · 32 Bit = 576 Bit vs. 6 · (32 Bit + 32 Bit) = 384 Bit
Column-Oriented Storage
Bitmap Encoding
Bitmap encoding
Slide 47
Thorsten Papenbrock
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 29
0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 30
0 0 0 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 31
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 68
1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 69
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 74
Bit-wise OR
SELECT product_sk, SUM(quantity) AS product_sales FROM fact_sales WHERE product_sk IN (30, 68, 69) GROUP BY product_sk;
Column-Oriented Storage
Run-length Encoding
Problem: Bitmaps are very long and sparse in practice
Run-length encoding
29 9 zeros, 1 one, rest zeros
30 10 zeros, 2 one, rest zeros
31 5 zeros, 4 one, 3 zeros, 3 ones, rest zeros
68 15 zeros, 1 one, rest zeros
69 0 zeros, 4 one, 12 zeros, 2 ones, rest zeros
74 4 zeros, 1 one, rest zeros
9 1
10 2
5 4 3 3
15 1
0 4 12 2
4 1
Slide 48
Thorsten Papenbrock
For more compression techniques see:
Daniel Abadi et. al. “The Design and Implementation of Modern Column-Oriented Database Systems”, 2013.
See also: Roaring Bitmaps http://roaringbitmap.org/
Overview
Storage and Retrieval
Slide 49
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Index Structures
Hash Indexes
SSTables and LSM-Trees
B-Trees
Further Indexes
Online Analytical Processing (OLAP)
Data Warehousing
Star- and Snowflake Schemata
Column-Oriented Storage
Data Cubes and Materialized Views
Data Cubes and Materialized Views
Pre-Aggregation
Slide 50
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Observation
Most analytical queries in data warehouses are aggregate queries
Aggregation patterns repeat frequently
Pre-calculate common aggregates!
Materialized Views
Are query results that are written back to disk
DBMS query optimizer can automatically use these views to answer
queries (or parts of queries)
Strategies to improve data analytics:
Pre-calculation: estimate common aggregates and pre-calculate them
Lazy pre-calculation: store each aggregate once it was queried
Data Cubes and Materialized Views
Data Cubes
Slide 51
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Definition
Materialized views for multi-dimensional aggregates, i.e.,
a grid of aggregates grouped by different dimensions
Example
32 33 34 35 … total
140101 149.60 31.01 84.58 28.18 … 40710.53
140102 132.18 19.78 82.91 10.96 … 73091.28
140103 196.75 0.00 12.52 64.67 … 54688.10
140104 178.36 9.98 88.75 56.16 … 95121.09
… … … … … … …
total 14967.09 5910.43 7328.85 6885.39 … 5365M
product_sk
date
_key
+ + + +
+
+
+
+
+
+
+
+
SELECT product_sk, date_sk, SUM(net_price) FROM fact_sales GROUP BY product_sk, date_sk
Data Cubes and Materialized Views
Slicing- and Dicing-Queries
Slide 52
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
32 33 34 35 … total
140101 149.60 31.01 84.58 28.18 … 40710.53
140102 132.18 19.78 82.91 10.96 … 73091.28
140103 196.75 0.00 12.52 64.67 … 54688.10
140104 178.36 9.98 88.75 56.16 … 95121.09
… … … … … … …
total 14967.09 5910.43 7328.85 6885.39 … 5365M
product_sk
date
_key
SELECT SUM(net_price) FROM fact_sales WHERE product_sk = 32 AND date_sk = 140101
SELECT SUM(net_price) FROM fact_sales WHERE date_sk = 140101
SELECT SUM(net_price) FROM fact_sales WHERE product_sk = 32 AND net_price > 100.00
Data Cubes and Materialized Views
Data Cubes of higher Dimension
Slide 53
Storage and Retrieval
Distributed Data Analytics
Thorsten Papenbrock
Distributed Data Analytics
Introduction Thorsten Papenbrock
G-3.1.09, Campus III
Hasso Plattner Institut