6.830 Lecture 7

37
6.830 Lecture 7 B+Trees & Column Stores 9/30/2015 Signup: http://bit.ly/6830f15

description

6.830 Lecture 7. 2/27/2013 Extensible Hashing, B+Trees , Buffer Pool Management. Hash Index. n buckets, on n disk pages Disk page 1 … Disk Page n. (‘ sam ’, 10k, …) (‘ joe ’, 20k, …). H(f1). e.g., H(x) = x mod n. Issues How big to make table? If we get it wrong, either - PowerPoint PPT Presentation

Transcript of 6.830 Lecture 7

Page 1: 6.830 Lecture 7

6.830 Lecture 7

B+Trees & Column Stores9/30/2015

Signup: http://bit.ly/6830f15

Page 2: 6.830 Lecture 7

B+Trees

ptr val11 ptr val12 ptr val13 …

ptr val21 ptr val22 ptr val23 …

ptr valn1 ptr valn2 ptr valn3 …

RIDn RIDn+1 RIDn+2 ptr RIDn+3 RIDn+4 RIDn+5 ptr

<val11

>val21, <val22

<valn1

Leaf nodes; RIDs in sorted order, w/ link pointers

Root node

Inner nodes

Page 3: 6.830 Lecture 7

B+Trees

ptr val11 ptr val12 ptr val13 …

ptr val21 ptr val22 ptr val23 …

ptr valn1 ptr valn2 ptr valn3 …

RIDn RIDn+1 RIDn+2 ptr RIDn+3 RIDn+4 RIDn+5 ptr

<val11

>val21, <val22

<valn1

Leaf nodes; RIDs in sorted order, w/ link pointers

Root node

Inner nodes

Page 4: 6.830 Lecture 7

RIDn RIDn+1 RIDn+2 ptr RIDn+3 RIDn+4 RIDn+5 ptr

<valn1

Leaf nodes; RIDs in sorted order, w/ link pointers

B+Trees

Page 5: 6.830 Lecture 7

Properties of B+Trees• Branching factor = B• LogB(tuples) levels• Logarithmic insert/delete/lookup performance• Support for range scans

• Link pointers• No data in internal pages• Balanced (see text “rotation”) algorithms to rebalance on insert/delete• Fill factor: All nodes except root kept 50% full (merge when falls below)• Clustered / unclustered

Page 6: 6.830 Lecture 7

Indexes RecapHeap File B+Tree Hash File

Insert O(1) O( logB n ) O(1)Delete O(P) O( logB n ) O(1)Scan O(P) O( logB n + R ) -- / O(P)Lookup O(P) O( logB n ) O(1)

n : number of tuplesP : number of pages in fileB : branching factor of B-TreeR : number of pages in range

Page 7: 6.830 Lecture 7

R-Trees / Spatial Indexes

x

y

Page 8: 6.830 Lecture 7

R-Trees / Spatial Indexes

x

y

Page 9: 6.830 Lecture 7

R-Trees / Spatial Indexes

x

y

Page 10: 6.830 Lecture 7
Page 11: 6.830 Lecture 7

Q

Page 12: 6.830 Lecture 7

Quad-Tree

x

y

Page 13: 6.830 Lecture 7

Quad-Tree

x

y

Page 14: 6.830 Lecture 7

Quad-Tree

x

y

Page 15: 6.830 Lecture 7

Typical Database Setup

Transactional databaseLots of writes/updatesReads of individual records

Analytics / Reporting Database“Warehouse”

Lots of reads of many recordsBulk updates

Typical query touches a few columns

“Extract, Transform, Load”

Page 16: 6.830 Lecture 7

How Long Does a Scan Take?

• Time proportional to amount of data read• Example

GM 30.77 1,000 NYSE 1/17/2007GM 30.77 10,000 NYSE 1/17/2007GM 30.78 12,500 NYSE 1/17/2007

AAPL 93.24 9,000 NQDS 1/17/2007

“Row” Representation

Even though we only need price, date and symbol, if data is on disk, must scan over all columns

SELECT avg(price) FROM tickstore WHERE symbol = ‘GM’ and date = ‘1/17/2007’

price quantity exchange datesymbol

Magnetic Disk

Page 17: 6.830 Lecture 7

Column Representation Reduces Scan Time

• Idea: Store each column in a separate file

30.7730.7730.7893.24

GMGMGM

AAPL

1,00010,00012,5009,000

NYSENYSENYSENQDS

1/17/20071/17/20071/17/20071/17/2007

Column Representation

Reads Just 3 Columns

Assuming each column is same size, reduces bytes read from disk by factor of 3/5

In reality, databases are often 100’s of columns

Page 18: 6.830 Lecture 7

18

When Are Columns Right?• Warehousing (OLAP)

• Read-mostly; batch update• Queries: Scan and aggregate a few columns

• Vs. Transaction Processing (OLTP)• Write-intensive, mostly single record ops.

• Column-stores: OLAP optimized• In practice >10x performance on comparable

HW, for many real world analytic applications• True even if w/ Flash or main memory!

Different architectures for different workloads

Page 19: 6.830 Lecture 7

19

C-Store: Rethinking Database Design from the Ground Up

Separate FilesColumn-based Compression

Write optimized storage

Inserts

TupleMover

Column-oriented query executor

SYM PRICE VOL EXCH TIME

IBM 100 10244 NYSE 1.17.0

7

IBM 102 11245 NYSE 1.17.07

SUN 58 3455 NQDS 1.17.07

SYM PRICE VOL EXCH TIME

IBM 100 10244 NYSE 1.17.07

IBM 102 11245 NYSE 1.17.07

SUN 58 3455 NQDS 1.17.07

“C-Store: A Column-oriented DBMS” -- VLDB 05

Shared nothing horizontal

partitioning

Page 20: 6.830 Lecture 7

20

Query Processing Example

• Traditional Row Store

SELECT avg(price)FROM tickstore WHERE symbol = ‘GM’ AND date = ‘1/17/2007’

DiskGM 30.77 1,000 NYSE 1/17/2007GM 30.77 10,000 NYSE 1/17/2007GM 30.78 12,500 NYSE 1/17/2007

AAPL 93.24 9,000 NQDS 1/17/2007

SELECTsym = ‘GM’

SELECTdate=’1/17/07’

AVGprice

Complete tuples

Complete tuples

Complete tuples

Page 21: 6.830 Lecture 7

21

Query Processing Example• Basic Column

Store

• “Early Materialization”

SELECT avg(price)FROM tickstore WHERE symbol = ‘GM’ AND date = ‘1/17/2007’

SELECTsym = ‘GM’

SELECTdate=’1/17/07’

AVGprice

Disk30.7730.7730.7893.24

GMGMGM

AAPL

1,00010,00012,5009,000

NYSENYSENYSENQDS

1/17/20071/17/20071/17/20071/17/2007

Construct Tuples

GM 30.77 1/17/07

Fields from same tuple at same index (position) in each

column file

Row-oriented plan

Complete tuples

Complete tuples

Complete tuples

Page 22: 6.830 Lecture 7

22

Query Processing Example

• C-Store• “Late

Materialization”

Disk30.7730.7730.7893.24

GMGMGM

AAPL

1,00010,00012,5009,000

NYSENYSENYSENQDS

1/17/20071/17/20071/17/20071/17/2007

Pos.SELECTsym = ‘GM’

Pos.SELECTdate=’1/17/07’

ANDPosition Bitmap

(1,1,1,1)

Position Bitmap(1,1,1,0)

Position Bitmap(1,1,1,0)

Position Lookup

Prices

AVG

Much less data flowing through

memory

See Abadi et alICDE 07

Page 23: 6.830 Lecture 7

23

Why Compress?

• Database size is 2x-5x larger than the volume of data loaded into it

• Database performance is proportional to the amount of data flowing through the system

Abadi et al, SIGMOD 06

Page 24: 6.830 Lecture 7

24

Query engine processes compressed dataTransfers load from disk to CPUMultiple compression types

Run-Length Encoding (RLE), LZ, Delta Value, Block Dictionary Bitmaps, Null Suppression

System chooses which to apply Typically see 50% - 90% compression

NULLs take virtually no space

Column-Oriented Compression

30.77+0

+.01+62.47

GMGMGM

AAPL

1,00010,00012,5009,000

3xGM1XAPPL

30.7730.7730.7893.24

1/17/20071/17/20071/17/20071/17/2007

4 x 1/17/2007NYSENYSENYSENQDS

3xNYSE1XNQDS

1,00010,00012,5009,000

RLE Delta LZ RLE RLE

Columns contain

similar data, which makes compression

easy

Page 25: 6.830 Lecture 7

25

Operating on Compressed Data

Disk30.77

+0+.01

+62.47

3xGM1xAPPL

1,00010,00012,5009,000

NYSENYSENYSENQDS

4x1/17/2007

Pos.SELECTsym = ‘GM’

Pos.SELECTdate=’1/17/07’

ANDPosition Bitmap

(4x1)

Position Bitmap(3x1,1x0)

Position Bitmap(3x1,1x0)

Position Lookup

Prices

AVG

Only possible with late

materialization!

Compression Aware

Page 26: 6.830 Lecture 7

26

Direct Operation Optimizations

• Compressed data used directly for position lookup• RLE, Dictionary, Bitmap

• Direct Aggregation and GROUP BY on compressed blocks• RLE, Dictionary

• Join runs of compressed blocks• RLE, Dictionary

• Min/max directly extracted from sorted data

Page 27: 6.830 Lecture 7

27

TPC-H Compression PerformanceQuery: SELECT colY, SUM(colX)

FROM lineItem GROUP BY colY• TPC-H Scale 10 (60M records)• Sorted on colY, then colX• colY uncompressed, cardinality varies

Y X1 A1 C1 D2 B2 C3 A

Page 28: 6.830 Lecture 7

28

Compression + Sorting is a Huge Win

How can we get more sorted data? Store duplicate copies of data

Use different physical orderings

Improves ad-hoc query performance Due to ability to directly operate on

sorted, compressed data

Supports fail-over / redundancy

Page 29: 6.830 Lecture 7

29

Write Performance

Tuple MoverAsynchronous Data Movement

Queries read from both WOS and ROS

BatchedAmortizes seeksAmortizes recompressionEnables continuous load

Trickle load: Very Fast Inserts

Page 30: 6.830 Lecture 7

When to Rewrite ROS Objects?• Store multiple ROS objects, instead of just one

• Each of which must be scanned to answer a query

• Tuple mover writes new objects• Avoids rewriting whole ROS on merge

• Periodically merge ROS objects to limit number of distinct objects that must be scanned (like Big Table)

Tuple Mover

WOS ROS

Older objects

Page 31: 6.830 Lecture 7

C-Store Performance

• How much do these optimizations matter?

• Wanted to compare against best you could do with a commercial system

Page 32: 6.830 Lecture 7

32

Emulating a Column Store

• Two approaches:1. Vertical partitioning: for n column table,

store n two-column tables, with ith table containing a tuple-id, and attribute i• Sort on tuple-id• Merge joins for query results

2. Index-only plans• Create a secondary index on each column• Never follow pointers to base table

Page 33: 6.830 Lecture 7

Two Emulation Approaches

Page 34: 6.830 Lecture 7

34

Bottom Line

C-Store, Compression

C-Store, No Compression

C-Store, Early Materialize

Rows

Rows, Vert. Part.

Rows, All Indexes

4

15

41

26

80

221

Time (s)

SSBM (Star Schema Benchmark -- O’Neil et al ICDE 08) Data warehousing benchmark based on TPC-H Scale 100 (60 M row table), 17 columns Average across 12 queries Row store is a commercial DB, tuned by professional DBA vs

C-Store

Commercial System Does Not Benefit From Vertical Partitioning

Page 35: 6.830 Lecture 7

35

Problems with Vertical Partitioning

①Tuple headers Total table is 4GB Each column table is ~1.0 GB Factor of 4 overhead from tuple headers and tuple-ids

②Merge joins Answering queries requires joins Row-store doesn’t know that column-tables are sorted

Sort hurts performance Would need to fix these, plus add direct operation on

compressed data, to approach C-Store performance

Page 36: 6.830 Lecture 7

Problems with Index-Only PlansConsider the query:

SELECT store_name, SUM(revenue) FROM Facts, Stores WHERE fact.store_id = stores.store_id AND stores.country = “Canada” GROUP BY store_name

• Two WHERE clauses result in a list of tuple IDs that pass all predicates

• Need to go pick up values from store_name and revenue columns

• But indexes map from valuetuple ID!• Column stores can efficiently go from tuple IDvalue in each

column

Page 37: 6.830 Lecture 7

37

Recommendations for Row-Store Designers

• Might be possible to get C-Store like performance①Need to store tuple headers elsewhere (not

require that they be read from disk w/ tuples)②Need to provide efficient merge join

implementation that understands sorted columns③Need to support direct operation on compressed

data• Requires “late materialization” design