Database scalability and indexes Goetz Graefe Hewlett-Packard Laboratories Palo Alto, CA –...

Database scalability and indexes

Goetz Graefe

Hewlett-Packard Laboratories

Palo Alto, CA – Madison, WI

April 18, 2023 Database scalability and indexes 2

Dimensions of scalability

• Data size – cost per terabyte ($/TB)

• Information complexity (database schema size)

• Operational scale (data sources & transformations)

• Multi-programming level (many queries)

• Concurrency (updates, roll-in load, roll-out purge)

• Query complexity (tables, operations, parameters)

• Representation (indexing) complexity

• Storage hierarchy (levels, staging)

• Hardware architecture (e.g., parallelism)


Agenda

• Indexing taxonomy

• B-tree technology

Balancing bandwidths• Disk, network, memory, CPU processing

– Decompression, predicate evaluation, copying

• Table scans– Row stores, column stores– NSM versus PAX versus ?

• Index scans– Range queries, look-ups, MDAM

• Intermediate results– Sort, hash join, hybrid hash join, etc.


How many disksper CPU core?

Flash devices ortraditional disks?

Hardware support

• CPU caches– Alignment, data organization– Prefetch instructions

• Instructions for large data– Quadwords, etc.

• Native encoding– Avoid decimal numerics

• GPUs? FPGAs?


Binary search orinterpolation search?

Avoid XML?

Read-ahead and write-behind

Buffer pool = latency × bandwidth

• Disk-order scans– Guided by allocation information

• Index-order scans– Guided by parent & grandparent levels– Avoid neighbor pointers in B-tree leaves

• Index-to-index navigation– Sort references prior to index nested loops join– Hint references from query execution to storage layer


More I/O requeststhan devices!

More I/O requeststhan devices!

“Fail fast” and fault isolation

• Local slow-down produces asymmetry– Weakest node imposes global slow-down

• Enable asynchrony in I/O and in processing

• Enable incremental load balancing– Schedule multiple work units per server– Largest first, assign work as servers free up


25 work units for 8 servers:S, J, etc. first – Q, Z, Y, X last

Scheduling in query execution

• Admission control – too much concurrency

• Degree of parallelism – match available cores

• Pipelining of operations – avoid thrashing

• “Slack” between producers and consumers– Partitioning: output buffer per consumer– Merging: input buffer per producer– “Free” packets to enable asynchronous execution– 512×512×4×64 KB = 236 B = 16 GB

Lower memory need with more synchronization?


Synchronization in communication

• “Slack” is a bad place to save memory!

• Demand-driven versus data-driven execution– Faster producer will starve for free packets– Faster consumer will starve for full packets– Slowest step in pipeline determines bandwidth


Bad algorithms in query execution

• Query optimization versus query execution– Compile-time versus run-time– Anticipated sizes, memory availability, etc.

• Fast execution with perfect query optimization– Merge join: sorted indexes, sorted intermediate results– Hash join

• Robust execution by run-time adaptation– Index nested loops join– Requires some innovation …


April 18, 2023 CIDR 2009 20

Query

• Varying predicate selectivity together or separately

• Forced plans – focus on robustness of execution – Resource management (memory allocation) – Index use, join algorithm, join order

select count (*) from lineitem where l_partkey >= :lowpart and l_shipdate >= :lowdate

April 18, 2023 CIDR 2009 21

Physical database • Primary index on order key, line number

• 1-column (non-covering) secondary indexes – Foreign keys, date columns

• 2-column (covering) secondary indexes – Part key + ship date, ship date + part key

• Large plan space – Table scan – Single index + fetch from table – Join two indexes to cover the query – Exploit two-column indexes

Wildly different performance curves


Single-table execution times

0.00

100.00

200.00

300.00

400.00

500.00

600.00

700.00

800.00

900.00

1,000.00

Row count

Tim

e [

se

co

nd

s]

Scan plan Fetch plan Join plan Fetch 9115 Hash join

Merge join Join + fetch

April 18, 2023 CIDR 2009 23

Observations • Table scan is very robust but not efficient

– Materialized views should enable fetching query results

• Traditional fetch is very efficient but not robust – Perhaps addressed with risk-based cost calculation

• Multi-index plans are efficient and robust – Independent of join order + method (in this experiment)

• Non-traditional fetch is quite robust – Asynchronous prefetch or read-ahead – Sorting record identifiers or keys in primary index – Sort effect seems limited at high end

Hash join vs index nested loops join• In-memory is an index!

– Direct address calculation– Thread-private: memory allocation, concurrency control

• Traditional index nested loops join– Index search using comparisons and binary search– Shared pages in the buffer pool

• Improved index nested loops join– Prefetch & pin the index in the buffer pool– Replace page identifiers with in-memory pointers– Replace binary search with interpolation search


Index maintenance

• Data warehouse: fact table with 3-9 foreign keys– Non-clustered index per foreign key– Plus 1-3 date columns with non-clustered indexes– Plus materialized and indexed views

• Traditional bulk insertion (load, roll-in)– Per row: 4-12 index insertions, read-write 1 leaf each– Per disk: 200 I/Os per second, 10 rows/sec = 1 KB/sec

• Known techniques– Drop indexes prior to bulk insertion?– Deferred index & view maintenance?


April 18, 2023 27

Partitioned B-trees

Traditional B-tree index

Partitioned B-tree …

… after merging a-j

a z

a za a azzz

a zk k kzzzkj

#1 #2 #3 #4

#4#3#2#1#0

April 18, 2023 28

Algorithms

• Run generation– Quicksort or replacement selection (priority queue)– Exploit all available memory, grow & shrink as needed

• Merging– Like external merge sort, efficient on block-access– Exploit all available memory, grow & shrink as needed– Best case: single merge step

Concurrency control and recovery


“Must reads”for database geeks

Concurrency control and recovery


“Should reads”for database geeks

Goetz Graefe: Key-range locking

31

Tutorial on hierarchical locking

• More generally: multi-granularity locking

• Lock acquisition down a hierarchy – “Intention” locks IS and IX

• Standard example: file & page – T1 holds S lock on file

– T2 wants IS lock on file, S locks on some pages

– T3 wants X lock on file

– T4 wants IX lock on file,X locks on some pages

S X IS IX SIX

S ok ok

X

IS ok ok ok ok

IX ok ok

SIX ok

S X

S ok

X


32

Quiz

• Why are all intention locks compatible?

• Conflicts are decided more accurately at a finer granularity of locking.


33

SQL Server lock modes


34

Lock manager invocations

• Combine IS+S+Ø into SØ (“key shared, gap free”) Cut lock manager invocations by factor 2

• Strict application of standard techniques No new semantics

Automatic derivation S X IS IX

S ok ok

X

IS ok ok ok

IX ok ok

S X SØ ØS XØ ØX SX XS

S ok ok ok

X

SØ ok ok ok ok ok

ØS ok ok ok ok ok

XØ ok ok

ØX ok ok

SX ok

XS ok


35

Key deletion

• User transaction – Sets ghost bit in record header – Lock mode is XØ (“key exclusive, gap free”)

• System transaction – Verifies absence of locks & lock requests – Erases ghost record – No lock required, data structure change only– Absence of other locks is required


36

Key insertion after deletion

• Insertion finds ghost record – Clears ghost bit – Sets other fields as appropriate – Lock mode is XØ (“key exclusive, gap free”)

• Insertion reverses deletion


37

Key insertion

• System transaction creates a ghost record – Verifies absence of ØS lock on low gap boundary

(actually compatibility with ØX) – No lock acquisition required

• User transaction marks the record valid – Locking the new key in XØ (“key exclusive, gap free”) – High concurrency among user insertions

• No need for “creative” lock modes or durations

• Insertion mirrors deletion


38

Logging a deletion

• Traditional design – Small log record in user transaction – Full undo log record in system transaction

• Optimization – Single log record for entire system transaction – With both old record identifier and transaction commit – No need for transaction undo – No need to log record contents – Big savings in clustered indexes

Transaction …, Page …, erase ghost 2; commit!


39

Logging an insertion

• 1st design – Minimal log record for ghost creation – key value only – Full log record in user transaction for update

• 2nd design – Full user record created as ghost – full log record – Small log record in user transaction

• Bulk append– Use 1st design above – Run-length encoding of multiple new keys

Transaction …, Page …, create ghosts 4-8, keys 4711 (+1)


40

Summary: key range locking

• “Radically old” design

• Sound theory – no “creative” lock modes – Strict application of multi-granularity locking – Automatic derivation of “macro” lock modes – Standard lock retention until end-of-transaction

• More concurrency than traditional designs – Orthogonality avoids missing lock modes

• Key insertion & deletion via ghost records – Insertion is symmetric to deletion – Efficient system transactions, including logging


Like scalabledatabase indexing


Summary

• Re-think parallel data & algorithms:– Partitioning: load balancing– Pipelining: communication & synchronization– Local execution: algorithms & data structures!

• Re-think power efficiency– Algorithms & data structures!

• Database query & update processing– Re-think indexes & their implementation

Database scalability and indexes Goetz Graefe Hewlett-Packard Laboratories Palo Alto, CA –...

Documents

Transcript of Database scalability and indexes Goetz Graefe Hewlett-Packard Laboratories Palo Alto, CA –...