Time for Change: Why Not Transact In Memory?

Time for Change:

Why Not Transact In Memory?

Sang K. Cha

Email: [email protected]

on leave from Seoul National University, Korea

mailto:[email protected]





2

Transact In Memory Project

My interest in main memory database started from: WS-IRIS at HP Lab (1991-1992.1):

• In-memory personal object manager project of Tore Risch (before SmallBase project which became TimesTen)

Related previous projects at Seoul National Univ.: C++ design & implementation of MMDBMS prototypes

• Initial design of MMDBMS modules (1993-1994)

• Mr.RT for ETRI (1995-1996)– 2+ local startups

• Xmas: Extension of Mr.RT (1997-1999)

Object-oriented middleware for spatial object database (1994-1999)

Transact in Memory project (2000 - present)

3


Spirit:

Experimental Presently, self-managed through Transact In

Memory, Inc. for research and development freedom Venture capital financing in the future?

Funding Scheme:

Do what we think are important rather than what others (government bureaucrats or committees) think important.

No more Build and Perish: Deal with the end users directly for quick feedback on

our work

4


Our Users

Performance-demanding users with simple applications (with 3n-letter company product bases) Telco companies Stock brokerage companies ….

Our People:

20+ is expanding Graduate students Consultant (patent attorney for IP protection) Full-time, field service engineering & marketing staffs

5

Outline

Introduction to Main Memory DBMSMMDBMS vs DRDBMS

Memory-primary vs disk-primary architecture

More on Transact In Memory ProjectSome hypotheses

Parallel Architecture with Parallel Logging and Recovery

Some Memory-Primary Design

P*TIME: a Highly Parallel Transact In Memory Engine

Summary

6

Main Memory DBMS

Database resident in memoryRead transactions simply read the in-memory data.

Update transactions do in-memory updates and write update log to the log disk.

Occasionally, checkpoint the dirty pages of the in-memory database to the disk-resident backup DB to shorten the recovery time.

Backup DB

Primary DB

LoggingCheckpointing

Log

MMDBMS

7

MMDBMS Technical Issues

Logging and Recovery: critical for not losing updates

In-Memory Index StructuresEfficient L2 Cache Utilization?

Concurrency Control of Index and Data

DBMS Process & Application Binding Architecture

Query Processing

Memory Management

Availability for Mission-critical Applications

8

Reminder on Hardware Advances

1000 times over the 15 years [Patterson]CPU power, Memory capacity, Disk capacity

• Not really for memory and disk speed • Network interface speed: 100Mbps – Gbps (maybe 100 times)

VAX 11/750 in 1984 (Stanford KBMS personal workstation)• CPU (32 bit, cycle time: 320 ns, CPI: 10)• 8MB memory (800 ns for write, 640ns for read)• Hundreds of MB disk space• 10Mbps ethernet connection

How much speedup in the DBMS performance over the same period? --- Not 1000 times!

9

In the old days (1984), a few MB was said…

With the availability of very large, relatively inexpensive main memories, it is becoming possible to keep large databases resident in memory. …. At the present time, memory for VAX 11/780 costs approximately $1,500 a megabyte. In 1990, a gigabyte of memory should cost less than $200,000 (with the availability of 1Mb chips). If 4 Mbit chips are available, the price might be as low as $50,000.

- DeWitt, Katz, Olken, Shapiro, Stonebraker, and Wood, “Implementation Techniques for Main Memory Database Systems,” SIGMOD 1984

10

and today a few GB….

Real numbers:

As the price of memory continues to drop below $1,000/GB, it is now feasible to place many of the database tables and indexes in main memory. - Kim, Cha, Kwon, SIGMOD 2001

We write the paper with essentially the same wording:

Max CPU# Max Memory Memory Price

www.kingston.com

Sun Blade 1K 2 8GB

Sun Fire 3800 8 64GB $1170/2GB, $7021/4GB

Sun Fire 15K 108 0.5TB

Compaq ML 570 4 16GB $914/2GB, $2143/4GB

11

MMDBMS vs DRDBMS

Well-Designed MMDBMS should outperform DRDBMS significantly

not only for read-oriented applications

but also for update-intensive applications!

with

memory-optimized data structures and algorithms

disk access reduced to the sequential form of log writing and occasional checkpointing

12

Q: Is Disk Database with large buffer the same as Main Memory Database?

No!• Complex mapping between disk and memory

– E.g., traversing index blocks in buffer requires bookkeeping the mapping between disk and memory addresses

» Swizzle or notLarge Buffer

Database Log

recordData Blocks

Index Blocks

disk address

• Disk index block design is not optimized against hardware cache misses.

13

Other Reasons to Say No!

Update performance advantage of main memory database

Disk-resident DBMS is limited in taking advantage of hardware advances

Huge code base • many levels of abstraction for portability

• backward compatibility with the old customers (long history of evolution) Slow to change Too many control knobs to optimize

Disk-primary design

14

Two Different Paradigms

Memory PrimaryDisk Primary

Design the disk-primary structures and algorithms.

Design the memory-primary structures and algorithms.

Cache-conscious design

Cache the disk blocks of the disk-resident database in the memory buffer.

Uniform mapping

If necessary, map the in-memory structures to the disk.

15

Memory Primary versus Disk Primary

CPU

Cache Memory

Main Memory (DRAM)

Memory-Primary Focus

Disk-Primary Focus

16

Why Memory-Primary Design? Memory access is no longer uniform!

Cost of L2 cache misses: hundred clock cycles, or thousand instruction opportunity

(From David Patterson, et. al., IEEE Micro, 1997)

CPU speed: 60% increase per year

DRAM speed: 10% increase per year

17

Memory-Primary Design

Cache-conscious: the well-known disk access rule comes back in a slightly different flavor

Random access is expensive:• Example: Pointer chasing

Block access is more efficient:• Align data structures with L2 cache lines (e.g., 64 bytes).

• Pack the right data in the cache block.

18

Cache behavior of commercial DBMS(on Uniprocessor Pentium II Xeon)

Memory related delays: 40-80% of execution time.

Data accesses on caches: 19-86% of memory stalls.

Multiprocessor cache behavior? Probably worse because of coherence cache misses

Anastassia Ailamaki et al, DBMSs on a Modern Processor: Where does time go?, VLDB 99

19

Coherence Cache Miss Problem

Updates on the shared data structures on shared-memory multiprocessor systems Cache invalidation Coherence Cache Miss

Problems with

Frequently updated, shared internal control structures

locks, transaction tables, etc.

20

Transact In Memory Project at SNU

Objective: Build a highly parallel, highly scalable Main-Memory DBMS

Specific technical issues published so far:

Parallel logging and recovery [ICDE01] Cache-conscious index [SIGMOD01] Cache-conscious concurrency control of index [VLDB01]

Exploit the parallelism in logging, recovery, and CC Exploit advances in HW computing power

21

Transact In Memory ProjectSome Hypotheses

MMDBMS should do more than just keeping the database in memoryNew architecture and new algorithms are needed; Otherwise DRDBMS will catch up eventually.

Memory-primary design should boost the performance significantly

MMDBMS is the best place to apply the self-tuning RISC DBMS concept!

Thanks to Surajit Chaudhuri and Gerhard Weikum for their VLDB 2000 vision paper!

Once we have a high-performance MMDBMS, today’s multi-tier architecture may not be the best fit.

Database servers may become idle …

This is also a good place for inventing …Because of new cost model, new premise

22

Parallel Main Memory DBMS

Parallel Checkpointing Parallel Logging

23

Parallel Main Memory DBMS

Fully Parallel Recovery:Backup Database Loading and Log Processing all in Parallel

24

Problems with Existing Logging Schemes

Constraint on log replay by the serialization order

Conventional WisdomDistribute log records to multiple disks

• Pays the cost of merging and applying log records by the serialization order during recovery

Partition the database and assigning different log disks to different partitions => limited parallelism

• Poor utilization of multiple log disks when updates are skewed to certain partitions

• Difficulty of handling transactions updating multiple segments

25

Differential Logging: Definition

Let the value of R: p q

Differential log (p,q) = p q

Redo: p (p,q) = qUndo: q (p,q) = p

where denotes the bit-wise XOR

[Lee, Kim, and Cha, ICDE 2001]

26

Differential Logging Example

0000

0101

1001

Redo L1; Redo L2; 0000 0101 1001

Redo L2; Redo L1; 0000 1100 1001

Value of a resource, R

1100…RL2

0101…RL1

XOR difference

Log sequence number

Resource ID

27

Source of Parallelism

XOR operation is commutative and associativeSerialization order is insignificant

• Log records can be applied in arbitrary order

• Log records can be distributed to multiple disks to improve the logging and recovery performance.

Redo and undo can be intermixed• Single-pass recovery possible

Backup DB loading can be intermixed with log processing• Initialize the primary DB with 0’s. Copying Backup DB into memory

is equivalent to applying XOR to the backup data and the corresponding memory initialized with 0’s.

28

Cache-conscious Index Structures

Design focus: minimize the L2 cache missesIndex node alignment with cache lines

• Node size: a few multiples of L2 cache blocks

Pointer elimination for increasing fanout Reduced height Reduced cache misses

• CSB+-tree [Rao & Ross, SIGMOD 2000]

Key compression for increasing fanout• CR-tree [Kim et al, SIGMOD 2001], pkT/pkB-tree[Chen, et al

SIGMOD 2001]

23 34 23 4734 58

B+-tree: CSB+-tree:

29

Concurrency Control of Index

What if we combine traditional latch-based index concurrency control schemes?

Lock coupling involves latching and unlatching index nodes while traversing down the tree

Latching/unlatching the index node involves memory write on the cache block containing the latch.

Memory writes on the shared data structures on the shared-memory multiprocessor systems lead to coherence cache misses.

30

OLFIT Concurrency Control: Overview

Consistent node access protocol [Cha et al, VLDB 2001, Patent-Pending]:

ReadNode: • Just read the node optimistically with no latch operations

• At the end, detect the read-update conflict by checking the latch state and the version number updated by the UpdateNode operation

• Retry on the conflict.

UpdateNode: • Latch and unlatch the index node before and after updating the node

content– Prevent other node updates from interfering

• Increment the version number of the index node before unlatching

31

P*TIME: Highly Parallel Transact In Memory Engine

Backup DB

Partitions

Parallel Logging

Parallel Checkpointing

LogLogLog DisksBackup DB

Partitions

Backup DB

Partitions

Primary DB

ParallelRecovery

CPU CPU CPU CPU

Cache Cache Cache CacheCache-Conscious

IndexConcurrency Control

32

P*TIME: Highly Parallel Transact In Memory Engine

Lightweight, multithreaded architectureScalable performance on multiprocessor platforms

Implemented mostly in C++

Small footprint (<10MB)

Industry standard API: SQL, JDBC (with binding to Java and C++), ODBC

Support of DB embedded application architecture

Multi-level concurrency control

Data replication for high availability

Simple performance modelEase of application tuning and optimization

33

P*TIME-Embedded Telco Directory Server

15M customer database 1 table (~1GB)3 hash indexes (~1.3GB)

L o o k u p

I B MM a in f r a m e

T e l - C o inS e r v e r

S K M e m b e r s h ipW e b S e r v e r s …

U p d a t e

A u t h e n t ic a t io n

S U N E n t e r p r i s e 6 5 0 06 U l t r a S P A R C I I 4 0 0 M H Z C P U s1 0 G B E D O R A M

V o lu m e 1

1 0 0 M B F ib e r C h a n n e l

S U N A 5 2 0 0 R A I D S y s t e m

V o lu m e 2 H o t P lu gV o lu m e 3

P * T I M E - M QB r id g e

I B M M Q

• Backup DB loading time: 16 sec(65MB/sec)

• Log processing time: 30-40MB/sec

• Parallel index building time: 19 sec

• Initial DB loading from text file: 6 min

34

Summary

The memory-primary architectureUtilizes the hardware resource more efficiently.

Still many things to do

P*TIME: second-generation MMDBMSMemory-primary architecture with parallelism exploited

Other optimizations in implementation

Numerous applications in Telecom and InternetDirectory servers

Update-intensive applications: batch on-line

Time for Change: Why Not Transact In Memory?

Documents

Transcript of Time for Change: Why Not Transact In Memory?