Time for Change: Why Not Transact In Memory?
description
Transcript of Time for Change: Why Not Transact In Memory?
Time for Change:
Why Not Transact In Memory?
Sang K. Cha
Email: [email protected]
on leave from Seoul National University, Korea
2
Transact In Memory Project
My interest in main memory database started from: WS-IRIS at HP Lab (1991-1992.1):
• In-memory personal object manager project of Tore Risch (before SmallBase project which became TimesTen)
Related previous projects at Seoul National Univ.: C++ design & implementation of MMDBMS prototypes
• Initial design of MMDBMS modules (1993-1994)
• Mr.RT for ETRI (1995-1996)– 2+ local startups
• Xmas: Extension of Mr.RT (1997-1999)
Object-oriented middleware for spatial object database (1994-1999)
Transact in Memory project (2000 - present)
3
Transact In Memory Project
Spirit:
Experimental Presently, self-managed through Transact In
Memory, Inc. for research and development freedom Venture capital financing in the future?
Funding Scheme:
Do what we think are important rather than what others (government bureaucrats or committees) think important.
No more Build and Perish: Deal with the end users directly for quick feedback on
our work
4
Transact In Memory Project
Our Users
Performance-demanding users with simple applications (with 3n-letter company product bases) Telco companies Stock brokerage companies ….
Our People:
20+ is expanding Graduate students Consultant (patent attorney for IP protection) Full-time, field service engineering & marketing staffs
5
Outline
Introduction to Main Memory DBMSMMDBMS vs DRDBMS
Memory-primary vs disk-primary architecture
More on Transact In Memory ProjectSome hypotheses
Parallel Architecture with Parallel Logging and Recovery
Some Memory-Primary Design
P*TIME: a Highly Parallel Transact In Memory Engine
Summary
6
Main Memory DBMS
Database resident in memoryRead transactions simply read the in-memory data.
Update transactions do in-memory updates and write update log to the log disk.
Occasionally, checkpoint the dirty pages of the in-memory database to the disk-resident backup DB to shorten the recovery time.
Backup DB
Primary DB
LoggingCheckpointing
Log
MMDBMS
7
MMDBMS Technical Issues
Logging and Recovery: critical for not losing updates
In-Memory Index StructuresEfficient L2 Cache Utilization?
Concurrency Control of Index and Data
DBMS Process & Application Binding Architecture
Query Processing
Memory Management
Availability for Mission-critical Applications
8
Reminder on Hardware Advances
1000 times over the 15 years [Patterson]CPU power, Memory capacity, Disk capacity
• Not really for memory and disk speed • Network interface speed: 100Mbps – Gbps (maybe 100 times)
VAX 11/750 in 1984 (Stanford KBMS personal workstation)• CPU (32 bit, cycle time: 320 ns, CPI: 10)• 8MB memory (800 ns for write, 640ns for read)• Hundreds of MB disk space• 10Mbps ethernet connection
How much speedup in the DBMS performance over the same period? --- Not 1000 times!
9
In the old days (1984), a few MB was said…
With the availability of very large, relatively inexpensive main memories, it is becoming possible to keep large databases resident in memory. …. At the present time, memory for VAX 11/780 costs approximately $1,500 a megabyte. In 1990, a gigabyte of memory should cost less than $200,000 (with the availability of 1Mb chips). If 4 Mbit chips are available, the price might be as low as $50,000.
- DeWitt, Katz, Olken, Shapiro, Stonebraker, and Wood, “Implementation Techniques for Main Memory Database Systems,” SIGMOD 1984
10
and today a few GB….
Real numbers:
As the price of memory continues to drop below $1,000/GB, it is now feasible to place many of the database tables and indexes in main memory. - Kim, Cha, Kwon, SIGMOD 2001
We write the paper with essentially the same wording:
Max CPU# Max Memory Memory Price
www.kingston.com
Sun Blade 1K 2 8GB
Sun Fire 3800 8 64GB $1170/2GB, $7021/4GB
Sun Fire 15K 108 0.5TB
Compaq ML 570 4 16GB $914/2GB, $2143/4GB
11
MMDBMS vs DRDBMS
Well-Designed MMDBMS should outperform DRDBMS significantly
not only for read-oriented applications
but also for update-intensive applications!
with
memory-optimized data structures and algorithms
disk access reduced to the sequential form of log writing and occasional checkpointing
12
Q: Is Disk Database with large buffer the same as Main Memory Database?
No!• Complex mapping between disk and memory
– E.g., traversing index blocks in buffer requires bookkeeping the mapping between disk and memory addresses
» Swizzle or notLarge Buffer
Database Log
recordData Blocks
Index Blocks
disk address
• Disk index block design is not optimized against hardware cache misses.
13
Other Reasons to Say No!
Update performance advantage of main memory database
Disk-resident DBMS is limited in taking advantage of hardware advances
Huge code base • many levels of abstraction for portability
• backward compatibility with the old customers (long history of evolution) Slow to change Too many control knobs to optimize
Disk-primary design
14
Two Different Paradigms
Memory PrimaryDisk Primary
Design the disk-primary structures and algorithms.
Design the memory-primary structures and algorithms.
Cache-conscious design
Cache the disk blocks of the disk-resident database in the memory buffer.
Uniform mapping
If necessary, map the in-memory structures to the disk.
15
Memory Primary versus Disk Primary
CPU
Cache Memory
Main Memory (DRAM)
Memory-Primary Focus
Disk-Primary Focus
16
Why Memory-Primary Design? Memory access is no longer uniform!
Cost of L2 cache misses: hundred clock cycles, or thousand instruction opportunity
(From David Patterson, et. al., IEEE Micro, 1997)
CPU speed: 60% increase per year
DRAM speed: 10% increase per year
17
Memory-Primary Design
Cache-conscious: the well-known disk access rule comes back in a slightly different flavor
Random access is expensive:• Example: Pointer chasing
Block access is more efficient:• Align data structures with L2 cache lines (e.g., 64 bytes).
• Pack the right data in the cache block.
18
Cache behavior of commercial DBMS(on Uniprocessor Pentium II Xeon)
Memory related delays: 40-80% of execution time.
Data accesses on caches: 19-86% of memory stalls.
Multiprocessor cache behavior? Probably worse because of coherence cache misses
Anastassia Ailamaki et al, DBMSs on a Modern Processor: Where does time go?, VLDB 99
19
Coherence Cache Miss Problem
Updates on the shared data structures on shared-memory multiprocessor systems Cache invalidation Coherence Cache Miss
Problems with
Frequently updated, shared internal control structures
locks, transaction tables, etc.
20
Transact In Memory Project at SNU
Objective: Build a highly parallel, highly scalable Main-Memory DBMS
Specific technical issues published so far:
Parallel logging and recovery [ICDE01] Cache-conscious index [SIGMOD01] Cache-conscious concurrency control of index [VLDB01]
Exploit the parallelism in logging, recovery, and CC Exploit advances in HW computing power
21
Transact In Memory ProjectSome Hypotheses
MMDBMS should do more than just keeping the database in memoryNew architecture and new algorithms are needed; Otherwise DRDBMS will catch up eventually.
Memory-primary design should boost the performance significantly
MMDBMS is the best place to apply the self-tuning RISC DBMS concept!
Thanks to Surajit Chaudhuri and Gerhard Weikum for their VLDB 2000 vision paper!
Once we have a high-performance MMDBMS, today’s multi-tier architecture may not be the best fit.
Database servers may become idle …
This is also a good place for inventing …Because of new cost model, new premise
22
Parallel Main Memory DBMS
Parallel Checkpointing Parallel Logging
23
Parallel Main Memory DBMS
Fully Parallel Recovery:Backup Database Loading and Log Processing all in Parallel
24
Problems with Existing Logging Schemes
Constraint on log replay by the serialization order
Conventional WisdomDistribute log records to multiple disks
• Pays the cost of merging and applying log records by the serialization order during recovery
Partition the database and assigning different log disks to different partitions => limited parallelism
• Poor utilization of multiple log disks when updates are skewed to certain partitions
• Difficulty of handling transactions updating multiple segments
25
Differential Logging: Definition
Let the value of R: p q
Differential log (p,q) = p q
Redo: p (p,q) = qUndo: q (p,q) = p
where denotes the bit-wise XOR
[Lee, Kim, and Cha, ICDE 2001]
26
Differential Logging Example
0000
0101
1001
Redo L1; Redo L2; 0000 0101 1001
Redo L2; Redo L1; 0000 1100 1001
Value of a resource, R
1100…RL2
0101…RL1
XOR difference
Log sequence number
Resource ID
27
Source of Parallelism
XOR operation is commutative and associativeSerialization order is insignificant
• Log records can be applied in arbitrary order
• Log records can be distributed to multiple disks to improve the logging and recovery performance.
Redo and undo can be intermixed• Single-pass recovery possible
Backup DB loading can be intermixed with log processing• Initialize the primary DB with 0’s. Copying Backup DB into memory
is equivalent to applying XOR to the backup data and the corresponding memory initialized with 0’s.
28
Cache-conscious Index Structures
Design focus: minimize the L2 cache missesIndex node alignment with cache lines
• Node size: a few multiples of L2 cache blocks
Pointer elimination for increasing fanout Reduced height Reduced cache misses
• CSB+-tree [Rao & Ross, SIGMOD 2000]
Key compression for increasing fanout• CR-tree [Kim et al, SIGMOD 2001], pkT/pkB-tree[Chen, et al
SIGMOD 2001]
23 34 23 4734 58
B+-tree: CSB+-tree:
29
Concurrency Control of Index
What if we combine traditional latch-based index concurrency control schemes?
Lock coupling involves latching and unlatching index nodes while traversing down the tree
Latching/unlatching the index node involves memory write on the cache block containing the latch.
Memory writes on the shared data structures on the shared-memory multiprocessor systems lead to coherence cache misses.
30
OLFIT Concurrency Control: Overview
Consistent node access protocol [Cha et al, VLDB 2001, Patent-Pending]:
ReadNode: • Just read the node optimistically with no latch operations
• At the end, detect the read-update conflict by checking the latch state and the version number updated by the UpdateNode operation
• Retry on the conflict.
UpdateNode: • Latch and unlatch the index node before and after updating the node
content– Prevent other node updates from interfering
• Increment the version number of the index node before unlatching
31
P*TIME: Highly Parallel Transact In Memory Engine
Backup DB
Partitions
Parallel Logging
Parallel Checkpointing
LogLogLog DisksBackup DB
Partitions
Backup DB
Partitions
Primary DB
ParallelRecovery
CPU CPU CPU CPU
Cache Cache Cache CacheCache-Conscious
IndexConcurrency Control
32
P*TIME: Highly Parallel Transact In Memory Engine
Lightweight, multithreaded architectureScalable performance on multiprocessor platforms
Implemented mostly in C++
Small footprint (<10MB)
Industry standard API: SQL, JDBC (with binding to Java and C++), ODBC
Support of DB embedded application architecture
Multi-level concurrency control
Data replication for high availability
Simple performance modelEase of application tuning and optimization
33
P*TIME-Embedded Telco Directory Server
15M customer database 1 table (~1GB)3 hash indexes (~1.3GB)
L o o k u p
I B MM a in f r a m e
T e l - C o inS e r v e r
S K M e m b e r s h ipW e b S e r v e r s …
U p d a t e
A u t h e n t ic a t io n
S U N E n t e r p r i s e 6 5 0 06 U l t r a S P A R C I I 4 0 0 M H Z C P U s1 0 G B E D O R A M
V o lu m e 1
1 0 0 M B F ib e r C h a n n e l
S U N A 5 2 0 0 R A I D S y s t e m
V o lu m e 2 H o t P lu gV o lu m e 3
P * T I M E - M QB r id g e
I B M M Q
• Backup DB loading time: 16 sec(65MB/sec)
• Log processing time: 30-40MB/sec
• Parallel index building time: 19 sec
• Initial DB loading from text file: 6 min
34
Summary
The memory-primary architectureUtilizes the hardware resource more efficiently.
Still many things to do
P*TIME: second-generation MMDBMSMemory-primary architecture with parallelism exploited
Other optimizations in implementation
Numerous applications in Telecom and InternetDirectory servers
Update-intensive applications: batch on-line