pddb
description
Transcript of pddb
Parallel and distributed databases
R & G Chapter 22
What is a distributed database?
Why distribute a database Scalability and performance
Resilience to failures
Thro
ughp
ut
Data
size
versusX X
Why distribute a database Data is already distributed
Or needs to be distributed
Data is in multiple systems
Why not distribute a database
You must earn your complexity! Communication needed
Must build a complex infrastructure Unpredictable latencies must be masked
More types of failures More components to fail Network failures Congestion, timeouts
More complex planning Communication cost plus I/O cost
May have to deal with heterogeneity Different types of systems Different schemas, possibly incompatible Different administrative domains
Types of distributed databases
The old days: mainframes
Definitely not distributed!
Client-server
User interaction
Data processingNetwork
Parallel database
Primary/secondary
X
Multidatabase
How do they work? What is shared? How to distribute the data? How to process the data? How to update the data?
What is shared? Memory
CPUs RAM Disk
Most modern DBMSs
What is shared? Disk
RAM
Oracle RAC
What is shared? Nothing
RAM
Search engines, Teradata
Server 1 Server 2 Server 3 Server 4
Bike $866/2/07 636353
Chair $106/5/07 662113
How to distribute the data?Couch $5706/1/07 424252
Car $11236/1/07 256623
Lamp $196/7/07 121113
Bike $566/9/07 887734
Scooter $186/11/07 252111
Hammer $80006/11/07 116458
How to distribute the data?
Hash partitioning Range partitioning(key,value)
Hash()
(key,value)
<= X > X
Server 1 Server 2 Server 3 Server 4
How to distribute the data?
Bike
Chair
Couch
Car
Lamp
Bike
Scooter
Hammer
$86
$10
$570
$1123
$19
$56
$18
$8000
6/2/07
6/5/07
6/1/07
6/1/07
6/7/07
6/9/07
6/11/07
6/11/07
636353
662113
424252
256623
121113
887734
252111
116458
Query processing Intra-operator parallelism
Inter-operator parallelism
Parallel scanning
filter filter filter filter filter filter
Result
Sorting
Sorting
Parallel hash join
Hash()
Join
Semi-join
Inter-operator parallelism
Updating distributed data Synchronous: read-any-write-all
Reads are fast
Updating distributed data Synchronous: voting
Updating distributed data Synchronous: voting
Writes tolerant to disconnection
Consistency of distributed data
Should provide ACID
Primary/secondary
Two-phase commit
PREPARE
PREPARED PREPARED
COMMIT
Two-phase commit
PREPARE
PREPARED ABORT
ABORT
Two-phase commit
PREPARE
PREPARED
ABORT
Two-phase commit
PREPARE
PREPARED PREPARED
X
Conclusion Parallelism and distribution very
useful Performance Fault tolerance Scale
But complex! Rethink lots of aspects of the system Must earn the complexity