pddb

Parallel and distributed databases

R & G Chapter 22

What is a distributed database?

Why distribute a database Scalability and performance

Resilience to failures

Thro

ughp

ut

Data

size

versusX X

Why distribute a database Data is already distributed

Or needs to be distributed

Data is in multiple systems

Why not distribute a database

You must earn your complexity! Communication needed

Must build a complex infrastructure Unpredictable latencies must be masked

More types of failures More components to fail Network failures Congestion, timeouts

More complex planning Communication cost plus I/O cost

May have to deal with heterogeneity Different types of systems Different schemas, possibly incompatible Different administrative domains

Types of distributed databases

The old days: mainframes

Definitely not distributed!

Client-server

User interaction

Data processingNetwork

Parallel database

Primary/secondary

X

Multidatabase

How do they work? What is shared? How to distribute the data? How to process the data? How to update the data?

What is shared? Memory

CPUs RAM Disk

Most modern DBMSs

What is shared? Disk

RAM

Oracle RAC

What is shared? Nothing

RAM

Search engines, Teradata

Server 1 Server 2 Server 3 Server 4

Bike $866/2/07 636353

Chair $106/5/07 662113

How to distribute the data?Couch $5706/1/07 424252

Car $11236/1/07 256623

Lamp $196/7/07 121113

Bike $566/9/07 887734

Scooter $186/11/07 252111

Hammer $80006/11/07 116458

How to distribute the data?

Hash partitioning Range partitioning(key,value)

Hash()

(key,value)

<= X > X

Server 1 Server 2 Server 3 Server 4

How to distribute the data?

Bike

Chair

Couch

Car

Lamp

Bike

Scooter

Hammer

$86

$10

$570

$1123

$19

$56

$18

$8000

6/2/07

6/5/07

6/1/07

6/1/07

6/7/07

6/9/07

6/11/07

6/11/07

636353

662113

424252

256623

121113

887734

252111

116458

Query processing Intra-operator parallelism

Inter-operator parallelism

Parallel scanning

filter filter filter filter filter filter

Result

Sorting

Parallel hash join

Hash()

Semi-join

Inter-operator parallelism

Updating distributed data Synchronous: read-any-write-all

Reads are fast

Updating distributed data Synchronous: voting

Updating distributed data Synchronous: voting

Writes tolerant to disconnection

Consistency of distributed data

Should provide ACID

Primary/secondary

Two-phase commit

PREPARE

PREPARED PREPARED

COMMIT

Two-phase commit

PREPARE

PREPARED ABORT

ABORT

Two-phase commit

PREPARE

PREPARED

ABORT

Two-phase commit

PREPARE

PREPARED PREPARED

X

Conclusion Parallelism and distribution very

useful Performance Fault tolerance Scale

But complex! Rethink lots of aspects of the system Must earn the complexity

pddb

Documents

Transcript of pddb