Cloud Databases Part 2
description
Transcript of Cloud Databases Part 2
![Page 2: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/2.jpg)
2
Relational Queries over SDDSs
We talk about applying SDDS files to a relational database implementation
In other words, we talk about a relational database using SDDS files instead of more traditional ones
We examine the processing of typical SQL queries– Using the operations over SDDS files»Key-based & scans
![Page 3: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/3.jpg)
3
Relational Queries over SDDSs
For most, LH* based implementation appears easily feasible The analysis applies to some extent to
other potential applications – e.g., Data Mining
![Page 4: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/4.jpg)
4
Relational Queries over SDDSs All the theory of parallel database
processing applies to our analysis– E.g., classical work by DeWitt team (U.
Madison) With a distinctive advantage–The size of tables matters less» The partitioned tables were basically static» See specs of SQL Server, DB2, Oracle…»Now they are scalable
–Especially this concerns the size of the output table»Often hard to predict
![Page 5: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/5.jpg)
5
How Useful Is This Material ?
http://research.microsoft.com/en-us/projects/clientcloud/default.aspx
Les Apps, Démos…
![Page 6: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/6.jpg)
6
How Useful Is This Material ?
The Computational Science and Mathematics division of the Pacific
Northwest National Laboratory is looking for a senior researcher in Scientific Data Management to develop and pursue new opportunities. Our research is aimed at creating new, state-of-the-art computational capabilities using extreme-scale simulation and peta-scale data analytics that enable scientific breakthroughs. We are looking for someone with a demonstrated ability to provide scientific leadership in this challenging discipline and to work closely with the existing staff, including the SDM technical group manager.
![Page 7: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/7.jpg)
7
How Useful Is This Material ?
![Page 8: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/8.jpg)
8
How Useful Is This Material ?
![Page 9: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/9.jpg)
9
Relational Queries over SDDSs We illustrate the point using the well-known
Supplier Part (S-P) database
S (S#, Sname, Status, City)P (P#, Pname, Color, Weight, City)SP (S#, P#, Qty)
See my database classes on SQL– At the Website
![Page 10: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/10.jpg)
10
Relational Database Queries over LH* tables
Single Primary key based searchSelect * From S Where S# = S1
Translates to simple key-based LH* search–Assuming naturally that S# becomes
the primary key of the LH* file with tuples of S (S1 : Smith, 100, London) (S2 : ….
![Page 11: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/11.jpg)
11
Relational Database Queries over LH* tables
Select * From S Where S# = S1 OR S# = S1 –A series of primary key based searches
Non key-based restriction–…Where City = Paris or City = London–Deterministic scan with local restrictions»Results are perhaps inserted into a temporary
LH* file
![Page 12: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/12.jpg)
12
Relational Operations over LH* tables
Key based Insert INSERT INTO P VALUES ('P8', 'nut', 'pink', 15, 'Nice') ;–Process as usual for LH*–Or use SD-SQL Server» If no access “under the cover” of the DBMS
Key based Update, Delete– Idem
![Page 13: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/13.jpg)
13
Relational Operations over LH* tables
Non-key projection Select S.Sname, S.City from S–Deterministic scan with local projections»Results are perhaps inserted into a
temporary LH* file (primary key ?) Non-key projection and restriction
Select S.Sname, S.City from SWhere City = ‘Paris’ or City = ‘London’– Idem
![Page 14: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/14.jpg)
14
Relational Operations over LH* tables
Non Key DistinctSelect Distinct City from P–Scan with local or upward propagated
aggregation towards bucket 0– Process Distinct locally if you do not
have any son–Otherwise wait for input from all your
sons–Process Distinct together–Send result to father if any or to client or
to output table– Alternative algorithm ?
![Page 15: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/15.jpg)
15
Relational Operations over LH* tables
Non Key Count or SumSelect Count(S#), Sum(Qty) from SP–Scan with local or upward propagated
aggregation–Eventual post-processing on the client
Non Key Avg, Var, StDev…–Your proposal here
![Page 16: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/16.jpg)
16
Relational Operations over LH* tables
Non-key Group By, Histograms…Select Sum(Qty) from SP Group By S#–Scan with local Group By at each server–Upward propagation –Or post-processing at the client Or the result directly in the output table
Of a priori unknown sizeThat with SDDS technology does not need to
be estimated upfront
![Page 17: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/17.jpg)
17
Relational Operations over LH* tables
EquijoinSelect * From S, SP where S.S# = SP.S#–Scan at S and scan at SP sends all tuples to temp
LH* table T1 with S# as the key –Scan at T merges all couples (r1, r2) of records
with the same S#, where r1 comes from S and r2 comes from SP–Result goes to client or temp table T2
All above is an SD generalization of Grace hash join
![Page 18: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/18.jpg)
18
Relational Operations over LH* tables
Equijoin & Projections & Restrictions & Group By & Aggregate &…–Combine what above– Into a nice SD-execution plan
Your Thesis here
![Page 19: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/19.jpg)
19
Relational Operations over LH* tables
Equijoin & -joinSelect * From S as S1, S where S.City =
S1.City and S.S# < S1.S# –Processing of equijoin into T1–Scan for parallel restriction over T1 with
the final result into client or (rather) T2 Order By and Top K–Use RP* as output table
![Page 20: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/20.jpg)
20
Relational Operations over LH* tables
Having
Select Sum(Qty) from SP Group By S# Having Sum(Qty) > 100
Here we have to process the result of the aggregation
One approach: post-processing on client or temp table with results of Group By
![Page 21: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/21.jpg)
21
Relational Operations over LH* tables
Subqueries – In Where or Select or From Clauses–With Exists or Not Exists or Aggregates… –Non-correlated or correlated
Non-correlated subquerySelect S# from S where status = (Select
Max(X.status) from S as X)–Scan for subquery, then scan for superquery
![Page 22: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/22.jpg)
22
Relational Operations over LH* tables
Correlated Subqueries
Select S# from S where not exists (Select * from SP where S.S# = SP.S#)
Your Proposal here
![Page 23: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/23.jpg)
23
Relational Operations over LH* tables
Like (…)–Scan with a pattern matching or regular
expression –Result delivered to the client or output
table Your Thesis here
![Page 24: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/24.jpg)
24
Relational Operations over LH* tables Cartesian Product & Projection &
Restriction…Select Status, Qty From S, SP
Where City = “Paris”–Scan for local restrictions and projection
with result for S into T1 and for SP into T2–Scan T1 delivering every tuple towards
every bucket of T3»Details not that simple since some flow control is
necessary – Deliver the result of the tuple merge over every
couple to T4
![Page 25: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/25.jpg)
25
Relational Operations over LH* tables
New or Non-standard Aggregate Functions– Covariance– Correlation–Moving Average– Cube– Rollup– -Cube– Skyline–… (see my class on advanced SQL)
Your Thesis here
![Page 26: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/26.jpg)
26
Relational Operations over LH* tables
Indexes Create Index SX on S (sname); Create, e.g., LH* file with records
(Sname, (S#1, S#2,..)
Where each S#i is the key of a tuple with Sname
Notice that an SDDS index is not affected by location changes due to splits– A potentially huge advantage
![Page 27: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/27.jpg)
27
Relational Operations over LH* tables
For an ordered index use – an RP* scheme– or Baton–…
For a k-d index use – k-RP* – or SD-Rtree–…
![Page 28: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/28.jpg)
28
![Page 29: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/29.jpg)
29
High-availability SDDS schemesData remain available despite :–any single server failure & most of
two server failures–or any up to k-server failure» k - availability–and some catastrophic failures
k scales with the file size–To offset the reliability decline which
would otherwise occur
![Page 30: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/30.jpg)
30
High-availability SDDS schemes Three principles for high-
availability SDDS schemes are currently known–mirroring (LH*m)–striping (LH*s)–grouping (LH*g, LH*sa, LH*rs)
Realize different performance trade-offs
![Page 31: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/31.jpg)
31
High-availability SDDS schemesMirroring –Lets for instant switch to the
backup copy–Costs most in storage overhead »k * 100 %–Hardly applicable for more than 2
copies per site.
![Page 32: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/32.jpg)
32
High-availability SDDS schemes Striping –Storage overhead of O (k / m) –m times higher messaging cost of a
record search–m - number of stripes for a record– k – number of parity stripes–At least m + k times higher record
search costs while a segment is unavailable»Or bucket being recovered
![Page 33: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/33.jpg)
33
High-availability SDDS schemes Grouping–Storage overhead of O (k / m) –m = number of data records in a record
(bucket) group– k – number of parity records per group– No messaging overhead of a record
search–At least m + k times higher record search
costs while a segment is unavailable
![Page 34: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/34.jpg)
34
High-availability SDDS schemesGrouping appears most practical–Good question»How to do it in practice ?–One reply : LH*RS–A general industrial concept:
RAIN » Redundant Array of Independent Nodes
http://continuousdataprotection.blogspot.com/2006/04/larchitecture-rain-adopte-pour-la.html
![Page 35: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/35.jpg)
35
LH*RS : Record Groups LH*RS records– LH* data records & parity records
Records with same rank r in the bucket group form a record group
Each record group gets n parity records– Computed using Reed-Salomon erasure correction codes»Additions and multiplications in Galois Fields» See the Sigmod 2000 paper on the Web site for details
r is the common key of these records Each group supports unavailability of up to n of its
members
![Page 36: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/36.jpg)
36
LH*RS Record Groups
non-key data c
Parity record r Data record c
parity bits
B c l c 1 r
(a)
(b)
x x x x x x
x x x x x x
Data records Parity records
![Page 37: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/37.jpg)
37
LH*RS Scalable availability
Create 1 parity bucket per group until M = 2i1
buckets Then, at each split, – add 2nd parity bucket to each existing group – create 2 parity buckets for new groups until 2i
2
buckets etc.
![Page 38: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/38.jpg)
38
LH*RS Scalable availability
![Page 39: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/39.jpg)
39
LH*RS Scalable availability
![Page 40: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/40.jpg)
40
LH*RS Scalable availability
![Page 41: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/41.jpg)
41
LH*RS Scalable availability
![Page 42: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/42.jpg)
42
LH*RS Scalable availability
![Page 43: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/43.jpg)
43
LH*RS : Galois Fields A finite set with algebraic structure– We only deal with GF (N) where N = 2^f ; f = 4, 8, 16 » Elements (symbols) are 4-bits, bytes and 2-byte words
Contains elements 0 and 1 Addition with usual properties– In general implemented as XOR
a + b = a XOR b Multiplication and division– Usually implemented as log / antilog calculus»With respect to some primitive element »Using log / antilog tablesa * b = antilog (log a + log b) mod (N – 1)
![Page 44: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/44.jpg)
44
Example: GF(4)
* 00 10 01 11 log antilog
00 00 00 00 00 00 - - 00
10 00 10 01 11 10 0 0 10
01 00 01 11 10 01 1 1 01
11 00 11 10 01 11 2 2 11
Direct Multiplication Logarithm Antilogarithm
Tables for GF(4).
Addition : XORMultiplication :
direct table Primitive element based log / antilog tables
Log tables are more efficient for a large GF
= 01
10 = 1
00 = 0
0 = 10 1 = 01 ; 2 = 11 ; 3 = 10
![Page 45: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/45.jpg)
45
String int hex log
0000 0 0 -0001 1 1 00010 2 2 10011 3 3 40100 4 4 20101 5 5 80110 6 6 50111 7 7 101000 8 8 31001 9 9 141010 10 A 91011 11 B 71100 12 C 61101 13 D 131110 14 E 111111 15 F 12
Example: GF(16)
Direct table would
have 256 elements
Addition : XOR
Elements & logs
= 2
![Page 46: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/46.jpg)
46
LH*RS Parity Management Create the m x n generator matrix G– using elementary transformation of extended
Vandermond matrix of GF elements– m is the records group size– n = 2l is max segment size (data and parity
records)– G = [I | P]– I denotes the identity matrix
The m symbols with the same offset in the records of a group become the (horizontal) information vector U
The matrix multiplication UG provides the (n - m) parity symbols, i.e., the codeword vector C
![Page 47: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/47.jpg)
47
LH*RS Parity Management
Vandermond matrix V of GF elements–For info see http://
en.wikipedia.org/wiki/Vandermonde_matrix
Generator matrix G –See http://
en.wikipedia.org/wiki/Generator_matrix
![Page 48: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/48.jpg)
48
LH*RS Parity Management
There are very many ways different G’s one can derive from any given V–Leading to different linear codes
Central property of any V :– Preserved by any G
Every square sub-matrix H is invertible
![Page 49: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/49.jpg)
49
LH*RS Parity Encoding What means that for any G, any H being a sub-matrix of G, any inf. vector U and any codeword D C such that
D = U * H, We have :
D * H-1 = U * H * H-1 = U * I = U
![Page 50: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/50.jpg)
50
LH*RS Parity Management If thus : For at least k parity columns in P, For any U and C any vector V of at
most k data values in UWe get V erased Then, we can recover V as follows
![Page 51: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/51.jpg)
51
LH*RS Parity Management
1. We calculate C using P during the encoding phase» We do not need full G for that
since we have I at the left.
2. We do it any time data are inserted » Or updated / deleted
![Page 52: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/52.jpg)
52
LH*RS Parity ManagementDuring recovery phase we then :
1. Choose H2. Invert it to H-1 3. Form D– From remaining at least m – k data values
(symbols)– We find them in the data buckets
– From at most k values in C – We find these in the parity buckets
4. Calculate U as above5. Restore V erased values from U
![Page 53: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/53.jpg)
53
LH*RS : GF(16) Parity EncodingRecords :
“En arche ...”, “Dans le ...”, “Am Anfang ...”, “In the beginning”
45 6E 20 41 72 , 41 6D 20 41 6E 44 61 6E 73 20 ”, 49 6E 20 70 74
1 0 0 0 8 1 7 7 9 3 2 7 70 1 0 0 8 7 1 9 7 3 2 7 70 0 1 0 1 7 8 3 7 9 7 2 70 0 0 1 7 1 8 3 9 7 7 2 7
F C A EF C A E
F C E AF C E A
G
![Page 54: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/54.jpg)
54
LH*RS GF(16) Parity Encoding
“En arche ...”, “Dans le ...”, “Am Anfang ...”, “In the beginning”
45 6E 20 41 72 , 41 6D 20 41 6E 44 61 6E 73 20 ”, 49 6E 20 70 74
1 0 0 0 8 1 7 7 9 3 2 7 70 1 0 0 8 7 1 9 7 3 2 7 70 0 1 0 1 7 8 3 7 9 7 2 70 0 0 1 7 1 8 3 9 7 7 2 7
F C A EF C A E
F C E AF C E A
G
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 0
Records :
![Page 55: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/55.jpg)
55
“En arche ...”, “Dans le ...”, “Am Anfang ...”, “In the beginning”
45 6E 20 41 72 , 41 6D 20 41 6E 44 61 6E 73 20 ”, 49 6E 20 70 74
1 0 0 0 8 1 7 7 9 3 2 7 70 1 0 0 8 7 1 9 7 3 2 7 70 0 1 0 1 7 8 3 7 9 7 2 70 0 0 1 7 1 8 3 9 7 7 2 7
F C A EF C A E
F C E AF C E A
G
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 0
5 1 4 9 F 8 A 4 B 1 1 2 7 E 9 9 A 5 1 4 9 F 8 A 4 B 1 1 2 7 E 9 9 A
LH*RS GF(16) Parity EncodingRecords :
![Page 56: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/56.jpg)
56
1 0 0 0 8 1 7 7 9 3 2 7 70 1 0 0 8 7 1 9 7 3 2 7 70 0 1 0 1 7 8 3 7 9 7 2 70 0 0 1 7 1 8 3 9 7 7 2 7
F C A EF C A E
F C E AF C E A
G
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 0
5 1 4 9 F 8 A 4 B 1 1 2 7 E 9 9 A ... … ... ... ... ... ... ... … ... ... ... ... … ... 6
36EE4
6EDCEE
6649DD
“En arche ...”, “Dans le ...”, “Am Anfang ...”, “In the beginning”
45 6E 20 41 72 , 41 6D 20 41 6E 44 61 6E 73 20 ”, 49 6E 20 70 74
LH*RS GF(16) Parity EncodingRecords :
![Page 57: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/57.jpg)
57
LH*RS Record/Bucket Recovery
Performed when at most k = n - m buckets are unavailable in a segment :
Choose m available buckets of the segment Form sub-matrix H of G from the corresponding
columns Invert this matrix into matrix H-1
Multiply the horizontal vector D of available symbols with the same offset by H-1
The result U contains the recovered data, i.e, the erased values forming V.
![Page 58: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/58.jpg)
58
ExampleData buckets
“En arche ...”, “Dans le ...”, “Am Anfang ...”, “In the beginning”
45 6E 20 41 72 , 41 6D 20 41 6E 44 61 6E 73 20 ”, 49 6E 20 70 74
![Page 59: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/59.jpg)
59
ExampleAvailable buckets“In the beginning”49 6E 20 70 744F 63 6E E4 48 6E DC EE 4A 66 49 DD
![Page 60: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/60.jpg)
60
Example
1 0 0 0 8 1 7 7 9 3 2 7 70 1 0 0 8 7 1 9 7 3 2 7 70 0 1 0 1 7 8 3 7 9 7 2 70 0 0 1 7 1 8 3 9 7 7 2 7
F C A EF C A E
F C E AF C E A
G
0 8 10 8 70 1 7 81 7 1
FF
F
H
“In the beginning”49 6E 20 70 744F 63 6E E4 48 6E DC EE 4A 66 49 DD
Available buckets
![Page 61: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/61.jpg)
61
Example
1 0 0 0 8 1 7 7 9 3 2 7 70 1 0 0 8 7 1 9 7 3 2 7 70 0 1 0 1 7 8 3 7 9 7 2 70 0 0 1 7 1 8 3 9 7 7 2 7
F C A EF C A E
F C E AF C E A
G
0 8 10 8 70 1 7 81 7 1
FF
F
H
1
14 2 0
.4 7 02 4 0
B F AC
DD
H
“In the beginning”49 6E 20 70 744F 63 6E E4 48 6E DC EE 4A 66 49 DD
Available buckets
E.g Gauss Inversion
![Page 62: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/62.jpg)
62
Example“In the beginning”49 6E 20 70 744F 63 6E E4 48 6E DC EE 4A 66 49 DD
1 0 0 0 8 1 7 7 9 3 2 7 70 1 0 0 8 7 1 9 7 3 2 7 70 0 1 0 1 7 8 3 7 9 7 2 70 0 0 1 7 1 8 3 9 7 7 2 7
F C A EF C A E
F C E AF C E A
G
0 8 10 8 70 1 7 81 7 1
FF
F
H
1
14 2 0
.4 7 02 4 0
B F AC
DD
H 4 4 45 1 4 6 6 6... ,, .,
Recoveredsymbols / buckets
Available buckets
![Page 63: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/63.jpg)
63
LH*RS Parity ManagementEasy exercise:1. How do we recover erased parity
values ?» Thus in C, but not in V » Obviously, this can happen as well.
2. We can also have data & parity values erased together» What do we do then ?
![Page 64: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/64.jpg)
64
LH*RS : Actual Parity Management
An insert of data record with rank r creates or, usually, updates parity records r An update of data record with rank r
updates parity records r A split recreates parity records–Data record usually change the rank
after the split
![Page 65: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/65.jpg)
65
LH*RS : Actual Parity Encoding Performed at every insert, delete and
update of a record–One data record at the time
Each updated data bucket produces -record that sent to each parity bucket –-record is the difference between the
old and new value of the manipulated data record»For insert, the old record is dummy»For delete, the new record is dummy
![Page 66: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/66.jpg)
66
LH*RS : Actual Parity Encoding
The ith parity bucket of a group contains only the ith column of G –Not the entire G, unlike one could
expect The calculus of ith parity record is
only at ith parity bucket–No messages to other data or parity
buckets
![Page 67: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/67.jpg)
67
LH*RS : Actual RS code Over GF (2**16) – Encoding / decoding typically faster than for our earlier
GF (2**8) » Experimental analysis– By Ph.D Rim Moussa
– Possibility of very large record groups with very high availability level k– Still reasonable size of the Log/Antilog multiplication
table» Ours (and well-known) GF multiplication method
Calculus using the log parity matrix– About 8 % faster than the traditional parity matrix
![Page 68: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/68.jpg)
68
LH*RS : Actual RS code 1-st parity record calculus uses only XORing– 1st column of the parity matrix contains 1’s only– Like, e.g., RAID systems– Unlike our earlier code published in Sigmod-2000
paper 1-st data record parity calculus uses only XORing– 1st line of the parity matrix contains 1’s only
It is at present for our purpose the best erasure correcting code around
![Page 69: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/69.jpg)
69
LH*RS : Actual RS code
0000 0000 0000 …
0000 5ab5 e267 …
0000 e267 0dce …
0000 784d 2b66 … … … … …
Logarithmic Parity Matrix
0001 0001 0001 …
0001 eb9b 2284 …
0001 2284 9é74 …
0001 9e44 d7f1 … … … … …
Parity Matrix
All things considered, we believe our code, the most suitable erasure correcting code for high-availability SDDS files at present
![Page 70: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/70.jpg)
70
LH*RS : Actual RS code Systematic : data values are stored as is Linear : – We can use -records for updates» No need to access other record group members
– Adding a parity record to a group does not require access to existing parity records
MDS (Maximal Distance Separable)– Minimal possible overhead for all practical records and
record group sizes»Records of at least one symbol in non-key field : – We use 2B long symbols of GF (2**16)
More on codes– http://fr.wikipedia.org/wiki/Code_parfait
![Page 71: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/71.jpg)
71
Performance
Data bucket load factor : 70 %
Parity overhead : k / m m is file parameter, m = 4,8,16… larger m increases the recovery cost
Key search time • Individual : 0.2419 ms• Bulk : 0.0563 ms
File creation rate • 0.33 MB/sec for k = 0, • 0.25 MB/sec for k = 1, • 0.23 MB/sec for k = 2
Record insert time (100 B)• Individual : 0.29 ms for k = 0,
0.33 ms for k = 1, 0.36 ms for k = 2
• Bulk : 0.04 ms
Record recovery time • About 1.3 ms
Bucket recovery rate (m = 4)• 5.89 MB/sec from 1-unavailability, • 7.43 MB/sec from 2-unavailability,• 8.21 MB/sec from 3-unavailability
(Wintel P4 1.8GHz, 1Gbs Ethernet)
![Page 72: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/72.jpg)
72
About the smallest possible – Consequence of MDS property of RS codes
Storage overhead (in additional buckets)– Typically k / m
Insert, update, delete overhead – Typically k messages
Record recovery cost– Typically 1+2m messages
Bucket recovery cost– Typically 0.7b (m+x-1)
Key search and parallel scan performance are unaffected– LH* performance
Parity Overhead Performance
![Page 73: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/73.jpg)
73
Reliability• Probability P that all the data are available• Inverse of the probability of the catastrophic k’ -
bucket failure ; k’ > k • Increases for • higher reliability p of a single node • greater k at expense of higher overhead
• But it must decrease regardless of any fixed k when the file scales• k should scale with the file• How ??
Performance
![Page 74: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/74.jpg)
74
Uncontrolled availabilityk = 4, p = 0.15
N
P
0.7500.8000.8500.9000.950
k = 4, p = 0.1
N
P
0.8500.9000.9501.000
m = 4, p = 0.15
m = 4, p = 0.1
M
P
M
P
OK
OK ++
![Page 75: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/75.jpg)
75
RP* schemes
Produce 1-d ordered files– for range search
Uses m-ary trees– like a B-tree
Efficiently supports range queries– LH* also supports range queries» but less efficiently
Consists of the family of three schemes– RP*N RP*C and RP*S
![Page 76: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/76.jpg)
76
Current PDBMS technology (Pioneer: Non-Stop SQL)
Static Range Partitioning Done manually by DBA Requires goods skills Not scalable
![Page 77: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/77.jpg)
77
Fig. 1 RP* design trade-offs
RP*N
RP*C
RP*S
No index all multicast
+ client index limited multicast
+ servers index optional multicast
RP* schemes
![Page 78: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/78.jpg)
78
theofand
toa
ofand
the
of
toa
of
of
and
the
of
to
a
of
in
that
is
and
theto
a
of
in
thatof
in
isofin
and
theto
a
of
in
that
of
in
it
ofin
i
is
and
theto
a
of
that
of
isofin
iin
infor
it
RP* file expansion
for
for
for
0 1 2 3
0 0
0 0
0
1
1
1
1 2
2
![Page 79: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/79.jpg)
79
RP* Range Query
Searches for all records in query range Q– Q = [c1, c2] or Q = ]c1,c2] etc
The client sends Q – either by multicast to all the buckets»RP*n especially
– or by unicast to relevant buckets in its image» those may forward Q to children unknown to the
client
![Page 80: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/80.jpg)
80
RP* Range Query Termination
Time-out Deterministic– Each server addressed by Q sends back at least
its current range– The client performs the union U of all results– It terminates when U covers Q
![Page 81: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/81.jpg)
81
RP*c client image0
0 fo r * in 2 o f *
0 fo r * in 2 o f 1
0 fo r 3 in 2 o f 1
E volu tion of RP * c lien t im a g e a fter sea rc h es for k eys .
c
T0
T1
T2
T3
i t, tha t, in
0-for
2inof
IAMs
1of
3forin
![Page 82: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/82.jpg)
82
RP*s
.
A n R P* f il e w ith ( a) 2 -lev e l ke rnel, and
(b ) 3 -level k ern e ls
0 1 2 3 4
aa
and
t h e s e
theth e se
a
of
th a t
of
is
ofin
iin
in
itfor
for
a afor
th is
th es e
to
a
in 2 of 1 these 4b
a i n b c
ca in 0 f o r 3c
* ( b )
a
and
theto
a
of
that
of
is
ofin
iin
in
itfor
for
0 1 2 3
0 fo r 3 in 2 of 1
afor
aa
a ( a )
Distr.Indexroot
Distr.Indexroot
Distr.Indexpage
Distr.Indexpage
IAM = traversed pages
![Page 83: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/83.jpg)
86
b RP*C RP*S LH*
50 2867 22.9 8.9
100 1438 11.4 8.2
250 543 5.9 6.8
500 258 3.1 6.4
1000 127 1.5 5.7
2000 63 1.0 5.2
Number of IAMs until image convergence
![Page 84: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/84.jpg)
87
RP* Bucket Structure
Header –Bucket range –Address of the
index root –Bucket size…
Index –Kind of of B+-
tree–Additional links
» for efficient index splitting during RP* bucket splits
Data–Linked leaves
with the data
Header B+-tree index Data (Linked list of index leaves)
Index
Root
Leaf headers
Records
…
![Page 85: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/85.jpg)
88
SDDS-2004 Menu Screen
![Page 86: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/86.jpg)
89
SDDS-2000: Server Architecture
. . .
Response
Results Results
Execution
Main memory Server RP* Buckets
Network (TCP/ IP, UDP)
Response
W.Thread 1
W.Thread N
ListenThread
Client
. . .
RP* Functions : Insert, Search, Update, Delete, Forward, Splite.
. . .
Request Analyze
BAT
SendAck
Requests queue
Ack queue
Client
Several buckets of different SDDS files Multithread
architecture Synchronization
queues Listen Thread for
incoming requests SendAck Thread for
flow control Work Threads for request processing response sendout request forwarding
UDP for shorter messages (< 64K) TCP/IP for longer
data exchanges
![Page 87: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/87.jpg)
90
SDDS-2000: Client Architecture
Receive Module Send Module
. . .
Requests Journal
Update
Return Response
Get Request
Client
Application 1
IP Add.
Request
Images
Response
Network (TCP/ IP, UDP)
Send Request
Receive Response
Server
Key IP Add. … …
SDDS Applications Interface
Analyze Response
Id_Req Id_App … …
Client Flow control
Manager
Application N . . .
Server
1 4 …
2 ModulesSend ModuleReceive Module
Multithread ArchitectureSendRequestReceiveRequestAnalyzeResponse
1..4GetRequestReturnResponse
Synchronization Queues
Client Images Flow control
![Page 88: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/88.jpg)
91
Performance AnalysisExperimental Environment
Six Pentium III 700 MHz o Windows 2000– 128 MB of RAM– 100 Mb/s EthernetMessages– 180 bytes : 80 for the header, 100 for the record– Keys are random integers within some interval– Flow Control sliding window of 10 messages Index– Capacity of an internal node : 80 index elements– Capacity of a leaf : 100 records
![Page 89: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/89.jpg)
92
Performance AnalysisFile Creation
Bucket capacity : 50.000 records150.000 random inserts by a single client With flow control (FC) or without
0
1000020000
30000
40000
5000060000
70000
80000
0 50000 100000 150000Number of records
Tim
e (m
s)
Rp*c/ Without FC RP*c/ With FCRP*n/ With FC RP*n/ Without FC
0.0000.1000.2000.3000.4000.5000.6000.7000.8000.9001.000
0 50000 100000 150000Number of records
Tim
e (m
s)
RP*c without FC RP*c with FCRP*n with FC RP*n without FC
File creation time Average insert time
![Page 90: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/90.jpg)
93
Discussion
Creation time is almost linearly scalable Flow control is quite expensive– Losses without were negligible
Both schemes perform almost equally well– RP*C slightly better » As one could expect
Insert time 30 faster than for a disk file Insert time appears bound by the client speed
![Page 91: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/91.jpg)
94
Performance AnalysisFile Creation
File created by 120.000 random inserts by 2 clientsWithout flow control
05000
1000015000200002500030000350004000045000
0 50000 100000 150000
Number of records
Tim
e (m
s)
0.0000.0500.1000.1500.2000.2500.3000.3500.4000.450
RP*c to. time / 2 clientsRP*n to. time / 2 clientsRP*c / Time per recordRP*n/ Time per record
0
10000
20000
30000
40000
50000
60000
0 50000 100000 150000 200000
Number of serversTi
me
(ms)
RP*c / 1 clientRP*n / 1 clientRP*c to. time / 2 clientsRP*n to. time / 2 clients
File creation by two clients : total time and per insert
Comparative file creation time by one or two clients
![Page 92: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/92.jpg)
95
Discussion
Performance improves Insert times appear bound by a server speed More clients would not improve
performance of a server
![Page 93: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/93.jpg)
96
Performance AnalysisSplit Time
0500
1000150020002500300035004000
1000
020
000
3000
040
000
5000
060
000
7000
080
000
9000
010
000
Bucket size
Spl
it ti
me
(ms)
00.020.040.060.080.10.120.140.16
Split time Time per Record
b Time Time/ Record 10000 1372 0.137 20000 1763 0.088 30000 1952 0.065 40000 2294 0.057 50000 2594 0.052 60000 2824 0.047 70000 3165 0.045 80000 3465 0.043 90000 3595 0.040
100000 3666 0.037
Split times for different bucket capacity
![Page 94: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/94.jpg)
97
Discusion
About linear scalability in function of bucket size
Larger buckets are more efficient Splitting is very efficient– Reaching as little as 40 s per record
![Page 95: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/95.jpg)
98
Performance AnalysisInsert without splits
Up to 100000 inserts into k buckets ; k = 1…5Either with empty client image adjusted by IAMs or
with correct imageRP*C RP*N
Without flow control With flow control Empty image Correct image
With flow control Without flow control
k
Ttl time
Time/ Ins. Ttl time
Time/ Ins. Ttl time
Time/ Ins. Ttl time
Time/ Ins. Ttl time
Time/ Ins.
1 35511 0.355 27480 0.275 27480 0.275 35872 0.359 27540 0.275 2 27767 0.258 14440 0.144 13652 0.137 28350 0.284 18357 0.184 3 23514 0.235 11176 0.112 10632 0.106 25426 0.254 15312 0.153 4 22332 0.223 9213 0.092 9048 0.090 23745 0.237 9824 0.098 5 22101 0.221 9224 0.092 8902 0.089 22911 0.229 9532 0.095
Insert performance
![Page 96: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/96.jpg)
99
Performance AnalysisInsert without splits
05000
10000150002000025000300003500040000
0 1 2 3 4 5
Number of servers
Tim
e (m
s)
RP*c/ With FC RP*c/ Without FCRP*n/ With FC RP*n/ Without FC
00.050.1
0.150.2
0.250.3
0.350.4
0 1 2 3 4 5
Number of servers
Tim
e (m
s)
RP*c/ With FC RP*c/ Without FCRP*n/ With FC RP*n/ Without FC
Total insert time Per record time
• 100 000 inserts into up to k buckets ; k = 1...5• Client image initially empty
![Page 97: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/97.jpg)
100
Discussion
Cost of IAMs is negligible Insert throughput 110 times faster than for a
disk file– 90 s per insert
RP*N appears surprisingly efficient for more buckets, closing on RP*c– No explanation at present
![Page 98: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/98.jpg)
101
Performance AnalysisKey Search
A single client sends 100.000 successful random search requestsThe flow control means here that the client sends at
most 10 requests without reply
RP*C RP*N With flow control Without flow control With flow control Without flow control
. k
Ttl time Avg time Ttl time Avg time Ttl time Avg time Ttl time Avg time 1 34019 0.340 32086 0.321 34620 0.346 32466 0.325 2 25767 0.258 17686 0.177 27550 0.276 20850 0.209 3 21431 0.214 16002 0.160 23594 0.236 17105 0.171 4 20389 0.204 15312 0.153 20720 0.207 15432 0.154 5 19987 0.200 14256 0.143 20542 0.205 14521 0.145
Search time (ms)
![Page 99: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/99.jpg)
102
Performance AnalysisKey Search
05000
10000150002000025000300003500040000
0 1 2 3 4 5
Number of servers
Tim
e (m
s)
RP*c/ With FC RP*c/ Without FCRP*n/ With FC RP*n/ Without FC
00.050.1
0.150.2
0.250.3
0.350.4
0 1 2 3 4 5
Number of serversTi
me
(ms)
RP*c/ With FC RP*c/ Without FCRP*n/ With FC RP*n/ Without FC
Total search time Search time per record
![Page 100: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/100.jpg)
103
Discussion
Single search time about 30 times faster than for a disk file– 350 s per search
Search throughput more than 65 times faster than that of a disk file– 145 s per search
RP*N appears again surprisingly efficient with respect RP*c for more buckets
![Page 101: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/101.jpg)
104
Performance AnalysisRange Query
Deterministic termination Parallel scan of the entire file with all the 100.000
records sent to the client
0
500
1000
1500
2000
2500
3000
3500
4000
0 1 2 3 4 5
Number of servers
Tim
e (m
s)
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0 1 2 3 4 5
Number of servers
Tim
e (m
s)
Range query total time Range query time per record
![Page 102: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/102.jpg)
105
Discussion
Range search appears also very efficient– Reaching 100 s per record delivered
More servers should further improve the efficiency– Curves do not become flat yet
![Page 103: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/103.jpg)
106
Scalability AnalysisThe largest file at the current
configuration - 64 MB buckets with b = 640 K- 448.000 records per bucket loaded at 70 % at
the average. - 2.240.000 records in total - 320 MB of distributed RAM (5 servers)- 264 s creation time by a single RP*N client - 257 s creation time by a single RP*C client - A record could reach 300 B- The servers RAMs were recently upgraded to
256 MB
![Page 104: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/104.jpg)
107
Scalability AnalysisIf the example file with b = 50.000 had
scaled to 10.000.000 records - It would span over 286 buckets (servers)- There are many more machines at Paris 9 - Creation time by random inserts would be - 1235 s for RP*N - 1205 s for RP*C - 285 splits would last 285 s in total- Inserts alone would last - 950 s for RP*N - 920 s for RP*C
![Page 105: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/105.jpg)
108
Actual results for a big file Bucket capacity : 751K records, 196 MB Number of inserts : 3M Flow control (FC) is necessary to limit the input
queue at each server
File creation by a single client - file size : 3,000,000 records
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
0 500000 1000000 1500000 2000000 2500000 3000000 3500000Number of records
Tim
e (m
s)
RP*c/ With FCRP*n/ With FC
![Page 106: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/106.jpg)
109
Actual results for a big file Bucket capacity : 751K records, 196 MB Number of inserts : 3M GA : Global Average; MA : Moving Average
Insert time by a single client - file size : 3,000,000 records
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0 500000 1000000 1500000 2000000 2500000 3000000 3500000Number of records
Tim
e (m
s) RP*c with FC / GARP*c with FC / MARP*n with FC / GARP*n with FC / MA
![Page 107: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/107.jpg)
110
Related WorksRP*N Imp. RP*C Impl. LH* Imp. RP*N Thr.
With FC No FC With FC No FC
tc 51000 40250 69209 47798 67838 45032 ts 0.350 0.186 0.205 0.145 0.200 0.143 ti,c 0.340 0,268 0.461 0.319 0.452 0.279 ti 0.330 0.161 0.229 0.095 0.221 0.086 tm 0.16 0.161 0.037 0.037 0.037 0.037 tr 0.005 0.010 0.010 0.010 0.010
tc: time to create the file ts: time per key search (throughput) ti: time per random insert (throughput) ti,c: time per random insert (throughput) during the file creation tm: time per record for splitting tr: time per record for a range query
Comparative Analysis
![Page 108: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/108.jpg)
111
Discussion
The 1994 theoretical performance predictions for RP* were quite accurate
RP* schemes at SDDS-2000 appear globally more efficient than LH*– No explanation at present
![Page 109: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/109.jpg)
112
ConclusionSDDS-2000 : a prototype SDDS
manager for Windows multicomputer - Various SDDSs - Several variants of the RP*Performance of RP* schemes appears in
line with the expectations - Access times in the range of a fraction of a
millisecond- About 30 to 100 times faster than a disk file
access performance- About ideal (linear) scalabilityResults prove also the overall efficiency of
SDDS-2000 architecture
![Page 110: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/110.jpg)
113
2011 Cloud Infrastructures in RP* Footsteps
RP* were the 1st schemes for SD Range Partitioning–Back to 1994, to recall.
SDDS 2000, up to SDDS-2007 were the 1st operational prototypes To create RP clouds in current
terminology
![Page 111: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/111.jpg)
114
2011 Cloud Infrastructures in RP* Footsteps
Today there are several mature implementations using SD-RP None cites RP* in the
references Practice contrary to the
honest scientific practice Unfortunately this seems to
be more and more often thing of the past Especially for the industrial
folks
![Page 112: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/112.jpg)
115
2011 Cloud Infrastructures in RP* Footsteps (Examples)
Prominent cloud infrastructures using SD-RP systems are disk oriented
GFS (2006)– Private cloud of Key, Value type– Behind Google’s BigTable– Basically quite similar to RP*s &
SDDS-2007– Many more features naturally
including replication
![Page 113: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/113.jpg)
116
2011 Cloud Infrastructures in RP* Footsteps (Examples)
Windows Azure Table (2009)– Public Cloud– Uses (Partition Key, Range Key, value) – Each partition key defines a partition– Azure may move the partitions around to balance the overall load
![Page 114: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/114.jpg)
117
2011 Cloud Infrastructures in RP* Footsteps (Examples)
Windows Azure Table (2009) cont.–It thus provides splitting in this sense–High availability uses the replication– Azure Table details are yet sketchy– Explore MS Help
![Page 115: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/115.jpg)
118
2011 Cloud Infrastructures in RP* Footsteps (Examples)
MongoDB– Quite similar to RP*s– For private clouds of up to 1000 nodes at present– Disk-oriented– Open-Source– Quite popular among the developers in the US– Annual conf (last one in SF)
![Page 116: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/116.jpg)
119
2011 Cloud Infrastructures in RP* Footsteps (Examples)
Yahoo PNuts Private Yahoo Cloud Provides disk-oriented SD-RP,
including over hashed keys– Like consistent hash
Architecture quite similar to GFS & SDDS 2007 But with more features
naturally with respect to the latter
![Page 117: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/117.jpg)
120
2011 Cloud Infrastructures in RP* Footsteps (Examples)
Some others–Facebook Cassandra» Range partitioning & (Key Value) Model » With Map/Reduce–Facebook Hive» SQL interface in addition
Idem for AsterData
![Page 118: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/118.jpg)
121
2011 Cloud Infrastructures in RP* Footsteps (Examples)
Several systems use consistent hash– Amazon This amounts largely to range partitioning Except that range queries mean nothing
![Page 119: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/119.jpg)
122
CERIA SDDS Prototypes
![Page 120: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/120.jpg)
123
Prototypes
LH*RS Storage (VLDB 04) SDDS –2006 (several papers)– RP* Range Partitioning– Disk back-up (alg. signature based, ICDE 04)– Parallel string search (alg. signature based, ICDE 04)– Search over encoded content
» Makes impossible any involuntary discovery of stored data actual content» Several times faster pattern matching than for Boyer Moore
– Available at our Web site SD –SQL Server (CIDR 07 & BNCOD 06)– Scalable distributed tables & views
SD-AMOS and AMOS-SDDS
![Page 121: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/121.jpg)
124
SDDS-2006 Menu Screen
![Page 122: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/122.jpg)
125
LH*RS Prototype
Presented at VLDB 2004 Vidéo démo at CERIA site Integrates our scalable availability RS based parity
calculus with LH* Provides actual performance measures– Search, insert, update operations– Recovery times
See CERIA site for papers – SIGMOD 2000, WDAS Workshops, Res. Reps. VLDB
2004
![Page 123: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/123.jpg)
126
LH*RS Prototype : Menu Screen
![Page 124: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/124.jpg)
127
SD-SQL Server : Server Node The storage manager is a full scale SQL-Server
DBMS SD SQL Server layer at the server node provides the
scalable distributed table management– SD Range Partitioning
Uses SQL Server to perform the splits using SQL triggers and queries– But, unlike an SDDS server, SD SQL Server does not
perform query forwarding–We do not have access to query execution plan
![Page 125: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/125.jpg)
128
Manages a client view of a scalable table – Scalable distributed partitioned view »Distributed partitioned updatable iew of SQL Server
Triggers specific image adjustment SQL queries– checking image correctness» Against the actual number of segments»Using SD SQL Server meta-tables (SQL Server tables)
– Incorrect view definition is adjusted– Application query is executed.
The whole system generalizes the PDBMS technology– Static partitioning only
SD-SQL Server : Client Node
![Page 126: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/126.jpg)
129
SD-SQL ServerGross Architecture
SQL-Server
Application
SD-DBS Manager
SQL-Server
Application
SQL-Server
Application
SD-DBS Manager
SD-DBS Manager SDDS
layer
SQL-Server layer
D1 D2 D999
999
![Page 127: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/127.jpg)
130
SD-SQL Server Architecture Server side
DB_1Segment
Meta-tablesSD_C SD_RP
DB_2Segment
Meta-tablesSD_C SD_RP
………
SQL Server 1 SQL Server 2• Each segment has a check constraint on the partitioning attribute• Check constraints partition the key space• Each split adjusts the constraint
Split SplitSplit
SQL …
![Page 128: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/128.jpg)
131
S
b+1
S S1
pb+1-p
p=INT(b/2)
C( S)= { c: c < h = c (b+1-p)}
C( S1)={c: c > = c (b+1-p)}
Check Constraint?
b
SELECT TOP Pi * INTO Ni.Si FROM S ORDER BY C ASCSELECT TOP Pi * WITH TIES INTO Ni.S1 FROM S ORDER BY C ASC
Single Segment Split Single Tuple Insert
![Page 129: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/129.jpg)
132132
Single Segment Split Bulk Insert
p = INT(b/2)C(S) = {c: l < c < h } à { c: l ≤ c < h’ = c (b+t–Np)}C(S1) = {c: c (b+t-p) < c < h}…C(SN) = {c: c (b+t-Np) ≤ c < c (b+t-(N-1)p)}
Pn
S
b
b+t
(b)
P1
b+t-np
S
b
(a)
S
b
(c)
b+t-np
P1
S1
b
Pn
SN
b
p p
Single segment split
![Page 130: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/130.jpg)
133133
Sk
b
S
b
S1
b
S1, n1
b
p
Sk
b
Sk, nk
b
p
Multi-Segment Split Bulk Insert
Multi-segment split
![Page 131: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/131.jpg)
134134
SDB DB1SDB DB1
Scalable Table T
sd_insert
N1 N2 N4N3
NDBDB1
NDBDB1
NDBDB1
sd_insert
NDBDB1
Ni
sd_create_node
sd_insert
N3
NDBDB1
sd_create_node_database
NDBDB1
…….
sd_create_node_database
SDB DB1
Split with SDB Expansion
![Page 132: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/132.jpg)
135
SD-DBS Architecture Client View
Distributed Partitioned
Union All View
Db_1.Segment1 Db_2. Segment1 …………
• Client view may happen to be outdated • not include all the existing segments
![Page 133: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/133.jpg)
136136
Internally, every image is a specific SQL Server view of the segments:Distributed partitioned union view
CREATE VIEW T AS SELECT * FROM N2.DB1.SD._N1_T UNION ALL SELECT * FROM N3.DB1.SD._N1_T
UNION ALL SELECT * FROM N4.DB1.SD._N1_TUpdatable• Through the check constraints
With or without Lazy Schema Validation
Scalable (Distributed) Table
![Page 134: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/134.jpg)
137
SD-SQL ServerGross Architecture : Appl. Query Processing
SQL-Server
Application
SD-DBS Manager
SQL-Server
Application
SQL-Server
Application
SD-DBS Manager
SD-DBS Manager SDDS
layer
SQL-Server layer
D1 D2 D999
999
9999 ?
![Page 135: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/135.jpg)
138
USE SkyServer /* SQL Server command */ Scalable Update Queriessd_insert ‘INTO PhotoObj SELECT * FROM
Ceria5.Skyserver-S.PhotoObj’
Scalable Search Queriessd_select ‘* FROM PhotoObj’ sd_select ‘TOP 5000 * INTO PhotoObj1
FROM PhotoObj’, 500
Scalable Queries Management
![Page 136: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/136.jpg)
139139
Concurrency
SD-SQL Server processes every command as SQL distributed transaction at Repeatable Read isolation level Tuple level locks Shared locks Exclusive 2PL locks Much less blocking than the Serializable Level
![Page 137: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/137.jpg)
140140
Splits use exclusive locks on segments and tuples in RP meta-table. Shared locks on other meta-tables: Primary, NDB
meta-tables
Scalable queries use basically shared locks on meta-tables and any other table involved All the conccurent executions can be shown
serializable
Concurrency
![Page 138: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/138.jpg)
141141
(Q) sd_select ‘COUNT (*) FROM PhotoObj’
Query (Q1) execution time
00,5
11,5
2
39500 79000 158000
Capacité de PhotoObj
Tem
ps
d'ex
écut
ion
de
(sec
)
Adjustment on a Peer Checking on a Peer
SQL Server Peer Adjustment on a Client
Checking on a Clientj SQL Server client
Image Adjustment
![Page 139: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/139.jpg)
142142
(Q): sd_select ‘COUNT (*) FROM PhotoObj’
Execution time of (Q) on SQL Server and SD-SQL Server
93156
220250
326
106
164226
256343
283203
93
356
436
220203123
76160
100
200
300
400
500
1 2 3 4 5
Nombre de segments
Tem
ps d
'exé
cutio
n (s
ec)
SQL Server-Distr SD-SQL ServerSQL Server-Centr SD-SQL Server LSV
SD-SQL Server / SQL Server
![Page 140: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/140.jpg)
143
• Will SD SQL Server be useful ?• Here is a non-MS hint from the
practical folks who knew nothing about it• Book found in Redmond Town
Square Border’s Cafe
![Page 141: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/141.jpg)
144
Algebraic Signatures for SDDS
Small string (signature) characterizes the SDDS record.
Calculate signature of bucket from record signatures.– Determine from signature whether record / bucket has
changed.» Bucket backup» Record updates» Weak, optimistic concurrency scheme» Scans
![Page 142: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/142.jpg)
145
Signatures
Small bit string calculated from an object. Different Signatures Different Objects Different Objects with high probability
Different Signatures.
» A.k.a. hash, checksum.» Cryptographically secure: Computationally impossible to find an
object with the same signature.
![Page 143: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/143.jpg)
146
Uses of Signatures
Detect discrepancies among replicas. Identify objects – CRC signatures.– SHA1, MD5, … (cryptographically secure).– Karp Rabin Fingerprints.– Tripwire.
![Page 144: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/144.jpg)
147
Properties of Signatures Cryptographically Secure Signatures: – Cannot produce an object with given signature. Cannot substitute objects without changing
signature. Algebraic Signatures:– Small changes to the object change the signature for
sure.» Up to the signature length (in symbols)
– One can calculate new signature from the old one and change.
Both:– Collision probability 2-f (f length of signature).
![Page 145: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/145.jpg)
148
Definition of Algebraic Signature: Page Signature
Page P = (p0, p1, … pl-1).– Component signature.
– n-Symbol page signature
– = (, 2, 3, 4…n) ; i = i
» is a primitive element, e.g., = 2.
1
0sig ( ) l i
iiP p
1 2sig ( ) (sig ( ),sig ( ),...,sig ( ))
nP P P P α
![Page 146: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/146.jpg)
149
Algebraic Signature Properties
Page length < 2f-1: Detects all changes of up to n symbols.
Otherwise, collision probability = 2-nf
Change starting at symbol r:
sig ( ') sig ( ) sig ( ).rP P
![Page 147: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/147.jpg)
150
Algebraic Signature Properties
Signature Tree: Speed up comparison of signatures
![Page 148: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/148.jpg)
151
Uses for Algebraic Signatures in SDDS
Bucket backup Record updates Weak, optimistic concurrency scheme Stored data protection against involuntary
disclosure Efficient scans– Prefix match– Pattern match (see VLDB 07)– Longest common substring match– …..
Application issued checking for stored record integrity
![Page 149: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/149.jpg)
152
Signatures for File Backup
Backup an SDDS bucket on disk. Bucket consists of large pages. Maintain signatures of pages on disk. Only backup pages whose signature has
changed.
![Page 150: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/150.jpg)
153
Signatures for File Backup
BUCKETPage 1Page 2Page 3Page 4Page 5Page 6Page 7
DISKPage 1Page 2Page 3Page 4Page 5Page 6Page 7
Backup Managersig 1sig 2sig 3sig 4sig 5sig 6sig 7
Application access but does not change page 2
Application changes page 3
Page 3 sig 3
Backup manager will only backup page 3
![Page 151: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/151.jpg)
154
Record Update w. Signatures
Application requests record R
Client provides record R, stores signature sigbefore(R)
Application updates record R: hands record to client.Client compares sigafter(R) with sigbefore(R): Only updates if different.
Prevents messaging of pseudo-updates
![Page 152: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/152.jpg)
155
Scans with Signatures
Scan = Pattern matching in non-key field. Send signature of pattern– SDDS client
Apply Karp-Rabin-like calculation at all SDDS servers.– See paper for details
Return hits to SDDS client Filter false positives.– At the client
![Page 153: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/153.jpg)
156
Scans with Signatures
Client: Look for “sdfg”.Calculate signature for sdfg.
Server: Field is “qwertyuiopasdfghjklzxcvbnm”Compare with signature for “qwer”Compare with signature for “wert”Compare with signature for “erty”Compare with signature for “rtyu”Compare with signature for “tyui”Compare with signature for “uiop”Compare with signature for “iopa”
Compare with signature for “sdfg” HIT
![Page 154: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/154.jpg)
157
Record Update
SDDS updates only change the non-key field. Many applications write a record with the same
value. Record Update in SDDS:– Application requests record.– SDDS client reads record Rb .– Application request update.– SDDS client writes record Ra .
![Page 155: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/155.jpg)
158
Record Update w. Signatures
Weak, optimistic concurrency protocol:– Read-Calculation Phase: » Transaction reads records, calculates records, reads
more records.» Transaction stores signatures of read records.
– Verify phase: checks signatures of read records; abort if a signature has changed.–Write phase: commit record changes.
Read-Commit Isolation ANSI SQL
![Page 156: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/156.jpg)
159
Performance Results
1.8 GHz P4 on 100 Mb/sec Ethernet Records of 100B and 4B keys. Signature size 4B – One backup collision every 135 years at 1
backup per second.
![Page 157: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/157.jpg)
160
Performance Results:Backups
Signature calculation 20 - 30 msec/1MB Somewhat independent of details of
signature scheme GF(216) slightly faster than GF(28) Biggest performance issue is caching. Compare to SHA1 at 50 msec/MB
![Page 158: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/158.jpg)
161
Performance Results:Updates
Run on modified SDDS-2000– SDDS prototype at the Dauphine
Signature Calculation– 5 sec / KB on P4– 158 sec/KB on P3– Caching is bottleneck
Updates– Normal updates 0.614 msec / 1KB records– Normal pseudo-update 0.043 msec / 1KB record
![Page 159: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/159.jpg)
162
More on Algebraic Signatures
Page P : a string of l < 2f -1 symbols pi ; i = 0..l-1 n-symbol signature base :– a vector = (1…n) of different non-zero elements of the
GF. (n-symbol) P signature based on : the vector
1 2sig ( ) (sig ( ),sig ( ),...,sig ( ))α
nP P P P
1
0sig ( )
l i
iiP p
• Where for each :
![Page 160: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/160.jpg)
163
The sig,n and sig2,n schemes
sig,n
= (, 2, 3…n) with n << ord(a) = 2f - 1.• The collision probability is 2-nf at bestsig2
,n = (,, 2, 4, 8…2n)
• The randomization is possibly better for more than 2-symbol signatures since all the i are primitive
• In SDDS-2002 we use sig,n • Computed in fact for p’ = antilog p• To speed-up the multiplication
![Page 161: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/161.jpg)
164
The sig,n Algebraic Signature
If P1 and P2 Differ by at most n symbols, Have no more than 2f – 1
then probability of collision is 0.New property at present unique to sig,n Due to its algebraic nature
If P1 and P2 differ by more than n symbols, then probability of collision reaches 2-nf
Good behavior for Cut/PasteBut not best possible
See our IEEE ICDE-04 paper for other properties
![Page 162: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/162.jpg)
165
The sig,n Algebraic SignatureApplication in SDDS-2004
Disk back up– RAM bucket divided into pages– 4KB at present– Store command saves only pages whose signature
differs from the stored one– Restore does the inverse
Updates– Only effective updates go from the client» E.g. blind updates of a surveillance camera image
– Only the update whose before signature ist that of the record at the server gets accepted» Avoidance of lost updates
![Page 163: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/163.jpg)
166
The sig,n Algebraic SignatureApplication in SDDS-2004
Non-key distributed scans– The client sends to all the servers the signature
S of the data to find using:– Total match» The whole non-key field F matches S– SF = S
– Partial match» S is equal to the signature Sf of a sub-field f of F – We use a Karp-Rabin like computation of Sf
![Page 164: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/164.jpg)
167
SDDS & P2P
P2P architecture as support for an SDDS– A node is typically a client and a server– The coordinator is super-peer– Client & server modules are Windows active services» Run transparently for the user» Referred to in Start Up directory
See :– Planetlab project literature at UC Berkeley– J. Hellerstein tutorial VLDB 2004
![Page 165: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/165.jpg)
168
SDDS & P2P
P2P node availability (churn)–Much lower than traditionally for a variety of
reasons» (Kubiatowicz & al, Oceanstore project papers)
A node can leave anytime– Letting to transfer its data at a spare– Taking data with
LH*RS parity management seems a good basis to deal with all this
![Page 166: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/166.jpg)
169
LH*RSP2P
Each node is a peer – Client and server
Peer can be– (Data) Server peer : hosting a data bucket– Parity (sever) peer : hosting a parity bucket» LH*RS only
– Candidate peer: willing to host
![Page 167: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/167.jpg)
170
LH*RSP2P
A candidate node wishing to become a peer– Contacts the coordinator– Gets an IAM message from some peer
becoming its tutor»With level j of the tutor and its number a»All the physical addresses known to the tutor
– Adjusts its image– Starts working as a client– Remains available for the « call for server
duty »»By multicast or unicast
![Page 168: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/168.jpg)
171
LH*RSP2P
Coordinator chooses the tutor by LH over the candidate address– Good load balancing of the tutors’ load
A tutor notifies all its pupils and its own client part at its every split– Sending its new bucket level j value
Recipients adjust their images Candidate peer notifies its tutor when it
becomes a server or parity peer
![Page 169: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/169.jpg)
172
LH*RSP2P
End result–Every key search needs at most one
forwarding to reach the correct bucket»Assuming the availability of the buckets
concerned–Fastest search for any possible SDDS»Every split would need to be synchronously
posted to all the client peers otherwise»To the contrary of SDDS axioms
![Page 170: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/170.jpg)
173
Churn in LH*RSP2P
A candidate peer may leave anytime without any notice– Coordinator and tutor will assume so if no reply
to the messages– Deleting the peer from their notification tables
A server peer may leave in two ways–With early notice to its group parity server» Stored data move to a spare
–Without notice» Stored data are recovered as usual for LH*rs
![Page 171: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/171.jpg)
174
Churn in LH*RSP2P
Other peers learn that data of a peer moved when the attempt to access the node of the former peer– No reply or another bucket found
They address the query to any other peer in the recovery group
This one resends to the parity server of the group– IAM comes back to the sender
![Page 172: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/172.jpg)
175
Churn in LH*RSP2P
Special case– A server peer S1 is cut-off for a while, its
bucket gets recovered at server S2 while S1 comes back to service– Another peer may still address a query to S1– Getting perhaps outdated data
Case existed for LH*RS, but may be now more frequent
Solution ?
![Page 173: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/173.jpg)
176
Churn in LH*RSP2P
Sure Read– The server A receiving the query contacts its
availability group manager»One of parity data manager »All these address maybe outdated at A as well»Then A contacts its group members
The manager knows for sure –Whether A is an actual server –Where is the actual server A’
![Page 174: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/174.jpg)
177
Churn in LH*RSP2P
If A’ ≠ A, then the manager – Forwards the query to A’ – Informs A about its outdated status
A processes the query The correct server informs the client with an
IAM
![Page 175: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/175.jpg)
178
SDDS & P2P
SDDSs within P2P applications– Directories for structured P2Ps» LH* especially versus DHT tables– CHORD– P-Trees
– Distributed back up and unlimited storage »Companies with local nets»Community networks– Wi-Fi especially– MS experiments in Seattle
Other suggestions ???
![Page 176: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/176.jpg)
179
Popular DHT: Chord(from J. Hellerstein VLDB 04 Tutorial)
Consistent Hash + DHT Assume n = 2m nodes
for a moment– A “complete” Chord
ring Key c and node ID N
are integers given by hashing into 0,..,24 – 1– 4 bits
Every c should be at the first node N c.– Modulo 2m
![Page 177: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/177.jpg)
180
Popular DHT: Chord
Full finger DHT table at node 0 Used for faster search
![Page 178: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/178.jpg)
181
Popular DHT: Chord
Full finger DHT table at node 0 Used for faster search Key 3 and Key 7 for instance from node 0
![Page 179: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/179.jpg)
182
Full finger DHT tables at all nodes O (log n) search cost– in # of forwarding messages Compare to LH* See also P-trees– VLDB-05 Tutorial by K. Aberer» In our course doc
Popular DHT: Chord
![Page 180: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/180.jpg)
183
Churn in Chord
Node Join in Incomplete Ring– New Node N’ enters the ring between
its (immediate) successor N and (immediate) predecessor– It gets from N every key c ≤ N – It sets up its finger table»With help of neighbors
![Page 181: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/181.jpg)
184
Churn in Chord Node Leave– Inverse to Node Join
To facilitate the process, every node has also the pointer towards predecessor
Compare these operations to LH* Compare Chord to LH* High-Availability in Chord– Good question
![Page 182: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/182.jpg)
185
DHT : Historical Notice
Invented by Bob Devine–Published in 93 at FODO
The source almost never cited The concept also used by S.
Gribble–For Internet scale SDDSs–In about the same time
![Page 183: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/183.jpg)
186
DHT : Historical Notice
Most folks incorrectly believe DHTs invented by Chord–Which did not cite initially neither
Devine nor our Sigmod & TODS LH* and RP* papers–Reason ?»Ask Chord folks
![Page 184: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/184.jpg)
187
SDDS & Grid & Clouds…
What is a Grid ?–Ask J. Foster (Chicago University)
What is a Cloud ?–Ask MS, IBM…
The World is supposed to benefit from power grids and data grids & clouds & SaaS Grid has less nodes than cloud ?
![Page 185: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/185.jpg)
188
SDDS & Grid & Clouds…Ex. Tempest : 512 super
computer grid at MHPCCDifference between a grid & al
and P2P net ?–Local autonomy ?–Computational power of servers–Number of available nodes ?–Data Availability & Security ?
![Page 186: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/186.jpg)
189
SDDS & Grid
An SDDS storage is a tool for data grids–Perhaps easier to apply than to
P2P»Lesser server autonomy » Better for stored data security
![Page 187: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/187.jpg)
190
SDDS & Grid
Sample applications we have been looking upon–Skyserver (J. Gray & Co)–Virtual Telescope–Streams of particules (CERN)–Biocomputing (genes, image
analysis…)
![Page 188: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/188.jpg)
191
Conclusion Cloud Databases of all kinds
appear a future SQL, Key Value… Ram Cloud as support for are
especially promising Just type “Ram Cloud” into
GoogleAny DB oriented algorithm
that scales poorly or is not designed for scaling is obsolete
![Page 189: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/189.jpg)
192
ConclusionA lot is done in the
infrastructure Advanced Research Especially on SDDSs But also for the industry GFS, Hadoop, Hbase, Hive,
Mongo, Voldemort… We’ll say more on some of
these systems later
![Page 190: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/190.jpg)
193
ConclusionSDDS in 2011- Research has demonstrated the
initial objectives- Including Jim Gray’s expectance- Distributed RAM based access can be up to 100 times faster than to a local disk- Response time may go down, e.g.,- From 2 hours to 1 min- RAM Clouds are promising
![Page 191: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/191.jpg)
194
ConclusionSDDS in 2011- Data collection can be almost arbitrarily
large- It can support various types of queries- Key-based, Range, k-Dim, k-NN…- Various types of string search (pattern matching)- SQL - The collection can be k-available - It can be secure- …
![Page 192: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/192.jpg)
195
Conclusion
SDDS in 2011- Database schemes : SD-
SQL Server 48 000 estimated
references on Google for "scalable distributed data structure“
![Page 193: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/193.jpg)
196
ConclusionSDDS in 2011 - Several variants of LH* and RP*- Numerous new schemes: - SD-Rtree, LH*RS
P2P, LH*RE, CTH*, IH, Baton, VBI…- See ACM Portal for refs- And Google in general
![Page 194: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/194.jpg)
197
Conclusion
SDDS in 2011 : new capabilities- Pattern Matching using
Algebraic Signatures - Over Encoded Stored Data in the cloud- Using non-indexed n-grams- see VLDB 08- with R. Mokadem, C. duMouza, Ph. Rigaux, Th. Schwarz
![Page 195: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/195.jpg)
198
ConclusionPattern Matching using
Algebraic Signatures - Typically the fastest exact match string search- E.g., faster than Boyer-Moore- Even when there is no parallel search
- Provides client defined cloud data confidentiality - under the “honest but curious” threat model
![Page 196: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/196.jpg)
199
ConclusionSDDS in 2011- Very fast exact match string
search over indexed n—grams in a cloud- Compact index with 1-2 disk accesses per search only- termed AS-Index -CIKM 09- with C. duMouza, Ph. Rigaux, Th. Schwarz
![Page 197: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/197.jpg)
200
Current Research at Dauphine & al
SD-Rtree–With CNAM–Published at ICDE 09 » with C. DuMouza et Ph. Rigaux
– Provides R-tree properties for data in the cloud» E.g. storage for non-point objects
– Allows for scans (Map/Reduce)
![Page 198: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/198.jpg)
201
Current Research at Dauphine & al
LH*RSP2P
– Thesis by Y. Hanafi– Provides at most 1 hop per search– Best result ever possible for an SDDS
–See: http://video.google.com/videoplay?docid=-7096662377647111009#
–Efficiently manages churn in P2P systems
![Page 199: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/199.jpg)
202
Current Research at Dauphine & al
LH*RE–With CSIS, George Mason U., VA– Patent pending– Client-side encryption for cloud
data with recoverable encryption keys– Published at IEEE Cloud 2010 »With S. Jajodia & Th. Schwarz
![Page 200: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/200.jpg)
203
Conclusion The SDDS domain is ready for the wide industrial use For new industrial strength applications
- These are likely to appear around the leading new products - That we outlined or mentioned at least
![Page 201: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/201.jpg)
204
Credits : Research LH*RS Rim Moussa (Ph. D. Thesis to defend in Oct.
2004) SDDS 200X Design & Implementation (CERIA)
» J. Karlson (U. Linkoping, Ph.D. 1st LH* impl., now Google Mountain View)» F. Bennour (LH* on Windows, Ph. D.); » A. Wan Diene, (CERIA, U. Dakar: SDDS-2000, RP*, Ph.D). » Y. Ndiaye (CERIA, U. Dakar: AMOS-SDDS & SD-AMOS,
Ph.D.)» M. Ljungstrom (U. Linkoping, 1st LH*RS impl. Master Th.)» R. Moussa (CERIA: LH*RS, Ph.D)» R. Mokadem (CERIA: SDDS-2002, algebraic signatures & their
apps, Ph.D, now U. Paul Sabatier, Toulouse)» B. Hamadi (CERIA: SDDS-2002, updates, Res. Internship)» See also Ceria Web page at ceria.dauphine.fr
SD SQL Server– Soror Sahri (CERIA, Ph.D.)
![Page 202: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/202.jpg)
205
Credits: Funding
– CEE-EGov bus project–Microsoft Research – CEE-ICONS project– IBM Research (Almaden)– HP Labs (Palo Alto)
![Page 204: Cloud Databases Part 2](https://reader036.fdocuments.net/reader036/viewer/2022062410/56816464550346895dd64756/html5/thumbnails/204.jpg)
207