L4 Relational Algebra - GitHub Pages › files › lectures › lec4.pdfOverview Last time, learned...
Transcript of L4 Relational Algebra - GitHub Pages › files › lectures › lec4.pdfOverview Last time, learned...
L4Relational Algebra
Eugene Wu
Supplemental Materials
Helpful Referenceshttps://en.wikipedia.org/wiki/Relational_algebra
Overview
Last time, learned aboutpre-relational modelsan informal introduction to relational modelan introduction to the SQL query language.
Learn about formal relational query languagesRelational Algebra (algebra: perform operations)Relational Calculus (logic: are statements true?)
Keys to understanding SQL and query processing
Who Cares?
Clean query semantics & rich program analysis
Helps/enables optimization
Opens up rich set of topics
Materialized views
Data lineage/provenance
Query by example
Distributed query execution
…
What’s a Query Language?
Allows manipulation and retrieval of data from a database.
Traditionally: QL != programming languageDoesn’t need to be turing completeNot designed for computationSupports easy, efficient access to large databases
Recent YearsScaling to large datasets is a realityQuery languages are a powerful way to
think about data algorithms scalethink about asynchronous/parallel programming
What’s a Query Language?
Tutorial at VLDB 2015 Abadi et al.http://www.slideshare.net/abadid/sqlonhadoop-tutorial
Formal Relational Query Languages
Formal basis for real languages e.g., SQL
Relational Algebra
Operational, used to represent execution plans
Relational Calculus
Logical, describes what data users want
(not operational, fully declarative)
Prelims
Query is a function over relation instances
Q(R1,…,Rn) = Rresult
Schemas of input and output relations are fixed and well defined by the query Q.
Positional vs Named field notationPosition easier for formal defs
one-indexed (not 0-indexed!!!)Named more readableBoth used in SQL
Prelims
Relation (for this lecture)Instance is a set of tuplesSchema defines field names and types (domains)Students(sid int, name text, major text, gpa int)
How are relations different than generic sets (ℝ)?Can assume item structure due to schemaSome algebra operations (x) need to be modifiedWill use this later
Relational Algebra Overview
Core 5 operationsPROJECT (π)SELECT (σ)UNION (U)SET DIFFERENCE (-)
CROSSPRODUCT (x)
Additional operationsRENAME (p)INTERSECT (∩)
JOIN (⨝)DIVIDE (/)
Instances Used Today: Library
Students, Reservations
Use positional or named field notation
Fields in query results are inherited from input relations (unless specified)
sid name gpa age
1 eugene 4 202 barb 3 213 tanya 2 88
sid name gpa age
4 aziz 3.2 212 barb 3 213 tanya 2 885 rusty 3.5 21
sid rid day
1 101 10/10
2 102 11/11
R1
S1
S2
Project
π<attr1,…>(A) = Rresult
Pick out desired attributes (subset of columns)
Schema is subset of input schema in the projection listπ<a,b,c>(A) has output schema (a,b,c) w/ types carried over
Project
πname,age(S2) = name age
aziz 21barb 21tanya 88rusty 21
sid name gpa age
4 aziz 3.2 212 barb 3 213 tanya 2 885 rusty 3.5 21
S2
Project
πname,name,age(S2) = name name age
aziz aziz 21barb barb 21tanya tanya 88rusty rusty 21
sid name gpa age
4 aziz 3.2 212 barb 3 213 tanya 2 885 rusty 3.5 21
S2
Project
age
2188
sid name gpa age
4 aziz 3.2 212 barb 3 213 tanya 2 885 rusty 3.5 21
S2
Where did all the rows go?Real systems typically don’t remove duplicates. Why?
πage(S2) =
Select
σ<p>(A) = Rresult
Select subset of rows that satisfy condition p
Won’t have duplicates in result. Why?
Result schema same as input
Select
sid name gpa age
1 eugene 4 202 barb 3 21
σage<30 (S1) =
name
eugenebarb
πname(σage<30 (S1)) =
sid name gpa age
1 eugene 4 202 barb 3 213 tanya 2 88
S1
Commutatively & Associativity
A + B = B + AA * B = B * A
A + (B * C) = (B * C) + A
A + (B + C) = (A + B) + CA + (B * C) = (A + B) * C
Commutatively & Associativity
A + B = B + AA * B = B * A
A + (B * C) = (B * C) + A
A + (B + C) = (A + B) + CA + (B * C) = (A + B) * C
Commutatively
πage(σage<30 (S1))
sid name gpa age
1 eugene 4 202 barb 3 213 tanya 2 88
σage<30 ( )sid name gpa age
1 eugene 4 202 barb 3 21
Commutatively
πage(σage<30 (S1))
sid name gpa age
1 eugene 4 202 barb 3 213 tanya 2 88
σage<30
πage
sid name gpa age
1 eugene 4 202 barb 3 21
( )age
2021
( ) =
Commutatively
σage<30(πage(S1))
sid name gpa age
1 eugene 4 202 barb 3 213 tanya 2 88
( )age
202188
πage
Commutatively
σage<30(πage(S1))
sid name gpa age
1 eugene 4 202 barb 3 213 tanya 2 88
( )age
2021
( ) age
202188
πage
σage<30 =
Commutatively
Does Project and Select commute?
πage(σage<30 (S1)) = σage<30(πage(S1))
What about
πname(σage<30 (S1))?
Commutatively
Does Project and Select commute?
πage(σage<30 (S1)) = σage<30(πage(S1))
What about
πname(σage<30 (S1)) != σage<30(πname(S1))
Commutatively
Does Project and Select commute?
πage(σage<30 (S1)) = σage<30(πage(S1))
What about
πname(σage<30 (S1)) != σage<30(πname, age(S1))
Commutatively
Does Project and Select commute?
πage(σage<30 (S1)) = σage<30(πage(S1))
What about
πname(σage<30 (S1)) = πname(σage<30(πname, age(S1)))
OK!
Union, Set-Difference
A op B = Rresult
A, B must be union-compatibleSame number of fieldsField i in each schema have same type
Result Schema borrowed from first arg (A)A(big int, poppa int) U B(thug int, life int) = ?
Union, Set-Difference
A op B = Rresult
A, B must be union-compatibleSame number of fieldsField i in each schema have same type
Result Schema borrowed from first arg (A)A(big int, poppa int) U B(thug int, life int) = Rresult(big int, poppa int)
Union, Intersect, Set-Difference
S1 US2=
sid name gpa age
1 eugene 4 204 aziz 3.2 215 rusty 3.5 213 tanya 2 882 barb 3 21
sid name gpa age
1 eugene 4 202 barb 3 213 tanya 2 88
sid name gpa age
4 aziz 3.2 212 barb 3 213 tanya 2 885 rusty 3.5 21
S1 S2
Union, Intersect, Set-Difference
sid name gpa age
1 eugene 4 20S1- S2=
sid name gpa age
1 eugene 4 202 barb 3 213 tanya 2 88
sid name gpa age
4 aziz 3.2 212 barb 3 213 tanya 2 885 rusty 3.5 21
S1 S2
Note on Set Difference & Performance
Notice that most operators are monotonicincreasing size of inputs à outputs growif A⊇B à Q(A, T) ⊇ Q(B, T)can compute incrementally
Set Difference is not monotonicif A⊇B à T – A⊆T – Be.g., 5 > 1 à 9 – 5 < 9 – 1
Set difference is blocking:For T – S, must wait for all S tuples before any results
Cross-Product
A(a1,…,an) x B(an+1,…,am) = Rresult(a1,…,am)
Each row of A paired with each row of B
Result schema concats A and B’s fields, inherit if possibleConflict: students and reservations have sid field
Different than mathematical “X” by flattening results:
math A x B = { (a, b) | a∈A ^ b∈B }e.g., {1, 2} X {3, 4} = { (1, 3), (1, 4), (2, 3), (2, 4) }
what is {1, 2} x {3, 4} x {5, 6}?
Cross-Product
(sid) name gpa age (sid) rid day
1 eugene 4 20 1 101 10/10
2 barb 3 21 1 101 10/10
3 tanya 2 88 1 101 10/10
1 eugene 4 20 2 102 11/11
2 barb 3 21 2 102 11/11
3 tanya 2 88 2 102 11/11
sid name gpa age
1 eugene 4 202 barb 3 213 tanya 2 88
S1
S1xR1=
sid rid day
1 101 10/10
2 102 11/11
R1
Rename
p(<new_name>(<mappings>), Q)
Explicitly defines/changes field names of schema
p(C(1 à sid1, 5 à sid2), S1 x R1)
sid1 name gpa age sid2 rid day
1 eugene 4 20 1 101 10/10
2 barb 3 21 1 101 10/10
3 tanya 2 88 1 101 10/10
1 eugene 4 20 2 102 11/11
2 barb 3 21 2 102 11/11
3 tanya 2 88 2 102 11/11
C =
π( ) =Project
σ( ) =Select
x =Cross product
- =Difference
U =Union
∩ =Intersect
Compound/Convenience Operators
INTERSECT (∩)
JOIN (⨝)DIVIDE (/)
Intersect
A ∩ B = Rresult
A, B must be union-compatible
Intersect
sid name gpa age
2 barb 3 213 tanya 2 88
S1∩S2=
sid name gpa age
1 eugene 4 202 barb 3 213 tanya 2 88
sid name gpa age
4 aziz 3.2 212 barb 3 213 tanya 2 885 rusty 3.5 21
S1 S2
Intersect
A ∩ B = Rresult
A, B must be union-compatible
Can we express using core operators?A∩B = ?
Intersect
A ∩ B = Rresult
Can we express using core operators?A∩B = A - ? (think venn diagram)
A B
Intersect
A ∩ B = Rresult
Can we express using core operators?A∩B = A – (A – B)
A B
Joins (high level)
Reserves BooksStudents
relation relation relation
What if you want to query across all three tables?e.g., all names of students that reserved “The Purple Crayon”
Need to combine these tables Cross product? But that ignores foreign key references
day
Joins (high level)
Reserves BooksStudents
relation relation relation
sid name gpa age
1 eugene 4 202 barb 3 213 tanya 2 88
S1rid name
101 The PurpleCrayon
102 1984
B1
day
sid rid day
1 101 10/10
2 102 11/11
R1
Joins (high level)
Reserves BooksStudents
relation relation relation
sid name gpa age
1 eugene 4 202 barb 3 213 tanya 2 88
S1sid rid day
1 101 10/10
2 102 11/11
R1rid name
101 The PurpleCrayon
102 1984
B1
day
Joins (high level)
Reserves
day
BooksStudents
relation relation relation
sid name gpa age
1 eugene 4 202 barb 3 213 tanya 2 88
S1 RB1sid (rid) day
1 101 10/10
2 102 11/11
(rid) name
101 The PurpleCrayon
102 1984
Joins (high level)
Reserves BooksStudents
relation relation relation
sid name gpa age
1 eugene 4 202 barb 3 213 tanya 2 88
S1 RB1sid (rid) day
1 101 10/10
2 102 11/11
(rid) name
101 The PurpleCrayon
102 1984
day
Joins (high level)
Reserves BooksStudents
relation relation relation
(sid) (name) gpa age
1 eugene 4 202 barb 3 21
SRB1
day
(sid) (rid) day
1 101 10/10
2 102 11/11
(rid) (name)
101 The PurpleCrayon
102 1984
theta (θ) Join
A ⨝c B = σc(A x B)
Most general form
Result schema same as cross product
Often far more efficient to compute than cross product
Commutative
(A⨝cB) ⨝cC = A⨝c(B ⨝cC)
theta (θ) Join
sid name gpa age
1 eugene 4 202 barb 3 213 tanya 2 88
S1sid rid day
1 101 10/10
2 102 11/11
R1
(sid) name gpa age (sid) rid day
1 eugene 4 20 1 101 10/10
2 barb 3 21 1 101 10/10
3 tanya 2 88 1 101 10/10
1 eugene 4 20 2 102 11/11
2 barb 3 21 2 102 11/11
3 tanya 2 88 2 102 11/11
S1⨝S1.sid ≤ R1.sid R1 = σ S1.sid ≤ R1.sid(S1xR1)=
theta (θ) Join
sid name gpa age
1 eugene 4 202 barb 3 213 tanya 2 88
S1sid rid day
1 101 10/10
2 102 11/11
R1
(sid) name gpa age (sid) rid day
1 eugene 4 20 1 101 10/10
2 barb 3 21 1 101 10/10
3 tanya 2 88 1 101 10/10
1 eugene 4 20 2 102 11/11
2 barb 3 21 2 102 11/11
3 tanya 2 88 2 102 11/11
S1⨝S1.sid ≤ R1.sid R1 = σ S1.sid ≤ R1.sid(S1xR1)=
theta (θ) Join
(sid) name gpa age (sid) rid day
1 eugene 4 20 1 101 10/10
1 eugene 4 20 2 102 11/11
2 barb 3 21 2 102 11/11
sid name gpa age
1 eugene 4 202 barb 3 213 tanya 2 88
S1sid rid day
1 101 10/10
2 102 11/11
R1
S1⨝S1.sid ≤ R1.sid R1 = σ S1.sid ≤ R1.sid(S1xR1)=
Equi-Join
A ⨝attr B = πall attrs except B.attr(A ⨝A.attr = B.attr B)
Special case where the condition is attribute equality
Result schema only keeps one copy of equality fields
Natural Join (A⨝B):
Equijoin on all shared fields (fields w/ same name)
Equi-Join
S1⨝sidR1 = sid name gpa age rid day
1 eugene 4 20 101 10/10
2 barb 3 21 102 11/11
sid name gpa age
1 eugene 4 202 barb 3 213 tanya 2 88
S1sid rid day
1 101 10/10
2 102 11/11
R1
Division
Let us have relations A(x, y), B(y)
A/B = { <x> | ∀y ∈B <x,y> ∈A }
Find all students that have reserved all booksA/B = all x (students) s.t. for every y (reservation), <x,y> ∈A
Good to ponder, not supported in most systems (why?)
Generalizationy can be a list of fields in Bx U y is fields in A
Examples
sid rid
1 1
1 2
1 3
1 4
2 1
2 2
3 2
4 2
4 4
rid
2
rid
2
4
rid
1
2
4
A R1 R2 R3
A/R1 A/R2 A/R3
Which sid values have relationship with all vals in Rx?
Examples
sid rid
1 1
1 2
1 3
1 4
2 1
2 2
3 2
4 2
4 4
rid
2
rid
2
4
rid
1
2
4
A R1 R2 R3
sid
1
2
3
4
A/R1 A/R2 A/R3
Which sid values have relationship with all vals in Rx?
Examples
sid rid
1 1
1 2
1 3
1 4
2 1
2 2
3 2
4 2
4 4
rid
2
rid
2
4
rid
1
2
4
A R1 R2 R3
sid
1
2
3
4
sid
1
4
A/R1 A/R2 A/R3
Which sid values have relationship with all vals in Rx?
Examples
sid rid
1 1
1 2
1 3
1 4
2 1
2 2
3 2
4 2
4 4
rid
2
rid
2
4
rid
1
2
4
A R1 R2 R3
sid
1
2
3
4
sid
1
4
sid
1
A/R1 A/R2 A/R3
Which sid values have relationship with all vals in Rx?
Is A/B a Fundamental Operation?
No. Shorthand like Joins
joins natively supported: ubiquitous, can be optimized
Hint: Find all xs not ‘disqualified’ by some y in B.
A(x, y), B(y)x value is disqualified if
1. by attaching y value from B (e.g., create <x, y>)
2. we obtain an <x,y> that is not in A.
sid rid
1 1
1 2
1 3
1 4
2 1
2 2
3 2
4 2
4 4
A
Disqualified = πx((πx(A) x B) – A)A/B = πx(A) - Disqualified
rid
2
4
B
sid rid
1 1
1 2
1 3
1 4
2 1
2 2
3 2
4 2
4 4
A
Disqualified = πx((πsid(A) x B) – A)A/B = πx(A) - Disqualified
rid
2
4
B
sid
1
2
3
4
πsid(A)
sid rid
1 1
1 2
1 3
1 4
2 1
2 2
3 2
4 2
4 4
A
Disqualified = πx((πsid(A) x B) – A)A/B = πx(A) - Disqualified
rid
2
4
B
sid rid
1 2
1 4
2 2
2 4
3 2
3 4
4 2
4 4
πsid(A) x B
sid
1
2
3
4
πsid(A)
sid rid
1 1
1 2
1 3
1 4
2 1
2 2
3 2
4 2
4 4
A
Disqualified = πx((πsid(A) x B) – A)A/B = πx(A) - Disqualified
rid
2
4
B
sid rid
1 2
1 4
2 2
2 4
3 2
3 4
4 2
4 4
sid rid
2 4
3 4
πsid(A) x B (πsid(A) x B) – A
sid
1
2
3
4
πsid(A)
sid rid
1 1
1 2
1 3
1 4
2 1
2 2
3 2
4 2
4 4
A
sid
1
4
A/B
Disqualified = πsid((πsid(A) x B) – A)A/B = πsid(A) - Disqualified
rid
2
4
B πsid(A) x B
sid rid
1 2
1 4
2 2
2 4
3 2
3 4
4 2
4 4
(πsid(A) x B) – A
sid rid
2 4
3 4
sid
1
2
3
4
πsid(A)
Different Plans, Same Results
Semantic equivalence:results are always the same
Note that it is independent of the database instance!
Names of students that reserved book 2
πname(σrid=2 (R1) ⨝ S1)
p(tmp1, σrid=2 (R1))p(tmp2, tmp1 ⨝ S1)πname(tmp2)
πname(σrid=2(R1 ⨝ S1))
EquivalentQueries
Names of students that reserved dbbooks
Need to join DB books with reserve and studentsπname(σtype=‘db’ (Book) ⨝ Reserve ⨝ Student)
Book(rid, type) Reserve(sid, rid) Student(sid, name)
Names of students that reserved dbbooks
Need to join DB books with reserve and studentsπname(σtype=‘db’ (Book) ⨝ Reserve ⨝ Student)
Book(rid, type) Reserve(sid, rid) Student(sid, name)
Names of students that reserved dbbooks
Need to join DB books with reserve and studentsπname(σtype=‘db’ (Book) ⨝ Reserve ⨝ Student
Book(rid, type) Reserve(sid, rid) Student(sid, name)
Names of students that reserved dbbooks
Need to join DB books with reserve and studentsπname(σtype=‘db’ (Book) ⨝ Reserve ⨝ Student)
Book(rid, type) Reserve(sid, rid) Student(sid, name)
Names of students that reserved dbbooks
Need to join DB books with reserve and studentsπname(σtype=‘db’ (Book) ⨝ Reserve ⨝ Student)
More efficient queryπname(πsid((πrid σtype=‘db’ (Book)) ⨝ Reserve) ⨝ Student)
Query optimizer can find the more efficient query!
Book(rid, type) Reserve(sid, rid) Student(sid, name)
Names of students that reserved dbbooks
Need to join DB books with reserve and studentsπname(σtype=‘db’ (Book) ⨝ Reserve ⨝ Student)
More efficient queryπname(πsid((πrid σtype=‘db’ (Book)) ⨝ Reserve) ⨝ Student)
Query optimizer can find the more efficient query!
Book(rid, type) Reserve(sid, rid) Student(sid, name)
Names of students that reserved dbbooks
Need to join DB books with reserve and studentsπname(σtype=‘db’ (Book) ⨝ Reserve ⨝ Student)
More efficient queryπname(πsid((πrid σtype=‘db’ (Book)) ⨝ Reserve) ⨝ Student)
Query optimizer can find the more efficient query!
Book(rid, type) Reserve(sid, rid) Student(sid, name)
Names of students that reserved dbbooks
Need to join DB books with reserve and studentsπname(σtype=‘db’ (Book) ⨝ Reserve ⨝ Student)
More efficient queryπname(πsid((πrid σtype=‘db’ (Book)) ⨝ Reserve) ⨝ Student)
Query optimizer can find the more efficient query!
Book(rid, type) Reserve(sid, rid) Student(sid, name)
Students that reserved DB or HCI book
1. Find all DB or HCI books
2. Find students that reserved one of those books
p(tmp, (σtype=‘DB’ v type=‘HCI’ (Book))
πname(tmp ⨝ Reserve ⨝ Student)
Alternatives
define tmp using UNION (how?)
what if we replaced v with ^ in the query?
Students that reserved a DB and HCI book
Does previous approach work?
p(tmp, (σtype=‘DB’ ^ type=‘HCI’ (Book))
πname(tmp ⨝ Reserve ⨝ Student)
NO
Why?
p(tmp, (σtype=‘DB’ ^ type=‘HCI’ (Book))πname(tmp ⨝ Reserve ⨝ Student)
for b in Book:
if b.type = ‘DB’ and b.type = ‘HCI’:for r in Reserve:
for s in Student:
if r.sid = s.sid and r.bid = b.bid:yield b.name
Students that reserved a DB and HCI book
Does previous approach work?
1. Find students that reserved DB books
2. Find students that reversed HCI books
3. Intersection
p(tmpDB, πsid(σtype=‘DB’ Book) ⨝ Reserve)
p(tmpHCI, πsid(σtype=‘HCI’ Book) ⨝ Reserve)
πname((tmpDB∩tmpHCI) ⨝ Student)
Students that reserved all books
Use division
Be careful with schemas of inputs to / !
p(tmp, (Reserves / Books))
tmp ⨝ Students
Books(rid, type) Reserves(sid, rid, date) Students(sid,)
Type error
Students that reserved all books
Use division
Be careful with schemas of inputs to / !
p(tmp, (Reserves / πrid (Books)))
tmp ⨝ Students
Books(rid, type) Reserves(sid, rid, date) Students(sid,)
Each student must have reserved all books on the same date!
Students that reserved all books
Use divisionBe careful with schemas of inputs to / !
p(tmp, (πsid,rid (Reserves)) / (πrid (Books)))tmp ⨝ Student
What if want students that reserved all horror books?
p(tmp, (πsid,rid Reserves) / (πrid(σtype=‘horror’ Book) ))
Books(rid, type) Reserves(sid, rid, date) Students(sid)
Let’s step back
Relational algebra is expressiveness benchmarkA language equal in expressiveness as relational algebra is relationally complete
But has limitationsnullsaggregationrecursionduplicates
Equi-Joins are a way of life
Matching of two sets based on shared attributes
Yelp: Join between your location and restaurants
Market: Join between consumers and suppliers
High five: Join between two hands on time and space
Comm.: Join between minds on ideas/concepts
Equi-Joins are a way of life
Matching of two things based on shared common attribute
What can we do with RA?
Query(DB instance) à Relation instance
GestureDB. Nandi et al.
What can we do with RA?
Query(DB instance) = Relation instance
Query by exampleHere’s data and examples of result, generate the query for me
Novel relationally complete interfaces
Summary
Relational Algebra (RA) operators
Operators are closed inputs & outputs are relations
Multiple Relational Algebra queries can be equivalent It is operationalSame semantics but different performance
Forms basis for optimizations
Next Time
Relational CalculusSQL