L4 Relational Algebra - GitHub Pages › files › lectures › lec4.pdfOverview Last time, learned...

L4Relational Algebra

Eugene Wu

Supplemental Materials

Helpful Referenceshttps://en.wikipedia.org/wiki/Relational_algebra

Overview

Last time, learned aboutpre-relational modelsan informal introduction to relational modelan introduction to the SQL query language.

Learn about formal relational query languagesRelational Algebra (algebra: perform operations)Relational Calculus (logic: are statements true?)

Keys to understanding SQL and query processing

Who Cares?

Clean query semantics & rich program analysis

Helps/enables optimization

Opens up rich set of topics

Materialized views

Data lineage/provenance

Query by example

Distributed query execution

…

What’s a Query Language?

Allows manipulation and retrieval of data from a database.

Traditionally: QL != programming languageDoesn’t need to be turing completeNot designed for computationSupports easy, efficient access to large databases

Recent YearsScaling to large datasets is a realityQuery languages are a powerful way to

think about data algorithms scalethink about asynchronous/parallel programming

What’s a Query Language?

Tutorial at VLDB 2015 Abadi et al.http://www.slideshare.net/abadid/sqlonhadoop-tutorial

Formal Relational Query Languages

Formal basis for real languages e.g., SQL

Relational Algebra

Operational, used to represent execution plans

Relational Calculus

Logical, describes what data users want

(not operational, fully declarative)

Prelims

Query is a function over relation instances

Q(R1,…,Rn) = Rresult

Schemas of input and output relations are fixed and well defined by the query Q.

Positional vs Named field notationPosition easier for formal defs

one-indexed (not 0-indexed!!!)Named more readableBoth used in SQL

Prelims

Relation (for this lecture)Instance is a set of tuplesSchema defines field names and types (domains)Students(sid int, name text, major text, gpa int)

How are relations different than generic sets (ℝ)?Can assume item structure due to schemaSome algebra operations (x) need to be modifiedWill use this later

Relational Algebra Overview

Core 5 operationsPROJECT (π)SELECT (σ)UNION (U)SET DIFFERENCE (-)

CROSSPRODUCT (x)

Additional operationsRENAME (p)INTERSECT (∩)

JOIN (⨝)DIVIDE (/)

Instances Used Today: Library

Students, Reservations

Use positional or named field notation

Fields in query results are inherited from input relations (unless specified)

sid name gpa age

1 eugene 4 202 barb 3 213 tanya 2 88

sid name gpa age

4 aziz 3.2 212 barb 3 213 tanya 2 885 rusty 3.5 21

sid rid day

1 101 10/10

2 102 11/11

R1

S1

S2

Project

π<attr1,…>(A) = Rresult

Pick out desired attributes (subset of columns)

Schema is subset of input schema in the projection listπ<a,b,c>(A) has output schema (a,b,c) w/ types carried over

Project

πname,age(S2) = name age

aziz 21barb 21tanya 88rusty 21

sid name gpa age


S2

Project

πname,name,age(S2) = name name age

aziz aziz 21barb barb 21tanya tanya 88rusty rusty 21

sid name gpa age


S2

Project

age

2188

sid name gpa age


S2

Where did all the rows go?Real systems typically don’t remove duplicates. Why?

πage(S2) =

Select

σ<p>(A) = Rresult

Select subset of rows that satisfy condition p

Won’t have duplicates in result. Why?

Result schema same as input

Select

sid name gpa age

1 eugene 4 202 barb 3 21

σage<30 (S1) =

name

eugenebarb

πname(σage<30 (S1)) =

sid name gpa age


S1

Commutatively & Associativity

A + B = B + AA * B = B * A

A + (B * C) = (B * C) + A

A + (B + C) = (A + B) + CA + (B * C) = (A + B) * C

Commutatively

πage(σage<30 (S1))

sid name gpa age


σage<30 ( )sid name gpa age


Commutatively

πage(σage<30 (S1))

sid name gpa age


σage<30

πage

sid name gpa age


( )age

2021

( ) =

Commutatively

σage<30(πage(S1))

sid name gpa age


( )age

202188

πage

Commutatively

σage<30(πage(S1))

sid name gpa age


( )age

2021

( ) age

202188

πage

σage<30 =

Commutatively

Does Project and Select commute?

πage(σage<30 (S1)) = σage<30(πage(S1))

What about

πname(σage<30 (S1))?

Commutatively



What about

πname(σage<30 (S1)) != σage<30(πname(S1))

Commutatively



What about

πname(σage<30 (S1)) != σage<30(πname, age(S1))

Commutatively



What about

πname(σage<30 (S1)) = πname(σage<30(πname, age(S1)))

OK!

Union, Set-Difference

A op B = Rresult

A, B must be union-compatibleSame number of fieldsField i in each schema have same type

Result Schema borrowed from first arg (A)A(big int, poppa int) U B(thug int, life int) = ?

Union, Set-Difference

A op B = Rresult

A, B must be union-compatibleSame number of fieldsField i in each schema have same type

Result Schema borrowed from first arg (A)A(big int, poppa int) U B(thug int, life int) = Rresult(big int, poppa int)

Union, Intersect, Set-Difference

S1 US2=

sid name gpa age

1 eugene 4 204 aziz 3.2 215 rusty 3.5 213 tanya 2 882 barb 3 21

sid name gpa age


sid name gpa age


S1 S2

Union, Intersect, Set-Difference

sid name gpa age

1 eugene 4 20S1- S2=

sid name gpa age


sid name gpa age


S1 S2

Note on Set Difference & Performance

Notice that most operators are monotonicincreasing size of inputs à outputs growif A⊇B à Q(A, T) ⊇ Q(B, T)can compute incrementally

Set Difference is not monotonicif A⊇B à T – A⊆T – Be.g., 5 > 1 à 9 – 5 < 9 – 1

Set difference is blocking:For T – S, must wait for all S tuples before any results

Cross-Product

A(a1,…,an) x B(an+1,…,am) = Rresult(a1,…,am)

Each row of A paired with each row of B

Result schema concats A and B’s fields, inherit if possibleConflict: students and reservations have sid field

Different than mathematical “X” by flattening results:

math A x B = { (a, b) | a∈A ^ b∈B }e.g., {1, 2} X {3, 4} = { (1, 3), (1, 4), (2, 3), (2, 4) }

what is {1, 2} x {3, 4} x {5, 6}?

Cross-Product

(sid) name gpa age (sid) rid day

1 eugene 4 20 1 101 10/10

2 barb 3 21 1 101 10/10

3 tanya 2 88 1 101 10/10

1 eugene 4 20 2 102 11/11

2 barb 3 21 2 102 11/11

3 tanya 2 88 2 102 11/11

sid name gpa age


S1

S1xR1=

sid rid day

1 101 10/10

2 102 11/11

R1

Rename

p(<new_name>(<mappings>), Q)

Explicitly defines/changes field names of schema

p(C(1 à sid1, 5 à sid2), S1 x R1)

sid1 name gpa age sid2 rid day

1 eugene 4 20 1 101 10/10

2 barb 3 21 1 101 10/10

3 tanya 2 88 1 101 10/10

1 eugene 4 20 2 102 11/11

2 barb 3 21 2 102 11/11

3 tanya 2 88 2 102 11/11

C =

π( ) =Project

σ( ) =Select

x =Cross product

- =Difference

U =Union

∩ =Intersect

Compound/Convenience Operators

INTERSECT (∩)

JOIN (⨝)DIVIDE (/)

Intersect

A ∩ B = Rresult

A, B must be union-compatible

Intersect

sid name gpa age

2 barb 3 213 tanya 2 88

S1∩S2=

sid name gpa age


sid name gpa age


S1 S2

Intersect

A ∩ B = Rresult

A, B must be union-compatible

Can we express using core operators?A∩B = ?

Intersect

A ∩ B = Rresult

Can we express using core operators?A∩B = A - ? (think venn diagram)

A B

Intersect

A ∩ B = Rresult

Can we express using core operators?A∩B = A – (A – B)

A B

Joins (high level)

Reserves BooksStudents

relation relation relation

What if you want to query across all three tables?e.g., all names of students that reserved “The Purple Crayon”

Need to combine these tables Cross product? But that ignores foreign key references

day

Joins (high level)



sid name gpa age


S1rid name

101 The PurpleCrayon

102 1984

B1

day

sid rid day

1 101 10/10

2 102 11/11

R1

Joins (high level)



sid name gpa age


S1sid rid day

1 101 10/10

2 102 11/11

R1rid name


102 1984

B1

day

Joins (high level)

Reserves

day

BooksStudents


sid name gpa age


S1 RB1sid (rid) day

1 101 10/10

2 102 11/11

(rid) name


102 1984

Joins (high level)



sid name gpa age


S1 RB1sid (rid) day

1 101 10/10

2 102 11/11

(rid) name


102 1984

day

Joins (high level)



(sid) (name) gpa age


SRB1

day

(sid) (rid) day

1 101 10/10

2 102 11/11

(rid) (name)


102 1984

theta (θ) Join

A ⨝c B = σc(A x B)

Most general form

Result schema same as cross product

Often far more efficient to compute than cross product

Commutative

(A⨝cB) ⨝cC = A⨝c(B ⨝cC)

theta (θ) Join

sid name gpa age


S1sid rid day

1 101 10/10

2 102 11/11

R1


1 eugene 4 20 1 101 10/10

2 barb 3 21 1 101 10/10

3 tanya 2 88 1 101 10/10

1 eugene 4 20 2 102 11/11

2 barb 3 21 2 102 11/11

3 tanya 2 88 2 102 11/11

S1⨝S1.sid ≤ R1.sid R1 = σ S1.sid ≤ R1.sid(S1xR1)=

theta (θ) Join


1 eugene 4 20 1 101 10/10

1 eugene 4 20 2 102 11/11

2 barb 3 21 2 102 11/11

sid name gpa age


S1sid rid day

1 101 10/10

2 102 11/11

R1

S1⨝S1.sid ≤ R1.sid R1 = σ S1.sid ≤ R1.sid(S1xR1)=

Equi-Join

A ⨝attr B = πall attrs except B.attr(A ⨝A.attr = B.attr B)

Special case where the condition is attribute equality

Result schema only keeps one copy of equality fields

Natural Join (A⨝B):

Equijoin on all shared fields (fields w/ same name)

Equi-Join

S1⨝sidR1 = sid name gpa age rid day

1 eugene 4 20 101 10/10

2 barb 3 21 102 11/11

sid name gpa age


S1sid rid day

1 101 10/10

2 102 11/11

R1

Division

Let us have relations A(x, y), B(y)

A/B = { <x> | ∀y ∈B <x,y> ∈A }

Find all students that have reserved all booksA/B = all x (students) s.t. for every y (reservation), <x,y> ∈A

Good to ponder, not supported in most systems (why?)

Generalizationy can be a list of fields in Bx U y is fields in A

Examples

sid rid

1 1

1 2

1 3

1 4

2 1

2 2

3 2

4 2

4 4

rid

2

rid

2

4

rid

1

2

4

A R1 R2 R3

A/R1 A/R2 A/R3

Which sid values have relationship with all vals in Rx?

Examples

sid rid

1 1

1 2

1 3

1 4

2 1

2 2

3 2

4 2

4 4

rid

2

rid

2

4

rid

1

2

4

A R1 R2 R3

sid

1

2

3

4

A/R1 A/R2 A/R3


Examples

sid rid

1 1

1 2

1 3

1 4

2 1

2 2

3 2

4 2

4 4

rid

2

rid

2

4

rid

1

2

4

A R1 R2 R3

sid

1

2

3

4

sid

1

4

A/R1 A/R2 A/R3


Examples

sid rid

1 1

1 2

1 3

1 4

2 1

2 2

3 2

4 2

4 4

rid

2

rid

2

4

rid

1

2

4

A R1 R2 R3

sid

1

2

3

4

sid

1

4

sid

1

A/R1 A/R2 A/R3


Is A/B a Fundamental Operation?

No. Shorthand like Joins

joins natively supported: ubiquitous, can be optimized

Hint: Find all xs not ‘disqualified’ by some y in B.

A(x, y), B(y)x value is disqualified if

1. by attaching y value from B (e.g., create <x, y>)

2. we obtain an <x,y> that is not in A.

sid rid

1 1

1 2

1 3

1 4

2 1

2 2

3 2

4 2

4 4

A

Disqualified = πx((πx(A) x B) – A)A/B = πx(A) - Disqualified

rid

2

4

B

sid rid

1 1

1 2

1 3

1 4

2 1

2 2

3 2

4 2

4 4

A

Disqualified = πx((πsid(A) x B) – A)A/B = πx(A) - Disqualified

rid

2

4

B

sid

1

2

3

4

πsid(A)

sid rid

1 1

1 2

1 3

1 4

2 1

2 2

3 2

4 2

4 4

A


rid

2

4

B

sid rid

1 2

1 4

2 2

2 4

3 2

3 4

4 2

4 4

πsid(A) x B

sid

1

2

3

4

πsid(A)

sid rid

1 1

1 2

1 3

1 4

2 1

2 2

3 2

4 2

4 4

A


rid

2

4

B

sid rid

1 2

1 4

2 2

2 4

3 2

3 4

4 2

4 4

sid rid

2 4

3 4

πsid(A) x B (πsid(A) x B) – A

sid

1

2

3

4

πsid(A)

sid rid

1 1

1 2

1 3

1 4

2 1

2 2

3 2

4 2

4 4

A

sid

1

4

A/B

Disqualified = πsid((πsid(A) x B) – A)A/B = πsid(A) - Disqualified

rid

2

4

B πsid(A) x B

sid rid

1 2

1 4

2 2

2 4

3 2

3 4

4 2

4 4

(πsid(A) x B) – A

sid rid

2 4

3 4

sid

1

2

3

4

πsid(A)

Different Plans, Same Results

Semantic equivalence:results are always the same

Note that it is independent of the database instance!

Names of students that reserved book 2

πname(σrid=2 (R1) ⨝ S1)

p(tmp1, σrid=2 (R1))p(tmp2, tmp1 ⨝ S1)πname(tmp2)

πname(σrid=2(R1 ⨝ S1))

EquivalentQueries

Names of students that reserved dbbooks

Need to join DB books with reserve and studentsπname(σtype=‘db’ (Book) ⨝ Reserve ⨝ Student)

Book(rid, type) Reserve(sid, rid) Student(sid, name)


Need to join DB books with reserve and studentsπname(σtype=‘db’ (Book) ⨝ Reserve ⨝ Student




More efficient queryπname(πsid((πrid σtype=‘db’ (Book)) ⨝ Reserve) ⨝ Student)

Query optimizer can find the more efficient query!


Students that reserved DB or HCI book

1. Find all DB or HCI books

2. Find students that reserved one of those books

p(tmp, (σtype=‘DB’ v type=‘HCI’ (Book))

πname(tmp ⨝ Reserve ⨝ Student)

Alternatives

define tmp using UNION (how?)

what if we replaced v with ^ in the query?

Students that reserved a DB and HCI book

Does previous approach work?

p(tmp, (σtype=‘DB’ ^ type=‘HCI’ (Book))

πname(tmp ⨝ Reserve ⨝ Student)

NO

Why?

p(tmp, (σtype=‘DB’ ^ type=‘HCI’ (Book))πname(tmp ⨝ Reserve ⨝ Student)

for b in Book:

if b.type = ‘DB’ and b.type = ‘HCI’:for r in Reserve:

for s in Student:

if r.sid = s.sid and r.bid = b.bid:yield b.name

Students that reserved a DB and HCI book

Does previous approach work?

1. Find students that reserved DB books

2. Find students that reversed HCI books

3. Intersection

p(tmpDB, πsid(σtype=‘DB’ Book) ⨝ Reserve)

p(tmpHCI, πsid(σtype=‘HCI’ Book) ⨝ Reserve)

πname((tmpDB∩tmpHCI) ⨝ Student)

Students that reserved all books

Use division

Be careful with schemas of inputs to / !

p(tmp, (Reserves / Books))

tmp ⨝ Students

Books(rid, type) Reserves(sid, rid, date) Students(sid,)

Type error


Use division

Be careful with schemas of inputs to / !

p(tmp, (Reserves / πrid (Books)))

tmp ⨝ Students

Books(rid, type) Reserves(sid, rid, date) Students(sid,)

Each student must have reserved all books on the same date!


Use divisionBe careful with schemas of inputs to / !

p(tmp, (πsid,rid (Reserves)) / (πrid (Books)))tmp ⨝ Student

What if want students that reserved all horror books?

p(tmp, (πsid,rid Reserves) / (πrid(σtype=‘horror’ Book) ))

Books(rid, type) Reserves(sid, rid, date) Students(sid)

Let’s step back

Relational algebra is expressiveness benchmarkA language equal in expressiveness as relational algebra is relationally complete

But has limitationsnullsaggregationrecursionduplicates

Equi-Joins are a way of life

Matching of two sets based on shared attributes

Yelp: Join between your location and restaurants

Market: Join between consumers and suppliers

High five: Join between two hands on time and space

Comm.: Join between minds on ideas/concepts

Equi-Joins are a way of life

Matching of two things based on shared common attribute

What can we do with RA?

Query(DB instance) à Relation instance

GestureDB. Nandi et al.

What can we do with RA?

Query(DB instance) = Relation instance

Query by exampleHere’s data and examples of result, generate the query for me

Novel relationally complete interfaces

Summary

Relational Algebra (RA) operators

Operators are closed inputs & outputs are relations

Multiple Relational Algebra queries can be equivalent It is operationalSame semantics but different performance

Forms basis for optimizations

Next Time

Relational CalculusSQL

L4 Relational Algebra - GitHub Pages › files › lectures › lec4.pdfOverview Last time, learned...

Documents

Transcript of L4 Relational Algebra - GitHub Pages › files › lectures › lec4.pdfOverview Last time, learned...