CSE 232A Graduate Database Systemscseweb.ucsd.edu/classes/fa19/cse232-a/slides/Topic0... ·...

Post on 15-Mar-2020

2 views 0 download

Transcript of CSE 232A Graduate Database Systemscseweb.ucsd.edu/classes/fa19/cse232-a/slides/Topic0... ·...

CSE 232A Graduate Database Systems

1

Fall 2019

Arun Kumar

2009: Bachelors in CSE from IIT Madras

2009—16: MS and PhD in CS from UW-Madison

PhD thesis area: Data systems for ML workloads

2016-: Asst. Prof. at UC San Diego CSE 2019-: + Asst. Prof. at UC San Diego HDSI

Summers: 110F!

Winters: —40F!

Ahh! :)

About Myself

2

3

What is this course about? Why take it?

1. IBM’s Watson beats humans in Jeapordy!

5

How did Watson achieve that?

6

Watson devoured LOTS of data!

7

2. “Structured” data with Google search results

8

How does Google know that?

9

Google also devours LOTS of data!

3. Amazon’s “spot-on” recommendations

10

11

How does Amazon know that?

12

You guessed it! LOTS and LOTS of data!

Analysis

13

And innumerable “traditional” applications

14

15

Large-scale data management systems are the cornerstone of many digital applications, both modern and traditional

16

The Age of “Big Data”/“Data Science”

17

Data data everywhere, All the wallets did shrink! Data data everywhere, Nor any moment to think?

18

CSE 232A will get you thinking about the fundamentals of database systems

1. How are large structured datasets stored and organized?

2. How are “queries” handled? 3. How to make the system faster? 4. Deeper and more recent issues

19

What this course is NOT about

❖ NOT a course on basics of relational algebra or SQL ❖ Take CSE 132A instead (pre-requisite for 232A!)

❖ NOT a course on how to use an RDBMS for database-backed applications (triggers, physical tuning, etc.) ❖ Take CSE 132B instead

❖ NOT a course on distributed systems or transactions ❖ Take CSE 223B instead

http://cseweb.ucsd.edu/classes/fa19/cse232-a

20

And now for the (boring) logistics …

21

Prerequisites

❖ CSE 132A (or equivalent) is essential ❖ CSE 120 is also helpful ❖ For all other cases, email the instructor with

proper justification. A waiver can be considered.

http://cseweb.ucsd.edu/classes/fa19/cse232-a

22

Course Administrivia

❖ Lectures: MWF 3:00-3:50pm, Ledden Auditorium Attending ALL lectures is mandatory! ❖ Instructor: Arun Kumar; arunkk@eng.ucsd.edu Office hours: Wed 4-5pm, 3218 CSE (EBU3b) ❖ TAs: Nikos Koulouris, Kaiqi Yao, Aman Achpal, and

Allen Ordookhanians ❖ Piazza: TBD

http://cseweb.ucsd.edu/classes/fa19/cse232-a

23

Grading

❖ Midterm Exam 1: 20% Date: Friday, October 25; in-class ❖ Midterm Exam 2: 20% Date: Monday, November 25; in-class ❖ Final Exam: 60% (cumulative)

Date: Friday, December 13, 3-6pm; Room TBD

❖ (Optional/Extra Credit) 6 Paper Reviews: 3%

http://cseweb.ucsd.edu/classes/fa19/cse232-a

24

Course Outline

1. How are large structured datasets stored? Storage and file layout

2. How are “queries” handled? Indexing, sorting, relational operator implementations, and query processing

3. How to make the system faster? Query optimization and parallelism

4. Recent issues and trends Data systems for ML, Data integration and cleaning, semi-structured data, ML for RDBMSs

http://cseweb.ucsd.edu/classes/fa19/cse232-a

25

The primary focus will be the relational data model and Relational Database Management Systems (RDBMS)

26

Relational model in a nutshell

Basically, Relation:Table :: Pilot:Driver (okay, a bit more)

The model formalizes “operations” to manipulate relations

RatingID Rating Date UserID MovieID1 3.5 08/27/15 23294 202 4.0 07/20/15 4232 2933 2.5 08/02/15 54551 846… … … … …

27

Relational model in a nutshell

Invented by E. F. Codd in 1970s at IBM Research in CA

28

Relational DBMS in a nutshell

A software system to implement the relational model, i.e., enable users to manage data stored as relations

29

Relational DBMS in a nutshell

First RDBMSs: System R (IBM) and Ingres (Berkeley) in 1970s

A rare photo of the original System R manual

Mike Stonebraker won the Turing Award in 2015!

30

Relational DBMS in a nutshell

RDBMS software is now a USD 40+ billions/year industry! Numerous open source RDBMSs also popular

People still start companies about what are basically RDBMSs!

31

Course Textbook

Prescribed: “Database Management Systems” 3rd Edition Raghu Ramakrishnan and Johannes Gehrke

Optional: “Database Systems: The Complete Book” by Garcia-Molina, Widom, and Ullman “Big Data Integration” by Dong and Srivastava

Aka The “Cow Book” Which cow are you?

32

Tentative Course Schedule

Week Topic

0-1 Introduction; Data Storage (Disks; Files; New Hardware)

2-3 Indexing (B+ Tree; Hash Indexes; Learned Indexes); Sorting

3-4 Relational Operator Implementations; Query Processing

4 Midterm Exam 1 on Friday, 10/25

5-6 Query Optimization; Materialized Views

6-7 Parallel RDBMSs; Dataflow Systems; Cloud RDBMSs

7-8 Data Systems for ML Workloads

9 Midterm Exam 2 on Monday, 11/25

9-10 Data Integration and Cleaning; Semi-Structured Data

10 More ML for RDBMSs; Recap

11 Final Exam on Friday, 12/13

33

General Dos and Do NOTs

Do: ❖ Raise your hand before asking questions during lectures ❖ Discuss papers in the reading list with peers and others ❖ Participate in class discussions; and on Piazza, if you like ❖ Use “CSE232A:” as subject prefix for all emails to me/TAs Do NOT: ❖ Plagiarize or share content for your paper reviews ❖ Harass, cut off, or be disrespectful to others in class ❖ Use email as primary communication mechanism for

doubts/questions instead of Office Hours ❖ Record or quote the instructor’s anecdotes out of class! ☺

34

Questions?

35

Example Relational DB: Netflix!

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 20… … … … …

UserID Name Age JoinDate79 Alice 23 01/10/13… … …

MovieID Name ReleaseDate Director20 Inception 07/13/2010 Christopher Nolan… … …

Ratings

Users

Movies

36

Recap: Relational Model

37

Relational Model: Basic Terms

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 4232 2933 2.5 08/02/15 54551 846… … … … …

What is a Relation?

A glorified table!

What are Attributes?

These things

What are Domains?

The mathematical “domains” for the attributes

Integers Real …

What is Arity?Ratings

Number of attributes

38

Relational Model: Basic Terms

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 4232 2933 2.5 08/02/15 54551 846… … … … …

What are Tuples?What is Cardinality?

These thingsNumber of tuples

Ratings

39

Referring to “tuples”: Two notations

1. Without using attribute names (positional/sequence) 2. Using attribute names (named/set)

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 4232 2933 2.5 08/02/15 54551 846… … … … …

 

Ratings (R)

t[1] = 3.5t.NumStars = 3.5

40

Relational Model: Basic Terms

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 4232 2933 2.5 08/02/15 54551 846… … … … …

What is Schema?

The relation name, and the name and logical descriptions of the attributes (including domains)

Aka “metadata”

Ratings

41

Relational Model: Basic Terms

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 4232 2933 2.5 08/02/15 54551 846… … … … …

What is an Instance?

A given relation populated with a set of tuples (loose analogy: schema:instance::type:value in PL)

Instance 1

RatingID NumStars Timestamp UserID MovieID3292 1.5 06/27/14 794 10

294122 4.0 07/10/14 232 32974423 0.5 03/08/14 8451 846

… … … … …

Instance 2

Ratings

42

Relational Model: Basic Terms

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 20… … … … …

What is a Relational Database?

UserID Name Age JoinDate79 Alice 23 01/10/13… … …

MovieID Name ReleaseDate Director20 Inception 07/13/2010 Christopher Nolan… … …

A collection of relations; similarly, schema vs. instance

Ratings

Users

Movies

43

“Write Operations” on a Relation

❖ Insert Add tuples to a relation ❖ Delete Remove tuples from a relation (typically based on

“predicate” matches, e.g., “NumStars <= 4.5” ❖ Modify Logically, deletes + inserts, but typically implemented

as in-place updates to a relation instance

44

“Read Operations” on a Relation

❖ “Select” Select all tuples from Ratings with “UserID == 19” ❖ “Project” Select only Director attribute from Movies ❖ “Aggregate” Select Average of all NumStars in Ratings

And a few more formal “algebraic” operations …

45

Recap: Relational Algebra

Basic Relational Operations

Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference

46

Select

47

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16

Ratings (R)

Example: Get all ratings with 4.0 or more stars

“Selection condition/predicate”Select

“Operator”

Select

48

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16

R

RatingID NumStars Timestamp UserID MovieID2 4.0 07/20/15 80 204 4.5 03/05/14 80 16

R’

Schema preservedSubset of tuples

(satisfying selection condition)

Complex Selection Conditions

49

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16

R

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 20

R’

Basic Relational Operations

Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference

50

Project

51

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16

Ratings (R)

Example: Get all MovieIDs

“Projection list”

Project

52

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16

R

MovieID2016

R’ Schema reduced (to projection list)

Tuple values “deduplicated” (slightly different semantics in SQL)

Composition of Relational Ops

53

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16

R

Example: Get UserID and NumStars of ratings with less than 3 stars

UserID NumStars79 2.5

R’RelOp(Relation) gives a Relation

Basic Relational Operations

Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference

54

Rename

55

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16

Ratings (R)

Example: Rename Timestamp to RateDate

⇢C(2�>RateDate)(R)

⇢RatingID,NumStars,RateDate,UserID,MovieID(R)

Rename

56

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16

R

R’

Instance preserved Schema modified

RatingID NumStars RateDate UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16

⇢C(2�>RateDate)(R)

Basic Relational Operations

Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference

57

Cross Product

58

UserID Name Age JoinDate79 Alice 23 01/10/1380 Bob 41 05/10/13

MovieID Name ReleaseYear Director20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron

Users (U)

Movies (M)

❖ Cartesian product (construct all pairs of tuples across tables) ❖ Schema of output “concatenates” the input schemas ❖ Be careful with attribute name conflicts! Use Rename op

Cross Product

59

UserID Name Age JoinDate79 Alice 23 01/10/1380 Bob 41 05/10/13

MovieID Name ReleaseYear Director20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron

Users (U)

Movies (M)

UserID U.Name Age JoinDate MovieID M.Name ReleaseYear

Director

79 Alice 23 01/10/13 20 Inception 2010 Christopher Nolan

79 Alice 23 01/10/13 16 Avatar 2009 Jim Cameron80 Bob 41 05/10/13 20 Inception 2010 Christopher

Nolan80 Bob 41 05/10/13 16 Avatar 2009 Jim Cameron

⇢C(1�>U .Name(U)⇥ ⇢C(1�>M .Name(M)

Basic Relational Operations

Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference

60

Union

61

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 16

R1

RatingID NumStars Timestamp UserID MovieID3 2.5 08/02/14 79 164 4.5 03/05/14 80 165 5.0 06/09/13 135 20

R2Union of sets of tuples (instances)

Inputs must have identical schema: “Union-compatible”

Union

62

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 165 5.0 06/09/13 135 20

 

Basic Relational Operations

Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference

63

Set Difference

64

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 16

R1

RatingID NumStars Timestamp UserID MovieID3 2.5 08/02/14 79 164 4.5 03/05/14 80 165 5.0 06/09/13 135 20

R2

Set difference of sets of tuples (instances) Inputs must be “Union-compatible”

Set Difference

65

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 16

R1

RatingID NumStars Timestamp UserID MovieID3 2.5 08/02/14 79 164 4.5 03/05/14 80 165 5.0 06/09/13 135 20

R2

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 20

R’ = R1 – R2

Basic Relational Operations

Select Project Rename Cross Product (aka Cartesian Product) Set Operations: Union Set Difference

66

Derived and Other Relational Ops

Set Operation: Intersection Join Group By Aggregate

67

Intersection

68

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 16

R1

RatingID NumStars Timestamp UserID MovieID3 2.5 08/02/14 79 164 4.5 03/05/14 80 165 5.0 06/09/13 135 20

R2

Set intersection of sets of tuples (instances) Inputs must be “Union-compatible”

Intersection

69

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 16

R1

RatingID NumStars Timestamp UserID MovieID3 2.5 08/02/14 79 164 4.5 03/05/14 80 165 5.0 06/09/13 135 20

R2

RatingID NumStars Timestamp UserID MovieID3 2.5 08/02/14 79 16

 

Derived and Other Relational Ops

Set Operation: Intersection Join Group By Aggregate

70

Join

71

❖ Equivalent Select on Cross Product, but “bypasses” full X

❖ Perhaps the most intensively studied Rel Op! ❖ Several “types” of Joins:

Natural Join and Equi-Join Condition Join (aka Theta Join) Semi-Join, Inner Join, Outer Join, Anti-Join, etc.

R ./JoinCondition M

�JoinCondition(R⇥M)

Natural Join

72

MovieID Name ReleaseYear Director20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron

Movies (M)

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16

Ratings (R)

R ./ M

Natural Join

73

RatingID

NumStars

Timestamp

UserID

MovieID

Name ReleaseYear

Director

1 3.5 08/27/15 79 20 Inception 2010 Christopher Nolan2 4.0 07/20/15 80 20 Inception 2010 Christopher Nolan3 2.5 08/02/14 79 16 Avatar 2009 Jim Cameron4 4.5 03/05/14 80 16 Avatar 2009 Jim Cameron

❖ “Join attributes”: attributes that determine matching tuples Have same name in both inputs! “MovieID” in R and M

❖ Implicit equality condition on join attributes If > 1 pair, implicit logical “and” of all equality terms ❖ Output schema concatenates input schemas But join attributes appear only once in output (Project)

R ./ M

Equi-Join

74

❖ Generalization of the Natural Join

❖ Attribute names of “join attributes” need not be the same ❖ EqualityCondition is a general boolean expression (logical

“and”, and/or “or”) of terms with equality predicates only ❖ Join attributes from both R and M in output (no Project)

Perhaps the most important and common type of Join! Lots of R&D on efficient implementations!

R ./EqualityCondition M

Equi-Join: Example

75

J K10 x20 y30 x

R1(J,K)P Qx 4y 9x 8

R2(P,Q) T(J,K,P,Q)J K P Q10 x x 420 y y 930 x x 810 x x 830 x x 4

T(J,K, P,Q) = R1 ./K=P R2

(Primary) Key-Foreign Key Join

76

❖ A special kind of equi-join ❖ One of the join attributes is the (Primary) Key of an input

relation; the other is a Foreign Key in the other relation

RatingID NumStars Timestamp UID MovieID

UserID Name Age JoinDate

Ratings

Users

Also a common and important (sub) type of join with even more specialized efficient implementations!

Ratings ./UID=UserID Users

Condition Join (aka “Theta” Join)

77

❖ Generalization of the Equi-Join

 

R ./JoinCondition M

Condition Join: Example

78

J K10 x20 y30 x

R1(J,K)P Qx 4y 9x 12

R2(P,Q) T(J,K,P,Q)J K P Q10 x x 420 y x 420 y y 930 x x 430 x y 930 x x 12

Perhaps the most difficult type of Join to implement efficiently!

T(J,K, P,Q) = R1 ./J/2>Q R2

Join Expressions

79

Can compose many joins into a single complex expression

RatingID NumStars Timestamp UserID MovieID

UserID UName Age JoinDate

MovieID Name ReleaseYear Director

Ratings

UsersMovies

 

Q. What do we get as the output?

UserID

UName

Age

JoinDate

RatingID

NumStars

Timestamp

MovieID

Name ReleaseYEar

Director

AllStuff

80

Taxonomy of Joins

All kinds of joins

Inner joins

Outer joins Semi joins Anti joins

Theta joinsEqui joins

Natural joins

Key-Foreign Key Joins“Snowflake” joins

“Star” joins

Derived and Other Relational Ops

Set Operation: Intersection Join Group By Aggregate

81

Group By Aggregate

82

❖ NOT a part of relational algebra, but “Extended RA”! ❖ Useful for “analytics” queries that aggregate numerical data

RatingID NumStars Timestamp UserID MovieIDRatings

What is the average rating for each movie? How many movies has each user rated?

❖ Standard 5 numerical aggregations supported in SQL:

Count, Sum, Average, Maximum, and Minimum Extra: Median, Mode, Variance, Standard Deviation, etc.

Group By Aggregate

83

“Grouping Attributes” (Subset of R’s attributes)

A numerical attribute in R“Aggregate Function”

(SUM, COUNT, AVG, MAX, MIN)

❖ Output schema will have X and an extra numerical attribute (result of the aggregate function)

❖ Can list multiple aggregate functions in the same operation

�X,Agg(Y )(R)

Group By Aggregate

84

RatingID NumStars Timestamp UserID MovieID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 16

R

MovieID AVG(NumStars)20 3.7516 3.5

R’

What is the average rating for each movie?

�MovieID,AV G(NumStars)(R)

Group By Aggregate

85

AVG(Age)25.75

U’

What is the average age of the users?

UserID Name Age JoinDate79 Alice 23 01/10/1380 Bob 41 05/10/13

123 Carol 19 08/09/14420 Dan 20 03/01/15

Users (U)

The set of Grouping Attributes can be empty too!

�AVG(Age)(U)

Derived and Other Relational Ops

Set Operation: Intersection Join Group By Aggregate

86

87

Recap: SQL

Basic Form of an SQL Query

88

SELECT [DISTINCT] target-list FROM relation-list [WHERE condition]

List of attributes to projectOptional

List of relations (possibly with “aliases”)

Selection/join condition (optional)

What does it mean logically?

89

SELECT [DISTINCT] target-list FROM relation-list [WHERE condition]

1. Cross-product of relations in relation-list 2. If condition given, apply it to filter out tuples 3. Remove attributes not present in target-list 4. If DISTINCT given, deduplicate tuples in result

The above is only a logical interpretation. It is NOT a “plan” an RDBMS would use in general to run an SQL query!

90

Example SQL Query

SELECT M.Name FROM Movies M WHERE M.Year = 2013

Movies (M)MovieID Name Year Director

20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen

Example: Get the names of movies released in 2013

⇡Name(�Y ear=2013(M))

91

Example SQL Query

SELECT M.Name FROM Movies M WHERE M.Year = 2013

Movies (M)MovieID Name Year Director

20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen

NameGravity

Blue Jasmine

92

Example SQL Query

SELECT * FROM Movies M WHERE M.Year = 2013

Movies (M)MovieID Name Year Director

20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen

MovieID Name Year Director53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen

93

Example SQL Query

SELECT M.Name FROM Movies M WHERE M.Year <> 2013

Movies (M)MovieID Name Year Director

20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen

Example: Get the names of movies from years other than 2013

⇡Name(�Y ear 6=2013(M))

94

Example SQL Query

SELECT M.Year FROM Movies M

Movies (M)MovieID Name Year Director

20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen

Example: For which years do we have movie data?⇡Y ear(M)

95

Example SQL Query

SELECT M.Year FROM Movies M

Movies (M)MovieID Name Year Director

20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen

Year2010200920132013

SQL allows repetitions of tuples in a relation! Not the same semantics as RA’s Project

Called “bag semantics” vs. RA’s set semantics

96

DISTINCT in SQL

SELECT DISTINCT M.Year FROM Movies M

Movies (M)MovieID Name Year Director

20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen

Year201020092013

DISTINCT needed to achieve set semantics of RA’s Project in SQL

97

Aliases in SQL

Movies (M)MovieID Name Year Director

20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen

SELECT M.Name FROM Movies M WHERE M.Year = 2013

Why bother with the alias? Not needed here!

SELECT Name FROM Movies WHERE Year = 2013

98

Aliases in SQL – Useful for Joins!

Movies (M)MovieID Name Year DirectorID

SELECT M.Name FROM Movies M, Directors D WHERE D.Name = “Jim Cameron” AND M.DirectorID = D.DID

Aliases help disambiguate attributes with the same name from multiple relations (or even a self-join!)

Example: Get names of movies directed by “Jim Cameron”

Directors (D)DID Name Age

99

More SQL Examples

Example: Get names of movies released in 2013 by Woody Allen or some other director 50 years or older

SELECT M.Name FROM Movies M, Directors D WHERE (D.Name = “Woody Allen” OR D.Age >= 50) AND M.Year = 2013 AND M.DirectorID = D.DID

Movies (M)MovieID Name Year DirectorID

Directors (D)DID Name Age

100

LIKE in SQL

SELECT DISTINCT M.Director FROM Movies M WHERE M.Name LIKE “Blue%”

Movies (M)MovieID Name Year Director

20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen

Example: Get the directors of movies that start with “Blue”

“%” matches any number of characters; “_” matches one

101

ORDER BY in SQL

SELECT M.Name FROM Movies M WHERE M.Year = 2013

Movies (M)MovieID Name Year Director

20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen

NameGravity

Blue Jasmine

ORDER BY M.NameName

Blue JasmineGravity

Useful for data readability Ordering defined by domain semantics Can specify DESC; multiple attributes

102

LIMIT in SQL

SELECT M.Name FROM Movies M WHERE M.Year >= 2010

MovieID Name Year Director20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron74 Blue Jasmine 2013 Woody Allen

ORDER BY M.Year

YearInceptionGravity

Blue Jasmine

Also useful for data readability Prevents “flooding” of screen with data Be wary of using it without ORDER BY!

LIMIT 2

103

Aggregate Functions in SQL

SELECT COUNT(*) FROM Movies M WHERE M.Year > 2010

Movies (M)MovieID Name Year Director

20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron91 Interstellar 2014 Christopher Nolan

How many movies came out after 2010?

104

Aggregate Functions in SQL

SELECT COUNT(*) FROM Movies M WHERE M.Year > 2010

Movies (M)MovieID Name Year Director

20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron91 Interstellar 2014 Christopher Nolan

How many movies came out after 2010?

COUNT(*)2

105

5 Native Aggregate Functions in SQL

❖ COUNT ([DISTINCT] attribute) ❖ AVG ([DISTINCT] attribute) ❖ SUM ([DISTINCT] attribute) ❖ MAX (attribute) ❖ MIN (attribute)

106

Aggregate Functions in SQL

SELECT COUNT(DISTINCT M.Director) FROM Movies M

Movies (M)MovieID Name Year Director

20 Inception 2010 Christopher Nolan16 Avatar 2009 Jim Cameron53 Gravity 2013 Alfonso Cuaron91 Interstellar 2014 Christopher Nolan

How many directors do we have?

Aggregate Functions in SQL

107

RatingID Stars UserID MovieID1 3.5 79 422 4.0 80 203 2.5 79 534 4.5 123 42

Ratings (R)

SELECT R.MovieID, MAX(R.Stars) FROM Ratings R

Which MovieID(s) have the highest rating?

Other attributes NOT allowed in the target-list as such!

Aggregate Functions in SQL

108

RatingID Stars UserID MovieID1 3.5 79 422 4.0 80 203 2.5 79 534 4.5 123 42

Ratings (R)

SELECT DISTINCT R.MovieID FROM Ratings R WHERE R.Stars = (SELECT MAX(R2.Stars) FROM Ratings R2)

Which MovieID(s) have the highest rating?

Group By Aggregate in SQL

109

SELECT [DISTINCT] target-list FROM relation-list [WHERE condition] GROUP BY grouping-list HAVING group-condition

X

Condition on each group in aggregate

target-list must be in this form: X’, Agg(Y)

Subset of X

�X,Agg(Y )(R)

Group By Aggregate in SQL

110

RatingID Stars UserID MovieID1 3.5 79 422 4.0 80 203 2.5 79 534 4.5 123 42

Ratings (R)

What is the average rating for each movie?

SELECT R.MovieID, AVG(R.Stars) AS AvgRating FROM Ratings R GROUP BY R.MovieID

�MovieID,AVG(Stars)(R)

Group By Aggregate in SQL

111

RatingID Stars UserID MovieID1 3.5 79 422 4.0 80 203 2.5 79 534 4.5 123 42

Ratings (R)

SELECT R.MovieID, AVG(R.Stars) AS AvgRating FROM Ratings R GROUP BY R.MovieID

MovieID AvgRating20 4.042 4.053 2.5

One tuple in output per unique value of R.MovieID (aka “group”)

112

Q1. Which of the following is not a basic operation in relational algebra?

ABCD

�[⇢./

Review Questions

113

Q2. What type of join is the following NOT an example of?

RID NumStars RateDate UID MIDUID UName Age JoinDateMID MName Year Director

RUM

ABCD

R ./ U

Inner join

Outer join

Natural join

Theta join

Review Questions

114

Q3. How many movies did “Jim Cameron” direct?

�COUNT(⇤)(�Director=“Jim Cameron”(M))

�ReleaseYear ,COUNT(⇤)(�Director=“Jim Cameron”(M))

�COUNT(⇤)(�Director=“Jim Cameron”(M ./ R))

�Director=“Jim Cameron”(�COUNT(⇤)(M))A

B

C

D

RID NumStars RateDate UID MIDUID UName Age JoinDateMID MName Year Director

RUM

Review Questions

115

Q4. Which of the following attributes is a primary key?

RID NumStars RateDate UID MIDUID UName Age JoinDateMID MName Year Director

RUM

ABCD

U.UID

R.UID

R.MID

M.Year

Review Questions

116

Q5. Which of the following SQL features do not have a counterpart operation in extended relational algebra?

ABCD

SELECT DISTINCT

WHERE

LIMIT

GROUP BY

Review Questions

117

Q6. What is the cardinality of this query’s output?

AB

CD

1

2

3

4

RID NumStars RateDate UID MID1 3.5 08/27/15 79 202 4.0 07/20/15 80 203 2.5 08/02/14 79 164 4.5 03/05/14 80 165 5.0 06/09/13 135 20

SELECT COUNT(DISTINCT MID) FROM R

R

Review Questions

118

Q7. Get the year and director of all movies from the last decade that have the term “Avengers” in their title.

ASELECT Year, Director from M WHERE Year >= 2008 AND MName LIKE “%Avengers%”

MID MName Year DirectorM

BSELECT Year, Director from M WHERE Year >= 2008 AND MName LIKE “%Avengers”

CSELECT Year, Director from M WHERE Year >= 2008 AND MName LIKE “Avengers%”

DSELECT Year, Director from M WHERE Year >= 2008 AND MName = “Avengers”

Review Questions

119

RatingID Stars UserID MovieIDRatings (R)

Q8. Write an SQL query to get the number of 5-star ratings for each movie directed by “Christopher Nolan”

SELECT R.MovieID, COUNT(R.Stars) AS NumHighRatings FROM Ratings R, Movies M WHERE M.Director = “Christopher Nolan” AND R.MovieID = M.MovieID

AND R.Stars = 5 GROUP BY R.MovieID

Movies (M) MovieID Name Year Director

Review Questions

120

Advanced SQL Operations (Optional)

UNION in SQL

121

Get the IDs of users that have rated a movie directed by “Ang Lee” or a movie that released in 2013

RID Stars UID MID

MID Name Year Director

Ratings (R)

Movies (M)

SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Director = “Ang Lee” UNION SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Year = 2013

Union-compatible!

Semantics of UNION in SQL

122

MID Name Year Director20 Inception 2010 Christopher Nolan42 Life of Pi 2012 Ang Lee53 Gravity 2013 Alfonso Cuaron

RID Stars UID MID1 3.5 79 422 4.0 80 203 2.5 79 534 4.5 123 42 R M

SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Director = “Ang Lee” UNION SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Year = 2013

UID79

UID79123

Semantics of UNION in SQL

123

UNIONUID79123

UID79

UID79123

UNION implicitly deduplicates tuples (unlike SELECT)!

Q. How to retain duplicates with UNION?

UNION ALLUID79123

UID79

UID7979123

INTERSECT in SQL

124

MID Name Year Director20 Inception 2010 Christopher Nolan42 Life of Pi 2012 Ang Lee53 Gravity 2013 Alfonso Cuaron

RID Stars UID MID1 3.5 79 422 4.0 80 203 2.5 79 534 4.5 123 42 R M

SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Director = “Ang Lee” INTERSECT SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Year = 2013

UID79

UID79123

UID79

INTERSECT

EXCEPT (Set Difference) in SQL

125

MID Name Year Director20 Inception 2010 Christopher Nolan42 Life of Pi 2012 Ang Lee53 Gravity 2013 Alfonso Cuaron

RID Stars UID MID1 3.5 79 422 4.0 80 203 2.5 79 534 4.5 123 42 R M

SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Director = “Ang Lee” EXCEPT SELECT R.UID FROM Ratings R, Movies M WHERE R.MID = M.MID AND M.Year = 2013

UID79

UID79123 UID

123

EXCEPT

The Contentious Bag Semantics!

126

UID79797912312380

UID7912312392

UNION ALL UID797979791231231231238092

Add the number of repetitions

The Contentious Bag Semantics!

127

UID79797912312380

UID7912312392

EXCEPT ALL UID797980Subtract the number

of repetitions

The Contentious Bag Semantics!

128

UID79797912312380

UID7912312392

INTERSECT ALL UID79123123Minimum of the

number of repetitions