Introduction to Data Science, Week 2 Notes...Introduction to Data Science, Week 2 Notes 2013...
Transcript of Introduction to Data Science, Week 2 Notes...Introduction to Data Science, Week 2 Notes 2013...
0
Page 1 of 7
Danny King
www.dannyking.eu
Introduction to Data Science, Week 2 Notes
Danny King, 2013
Introduction to Data Science, Week 2 Notes
Different types of database structure/organisation have benefits or weaknesses for different types of
activities (e.g. tradeoffs between read and write speeds). When choosing the data storage mechanism you
should keep in mind: how is the data organised on disk? What kinds of queries are best suited to the
organisation (and what aren’t)? How hard is it to update the data/add new data? What happens when
you encounter unanticipated queries – do you need to
reorganise the data and how hard is that?
The key part of relational databases is that the data is
independent of the application code – activities of users and
applications should remain unaffected when the internal
representation of data is changed. Tabular data has an
algebraic structure which allows reasoning and manipulation
independently of physical data representation. Programs you
write to manipulate data are more robust than they would be
without the relational model.
SQL is converted into relational algebra by relational DBMSs so
that they can use algebraic simplification to optimise the
equation and eliminate unnecessary tasks, making the computation more efficient, which is especially
important over large data sets. Algebraic closure in this context means that any operation (on a table) will
return a table so you can chain operations.
Sets can’t have duplicate elements and bags can.
Union bags Union sets
0
Page 2 of 7
Danny King
www.dannyking.eu
Introduction to Data Science, Week 2 Notes
Danny King, 2013
Intersection
e.g.:
Projection: remove all columns that aren’t explicitly listed and remove all duplicates that remain (for sets)
Bag semantics
Bag semantics
0
Page 3 of 7
Danny King
www.dannyking.eu
Introduction to Data Science, Week 2 Notes
Danny King, 2013
Cross product: all possible combinations of R1 with R2. Size of a cross product is |R1| x |R2|
Equijoin is a join with an equality condition. Join is: for every record in R1 find a corresponding record in R2
that satisfies some condition. Theta join is simply a join where the condition can be anything (so equijoin is a
special case of theta join where the condition is equality).
0
Page 4 of 7
Danny King
www.dannyking.eu
Introduction to Data Science, Week 2 Notes
Danny King, 2013
When dealing with understanding a
complicated query, look for the FROM
clause. In this case, there are 3 nested
FROM clauses (closure allows this –
operating on new tables returned by
each query).
You can save any of these statements
as a view (i.e. give it a name), then you
can refer to it as if it were a table.
Sometimes it helps to draw out complex
conditions graphically to work out what is
happening.
There are three types of user-defined functions
(UDFs).
Aggregate functions only appear in the select
clause and is always associated with a GROUP BY
– a common example is a concatenate function.
Table functions appear in the from clause –
usually the most complicated – commonly they
are used to generate sequence of integers (1, 2,
..., 4, 5) in a temporary table.
Scalar functions can appear pretty much
anywhere an attribute can – e.g. if in a SELECT
clause, a WHERE clause, in a JOIN condition, etc.
0
Page 5 of 7
Danny King
www.dannyking.eu
Introduction to Data Science, Week 2 Notes
Danny King, 2013
Declarative languages: you specify the answer you want but don’t specify anything about how to get it.
Relational Algebra expressions are not declarative (you do specify an order) but SQL is. With SQL you just
specify the properties that must be true of the result and you let the database figure out the right way to
compute it. The SQL is the ‘what’ not the ‘how.’
Logical Independence
Physical independence is the ability to isolate/protect the application from physical changes in the
organisation of the data. So if things were rearranged on disk, we don’t
have to rewrite the application. This is what databases provide.
Logical data independence is the ability to isolate/protect the
application form changes to the logical structure of the database –
e.g. adding a new column. If your application doesn’t care about that
column it shouldn’t have to be re-written. This logical data
independence is provided by views (all relational databases have this
concept).
This allows us to separate the people who are administering the
database from the ones who are actually accessing it.
A view is just a query with a name (saved in the
database) which you can access as if it were a
real table.
No matter how you want to organise your
tables views allows you to provide a different
perspective on the data (e.g. different
structure, different column names) for users or
other queries to use.
Access views exactly as if they were a table.
The database pulls all these together and works out the
best way to deal with them all in one go, so you don’t need
to worry about having a stack of queries or views – the
database will translate the whole thing into one big query.
0
Page 6 of 7
Danny King
www.dannyking.eu
Introduction to Data Science, Week 2 Notes
Danny King, 2013
Indexes allow you to find needles in the
haystack of database data.
Queries will always finish regardless of
database size as long as it fits on disk: you
don’t need to worry about amount of
memory(RAM).
Written Notes
0
Page 7 of 7
Danny King
www.dannyking.eu
Introduction to Data Science, Week 2 Notes
Danny King, 2013