Introduction to SQL and SQL in R - LISA (Laboratory for ... to SQL.pdf · SQL is a special-purpose...
Transcript of Introduction to SQL and SQL in R - LISA (Laboratory for ... to SQL.pdf · SQL is a special-purpose...
2
Laboratory for Interdisciplinary Statistical
Analysis
LISA helps VT researchers benefit from the use of Statistics
Short Courses: Designed to help graduate students apply statistics in their research
Walk-In Consulting: M-F 1-3 PM GLC Video Conference Room;
11 AM-1 PM Old Security Building Room 103
For questions requiring <30 mins
All services are FREE for VT researchers.
We assist with research—not class projects or homework.
Collaboration:
Visit our website to request personalized statistical advice and assistance with:
Experimental Design • Data Analysis • Interpreting
Results
Grant Proposals • Software (R, SAS, JMP, SPSS...) LISA statistical collaborators aim to explain concepts in ways useful for your
research.
Great advice right now: Meet with LISA before collecting your data.
Part I. SQL
1. Basic Concepts in SQL
2. Querying From One Table
3. Nested Queries
4. Querying From Multiple Tables
Part II. SQL in R
1. sqldf Package
2. Write SQL Command in R
3
Course Plan
SQL is a special-purpose programming language designed for managing data held in a relational database management system (RDBMS). (wikipedia)
In the era of Big Data,
(1) Simple Method + Massive Dataset
(2) The time spent on data management often exceeds the time spent on analysis.
Why SQL?
4
Basic Concepts: Table
A Table (or a Relation) is a set of records that have the same attributes. 1. A table always have an unique table name in SQL.
Name Gender Age Occupation
Jessica Female 28 Student
Michael Male 28 Professor
Emma Female 34 Professor
5
Basic Concepts: Column
A Column ( or Attribute or Field) is a characteristics of objects. 1. A column always have a unique column name within a table.
2. A column is either numeric (age), categorical (gender) or one of the other pre-defined types in SQL, e.g. date.
3. Primary Key: a column that has unique value for each object.
Name(PK) Gender Age Occupation
Jessica Female 28 Student
Michael Male 28 Professor
Emma Female 34 Professor 6
Basic Concepts: Row
A Row ( or a Tuple or Record) is an object represented by a list of attributes. 1. A row always have an unique value in Primary Key.
Name(PK) Gender Age Occupation
Jessica Female 28 Student
Michael Male 28 Professor
Emma Female 34 Professor
7
Basic SQL Command
Example dataset: a world fact dataset (sqlzoo.net)
http://sqlzoo.net/wiki/SELECT_basics
(Table name: world)
8
SELECT …FROM …
Function: Selecting one or more columns from a table.
Example:
SELECT * FROM world
SELECT name FROM world
SELECT name, gdp FROM world
A SQL query always start with a SELECT…FROM… command.
9
DISTINCT Clause
Function: Removing duplicate records in query result.
Example: SELECT continent FROM world
(This query includes duplicate results)
SELECT DISTINCT continent FROM world
(This query excludes duplicate results)
10
LIMIT clause
Function: Limiting the number of records returned.
Example: SELECT DISTINCT continent FROM world
(This query returns all distinct results)
SELECT DISTINCT continent FROM world
LIMIT 3
(This query returns three distinct results)
11
WHERE clause
Function: Filtering the records returned by a select command.
Example: Filtering based on a numeric attribute SELECT name, area FROM world /* do not forget this line*/
WHERE area = 9596961 (km^2)
WHERE area<> 9596961
WHERE area > 5000000
WHERE area < 1
WHERE area IN (0, 9596961)
WHERE area BETWEEN 9000000 AND 10000000
WHERE area NOT BETWEEN 1 AND 10000000
12
WHERE clause (cont.)
Example: Filtering based on a categorical attribute. SELECT name, continent FROM world
WHERE continent = 'north america'
WHERE continent <> 'north america'
WHERE continent >'north america'
WHERE continent < 'north america'
WHERE continent IN ('north america', ‘south america')
WHERE continent BETWEEN ‘D’ AND ‘F’
WHERE continent NOT BETWEEN ‘D’ AND ‘F’
WHERE continent LIKE '%america‘
13
WHERE clause (cont.)
Example: Filtering based on two or more conditions. SELECT name, continent FROM world
WHERE area>9000000 AND area<10000000
WHERE continent = 'north america‘ AND area>9000000
WHERE continent = 'north america‘ OR area>9000000
WHERE (continent = 'north america‘ AND area>500000) OR (continent = ‘south america‘ AND area>500000)
14
ORDERED BY clause
Function: Sorting the query result by a column.
Example:
Sort Ascending (Default) SELECT name, area FROM world
ORDER BY area (ASC)
Sort Descending
SELECT name, area FROM world
ORDER BY area DESC
15
Arithmetic operations
SQL permits arithmetic operations on columns.
Example: Query GDP: select name, gdp from world
order by gdp DESC
Query GDP in Trillion: select name, gdp/1000000000 from world
order by gdp DESC
Query GDP per capita select name, gdp/population from world
order by gdp/population DESC
16
Write the SQL queries in the first tutorial.
http://sqlzoo.net/wiki/SELECT_basics
17
Practice: SQLZOO:SELECT basics
Function: Producing descriptive statistics of a column.
Example:
Query the average GDP of the world.
SELECT avg(gdp) FROM world
More Aggregate Functions:
max(), min(), sum(), count(), var(), stdev()
Note, max() and min() operates on strings as well.
Aggregate Functions
18
Function: Stratifying aggregate result by a categorical column
Example:
Query the average GDP of each continent.
SELECT avg(gdp) FROM world.
GROUP BY continent
GROUP BY
19
Function: Filtering aggregate results.
Example:
Find the continent with average country area greater than 1 million.
SELECT continent, avg(area) FROM world
GROUP BY continent
HAVING avg(area)>1000000
HAVING Clause
20
Query result can be used in the where clause of another query. This is referred as Nested Query
SELECT name FROM world WHERE population > (SELECT population FROM world WHERE name=‘Russia') If we run the nested query, we will get 146000000, so the above query is equivalent to
SELECT name FROM world WHERE population > 146000000
Nested Query
21
Function: Concatenating two tables “vertically”
UNION Command
Name Gender Age Occupation
Jessica Female 28 Student
Michael Male 28 Professor
Emma Female 34 Professor
Name Gender Age Occupation
James Male 23 Student
Anna Female 22 Staff
Andy Male 45 Professor
Name Gender Age Occupation
Jessica Female 28 Student
Michael Male 28 Professor
Emma Female 34 Professor
James Male 23 Student
Anna Female 22 Staff
Andy Male 45 Professor
22
Example: SELECT * FROM world
WHERE name = ‘Germany’
UNION
SELECT * FROM world
WHERE name=‘France’
Why is UNION command useful if we can use WHERE clause?
Answer: These queries could come from different tables.
UNION Command
23
Function: Concatenating two tables “horizontally”.
JOIN Command
Name Gender Age Occupation
Jessica Female 28 Student
Michael Male 28 Professor
Emma Female 34 Professor
Name Department
Jessica Math
Michael Statistics
Emma Art
Name Gender Age Occupation Department
Jessica Female 28 Student Math
Michael Male 28 Professor Statistics
Emma Female 34 Professor Art
24
JOIN is generally more complicated than union because you have to specify a common attribute to link two tables. In the previous example, we joined tables with the name column.
If both tables contain a column with the same column name, we should distinguish two columns by attaching the table name in front
Table1Name.ColumnName vs Table2Name.ColumnName
JOIN Command
25
There are four different types of join:
(1) INNER JOIN
(2) LEFT JOIN
(3) RIGHT JOIN
(4) FULL OUTER JOIN
JOIN Command
26
INNER JOIN produce a record if it exist in BOTH tables
INNER JOIN
Name Gender Age Occupation
Jessica Female 28 Student
Michael Male 28 Professor
Emma Female 34 Professor
Name Department
Jessica Math
John CS
Name Gender Age Occupation Department
Jessica Female 28 Student Math
27
LEFT JOIN produce a record if it exists in the left table
LEFT JOIN
Name Gender Age Occupation
Jessica Female 28 Student
Michael Male 28 Professor
Emma Female 34 Professor
Name Department
Jessica Math
John CS
Name Gender Age Occupation Department
Jessica Female 28 Student Math
Michael Male 28 Professor NULL
Emma Female 34 Professor NULL
28
RIGHT JOIN produce a record only if it exists in the right table
RIGHT JOIN
Name Gender Age Occupation
Jessica Female 28 Student
Michael Male 28 Professor
Emma Female 34 Professor
Name Department
Jessica Math
John CS
Name Gender Age Occupation Department
Jessica Female 28 Student Math
John NULL NULL NULL CS
29
FULL OUTER JOIN produce a record if it exists in either table.
Note, FULL OUTER JOIN is not supported in MySQL.
FULL OUTER JOIN
Name Gender Age Occupation
Jessica Female 28 Student
Michael Male 28 Professor
Emma Female 34 Professor
Name Department
Jessica Math
John CS
Name Gender Age Occupation Department
Jessica Female 28 Student Math
Michael Male 28 Professor NULL
Emma Female 34 Professor NULL
John NULL NULL NULL CS 30
JOIN syntax
Left Table Name: Games
Right Table Name: City
Year City
1896 Athens
1948 London
2004 Athens
2008 Beijing
2012 London
Olympic Game Dataset http://sqlzoo.net/wiki/SELECT_.._JOIN
Name Country
Sydney Australia
Athens Greece
Beijing China
London UK
31
INNER JOIN SELECT * FROM games JOIN city ON games.city=city.name
LEFT JOIN SELECT * FROM games LEFT JOIN city ON games.city=city.name
RIGHTT JOIN SELECT * FROM games RIGHT JOIN city ON games.city=city.name FULL OUTER JOIN SELECT * FROM games FULL OUTER JOIN city ON games.city=city.name
JOIN Example
32
Two ways of applying SQL queries in R
(1) A “Local” Way.
Query from a dataframe.
(2) A “Remote” Way
Accessing a SQL database remotely.
SQL in R
33