Data Manipulation - Fabrice...
Transcript of Data Manipulation - Fabrice...
![Page 1: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/1.jpg)
Data Manipulation
Fabrice Rossi
CEREMADEUniversité Paris Dauphine
2019
![Page 2: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/2.jpg)
Data Manipulation
In this courseI tabular dataI elementary extension to
multiple-table dataI data transformation
I wranglingI filteringI ordering
I data aggregation andsummary
I tidy data and reshaping
In other coursesI database management
systemI data modelsI relational dataI unstructured data
2
![Page 3: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/3.jpg)
Data Model
In this courseI a data set is
I a (finite) set of entities (a.k.a. objects, instances, subjects)I each entity is described by its values with respect to a fix set of
variables (a.k.a. attributes)I in practice a data set is a table with
I a row per entityI a column per variable
ExtensionI multiple-table dataI a data set = several tables
3
![Page 4: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/4.jpg)
Example
age job marital education default balance housing1 30 unemployed married primary no 1787 no2 33 services married secondary no 4789 yes3 35 management single tertiary no 1350 yes4 30 management married tertiary no 1476 yes5 59 blue-collar married secondary no 0 yes6 35 management single tertiary no 747 no7 36 self-employed married tertiary no 307 yes8 39 technician married secondary no 147 yes9 41 entrepreneur married tertiary no 221 yes
10 43 services married primary no -88 yes11 39 services married secondary no 9374 yes12 43 admin. married secondary no 264 yes13 36 technician married tertiary no 1109 no14 20 student single secondary no 502 no15 31 blue-collar married secondary no 360 yes16 40 management married tertiary no 194 no17 56 technician married secondary no 4073 no18 37 admin. single tertiary no 2317 yes19 25 blue-collar single primary no -221 yes20 31 services married secondary no 132 no
4
![Page 5: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/5.jpg)
Variable types
NumericalI essentially “physical”
measurementsI integer or decimalI easier to handle than the
other types
CategoricalI a.k.a. Nominal (factors and
levels in R)I finite number of values
(called categories ormodalities)
I might be ordered
Dates and timesI very important in numerous
applicationsI notoriously difficult to handleI use specific libraries!
Short textsI a.k.a. stringsI could be handled as
categorical dataI specific processing in some
casesI do not confuse them with full
texts
5
![Page 6: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/6.jpg)
Example
Bank datasetI sources
I https://archive.ics.uci.edu/ml/datasets/Bank%2BMarketing
I http://hdl.handle.net/1822/14838
I data typesI age: integerI balance: integerI education: categorical semi orderedI most of the others: categorical with some binary
6
![Page 7: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/7.jpg)
Data Management
Data manipulation softwareI typical examples: R with tidyverse or python with pandasI limited automatic support for enforcing complex data models
I declarative support for broad typesI constraints can be checked explicitly
very complex constraints can be enforcederror/bug pronedifficult to read
I documentation is needed
7
![Page 8: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/8.jpg)
Outline
Introduction
Data transformation
Data grouping and summarizing
Tidy data
Multiple data tables
8
![Page 9: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/9.jpg)
Subsetting
Working on a subpopulationI a.k.a by “removing” rowsI several motivations
I speed (for large data sets)I robustness (removing outliers)I modeling
I generally called filteringI declarative approach in R and python
I give me the subset of the data that fulfills some conditionsI supported by comparison and Boolean operators
9
![Page 10: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/10.jpg)
Example
Bank data setMarried clients with secondary education level in their thirties (agebetween 30 and 39 included)
Python (pandas)bank[ (bank.marital == 'married') &
(bank.education == 'secondary') &(bank.age >= 30) &(bank.age < 40) ]
R (dplyr)
bank %>% filter(marital == "married",education == "secondary",age >= 30,age < 40)
10
![Page 11: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/11.jpg)
Example
Bank data setMarried clients with secondary education level in their thirties (agebetween 30 and 39 included)
Python (pandas)bank[ (bank.marital == 'married') &
(bank.education == 'secondary') &(bank.age >= 30) &(bank.age < 40) ]
R (dplyr)
bank %>% filter(marital == "married",education == "secondary",age >= 30,age < 40)
10
![Page 12: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/12.jpg)
Computational considerations
Running timeI filtering is a row oriented operationI naive implementation
I browse the data row by rowI keep a row if the conditions are fulfilled
I run time proportional to the number of rows in the dataI can be improved in some cases via indexing
Do not program it yourself!I far less efficientI less readable
11
![Page 13: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/13.jpg)
Computational considerations
Running timeI filtering is a row oriented operationI naive implementation
I browse the data row by rowI keep a row if the conditions are fulfilled
I run time proportional to the number of rows in the dataI can be improved in some cases via indexing
Do not program it yourself!I far less efficientI less readable
11
![Page 14: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/14.jpg)
Dropping Variables
Column oriented subsettingI two main motivations
I to restrict the data set to variable types compatible with sometechnique
I to restrict the data set to meaningful variables for an automatedanalysis (e.g. clustering or predictive modeling)
I simple declarative approach
12
![Page 15: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/15.jpg)
Example
Bank data setKeep some numerical variables
Python (pandas)bank[['age', 'balance', 'day', 'duration']]
R (dplyr)
bank %>% select(age, balance, day, duration)
13
![Page 16: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/16.jpg)
Example
Bank data setKeep some numerical variables
Python (pandas)bank[['age', 'balance', 'day', 'duration']]
R (dplyr)
bank %>% select(age, balance, day, duration)
13
![Page 17: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/17.jpg)
Ordering
SortingI standard sorting featureI multiple criteria
Pythonbank.sort_values(by=['age',
'balance'])
Rbank %>% arrange(age, balance)
14
![Page 18: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/18.jpg)
Transformation of Variables
Variable OperationsI modifying a variableI adding new variables from
other sourcesI computing new variables
(based on existing ones)
Data WranglingI low level transformationI recodingI extraction and mergingI etc.
Data ManagementI context variablesI enforcing data model
Preparing AnalysisI e.g. recoding categorical to
numericalI or quantifying numerical
variablesI scaling, normalizationI merging categories
15
![Page 19: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/19.jpg)
Computed variables
PrincipleI row oriented calculationI create a new variable using the existing onesI e.g. duration from starting and ending timesI combines nicely with aggregation/summary functions
SupportI numerous statistical summary functions (column oriented)I column oriented arithmetic (e.g. sum of columns)I column oriented logical operations (e.g. comparison)I function application (e.g. to each row)
16
![Page 20: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/20.jpg)
Example
Bank data setBinary variable telling whether some client has some characteristicsI more than average mean annual balanceI at least one loan
Pythonbank['moreavg'] = bank['balance'] > bank['balance'].mean()bank['oneormoreloan'] = (bank['loan'] == 'yes') | (bank['housing'] == 'yes')
Rbank %>% mutate(moreavg = balance > mean(balance))bank %>% mutate(oneormoreloan = loan=="yes" | housing=="yes")
17
![Page 21: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/21.jpg)
Example
One Hot EncodingI typical preparatory transformationI categorical variable turn into a set of binary variablesI e.g. Gender=Male or Female transformed into GenderMale and
GenderFemale (binary)
Pythonbank.join(pd.get_dummies(bank['education']))bank.join(pd.get_dummies(bank['education'])).drop('education',axis=1)
Rbank %>%
bind_cols(as.data.frame(model.matrix(~education-1,data=bank))) %>%select(-education)
18
![Page 22: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/22.jpg)
Example
age job marital education default balance housing1 30 unemployed married primary no 1787 no2 33 services married secondary no 4789 yes3 35 management single tertiary no 1350 yes4 30 management married tertiary no 1476 yes5 59 blue-collar married secondary no 0 yes6 35 management single tertiary no 747 no7 36 self-employed married tertiary no 307 yes8 39 technician married secondary no 147 yes9 41 entrepreneur married tertiary no 221 yes
10 43 services married primary no -88 yes11 39 services married secondary no 9374 yes12 43 admin. married secondary no 264 yes13 36 technician married tertiary no 1109 no14 20 student single secondary no 502 no15 31 blue-collar married secondary no 360 yes16 40 management married tertiary no 194 no17 56 technician married secondary no 4073 no18 37 admin. single tertiary no 2317 yes19 25 blue-collar single primary no -221 yes20 31 services married secondary no 132 no
19
![Page 23: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/23.jpg)
Example
age job marital primary secondary tertiary unknown1 30 unemployed married 1 0 0 02 33 services married 0 1 0 03 35 management single 0 0 1 04 30 management married 0 0 1 05 59 blue-collar married 0 1 0 06 35 management single 0 0 1 07 36 self-employed married 0 0 1 08 39 technician married 0 1 0 09 41 entrepreneur married 0 0 1 0
10 43 services married 1 0 0 011 39 services married 0 1 0 012 43 admin. married 0 1 0 013 36 technician married 0 0 1 014 20 student single 0 1 0 015 31 blue-collar married 0 1 0 016 40 management married 0 0 1 017 56 technician married 0 1 0 018 37 admin. single 0 0 1 019 25 blue-collar single 1 0 0 020 31 services married 0 1 0 0
20
![Page 24: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/24.jpg)
Improving Representation
Data ManagementI convert values to proper
types, e.g.I integer only to language
supported integerI data as string to language
supported dateI string to semantic content
I proprer encoding of missingdata
I add diagnostic data
Nominal dataI important particular caseI frequently represented by
stringsI loss of efficiencyI cannot leverage automatic
handling (in R in particular)
21
![Page 25: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/25.jpg)
Example
Data RepresentationI make sure that nominal variables are recognized as suchI yes/no nominal variables can be encoded as logical variables
Pythonfor myvar in ['job', 'marital', 'education']:
bank[myvar] = bank[myvar].astype('category')for myvar in ['default', 'housing', 'loan']:
bank[myvar] = bank[myvar] == 'yes'
Rbank %>% mutate_at(vars(job,marital,education), ~(as.factor(.)))bank %>% mutate_at(vars(default,housing,loan), ~(. == "yes"))
22
![Page 26: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/26.jpg)
Example
Unknown valuesI specific “unknown” category in the bank dataset (for several
variables)I not considered “special” by the software while it should be
PythonNumpy provides a special value nan for not availablebank.replace("unknown",np.nan)
RSimilar special value NA
bank %>% mutate_all(~replace(., . == "unknown", NA))
23
![Page 27: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/27.jpg)
Outline
Introduction
Data transformation
Data grouping and summarizing
Tidy data
Multiple data tables
24
![Page 28: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/28.jpg)
Conditional analysis
Finding dependencies and linksOne of the main goal of data analysis, e.g.I predictive models: links between target variables and explanatory
variablesI frequent patterns: variables that are frequently non zero at the
same timeI etc.
Conditional summariesI chose one or more variablesI for each possible combination of the values of the chosen
variablesI find all corresponding objects in the data setI compute a summary of the other variables on this subset
25
![Page 29: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/29.jpg)
Mechanism
X Y4 C2 B2 C3 C4 B1 C4 C4 A3 B3 A1 A1 B1 A3 B2 C
Split
Y XA 4A 3A 1A 1
Y XB 2B 4B 3B 1B 3
Y XC 4C 2C 3C 1C 4C 2
Apply
Y SumA 9
Y SumB 13
Y SumC 16
CombineY SumA 9B 13C 16
26
![Page 30: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/30.jpg)
Example
Bank datasetI balance conditioned on the response to the marketing campaignI age conditioned on marital status and education level
Pythonbank.groupby('y')['balance'].median()bank.groupby(['marital', 'education'])['age'].mean()
Rbank %>% group_by(y) %>% summarize(median_balance = median(balance))bank %>% group_by(marital, education) %>%
summarize(mean_age = mean(age))
27
![Page 31: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/31.jpg)
Example
Balance versus marketing
y median_balanceno 419.50yes 710.00
Age versus marital andeducation
marital education mean_agedivorced primary 51.39divorced secondary 43.50divorced tertiary 45.15divorced missing 50.38married primary 47.51married secondary 42.40married tertiary 41.78married missing 48.44single primary 37.01single secondary 33.05single tertiary 34.51single missing 34.65
28
![Page 32: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/32.jpg)
Pivot Table
Conditioning by two variablesI useful special caseI we can leverage standard tabular representation
I compute a standard aggregated table with two conditioning variablesand a single aggregate
I “pivot” the tableI remove one of conditioning variable and the aggregateI creates as many columns as they are values of the removed variableI use the aggregate to populate cells
29
![Page 33: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/33.jpg)
Mechanism
X Y Z2 B U2 C V3 A U2 B U1 C U4 C V3 B U4 C U1 B V3 A U2 A V4 A U3 A V4 B U3 B U
S-A-C
Y Z SumA U 10A V 5B U 14B V 1C U 5C V 6
PivotY/Z U VA 10 5B 14 1C 5 6
30
![Page 34: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/34.jpg)
Example
Age versus marital and education
marital education mean_agedivorced primary 51.39divorced secondary 43.50divorced tertiary 45.15divorced missing 50.38married primary 47.51married secondary 42.40married tertiary 41.78married missing 48.44single primary 37.01single secondary 33.05single tertiary 34.51single missing 34.65
marital/education primary secondary tertiary missingdivorced 51.39 43.50 45.15 50.38married 47.51 42.40 41.78 48.44single 37.01 33.05 34.51 34.65
31
![Page 35: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/35.jpg)
Howto
PythonSpecific support for Pivot tablesbank.pivot_table('age', index = 'marital', columns = 'education')
RA particular case of table reshaping
bank %>% group_by(marital, education) %>%summarize(mean_age = mean(age)) %>% spread(key = education,value = mean_age)
32
![Page 36: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/36.jpg)
Multidimensional analysis (MDA)
Pivot (hyper)cubeI a pivot table with more than 2 “dimensions” (e.g. a pivot cube)I specific vocabulary:
I a dimension: a variable with a finite set of possible valuesI a measure: a numerical variableI a cell contains aggregate values for objects with given values for the
dimension
a very convoluted way of presenting conditional analysisrich possibilities when the set of values of a “dimension” isstructured (e.g. postcodes)rich support with specific OLAP software (not in this part of thecourse)
33
![Page 37: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/37.jpg)
Example
Bank data setI Possible dimensions: job, marital, education, housing, loanI Possible measures: age and balanceI A cell (such as unemployed, married, primary education, no
housing and no loan) contains the average age and the medianbalance for the persons with the specified values on thedimensions
Tabular point of view
job marital education housing loan mean_age median_balanceadmin. divorced primary no no 57.00 1.00admin. divorced primary no yes 56.00 0.00admin. divorced primary yes no 57.00 179.00admin. divorced secondary no no 41.80 432.00admin. divorced secondary no yes 45.83 175.00
and 337 additional rows...
34
![Page 38: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/38.jpg)
Example
Tabular viewy housing loan nno no no 1394no no yes 267no yes no 1958no yes yes 381yes no no 283yes no yes 18yes yes no 195yes yes yes 25
MDA view
y="no"housing/loan no yesno 1394 267yes 1958 381
y="yes"housing/loan no yesno 283 18yes 195 25
arranged as a cube!
35
![Page 39: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/39.jpg)
Outline
Introduction
Data transformation
Data grouping and summarizing
Tidy data
Multiple data tables
36
![Page 40: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/40.jpg)
Tidy Data
Concept introduced by Hadley Wickham the Tidy Data paper
DefinitionA data set made of several data tables is tidy if
1. each variable forms a column of a table2. each observation forms a row of a table3. each type of observational unit forms a table
Observational unitI a particular type of observations in a data setI e.g.
I personsI daily behavior of personsI etc.
37
![Page 41: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/41.jpg)
Example
Bank data setI mixes person information and marketing campaign information
(and economic variables in an extended version!)I marketing campaign data
I last contact dataI contacts during the campaignI summary of previous campaigns!
I clearly untidyI violates property 3I multiple types of observational unit in a single table
38
![Page 42: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/42.jpg)
Example
Pivot tablehousing/loan no yesno 1677 285yes 2153 406
I intrinsically untidyI columns are not variables but variable values!
Summary table
loan housing nno no 1677no yes 2153yes no 285yes yes 406
I tidy dataI new observational unit: groups of observations!
39
![Page 43: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/43.jpg)
Example
Pivot tablehousing/loan no yesno 1677 285yes 2153 406
I intrinsically untidyI columns are not variables but variable values!
Summary table
loan housing nno no 1677no yes 2153yes no 285yes yes 406
I tidy dataI new observational unit: groups of observations!
39
![Page 44: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/44.jpg)
Tidying Data
Splitting or JoiningI observational unit levelI splitting
I separates different observational unit into several tablesI filtering/selecting + linking
I joiningI merges several tables about the same observational unitI specific tools (see the last part of this course)
40
![Page 45: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/45.jpg)
Tidying Data
Gathering or SpreadingI variable/observation levelI specific toolsI gathering
I reduces the number of columnsI merges several columns in a single one that corresponds to a proper
variableI spreading
I increases the number of columnsI splits a column into several ones that correspond to proper variables
41
![Page 46: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/46.jpg)
Example (Splitting)
Splitting bank marketing dataSteps:
1. add an identifier to each row (to identify clients)2. select variables associated to each observational unit, keeping the
id2.1 persons2.2 current campaign (i.e. last contact)2.3 previous campaigns
3. clean the tables (remove useless rows, e.g. when no previouscampaign is available)
42
![Page 47: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/47.jpg)
Example (Splitting)
PythonI pandas data frames have always row identifiersI splitting and cleaning
bank.persons = bank[['age', 'job', 'marital', 'education', 'default','balance', 'housing', 'loan']]
bank.current = bank[['contact', 'day', 'month', 'duration','campaign']]
bank.previous = bank[bank['pdays'] != -1][['pdays', 'previous','poutcome']]
43
![Page 48: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/48.jpg)
Example (Splitting)
R1. identifier
bank.tidy <- bank %>% mutate(key = 1:nrow(bank))
2. selection with cleanupbank.persons <- bank.tidy %>% select(key, age, job, marital,
education, default, balance, housing, loan)bank.current <- bank.tidy %>% select(key, contact, day, month,
duration, campaign)bank.previous <- bank.tidy %>% filter(pdays != -1) %>% select(key,
pdays, previous, poutcome)
44
![Page 49: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/49.jpg)
Spreading data
Replace variable encoded on rows by columnsI operates on two variables in the original table: a key and a valueI each value taken by the key becomes a columnI the value variable is used to fill the column
Original table
X Y Z1 A 21 B 32 A 42 B 5
Spread tableY is the key, Z is the value
X A B1 2 32 4 5
45
![Page 50: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/50.jpg)
Example (Spreading)
CalIt2 datasetI flow (in number of persons) in and out a buildingI direction encoded in the flow column
flow date time count7 2005-07-24 00:00:00 09 2005-07-24 00:00:00 07 2005-07-24 00:30:00 19 2005-07-24 00:30:00 07 2005-07-24 01:00:00 09 2005-07-24 01:00:00 07 2005-07-24 01:30:00 09 2005-07-24 01:30:00 07 2005-07-24 02:00:00 09 2005-07-24 02:00:00 0
I untidy: flow is not a variable!I spreading is needed
46
![Page 51: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/51.jpg)
Example (Spreading)
CalIt2 datasetI flow (in number of persons) in and out a buildingI tidy version
date time entering leaving2005-07-24 00:00:00 0 02005-07-24 00:30:00 1 02005-07-24 01:00:00 0 02005-07-24 01:30:00 0 02005-07-24 02:00:00 0 02005-07-24 02:30:00 2 02005-07-24 03:00:00 0 02005-07-24 03:30:00 0 02005-07-24 04:00:00 0 02005-07-24 04:30:00 0 0
47
![Page 52: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/52.jpg)
Example (Spreading)
PythonI spreading can be done by leveraging the indexing systemI hierarchical indexing in this case
calit.set_index(['date', 'time'], inplace = True)tcalit = calit.pivot(columns='flow')tcalit.columns = tcalit.columns.to_flat_index()tcalit.rename(columns={('count', 7):'entering',
('count', 9): 'leaving'},inplace=True)
tcalit.reset_index(inplace = True)
I this is somewhat convoluted
48
![Page 53: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/53.jpg)
Example (Spreading)
PythonI spreading can be also be done with pivot_table
tcalit = calit.pivot_table(values='count',index=['date', 'time'],columns='flow')
tcalit.rename(columns={7: 'entering', 9: 'leaving'},inplace=True)
tcalit.reset_index(inplace=True)
I a bit simpler
RStandard use of the spread function
calit %>% spread(flow, count) %>% rename(entering = `7`, leaving = `9`)
49
![Page 54: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/54.jpg)
Gathering data
GatherI gathering is the reverse of spreadingI it reduces the number of columns by encoding them as a series of
rows and two new columns/variablesI the new key variable encode the gathered columns while the new
value variable contains the original value
Original table
Gather X, Y and ZW X Y Za 1 2 3b 5 6 7
Spread table
W K Va X 1a Y 2a Z 3b X 5b Y 6b Z 7
50
![Page 55: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/55.jpg)
Example (Gathering)
Sales transaction data setI weekly product salesI one row per product, one column per week: product view
Product_Code W0 W1 W2 W3 W4 W5P1 11.00 12.00 10.00 8.00 13.00 12.00P2 7.00 6.00 3.00 2.00 7.00 1.00P3 7.00 11.00 8.00 9.00 10.00 8.00P4 12.00 8.00 13.00 5.00 9.00 6.00P5 8.00 5.00 13.00 11.00 6.00 7.00P6 3.00 3.00 2.00 7.00 6.00 3.00P7 4.00 8.00 3.00 7.00 8.00 7.00P8 8.00 6.00 10.00 9.00 6.00 8.00P9 14.00 9.00 10.00 7.00 11.00 15.00P10 22.00 19.00 19.00 29.00 20.00 16.00
with 53 columns and 811 rows
51
![Page 56: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/56.jpg)
Example (Gathering)
Gather the weeksI gather all the columns except the product oneI one observation: (product, week)
Product_Code Week QuantityP1 1 11P2 1 7P3 1 7P4 1 12P5 1 8P6 1 3P7 1 4P8 1 8P9 1 14P10 1 22
with 42162 more rows
52
![Page 57: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/57.jpg)
Example (Gathering)
PythonUsing the melt functionprodlong = pd.melt(prodperweek, 'Product_Code')prodlong.rename(columns={'variable': 'Week',
'value': 'Quantity'},inplace=True)
prodlong['Week'] = pd.to_numeric(prodlong['Week'].str[1:])+1
RStandard use of the gather function
prodperweek %>% gather(Week, Quantity, -Product_Code) %>%mutate(Week = parse_number(Week) + 1)
53
![Page 58: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/58.jpg)
Tidy Data
Modeling assumptionsI data are tidy only with respect to some modeling assumptionsI what is an observation?
I maybe the most important assumptionI very frequently associated to an independence assumption
Practical aspectsI the data format must be adapted to the toolI data mining and machine learning
I generally limited to a single tableI one might need to merge tables (e.g. bank data set)
I column oriented software (R)I have limited row oriented capabilitiesI might need a specific untidy representation
54
![Page 59: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/59.jpg)
Tidy Data
ExamplesI bank marketing: several object types or only one (person +
marketing action)I flow in/out a building
I the half-hour bidirectional flow is an observationI or a day is an observation
I weekly salesI product point of view (original data set)I product × week point of viewI week point of view
55
![Page 60: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/60.jpg)
Outline
Introduction
Data transformation
Data grouping and summarizing
Tidy data
Multiple data tables
56
![Page 61: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/61.jpg)
Multiple data tables
Tidy dataI one table per observational
unitI complex data
I multiple observational units(e.g. persons andproducts)
I multiple tables!
DifficultiesI the vast majority of data
analysis methods are limitedto single tables
I complex real world data usemultiple table!
I a core data manipulationtask: join multiple tables intodata analysis oriented tables
57
![Page 62: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/62.jpg)
Example
Loan application data setI https://relational.fit.cvut.cz/dataset/Financial
I 8 tables includingI client tableI account tableI credit card tableI loan tableI etc.
I open ended data set: no specific goal
58
![Page 63: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/63.jpg)
Associations and keys
Relational dataI tables must be related one to anotherI in the relational model a table is a relationI some relations describe entities while others describe links
between entities
KeysI a key is a (set of) variable(s) that uniquely identifies an entityI a primary key does that in the relation/table that describe the entityI a foreign key does that in another relation/table
59
![Page 64: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/64.jpg)
Loan application data set
Client tableclient_id gender birth_date district_id
1 F 1970-12-13 182 M 1945-02-04 13 F 1940-10-09 14 M 1956-12-01 55 F 1960-07-03 5
KeysI primary key client_idI foreign key
district_id
Account tableaccount_id district_id date
1 18 1995-03-242 1 1993-02-263 5 1997-07-074 12 1996-02-215 15 1997-05-30
KeysI primary key account_idI foreign key
district_id
60
![Page 65: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/65.jpg)
Loan application data set
Disposition tabledisp_id client_id account_id type
1 1 1 OWNER2 2 2 OWNER3 3 2 DISPONENT4 4 3 OWNER5 5 3 DISPONENT
KeysI primary key disp_idI foreign keys client_id
and account_id
Link tableI the disposition table/relation is a typical example of a link tableI it associates clients with accountsI the link is also an entity as it has characteristics
61
![Page 66: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/66.jpg)
Loan application data set
District tabledistrict_id Name Region Inhabitants A5 A6 A7 A8
1 Hl.m. Praha Prague 1204953 0 0 0 12 Benesov central Bohemia 88884 80 26 6 23 Beroun central Bohemia 75232 55 26 4 14 Kladno central Bohemia 149893 63 29 6 25 Kolin central Bohemia 95616 65 30 4 16 Kutna Hora central Bohemia 77963 60 23 4 27 Melnik central Bohemia 94725 38 28 1 38 Mlada Boleslav central Bohemia 112065 95 19 7 1
Direct linksI no link table from accounts and clients to the district tableI district_id is used as a foreign key in the account and client
tables
62
![Page 67: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/67.jpg)
Joining tables
Main operationsI we need to build unique tables that gather information from
separate onesI this is done via join operations
I identifying matching entities in two different tablesI generating tables which combine variables from said tables for
matched entities
ExampleI joining client information with account informationI joining client information with district information
63
![Page 68: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/68.jpg)
Basic case
PrincipleI input
I two tables A and BI A contains a variable V which is a foreign keyI B contains the same variable V as its primary key
I resultI a table with all the variables in A and B (no repeat)I such that each entity in A is merged with the entity in B referenced
by the value of V
64
![Page 69: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/69.jpg)
Example
Client with district informationI join the client table with the district tableI district_id: foreign key in the client table, primary key in the
district tableI (part of the) resultclient_id gender birth_date district_id Name Region Inhabitants A5 A6 A7
1 F 1970-12-13 18 Pisek south Bohemia 70699 60 13 22 M 1945-02-04 1 Hl.m. Praha Prague 1204953 0 0 03 F 1940-10-09 1 Hl.m. Praha Prague 1204953 0 0 04 M 1956-12-01 5 Kolin central Bohemia 95616 65 30 45 F 1960-07-03 5 Kolin central Bohemia 95616 65 30 4
Pythonpd.merge(client, district)
Rclient %>% inner_join(district)
Common semantics: natural joinI common columns/variables are considered as a key
65
![Page 70: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/70.jpg)
Join types
Missing keysI missing foreign keysI unreferenced foreign keysI wrong foreign keys
How to build the join table?I intersection approachI missing data approach
Inner joinI most common solutionI keeps only full rowsI discard rows that would have
missing values
Outer joinsI produce tables with missing
dataI full or asymmetric (left or
right) joins
66
![Page 71: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/71.jpg)
Join types
Missing keysI missing foreign keysI unreferenced foreign keysI wrong foreign keys
How to build the join table?I intersection approachI missing data approach
Inner joinI most common solutionI keeps only full rowsI discard rows that would have
missing values
Outer joinsI produce tables with missing
dataI full or asymmetric (left or
right) joins
66
![Page 72: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/72.jpg)
All cases
Left tablex y
-3 14 25 NA
Missing foreign key
Right tabley z1 a2 b3 c
Unreferenced key
Inner joinx y z
-3 1 a4 2 b
Only full rows
Full outer joinx y z
-3 1 a4 2 b5 NA NA
NA 3 c
All combinations
Left outer joinx y z
-3 1 a4 2 b5 NA NA
All rows from the lefttable
Right outer joinx y z
-3 1 a4 2 b
NA 3 c
All rows from theright table
67
![Page 73: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/73.jpg)
All cases
Left tablex y
-3 14 25 NA
Missing foreign key
Right tabley z1 a2 b3 c
Unreferenced key
Inner joinx y z
-3 1 a4 2 b
Only full rows
Full outer joinx y z
-3 1 a4 2 b5 NA NA
NA 3 c
All combinations
Left outer joinx y z
-3 1 a4 2 b5 NA NA
All rows from the lefttable
Right outer joinx y z
-3 1 a4 2 b
NA 3 c
All rows from theright table
67
![Page 74: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/74.jpg)
All cases
Left tablex y
-3 14 25 NA
Missing foreign key
Right tabley z1 a2 b3 c
Unreferenced key
Inner joinx y z
-3 1 a4 2 b
Only full rows
Full outer joinx y z
-3 1 a4 2 b5 NA NA
NA 3 c
All combinations
Left outer joinx y z
-3 1 a4 2 b5 NA NA
All rows from the lefttable
Right outer joinx y z
-3 1 a4 2 b
NA 3 c
All rows from theright table
67
![Page 75: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/75.jpg)
All cases
Left tablex y
-3 14 25 NA
Missing foreign key
Right tabley z1 a2 b3 c
Unreferenced key
Inner joinx y z
-3 1 a4 2 b
Only full rows
Full outer joinx y z
-3 1 a4 2 b5 NA NA
NA 3 c
All combinations
Left outer joinx y z
-3 1 a4 2 b5 NA NA
All rows from the lefttable
Right outer joinx y z
-3 1 a4 2 b
NA 3 c
All rows from theright table
67
![Page 76: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/76.jpg)
Implementation
PythonI merge functionI pd.merge(left, right):
inner joinI parameters
I how: join type among'left', 'right','outer', 'inner'
I on: column name(s) for thejoin
I many others
RI merge in base RI dplyr:
I several functions in withexplicit names:inner_join,full_join, left_join,right_join
I by parameter: columnname(s) for the join
left %>% full_join(right)
68
![Page 77: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/77.jpg)
Multiple joins
Loan dataI table with clients and
accountsI difficulties
I link tableI duplicate variable name
district_id
SolutionI two joinsI columns renaming
Pythonda = pd.merge(disposition, account)da.rename(columns={
'district_id':'acc_district_id'},
inplace=True)fulldata = pd.merge(da, client)
Rdisposition %>% inner_join(account) %>%
rename(acc_district_id=district_id) %>%
inner_join(client)
69
![Page 78: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/78.jpg)
Multiple joins
Loan dataI table with clients and
accountsI difficulties
I link tableI duplicate variable name
district_id
SolutionI two joinsI columns renaming
Pythonda = pd.merge(disposition, account)da.rename(columns={
'district_id':'acc_district_id'},
inplace=True)fulldata = pd.merge(da, client)
Rdisposition %>% inner_join(account) %>%
rename(acc_district_id=district_id) %>%
inner_join(client)
69
![Page 79: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/79.jpg)
Result
disp_id client_id account_id type1 1 1 OWNER2 2 2 OWNER3 3 2 DISPONENT4 4 3 OWNER5 5 3 DISPONENT6 6 4 OWNER7 7 5 OWNER8 8 6 OWNER9 9 7 OWNER
10 10 8 OWNER11 11 8 DISPONENT12 12 9 OWNER13 13 10 OWNER14 14 11 OWNER15 15 12 OWNER16 16 12 DISPONENT17 17 13 OWNER18 18 13 DISPONENT19 19 14 OWNER20 20 15 OWNER
account_id district_id date1 18 1995-03-242 1 1993-02-263 5 1997-07-074 12 1996-02-215 15 1997-05-306 51 1994-09-277 60 1996-11-248 57 1995-09-219 70 1993-01-27
10 54 1996-08-2811 76 1995-10-1012 21 1997-04-1513 76 1997-08-1714 47 1996-11-2715 70 1993-10-0216 12 1997-09-2317 1 1997-01-0818 43 1993-05-2619 21 1995-04-0720 74 1996-08-24
70
![Page 80: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/80.jpg)
Result
disp_id client_id account_id type acc_district_id date1 1 1 OWNER 18 1995-03-242 2 2 OWNER 1 1993-02-263 3 2 DISPONENT 1 1993-02-264 4 3 OWNER 5 1997-07-075 5 3 DISPONENT 5 1997-07-076 6 4 OWNER 12 1996-02-217 7 5 OWNER 15 1997-05-308 8 6 OWNER 51 1994-09-279 9 7 OWNER 60 1996-11-24
10 10 8 OWNER 57 1995-09-2111 11 8 DISPONENT 57 1995-09-2112 12 9 OWNER 70 1993-01-2713 13 10 OWNER 54 1996-08-2814 14 11 OWNER 76 1995-10-1015 15 12 OWNER 21 1997-04-1516 16 12 DISPONENT 21 1997-04-1517 17 13 OWNER 76 1997-08-1718 18 13 DISPONENT 76 1997-08-1719 19 14 OWNER 47 1996-11-2720 20 15 OWNER 70 1993-10-02
71
![Page 81: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/81.jpg)
Result
client_id type acc_district_id date1 OWNER 18 1995-03-242 OWNER 1 1993-02-263 DISPONENT 1 1993-02-264 OWNER 5 1997-07-075 DISPONENT 5 1997-07-076 OWNER 12 1996-02-217 OWNER 15 1997-05-308 OWNER 51 1994-09-279 OWNER 60 1996-11-24
10 OWNER 57 1995-09-2111 DISPONENT 57 1995-09-2112 OWNER 70 1993-01-2713 OWNER 54 1996-08-2814 OWNER 76 1995-10-1015 OWNER 21 1997-04-1516 DISPONENT 21 1997-04-1517 OWNER 76 1997-08-1718 DISPONENT 76 1997-08-1719 OWNER 47 1996-11-2720 OWNER 70 1993-10-02
client_id gender birth_date district_id1 F 1970-12-13 182 M 1945-02-04 13 F 1940-10-09 14 M 1956-12-01 55 F 1960-07-03 56 M 1919-09-22 127 M 1929-01-25 158 F 1938-02-21 519 M 1935-10-16 60
10 M 1943-05-01 5711 F 1950-08-22 5712 M 1981-02-20 4013 F 1974-05-29 5414 F 1942-06-22 7615 F 1918-08-28 2116 M 1919-02-25 2117 M 1934-10-13 7618 F 1931-04-05 7619 M 1942-12-28 4720 M 1979-01-04 46
72
![Page 82: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/82.jpg)
Result
disp_id client_id account_id type acc_district_id date gender birth_date district_id1 1 1 OWNER 18 1995-03-24 F 1970-12-13 182 2 2 OWNER 1 1993-02-26 M 1945-02-04 13 3 2 DISPONENT 1 1993-02-26 F 1940-10-09 14 4 3 OWNER 5 1997-07-07 M 1956-12-01 55 5 3 DISPONENT 5 1997-07-07 F 1960-07-03 56 6 4 OWNER 12 1996-02-21 M 1919-09-22 127 7 5 OWNER 15 1997-05-30 M 1929-01-25 158 8 6 OWNER 51 1994-09-27 F 1938-02-21 519 9 7 OWNER 60 1996-11-24 M 1935-10-16 60
10 10 8 OWNER 57 1995-09-21 M 1943-05-01 5711 11 8 DISPONENT 57 1995-09-21 F 1950-08-22 5712 12 9 OWNER 70 1993-01-27 M 1981-02-20 4013 13 10 OWNER 54 1996-08-28 F 1974-05-29 5414 14 11 OWNER 76 1995-10-10 F 1942-06-22 7615 15 12 OWNER 21 1997-04-15 F 1918-08-28 2116 16 12 DISPONENT 21 1997-04-15 M 1919-02-25 2117 17 13 OWNER 76 1997-08-17 M 1934-10-13 7618 18 13 DISPONENT 76 1997-08-17 F 1931-04-05 7619 19 14 OWNER 47 1996-11-27 M 1942-12-28 4720 20 15 OWNER 70 1993-10-02 M 1979-01-04 46
73
![Page 83: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/83.jpg)
Advanced join topics
Support for real world issuesI key selectionI variable renamingI filtering join (dplyr R only)I enforcing/checking unicity of keys (python only)I index based join (python only)I etc.
74
![Page 84: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/84.jpg)
Licence
This work is licensed under a Creative CommonsAttribution-ShareAlike 4.0 International License.
http://creativecommons.org/licenses/by-sa/4.0/
75
![Page 85: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/85.jpg)
Version
Last git commit: 2019-12-11By: Fabrice Rossi ([email protected])Git hash: af2cee4da140c15fb47c0ff45a9ff8b1028fcbd0
76
![Page 86: Data Manipulation - Fabrice Rossiapiacoa.org/publications/teaching/data-science/data-manipulation.pdfData Management Data manipulation software I typical examples: R with tidyverse](https://reader034.fdocuments.net/reader034/viewer/2022042323/5f0d482f7e708231d4399279/html5/thumbnails/86.jpg)
Changelog
I November 2019: added multiple table dataI October 2019: initial version
77