An R package for exploratory data analysis for teaching ...
Transcript of An R package for exploratory data analysis for teaching ...
![Page 1: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/1.jpg)
An R package for exploratory data
analysis for teaching and research
François Husson, Julie Josse & Sébastien Lê
![Page 2: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/2.jpg)
To make exploratory multivariate data analysis with a free
software
The possibility to propose new methods (taking into
account different structure on the data)
To have a package user friendly and oriented to
practitioner (a very easy GUI)
Why ?
![Page 3: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/3.jpg)
Methods implemented are similar in their main objective: to
sum up and simplify the data by reducing the dimensionality
of the dataset
Continuous variables: Principal Components Analysis
Contingency table: Correspondence Analysis
Categorical variables: Multiple Correspondence Analysis
Continuous and categorical variables: Mixed Data Analysis
1 – The classical methods
![Page 4: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/4.jpg)
Data : performances of 41 athletes during two meetings of decathlon
100
m
Lo
ng
.jum
p
Sh
ot.
pu
t
Hig
h.ju
mp
40
0m
110
m.h
urd
le
Dis
cu
s
Po
le.v
ault
Jav
elin
e
150
0m
Ra
nk
Po
ints
Co
mp
eti
tio
n
SEBRLE 11.04 7.58 14.83 2.07 49.81 14.69 43.75 5.02 63.19 291.70 1 8217 Decastar CLAY 10.76 7.40 14.26 1.86 49.37 14.05 50.72 4.92 60.15 301.50 2 8122 Decastar KARPOV 11.02 7.30 14.77 2.04 48.37 14.09 48.95 4.92 50.31 300.20 3 8099 Decastar BERNARD 11.02 7.23 14.25 1.92 48.93 14.99 40.87 5.32 62.77 280.10 4 8067 Decastar YURKOV 11.34 7.09 15.19 2.10 50.42 15.31 46.26 4.72 63.44 276.40 5 8036 Decastar Sebrle 10.85 7.84 16.36 2.12 48.36 14.05 48.72 5.00 70.52 280.01 1 8893 OlympicG Clay 10.44 7.96 15.23 2.06 49.19 14.13 50.11 4.90 69.71 282.00 2 8820 OlympicG Karpov 10.50 7.81 15.93 2.09 46.81 13.97 51.65 4.60 55.54 278.11 3 8725 OlympicG Macey 10.89 7.47 15.73 2.15 48.97 14.56 48.34 4.40 58.46 265.42 4 8414 OlympicG Warners 10.62 7.74 14.48 1.97 47.97 14.01 43.73 4.90 55.39 278.05 5 8343 OlympicG
PCA Example
![Page 5: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/5.jpg)
PCA example
Introduction of
supplementary information:• supplementary continuous
variables
Graphs enriched by :• representing the variables
according to their quality of
representation
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
Variables factor map (PCA)
Dimension 1 (32.72%)
Dim
ens
ion
2 (
17.3
7%
)
100m
Long.jump
Shot.put
High.jump
400m
110m.hurdle
Discus
Javeline
1500m
RankPoints
Indicators:• contribution
• quality of representation
![Page 6: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/6.jpg)
Introduction of supplementary information:• supplementary individuals
• supplementary categorical variables
Graphs enriched by:• coloring according to
supplementary
information
• confidence ellipses
around the categories
-4 -2 0 2 4 6
-4-2
02
4
Individuals factor map (PCA)
Dimension 1 (32.72%)
Dim
ens
ion
2 (
17.
37%
)
SEBRLECLAYKARPOV
BERNARD
YURKOV
WARNERS
ZSIVOCZKY
McMULLENMARTINEAU
HERNU
BARRAS
NOOL
BOURGUIGNON
Sebrle
Clay
Karpov
Macey
Warners
Zsivoczky
Hernu
Nool
Bernard
Schwarzl
Pogorelov
Schoenbeck
Barras
Smith
Averyanov
Ojaniemi
Smirnov
Qi
Drews
Parkhomenko
Terek
Gomez
Turi
Lorenzo
Karlivans
Korkizoglou
Uldal
Casarsa
DecastarOlympicG
Decastar
OlympicG
PCA example
Indicators:• contribution
• quality of representation
![Page 7: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/7.jpg)
PCA example
-4 -2 0 2 4
Dimension 1 (32.71 %)
-2
0
2
4
Dim
ensi
on 2
(17
.37
%)
DecastarOlympicG
Schoenbeck
SEBRLECLAY
KARPOV
BERNARD
YURKOV
WARNERS
ZSIVOCZKY
McMULLEN
MARTINEAUHERNU
BARRAS
NOOL
BOURGUIGNON
Sebrle
Clay
Karpov
Macey
Warners
Zsivoczky
Hernu
Nool
Bernard
Schwarzl
Pogorelov
Barras
Smith
Averyanov
Ojaniemi
Smirnov
Qi
Drews
Parkhomenko
Terek
Gomez
Turi
Lorenzo
Karlivans
Korkizoglou
Uldal
Casarsa
Nb points
![Page 8: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/8.jpg)
PCA example
-4 -2 0 2 4
Dimension 1 (32.71 %)
-2
0
2
4
Dim
ensi
on 2
(17
.37
%)
DecastarOlympicG
SEBRLECLAY
KARPOV
BERNARD
YURKOV
WARNERS
ZSIVOCZKY
McMULLEN
MARTINEAUHERNU
BARRAS
NOOL
BOURGUIGNON
Sebrle
Clay
Karpov
Macey
Warners
Zsivoczky
Hernu
Nool
Bernard
Schwarzl
Pogorelov
Schoenbeck
Barras
Smith
Averyanov
Ojaniemi
Smirnov
Qi
Drews
Parkhomenko
Terek
Gomez
Turi
Lorenzo
Karlivans
Korkizoglou
Uldal
Casarsa
Pole.vault
![Page 9: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/9.jpg)
Description of the dimensions
By the quantitative variables:
• The correlation between each variable and the coordinate of the individuals on the axis s is calculated
• The correlation coefficients are sorted • Only the significant correlations are given
$Dim.1$Dim.1$quanti Dim.1Points 0.96Long.jump 0.74Shot.put 0.62Rank -0.67400m -0.68110m.hurdle -0.75100m -0.77
$Dim.2$Dim.2$quanti Dim.2Discus 0.61Shot.put 0.60
Significant level = 0.05
Best variable to describe the 1st dimension
![Page 10: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/10.jpg)
Description of the dimensions
By the qualitative variables:
• Perform a one-way analysis of variance with the coordinates of
the individuals on the axis explained by the qualitative variable
Significant level = 0.2
$Dim.1$quali P-valueCompetition 0.155
$Dim.1$category Estimate P-valueOlympicG 0.4393 0.155Decastar -0.4393 0.155
• For each category, a student
T-test to compare the average
of the category with the general
mean
• A F-test by variable
![Page 11: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/11.jpg)
2 – Structure on the data
Different structure on the data are proposed:
a partition on the variables: several sets of variables are
simultaneously studied: Multiple Factor Analysis,
Generalized Procrustes Analysis
a hierarchy on the variables: variables are grouped and
subgrouped (like in questionnaires structured in topics
and subtopics): Hierarchical Multiple Factor Analysis
a partition on the individuals: several sets of individuals
described by the same variables: Dual Multiple Factor
Analysis
![Page 12: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/12.jpg)
Groups of variables (MFA)
Groups of
variables are
quantitative and/
or qualitative
Objectives : - study the link between the sets of variables - balance the influence of each group of variables - give the classical graphs but also specific graphs:
groups of variables - partial representation
Examples : - Genomic: DNA, protein- Sensory analysis: sensorial, physico-chemical- Comparison of coding (quantitative / qualitative)
![Page 13: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/13.jpg)
Hierarchy on the variables (HMFA)
Two levels for the hierarchy: the first one contains L groups,each l group contains Jl subgroups, and each subgroup have Kj variables
Objective: to balance the groups and the subgroups of variables
![Page 14: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/14.jpg)
Partition on the individuals (DMFA)
Objective: to compare the covariance matrices
Group 1
Group J
1 k K 1 i xik I1 1
IJ
![Page 15: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/15.jpg)
3 – Graphical User Interface
Menu of the FactoMineR GUI
![Page 16: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/16.jpg)
3 – Graphical User Interface
Main window of the PCA
![Page 17: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/17.jpg)
3 – Graphical User Interface
Graphical options
![Page 18: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/18.jpg)
3 – Graphical User Interface
![Page 19: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/19.jpg)
4 – Conclusion
For researchers, practitioners and students: with classical and
advanced methods
The FactoMineR package is available on the CRAN
The GUI can be simply loaded:source("http://factominer.free.fr/install-facto.r")
A website is dedicated to this package: http://factominer.free.fr
Future: dynamical graphsPerspective: UseR!2008 (2 tutorials), UseR!2009 at Rennes
![Page 20: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/20.jpg)
MFA example: representation of the individuals
-4 -2 0 2
-20
24
Dim 1 (49.38 %)
Dim
2 (
19.4
9 %
)
2EL
1CHA
1FON
1VAU
1DAM2BOU 1BOI
3EL
DOM11TUR
4EL PER1
2DAM
1POY1ING
1BEN2BEA
1ROC2ING
T1 T2
http://factominer.free.fr
![Page 21: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/21.jpg)
MFA example: representation of the variables
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
Correlation circle
Dim 1 (49.38 %)
Dim
2 (
19.4
9 %
)odorvisualodor.after.shak.tasteoverall Odor.Intensity.before.shaking
Aroma.quality.before.shakingFruity.before.shaking
Flower.before.shaking
Spice.before.shaking
Visual.intensityNuance
Surface.feeling
Odor.Intensity
Quality.of.odourFruity
Flower
SpicePlante
Phenolic
Aroma.intensity
Aroma.persistency
Aroma.quality
Attack.intensity
Acidity Astringency
Alcohol
BalanceSmooth
Bitterness
Intensity
Harmony
Overall.qualityTypical
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
Correlation circle
Dim 1 (49.38 %)
Dim
2 (
19.4
9 %
)odorvisualodor.after.shak.tasteoverall Odor.Intensity.before.shaking
Aroma.quality.before.shakingFruity.before.shaking
Flower.before.shaking
Spice.before.shaking
Visual.intensityNuance
Surface.feeling
Odor.Intensity
Quality.of.odourFruity
Flower
SpicePlante
Phenolic
Aroma.intensity
Aroma.persistency
Aroma.quality
Attack.intensity
Acidity Astringency
Alcohol
BalanceSmooth
Bitterness
Intensity
Harmony
Overall.qualityTypical
http://factominer.free.fr
![Page 22: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/22.jpg)
MFA example: representation of the groups
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Dim 1 (49.38 %)
Dim
2 (
19.
49 %
)
odor
visual
odor.after.shaking
taste
origin
overall
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Dim 1 (49.38 %)
Dim
2 (
19.
49 %
)
odor
visual
odor.after.shaking
taste
origin
overall
http://factominer.free.fr
![Page 23: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/23.jpg)
-6 -4 -2 0 2 4
-4-2
02
46
Individual factor map
Dim 1 (49.38 %)
Dim
2 (
19
.49
%)
2EL
1CHA
1FON
1VAU
1DAM2BOU1BOI
3EL
DOM11TUR
4EL PER1
2DAM1POY1ING
1BEN2BEA
1ROC2ING
T1 T2
odorvisualodor.after.shakingtaste
MFA example: representation of the partial points
http://factominer.free.fr
![Page 24: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/24.jpg)
MFA example: representation of the partial points
-4 -2 0 2 4
02
46
Dim 1 (49.38 %)
Dim
2 (
19
.49
%)
Saumur
Bourgueuil
ChinonReferenceEnv1
Env2
Env4
olfvisolfaggust
http://factominer.free.fr
![Page 25: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/25.jpg)
Unsupervised classification
1V
AU
2IN
G
3E
L
T1
T2
1D
AM
2B
EA
4E
L
PE
R1
2B
OU
1B
OI
1IN
G
2D
AM
1P
OY
1C
HA
1F
ON
1T
UR
2E
L
DO
M1
1B
EN
1R
OC
02
46
81
01
2Cluster Dendrogram for Solution HClust.2
Method=ward; Distance=euclidianObservation Number in Data Set don
He
igh
t 4 classes
http://factominer.free.fr
![Page 26: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/26.jpg)
MFA example: representation of the individuals
-4 -2 0 2
-20
24
Dim 1 (49.38 %)
Dim
2 (
19.4
9 %
)
2EL
1CHA
1FON
1VAU
1DAM2BOU1BOI
3EL
DOM11TUR
4EL PER1
2DAM
1POY1ING
1BEN2BEA
1ROC2ING
T1
T2
classe1
classe2
classe3
classe4
classe1classe2classe3classe4
http://factominer.free.fr
![Page 27: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/27.jpg)
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
Prefmap-PLS graph between Rank and Points
Correlation between Rank and Points : -0.7392Rank
Po
ints
X100m
Long.jump
Shot.putHigh.jump
X400mX110m.hurdle
Discus
Pole.vault
Javeline
X1500m
http://factominer.free.fr
![Page 28: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/28.jpg)
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
Dimension 1 (61.49%)
Dim
ensi
on 2
(16
.46%
)
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
setosaversicolorvirginicavar
http://factominer.free.fr
MFA example: representation of the variables
![Page 29: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/29.jpg)
0.0 0.2 0.4 0.6 0.8 1.0 1.2
0.0
0.2
0.4
0.6
0.8
1.0
Projection of the groups
Dim 1
Dim
2
setosa
versicolor
virginica
http://factominer.free.fr
![Page 30: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/30.jpg)
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
Biplot between axes 1 and 2 for group versicolor
Correlation between Dim.1 and Dim.2 : 0.09613Dim.1
Dim
.2
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
http://factominer.free.fr
![Page 31: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/31.jpg)
-3 -2 -1 0 1 2 3
-2-1
01
23
Individual factor map
Dim 1 (20.39 %)
Dim
2 (
13
.29
%)
A
GBM
O
OA
AGBMOOA
http://factominer.free.fr
![Page 32: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/32.jpg)
-1 0 1 2
-10
12
Individual factor map
Dim 1 (20.39 %)
Dim
2 (
13
.29
%)
A
GBM
O
OA
CGHexpr
http://factominer.free.fr
![Page 33: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/33.jpg)
-4 -2 0 2 4
-4-2
02
Individual factor map
Dim 1 (20.39 %)
Dim
2 (
13
.29
%)
A
GBM
O
OA
http://factominer.free.fr
![Page 34: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/34.jpg)
*Dataframe
11 12
X100m
7
8
Long.jump
DecastarOlympicG
BERNARD
SEBRLE
CLAY
KARPOV
YURKOV
WARNERS
ZSIVOCZKYMcMULLEN
MARTINEAU
HERNU
BARRAS
NOOL
BOURGUIGNON
Sebrle
Clay
Karpov
Macey
Warners
Zsivoczky
Hernu
Nool
Bernard
Schwarzl
PogorelovSchoenbeck
Barras
Smith
Averyanov
Ojaniemi
Smirnov
Qi
Drews
Parkhomenko
Terek
Gomez
Turi
Lorenzo
Karlivans
Korkizoglou
Uldal
Casarsa
http://factominer.free.fr
![Page 35: An R package for exploratory data analysis for teaching ...](https://reader030.fdocuments.net/reader030/viewer/2022012014/615960564049743e84747cbf/html5/thumbnails/35.jpg)
The FactoMineR team is nearly all the time
ready to improve the package