Acoustic chemometrics - monitoring of chemical production ...
BASICS OF CHEMOMETRICS
Transcript of BASICS OF CHEMOMETRICS
1
BASICS OF CHEMOMETRICS
Juan Antonio Fernández Pierna
Vincent Baeten
Pierre Dardenne
Walloon Agricultural Research Centre (CRA-W)
Valorisation of Agricultural Products Department
Gembloux, Belgium
2
INTRODUCTION
- Chemometrics Introduction
What is this and why we need it
- Some definitions
- Overview of methods
- Examples
2
3
The use of multivariate analysis in the discipline X:
- Biometrics (used in biology)
- Technometrics (used in engineering)
- Psychrometrics (used in phychology)
- Chemometrics (used in chemistry)
Statistical, mathematical or graphical
technique, considers multiple variables
simultaneously
X - METRICS
4
“Chemometrics is the chemical discipline that
uses mathematics and statistics to design or
select optimal experimental procedures, to
provide maximum relevant chemical
information by analyzing chemical data, and to
obtain knowledge about chemical systems”
D. L. Massart
CHEMOMETRICS – INTRODUCTION
5
Statistics
Food Feed
Organicchemistry
Theoreticaland physical
chemistry
Analyticalchemistry
Organicchemistry
Theoreticaland physical
chemistry
chemistry
Computing
Engineering
BiologyIndustrialMathematics
Chemometrics
Meeting point of various disciplines
CHEMOMETRICS – INTRODUCTION
6
The scientific world today
CHEMOMETRICS – INTRODUCTION
• The data flood generated by modern analytical
instrumentation produces large quantity of numbers
to understand and quantify phenomenons around us.
• A deeper understanding of those methods and tools
for viewing all data simultaneously are needed.
• The evolution of personal computers allows faster
acquisition, processing and interpretation of chemical
data.
• Every scientist uses software related to mathematical
methods or to processing of knowledge.
7
Use of mathematical and statistical methods for
selecting optimal experiments
Statistical experimental design
Design of Experiments (DoE)…
Use of mathematical and statistical methods for
selecting optimal experiments
Statistical experimental design
Design of Experiments (DoE)…
Extracting maximum amount of information when
analysing multivariate (chemical) data
Classification
Process monitoring,
Multivariate calibration…
Extracting maximum amount of information whenwhen
analysing multivariate (chemical) data
Classification
Process monitoring,
Multivariate calibration…
CHEMOMETRICS – INTRODUCTION
8
CHEMOMETRIC
S Analytical
request
Analytical
method
Analytical
answer
CHEMOMETRICS
Useful at any point in an analysis, from the first
conception of an experiment until the data is
discarded.
CHEMOMETRICS – INTRODUCTION
9
Huge growth area in past 15 years
• Process Control and analysis
• Food and feed analysis
• Biology – metabolomics etc
• Environmental monitoring
• Analytical Chemistry
CHEMOMETRICS – APPLICATIONS
10
Linear algebra is the language of Chemometrics. One cannot expect
to truly understand most chemometric techniques without a basic
understanding of linear algebra (Wise and Gallagher, 1998)
Matrix and vector operations
CHEMOMETRICS – DEFINITIONS
11
- Samples are referred to as OBJECTS
CHEMOMETRICS:
Extract meaningful information about the objects and the
variables from data matrices
- Measurement results (e.g. concentration, absorbances, …) are referred
to as VARIABLES
X
K variables
M objects - observations
-A data table of K variables and M objects is referred to as a DATA
MATRIX OF SIZE MxK
CHEMOMETRICS – DEFINITIONS
12
úúúúú
û
ù
êêêêê
ë
é
=
np2n1n
p22221
p11211
x...xx
............
x...xx
x...xx
X
Variables
Sam
ple
s
If X has 3 rows and 5 columns
3 x 5 data matrix
1200 1400 1600 1800 2000 2200 2400-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Wavelength
Absorb
ance
1200 1400 1600 1800 2000 2200 2400-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Wavelength
Absorb
ance
1200 1400 1600 1800 2000 2200 2400-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Wavelength
Absorb
ance
CHEMOMETRICS – DEFINITIONS
13
úúúúú
û
ù
êêêêê
ë
é
npnn
p
p
xxx
xxx
xxx
...
............
...
...
21
22221
11211
In summary…
X
K variables
M objects
=
CHEMOMETRICS – INTRODUCTION
14
Ø Sampling, selection of objects and variables
Ø Clustering
Ø Multivariate regressions, calibrations and predictions
Ø Neural Networks
Ø Validation
Ø Graphical display and outlier detection
Source: Chemometrics – Introduction - Jens C. Frisvad
CHEMOMETRICS – IMPORTANT DISCIPLINES
15
Selection optimal complexity
Model building
Supervised analysis
Regression models
Model construction
Calibration
Calibration data set (X) Sample Selection
Visualization
Unsupervised analysis
Principal Component Analysis
(PCA)
outlier detection
Data matrix pre-treatment
clustering tendency
Data Exploration
Pattern Recognition
Outliers Prediction
New data
Uncertainty estimation
Validation
16
Population sources of variation:
- origins of samples
- processes
- varieties
- storage conditions
- sample preparations (t°, particle size)
- residual moisture
INCLUDE IN THE CALIBRATION SET ALL THE FACTORS OF VARIATION:
Measurement factors:
- room temperature
- operator
- instrument setup
SAMPLE SELECTION
17
Grouping of objects e.g. how similar is the behaviour of compounds,
how similar are products…
Looking at relationships:
Between samples: food samples / patients / people / spectra / …
Between variables: elemental compositions / compound
concentrations / spectral peaks / …
PATTERN RECOGNITION
PCA
Principal Component Analysis
18
0.20.25
0.30.35
0.4
0.4
0.5
0.6
0.7
0.8-5
-4
-3
-2
-1
x 10-3
PCA is a chemometric technique for visualizing high
dimensional data. PCA reduces the dimensionality (the
number of variables) of a data set by maintaining as much
variance as possible.
PC1
PC2
-3 -2 -1 0 1 2 3 4-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
PC 1 (96.09%)
PC
2 (
3.4
2%
)
PC1 PC2
PRINCIPAL COMPONENT ANALYSIS
19
-3 -2 -1 0 1 2 3 4-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Scores on PC 1 (96.09%)
Score
s o
n P
C 3
(0.3
4%
)
Samples/Scores Plot of Xoffset
Object space
-3 -2 -1 0 1 2 3 4-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Scores on PC 1 (96.09%)
Score
s o
n P
C 3
(0.3
4%
)
Samples/Scores Plot of Xoffset
Clusters
Outliers
-3 -2 -1 0 1 2 3 4-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Scores on PC 1 (96.09%)
Score
s o
n P
C 3
(0.3
4%
)
Samples/Scores Plot of Xoffset
PRINCIPAL COMPONENT ANALYSIS
20
Quantitative or qualitative estimation. Especially mixtures.
Univariate calibration : one measurement e.g. a peak hight
Multivariate calibration : several measurements e.g. spectra
Statistical, mathematical or graphical technique, considers
multiple variables simultaneously
CALIBRATION
21
Linking two sets of data together : peak hight to
concentration / Spectra to concentrations / biological
activity to structure / …
MULTIVARIATE CALIBRATION
• Relate instrumental response to chemical concentration.
• Improper calibration yields improper results.
• Linear, non-linear and multivariate calibration algorithms
are available, depending on the instrument and the analysis
involved.
22
X – Data
spectra
Y – Data
Reference values
Calibration
model +
Calibration
- Infrared spectroscopy (IR-Raman)
- Fourier-Transform Infrared spectroscopy (FTIR)
- Nuclear Magnetic Resonance (NMR)
- X-Ray Fluorescence (XRF)
- X-Ray Diffraction (XRD) spectroscopy
- …
- Moisture (regression)
- Concentration of protein (regression)
- country of origin (discrimination)
- PDO (discrimination)
- …
MULTIVARIATE CALIBRATION
23
X – Data
spectra
Y – Data
Ref value
Calibration
model +
X – Data
spectra
Calibration
model
Y – Data
// ref value. +
Calibration
Validation
Calibration uses empirical data and prior knowledge for
determining how to predict unknown quantitative
information y from available measurements x, via some
mathematical transfer functions
MULTIVARIATE CALIBRATION - VALIDATION
24
Y is a matrix nxm containing the reference values
X is a matrix nxp containing the spectra (NIR, MIR,
Raman,…)
B is a matrix pxm containing the regression coeficients
E is a matrix nxm explaining for the model error.
Y=f(X)=XB+E
MULTIVARIATE CALIBRATION
25
Multivariate analysis
Unsupervised
Principal Component
Analysis (PCA)
Cluster Analysis
(CA)
Supervised
Regression
Multiple Linear Regression (MLR)
Principal Component Regression (PCR)
Partial Least Squares (PLS)
Artificial Neural Networks (ANN)
LS–Support Vector Machines (LS-SVM)
Local techniques
Discrimination
PLS-Discriminant Analysis (PLS-DA)
SIMCA
2525
K-Nearest Neighbours (k-NN)
25
Support Vector Machines (SVM)Algorithms 26
Multivariate analysis
Unsupervised
Principal Component
Analysis (PCA)
Cluster Analysis
(CA)
Supervised
Regression
Multiple Linear Regression (MLR)
Principal Component Regression (PCR)
Partial Least Squares (PLS)
Artificial Neural Networks (ANN)
LS–Support Vector Machines (LS-SVM)
Local techniques
Discrimination
PLS-Discriminant Analysis (PLS-DA)
SIMCA
K-Nearest Neighbours (k-NN)
Support Vector Machines (SVM)
Multiple Linear Regression (MLR)
Principal Component Regression (PCR)
Partial Least Squares (PLS)
Artificial Neural Networks (ANN)
LS–Support Vector Machines (LS-SVM)
Local techniques
PLS-Discriminant Analysis (PLS-DA)
SIMCA
K-Nearest Neighbours (k-NN)
Support Vector Machines (SVM)
Supervised
Regression
Discrimination
Unsupervised - Pattern recognition
à Seeks similarities and regularities present in the data.
27
Multivariate analysis
Unsupervised
Principal Component
Analysis (PCA)
Cluster Analysis
(CA)
Supervised
Regression
Multiple Linear Regression (MLR)
Principal Component Regression (PCR)
Partial Least Squares (PLS)
Artificial Neural Networks (ANN)
LS–Support Vector Machines (LS-SVM)
Local techniques
Discrimination
PLS-Discriminant Analysis (PLS-DA)
SIMCA
K-Nearest Neighbours (k-NN)
Support Vector Machines (SVM)
PLS-Discriminant Analysis (PLS-DA)
SIMCA
K-Nearest Neighbours (k-NN)
Support Vector Machines (SVM)
Discrimination
Unsupervised
Principal Component
Analysis (PCA)
Cluster Analysis
(CA)
Supervised - Regression
à It includes any techniques for modeling and analyzing several
variables, when the focus is on the relationship between a dependent
variable (property of interest) and one or more independent
variables (spectra).
28
Multivariate analysis
Unsupervised
Principal Component
Analysis (PCA)
Cluster Analysis
(CA)
Supervised
Regression
Multiple Linear Regression (MLR)
Principal Component Regression (PCR)
Partial Least Squares (PLS)
Artificial Neural Networks (ANN)
LS–Support Vector Machines (LS-SVM)
Local techniques
Discrimination
PLS-Discriminant Analysis (PLS-DA)
SIMCA
K-Nearest Neighbours (k-NN)
Support Vector Machines (SVM)
Multiple Linear Regression (MLR)
Principal Component Regression (PCR)
Partial Least Squares (PLS)
Artificial Neural Networks (ANN)
LS–Support Vector Machines (LS-SVM)
Local techniques
(CA)
Regression
Unsupervised
Principal Component
Analysis (PCA)
Cluster Analysis
(CA)
Supervised - Discrimination
à Regression based statistical technique used in determining which
particular classification or group an item of data or an object belongs to
on the basis of its characteristics or essential features.
29
Find a set of predictor variables which gives a good fit,
predicts the dependent value well and is as small as
possible.
Ø Backward elimination: Start with all the predictors and
potentially drop predictors in subsequent steps.
Ø Forward selection: Start with no predictors and potentially
add predictors in subsequent steps.
Ø Stepwise regression: Combination of backward
elimination and forward selection.
VARIABLE SELECTION
30
VARIABLE SELECTION
31
Reducing number of variables
Possibility of constructing fast spectrometers based on a reduced
number of variables aiming to a reduction of training and
utilization times to be used, for instance, in a conveyer belt.
Moreover, the fact of grouping the different selected variables in
regions allows an easy chemical interpretation of the spectra.
References
- ‘Backward variable selection in PLS (BVSPLS)’; J.A. Fernández
Pierna, V. Baeten, P. Dardenne. Analytica Chimica Acta 642 (2009)
pp. 89-93.
- ‘Prediction error improvements using variable selection on small
calibration sets - a comparison of some recent methods’. S.I.
Overgaard, J.A. Fernández Pierna, V. Baeten, P. Dardenne & T.
Isaksson. J. Near Infrared Spectrosc. 20, 329-337 (2012).
VARIABLE SELECTION
32
CHEMOMETRICS – SOFTWARE
33
Vegetal
Fish
Poultry Bovine
+ Pig
Terrestrial animal
Other
Other
CAL LOOCV
Belonging
to...
% classified as... % classified as...
Fish Rest Fish Rest
Fish 98.6 1.4 96.8 3.2
Rest 9.5 90.5 10 90
FARIMAL
93.4 % correct classification
EXAMPLE: FEED
34
EXAMPLE: FEED
35
SVM discrimination models
‘NIR hyperspectral imaging spectroscopy and chemometrics for the detection of undesirable substances in food and
feed’ J.A. Fernández Pierna, Ph. Vermeulen, O. Amand, A. Tossens, P. Dardenne and V. Baeten. Special issue
Chemometrics and Intelligent Laboratory Systems 117 (2012) 233-239
EXAMPLE: CEREALS - IMPURITIES
36
§ Walnut
§ Hazelnut
§ Brazil nut
§ Almond
§ Cashew
§ Raisin
EXAMPLE: NUTS AND DRIED FRUITS
37
KNN - PCA models
‘Using a visible vision system for on-line determination of quality parameters of olive fruits’. E. Guzmán,
V. Baeten, J.A. Fernández Pierna, J.A. García Mesa. Foof and Nutrition Sciences, 4, 90-98 (2013).
EXAMPLE: OLIVES - IMPURITIES
38
‘Application of low-resolution Raman spectroscopy for the analysis of oxidized olive oil’ E. Guzmán,
V. Baeten, J.A. Fernández Pierna, J.A. García Mesa. (2011) Food control 22, 2036-2040.
EXAMPLE: OLIVE OIL – QUALITY PARAMETERS
39
Detection of the Presence of Hazelnut Oil in Olive Oil by FT-Raman and FT-MIR Spectroscopy' V. Baeten,
J.A. Fernández Pierna, P. Dardenne, M. Meurens, D.L. García González, R. Aparicio. Journal of
Agricultural and Food Chemistry, 53 (16) (2005), 6201-6206.
EXAMPLE: OLIVE OIL – ADULTERATION
40
optical microscopy NIR hyperspectral imaging
SVM discrimination models
‘NIR hyperspectral imaging spectroscopy and chemometrics for the detection of undesirable substances in food and
feed’ J.A. Fernández Pierna, Ph. Vermeulen, O. Amand, A. Tossens, P. Dardenne and V. Baeten. Special issue
Chemometrics and Intelligent Laboratory Systems 117 (2012) 233-239
EXAMPLE: SUGAR BEET PLANTS – CYST
41
EXAMPLE: SUGAR BEET PLANTS – CERCOSPORA
42
Multivariate regression method comparison:
PLS, ANN and LS-SVM
EXAMPLE: FEED PRODUCTS
43
Feed -------------------------------------------- (28676x700)
Ash, Fat, Fibre, Starch, Protein
Feed Ingredients --------------------------- (26652x700)
Ash, Fat, Fibre, Protein
Fresh Silages ------------------------------- (1035x700)
Dry Matter, Fibre, Protein
Soils ------------------------------------------- (1625x700)
CEC, COT_SK, N_Kj
Pre-processing: SNV + detrend + First derivative
EXAMPLE: FEED PRODUCTS
44
Example of data distribution (Soils N_Kj)
EXAMPLE: FEED PRODUCTS
45
Feed - Fat
0
0.2
0.4
0.6
0.8
1
pls ann svm
RM
S
train
val
ks
rd
Feed - fibre
0
0.2
0.4
0.6
0.8
1
1.2
1.4
pls ann svm
RM
S
train
val
ks
rd
Feed - ash
0
0.5
1
1.5
2
2.5
pls ann svm
RM
S
train
val
ks
rd
Feed - starch
0
0.5
1
1.5
2
2.5
3
3.5
pls ann svm
RM
S
train
val
ks
rd
Feed - protein
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
pls ann svm
RM
S
train
val
ks
rd
EXAMPLE: FEED PRODUCTS
46
Feed ingredients - ash
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
pls ann svm
RM
S
train
val
ks
rd
Feed ingredients - fat
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
pls ann svm
RM
S
train
val
ks
rd
Feed ingredients - fibre
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
pls ann svm
train
val
ks
rd
Feed ingredients - protein
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
pls ann svm
RM
S
train
val
ks
rd
EXAMPLE: FEED PRODUCTS
47
Fresh silages - fibre
0
0.5
1
1.5
2
2.5
pls ann svm
RM
S
train
val
ks
rd
Fresh silages - protein
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
pls ann svm
RM
S
train
val
ks
rd
Fresh silages - dry matter
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
pls ann svm
RM
S
train
val
ks
rd
EXAMPLE: FEED PRODUCTS
48
0 10 20 30 40 50 60 700
10
20
30
40
50
60
0 10 20 30 40 50 60 700
10
20
30
40
50
60
CALIBRATION TRANSFER FROM DISPERSIVE
INSTRUMENTS TO HANDHELD SPECTROMETERS (MEMS)
‘Calibration Transfer from Dispersive Instruments to Handheld Spectrometers’, J.A. Fernández
Pierna, P. Vermeulen, B. Lecler, V. Baeten, P. Dardenne. Applied Spectroscopy 64 (6) (2010)
EXAMPLE: TRANSFERT
49
EXAMPLE: TRANSFERT
50
http://www.cra.wallonie.be/
j.fernandez@ cra.wallonie.be
QUESTIONS ABOUT CHEMOMETRICS?
51