Données incomplètes et Apprentissage statistique Quelques...

28
FRENCH ACTUARIAL SUMMER SCHOOL 5-8 septembre 2016 – ISFA, Lyon Données incomplètes et Apprentissage statistique Quelques réflexions Christian ROBERT

Transcript of Données incomplètes et Apprentissage statistique Quelques...

Page 1: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

FRENCH ACTUARIAL SUMMER SCHOOL

5-8 septembre 2016 – ISFA, Lyon

Données incomplètes et Apprentissage statistique

Quelques réflexions

Christian ROBERT

Page 2: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

chaire-dami.fr

Page 3: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

Insurance modelsthe impact of the regulatory and accounting environment on

their development and management

Risk measures and performance indicators for insurance risk management

Governance of internal models and attitudes of top management with respect to models

Customer behaviour and risk attitudes

Proxies, model points and advanced simulation techniques for risk management

Page 4: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

Data analytics in insurance

Governance for data analytics, new business models with big data and analytics Risk-based pricing, predictive

analytics, machine learning

Privacy concerns, data anonymization, open data

Page 5: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

Labeled data Unlabeled data

Data types

(Y , X)

Y : labels = or , response variable, output variable X : explanatory variables, input variables, co-variables

X1

X2X2

X1

Data: (X)

Page 6: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

Exp

lain

Pre

dic

tTrain Test

Data to be explained and/or to be predicted

X1

X2

X1

X2

X1

X2

(Y , X)Data:

(Y , X) and (? , X) Data:

Page 7: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

Imperfect labeled data

((min(Y , C), 1Y > C ), X) with Y C (((Y , C)|Y > C), X) with Y CX1

X2

X1

X2

Censored data Truncated data

X1

X2

(Y*= Y 1ε= -1 + Y^ 1ε= 1 , X) with Y Y^

Random wrong label

X1

X2

(Y*= Y + ε, X) with ε X

T

Noisy labeled data with endogeneous errors

Only probabilistic schemes?

T

Data:

Data:

T T

Page 8: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

Labeled with unlabeled data / Missing values

X1

X2

Missing completly at random(Y*= Y 1ε= -1 + Ø 1ε= 1 , X) ε X (Y , X *= X 1ε= -1 + Ø 1ε= 1 ) ε X

Missing at random(Y*= Y 1ε= -1 + Ø 1ε= 1 , X) ε X (Y , X *= X 1ε= -1 + Ø 1ε= 1 ) ε X

Missing not a random(Y*= Y 1Y < c + Ø 1Y > c , X) (Y , X *= X 1Y < c + Ø 1Y > c )

T

T

X1

X2

Some components of X are not observedSome labels Y are not observed

T

T

Page 9: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

X1

X2

X1

X2

(Y , X) and (? , X, Z) Data:

Z

Predictcontrolleddata

X1

X2

X1

X2

(Y , X) and (? , X) Data:

Train Test

Y = f1(X) Y = f2(X) D(f1, f2) < A

Predicttest data with a differentgeneratingprocess

When train and test data bases differ

Page 10: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

Sas Institute

Keeping an Open Mind: Multiculturalism in Data Science

Page 11: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

Artificial Intelligence (AI): is the science and engineering of making intelligent machines, especially intelligent computer programs. It is related to the similar task of using computers to understand human intelligence, but AI does not have to confine itself to methods that are biologically observable.

Data mining: has for goal to extract information from a data set and transform it into an understandable structure for further use. It involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

Knowledge discovery in databases (KDD): is the process of discovering useful knowledge from a collection of data. This widely used data mining technique is a process that includes data preparation and selection, data cleaning, incorporating prior knowledge on data sets and interpreting accurate solutions from the observed results.

Page 12: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

Machine learning: explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions.

Pattern recognition: is a branch of machine learning that focuses on the recognition of patterns and regularities in data. Pattern recognition systems are in many cases trained from labeled "training" data, but when no labeled data are available other algorithms can be used to discover previously unknown patterns.

Statistics: is the science that deals with the collection, classification, analysis, and interpretation of numerical facts or data, and that, by use of mathematical theories of probability, imposes order and regularity on aggregates of more or less disparate elements.

Page 13: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

SubfieldsMachine Learning is a subfield of computer science and artificial intelligence which deals with building systems that can learn from data, instead of explicitly programmed instructions.Statistical Modelling is a subfield of mathematics which deals with finding relationship between variables to predict an outcome

Data mechanism/data generating processMachine Learning uses algorithmic models and treats the data mechanism as unknown.Statistical Modelling assumes that the data are generated by a given stochastic data model.

Model choiceMachine Learning focuses on Predictive Accuracy even in the face of lack of interpretability of models. Model Choice is based on Cross Validation of Predictive Accuracy using Partitioned Data Sets.Statistical Modelling focuses on hypothesis testing of causes and effects and interpretability of models. Model Choice is based on parameter significance and/or confidence intervals, and In-sample Goodness-of-fit.

Machine Learning vs Statistics/Econometrics

Page 14: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

Bias-Variance Trade-off

Page 15: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

Mining imperfect data in insurance

Truncated / censored data

Page 16: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

Incurred But Not Reported (IBNR)Reported But Not Paid claims (RBNP)Reported But Not Settled claims (RBNS)

Individual claim process

Mining imperfect data in insurance

Page 17: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

Insurance products with several generations of policies / customers

Mining imperfect data in insurance

Page 18: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

Mining imperfect data in insurance

Novelty / Fraud detection

Page 19: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

Probability Schemes:

- Simple: Every individual in the sampling frame (i.e., desired population) has an equal and independent chance of being chosen for the study. - Stratified: Sampling frame is divided into sub-sections comprising groups thatare relatively homogeneous with respect to one or more characteristics and a random sample from each stratum is selected. - Cluster: Selecting intact groups representing clusters of individuals rather thanchoosing individuals one at a time. - Systematic/Interval: Choosing individuals from a list by selecting every n-th sampling frame member, where k typifies the population divided by the preferredsample size. - Multi-Stage Random: Choosing a sample from the random sampling schemes in multiple stages.

Train/test data with different sampling schemes

Mining imperfect data in insurance

Page 20: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

Major Non-probability Schemes:

- Maximum Variation: Choosing settings, groups, and/or individuals to maximize the range of perspectives investigated in the study. - Homogenous: Choosing settings, groups, and/or individuals based on similaror specific characteristics. - Critical Case: Choosing settings, groups, and/or individuals based on specificcharacteristic(s) because their inclusion provides the researcher with compellinginsight about a phenomenon of interest.- Theory-Based: Choosing settings, groups, and/or individuals because theirinclusion helps the researcher to develop a theory. - Confirming/Disconfirming: After beginning data collection, the researcherconducts subsequent analyses to verify or contradict initial results. - Snowball/chain: Participants are asked to recruit individuals to join the study. Extreme: Selecting outlying cases and conducting comparative analyses - Typical Case: Selecting and analyzing average or normal cases. - Intensity: Choosing settings, groups, and/or individuals because theirexperiences relative to the phenomena of interest are viewed as intense but not extreme.- Politically Important Case: Choosing settings, groups, and/or individuals to beincluded or excluded based on their political connections to the phenomena of interest.

Page 21: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

• Lopez et al. (2015) used an approach that is based on the IPCW strategy (Inverse Probability of Censoring Weighting") and that consists in determining a weighting scheme that compensates the lack of complete observations in the sample.

Tree-based censored regression/Survival random forest

• Random forests have been extended to the survival context by Ishwaran et al. (2008), who prove consistency of Random Survival Forests (RSF) algorithm assuming that all variables are categorical.

• Yang et al. (2010) showed that by incorporating kernel functions into RSF, their algorithm KIRSF achieves better results in many situations.

((min(Y , C), 1Y > C ), X) with Y CX1

X2

T

Page 22: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

One-class classification

One-class classification tries to identify objects of a specific class amongst all objects, by learning from a training set containing only the objects of that class.

It is also known as Outlier detection, Novelty detection, Concept learning, Single class classification, or Unary classification.

X2

X1

or

An example is the automatic diagnosis of a disease. It is relatively easy to compile positive data (all patients who are known to have a ‘common’ disease) but negative data may be difficult to obtain since other patients in the database cannot be assumed to be negative cases if they have never been tested, and such tests can be expensive.

Algorithms that can be used• One-class Support Vector Machines (OSVMs)• Neural networks • Decision trees• Nearest neighbours

Page 23: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

X2

X1

Semi-supervised learning

It is a class of supervised learning tasks and techniques that also make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data.

Goal: Using both labeled and unlabeled data to build better learners, than using each one alone.

In order to make any use of unlabeled data, it is implicitly assumed some structure to the underlying distribution of data: Smoothness assumption, Cluster assumption, Manifold assumption.

Algorithms that can be used• self-training models, • EM with generative mixture, • co-training, • transductive support vector machines, • graph-based methods.

Page 24: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

Several approaches:

- self-training models, - EM with generative mixture, - co-training, - transductive support vector machines, - graph-based methods.

Page 25: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique
Page 26: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique
Page 27: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique
Page 28: Données incomplètes et Apprentissage statistique Quelques ...campus.univ-lyon1.fr/eaj2016/files/2016/05/Robert-christian.pdf · Données incomplètes et Apprentissage statistique

Learning from Positive and Unlabeled data

X2

X1

One has a set of examples of a class , and a set of unlabeled examples with instances of a class and also not from (negative examples).

Goal: Build a classifier to classify the unlabeled examples and/or future (test) data.

Key feature of the problem: no labeled negative training data.

This problem is known as PU-learning.

An example is when a company has database with details of its customer – positive examples, but no information on those who are not their customers, i.e., no negative examples.

2-step strategy for text classificationStep 1: Identifying a set of reliable negative documents from the unlabeled set. Step 2: Building a sequence of classifiers by iteratively applying a classification algorithm and then selecting a good classifier.