Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de...

29
Data Mining - Overview Dr. Jean-Michel RICHER 2018 Dr. Jean-Michel RICHER Data Mining - Overview 1 / 27

Transcript of Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de...

Page 1: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

Data Mining - Overview

Dr. Jean-Michel RICHER

[email protected]

Dr. Jean-Michel RICHER Data Mining - Overview 1 / 27

Page 2: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

Outline

1. Introduction

2. What will we cover ?

3. Skills required and to acquire

4. Evaluation

Dr. Jean-Michel RICHER Data Mining - Overview 2 / 27

Page 3: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

1. Introduction

Dr. Jean-Michel RICHER Data Mining - Overview 3 / 27

Page 4: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

Subject

Definition 1Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful informa-tion from data.

Definition 2Data Mining is the computing process of discovering pat-terns in large data sets involving methods at the intersec-tion of machine learning, statistics, and database systems.

Dr. Jean-Michel RICHER Data Mining - Overview 4 / 27

Page 5: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

Subject

Definition 1Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful informa-tion from data.

Definition 2Data Mining is the computing process of discovering pat-terns in large data sets involving methods at the intersec-tion of machine learning, statistics, and database systems.

Dr. Jean-Michel RICHER Data Mining - Overview 4 / 27

Page 6: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

What do they have in common ?

Common termsdiscovery, extraction (not obvious)useful information, patterns, rules (form of the usefulinformation)data, database (potentially a huge amount of data)statistics, machine learning, ... (methods)

Dr. Jean-Michel RICHER Data Mining - Overview 5 / 27

Page 7: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

What is data mining ?

DefinitionData Mining is a branch of computer science which is ei-ther

a knowledge discovery process of hidden or nontrivial information (ex: supermarket rules)or a knowledge synthesis process used to improve theunderstanding of raw data (ex: flower classification)

Dr. Jean-Michel RICHER Data Mining - Overview 6 / 27

Page 8: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

Data definition

More generally

given a set of Objects X = { ~X1, . . . ,~Xn} (persons,

customers, patients, plants, ...)defined by a set of properties P = {P1, . . . ,Pk} (length,size, weight, color, ...)each Pi having discrete or continuous valueswe want to find

I patterns / rules than can explain the relationshipsbetween objects and their properties

I classify objects

Dr. Jean-Michel RICHER Data Mining - Overview 7 / 27

Page 9: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

What are properties ?

A note on propertiesProperties can be

quantitative: a numerical valuequalitative: a text

ExampleFor temperatures:

quantitative and continuous: [−20◦C,+40◦C]

qualitative and discrete: very cold, cold, ..., hot, veryhot

Dr. Jean-Michel RICHER Data Mining - Overview 8 / 27

Page 10: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

From data to knowledge

Objects P1 P2 ... Pk

~X1 x11 x2

1 ... xk1

...~Xn x1

n x2n ... xk

n

if (x1i > 10) and (x3

i == true) then x7i = humid

x1i = 1.23× x2

i + 2.56x3i

c1 = { ~X1,~X3},c2 = { ~X2,

~X4,~X5}

Dr. Jean-Michel RICHER Data Mining - Overview 9 / 27

Page 11: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

Example (1/3)

Truth tableConsider the following truth table in logic (0 = false, 1 =true):

X Y Z f (X ,Y , Z)0 0 0 00 0 1 00 1 0 00 1 1 11 0 0 01 0 1 11 1 0 11 1 1 1

How can you summarize the function f (X ,Y , Z) ?

Dr. Jean-Michel RICHER Data Mining - Overview 10 / 27

Page 12: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

Example (2/3)

Boolean expressionYou can compute the boolean expression of f (X ,Y , Z) as :

f (X ,Y , Z) = X .Y .Z + X .Y .Z + X .Y .Z + X .Y .Z

Majority functionIt is the majority function, such that f (X ,Y , Z) = 1 if thenumber of variables (X ,Y , Z) set to 1 is greater than thenumber of variables set to 0.

Dr. Jean-Michel RICHER Data Mining - Overview 11 / 27

Page 13: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

Example (2/3)

Boolean expressionYou can compute the boolean expression of f (X ,Y , Z) as :

f (X ,Y , Z) = X .Y .Z + X .Y .Z + X .Y .Z + X .Y .Z

Majority functionIt is the majority function, such that f (X ,Y , Z) = 1 if thenumber of variables (X ,Y , Z) set to 1 is greater than thenumber of variables set to 0.

Dr. Jean-Michel RICHER Data Mining - Overview 11 / 27

Page 14: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

Example (3/3)

Majority functionWith the majority function definition, you can extend thefunction with a greater number of input variables andkeep the same definition :

f (X ,Y , Z ,W )

f (X ,Y , Z ,W , T )

...

Dr. Jean-Michel RICHER Data Mining - Overview 12 / 27

Page 15: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

2. What will we cover ?

Dr. Jean-Michel RICHER Data Mining - Overview 13 / 27

Page 16: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

We will cover

Theoretical and practical aspects ofregressionclassificationclusteringdecision treeneural networks

Dr. Jean-Michel RICHER Data Mining - Overview 14 / 27

Page 17: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

We will cover

RegressionPrinciple: try tocompute a property infunction of otherpropertiesexample: weight =f (height ,age, sex)

0

5

10

15

20

0 1 2 3 4 5 6 7 8

y

x

measureslaw 2*x+3

Weka LR

Dr. Jean-Michel RICHER Data Mining - Overview 15 / 27

Page 18: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

We will cover

0

5

10

15

20

0 1 2 3 4 5 6 7 8

y

x

measureslaw 2*x+3

Weka LR

Dr. Jean-Michel RICHER Data Mining - Overview 15 / 27

Page 19: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

We will cover

ClassificationPrinciple: try to splitdata into groups usingthe properties of objets,knowing the number ofgroupspart of supervisedlearning

Dr. Jean-Michel RICHER Data Mining - Overview 16 / 27

Page 20: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

We will cover

ClusteringPrinciple: try to create groups of objects based ontheir similarities, the number of groups is unknownpart of unsupervised learning

Dr. Jean-Michel RICHER Data Mining - Overview 17 / 27

Page 21: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

We will cover

Decision treePrinciple: create amodel that predicts thevalue of a targetvariable by learningsimple decision rulesinferred from the datafeaturesa non-parametricsupervised learningmethod used forclassification andregression

Dr. Jean-Michel RICHER Data Mining - Overview 18 / 27

Page 22: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

We will cover

Dr. Jean-Michel RICHER Data Mining - Overview 18 / 27

Page 23: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

We will cover

Neural Networksalso called ArtificialNeural Networks (ANNs)or ConnectionistSystemsPrinciple: a computingdevice inspired bybiological neuralnetworks that has theability to progressivelylearn by consideringexamples (train set)kind of a black-box

Dr. Jean-Michel RICHER Data Mining - Overview 19 / 27

Page 24: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

Resources

SoftwareWe will use the following software:

WEKA Waikato Environment for Knowledge Analysis,http://www.cs.waikato.ac.nz/ml/weka

R (rio, dplyr, ggplot2)Python (numpy, scipy, sklearn, matplotlib)

Dr. Jean-Michel RICHER Data Mining - Overview 20 / 27

Page 25: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

3. Skills

Dr. Jean-Michel RICHER Data Mining - Overview 21 / 27

Page 26: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

Required skills

Skills you need to haveBefore entering the course you should have

mathematical background (matrix, statistics,probabilities)computer science background (Java, R or Python)

Dr. Jean-Michel RICHER Data Mining - Overview 22 / 27

Page 27: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

Skills to acquire

Skills to acquireAfter this course you shouldbe able to

give a definition and explain what is data miningdefine the different kind of analysis / analytics of DataMiningchose the correct method / algorithm to produce ananalysis in function of the data and the question youhave to answer

Dr. Jean-Michel RICHER Data Mining - Overview 23 / 27

Page 28: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

3. End

Dr. Jean-Michel RICHER Data Mining - Overview 24 / 27

Page 29: Data Mining - Overviericher/dm/data_mining_1_overview.pdf · Definition 1 Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful

University of Angers - Faculty of Sciences

UA - Angers2 Boulevard Lavoisier 49045 AngersCedex 01Tel: (+33) (0)2-41-73-50-72

Dr. Jean-Michel RICHER Data Mining - Overview 25 / 27