Descriptive Exploratory Data Analysis 9/6/2007 Jagdish S. Gangolly State University of New York at...

33
Descriptive Exploratory Data Analysis 9/6/2007 Jagdish S. Gangolly State University of New York at Albany

Transcript of Descriptive Exploratory Data Analysis 9/6/2007 Jagdish S. Gangolly State University of New York at...

Descriptive Exploratory Data Analysis 9/6/2007

Jagdish S. Gangolly

State University of New York at Albany

Data Manipulation:

– Matrices: bind rows (rbind), bind columns (cbind) – Arrays: rowMeans, colMeans, rowSums, colSums, rowVars, colVars,…

– apply(data, dim, function,…)– attach(framename):permits you to refer to

variables without cumbersome notations. You can detach the frame when done.

– function (x) { function definition}: To define your own functions

– rm(comma-separated S-Plus objects): To remove objects

Trellis Graphics I• A matrix of graphs

Example:>par(mfrow=c(2,2)) # 2 X 2 matrix of figures

>x <- 1:100/100:1 >plot(x) # plot cell (1,1)>plot(x, type=“l”) # plot cell (1,2) line>hist(x) # plot cell (2,1) histogram>boxplot(x) # plot cell (2,2) boxplot

Trellis Graphics II

Syntax:Dependent variable ~ explanatory variable |conditioning variable Data set

Output:

>trellis.device(motif)

>dev.off() or >graphics.off()

Trellis Graphics III

Example:histogram(~height | voice.part,

data=singer)

– No dependent variable for histogram

– Height is explanatory variable

– Data set is singer

Trellis Graphics IV

• Layout: layout and skip and aspect parameters (p.147).

• Ordering graphs: left to right, bottom to top. If as.table=T, left to right top to bottom p.149).

Data Mining

• What is Data mining?

• Data mining primitives– Task-relevant data– Kinds of knowledge to be mined– Background knowledge– Interestedness measures– Visualisation of discovered patterns

• Query language

Data Mining• Concept Description (Descriptive

Datamining)– Data generalisation

• Data cube (OLAP) approach (offline pre-computation)• Attribute-oriented induction approach (online aggregation)• Presentation of generalisation

• Descriptive Statistical Measures and Displays

What is Data mining?• Discovery of knowledge from

Databases

– A set of data mining primitives to facilitate such discovery (what data, what kinds of knowledge, measures to be evaluated, how the knowledge is to be visualised)

– A query language for the user to interactively visualise knowledge mined

Data mining primitives I

• Task-relevant data: attributes relevant for the study of the problem at hand

• Kinds of knowledge to be mined: characterisation, discrimination, association, classification, clustering, evolution,…

• Background knowledge: Knowledge about the domain of the problem (concept hierarchies, beliefs about the relationships, expected patterns of data, …)

Data mining primitives II

• Interestedness measures: support measures (prevalence of rule pattern) and confidence measures(strength of the implication of the rule)

• Visualisation of discovered patterns: rules, tables, charts, graphs, decision trees, cubes,…

Task-relevant Data

Steps:

• Derivation of initial relation through database queries (data retrieval operations). (Obtaining a minable view)

• Data cleaning & transformation of the initial relation to facilitate mining

• Data mining

Kinds of knowledge to be mined

• Kinds of knowledge & templates (meta-patterns, meta-rules, meta-queries)– AssociationAn Example:age(X:customer, W) Λ income(X, Y) buys(X, Z)– Classification– Discrimination– Clustering– Evolution analysis

Background knowledge

• Knowledge from the problem domain– usually in the form of

• concept hierarchies (rolling up or drilling down)• schema hierarchies (lattices)• set-grouping hierarchies (successive sub-grouping of

attributes)• rule-based hierarchies

Interestedness measures I

• Simplicity: More complex the structure, the more difficult it is to interpret, and so likely to be less interesting (rule length,…)

• Certainty: Validity, trustworthiness # tuples containing both A and B

confidence(AB)

# tuples containing ASometimes called “certainty factor”

Interestedness measures II

• Utility: Support is the percentage of task-relevant data tuples for which the pattern is true

# tuples containing both A and B

support(AB)

total # tuples

Visualisation of discovered patterns

• Hierarchies

• tables

• pie/bar charts

• dot/box plots

• ……

Descriptive Datamining (Concept Description & Characterisation )

• Concept description:Description of data generalised at multiple levels of abstraction

• Concept characterisation: Concise and succinct summarisation of a given collection of data

• Concept comparison: Discrimination

Data Generalisation

• Abstraction of task-relevant high conceptual level data from a database containing relatively low conceptual level data– Data cube (OLAP) approach (offline pre-

computation) (Figs 2.1 & 2.2, pages 46 &47)– Attribute-oriented induction approach (online

aggregation)

• Presentation of generalisation (Tables 5.3 & 5.4 on p. 191, and Figs 5.2, 5.3, & 5.4 on pages 192 & 193)

Descriptive Statistical Measures and Displays I

• Measures of central tendency– Mean, Weighted mean (weights signifying

importance or occurrence frequency)– Median– Mode

• Measures of dispersion– Quartiles, outliers, boxplots

Descriptive Statistical Measures and Displays II

• Displays– Histograms (Fig 5.6, page 214)

Descriptive Statistical Measures and Displays III

– Barcharts

Descriptive Statistical Measures and Displays IV

– Quantile plot (Fig 5.7, page 215)

Descriptive Statistical Measures and Displays V

– Quantile-Quantile plot (Fig 5.8, page 216)

Descriptive Statistical Measures and Displays VI

– Scatter plot (Fig 5.9, page 216)

Descriptive Statistical Measures and Displays VII

– Loess curve (Fig 5.10, page 217)

Descriptive Data Exploration

• summary : mean, median, quartiles p.171

• stem : stem and leaf display p.171

• quantile p.172

• stdev p.173

• tapply : splits data p.174

• by p.175

• mean works on vector, and other structures need to be converted to vectors before computing means.

• (example on p.176-7)

Data Preprocessing for Datamining I

• Why– Incomplete

• Attribute values not available, equipment malfunctions, not considered important

– Noisy (errors)• instrument problems, human/computer errors,

transmission errors

– Inconsistent• inconsistencies due to data definitions

Data Preprocessing for Datamining II

• Data Cleaning– Missing values:

• ignore tuple, fill-in values manually, use a global constant (unknown), missing value=attribute mean, missing value = attribute group mean, missing value= most probable value

– Noisy data:• Binning: partitioning into equi-sized bins, smoothing by bin

means or bin boundaries

• Clustering

• Inspection: computer & human

• Regression

– Inconsistencies

Data Preprocessing for Datamining III

• Data Integration: Combining data from different sources into a coherent whole– Schema integration: combining data models (entity

identification problems)– Redundancy (derived values, calculated fields, use of

different key attributes): use of correlations to detect redundancies

– Resolution of data value conflicts (coding values in different measures)

Data Preprocessing for Datamining III

• Transformation– Smoothing– Aggregation– Generalisation– Normalisation– Attribute (or feature) construction

Data Preprocessing for Datamining IV

• Data Reduction & compression– Data cube aggregation (p.117)– Dimension reduction: minimise loss of

information. • Attribute selection

• Decision tree induction

• Principal components analysis

Data Preprocessing for Datamining IV– Numerosity reduction

• Regression/log-linear regression

• histograms

• Clustering