Descriptive Exploratory Data Analysis 9/6/2007 Jagdish S. Gangolly State University of New York at...
-
Upload
ethen-maples -
Category
Documents
-
view
221 -
download
2
Transcript of Descriptive Exploratory Data Analysis 9/6/2007 Jagdish S. Gangolly State University of New York at...
Descriptive Exploratory Data Analysis 9/6/2007
Jagdish S. Gangolly
State University of New York at Albany
Data Manipulation:
– Matrices: bind rows (rbind), bind columns (cbind) – Arrays: rowMeans, colMeans, rowSums, colSums, rowVars, colVars,…
– apply(data, dim, function,…)– attach(framename):permits you to refer to
variables without cumbersome notations. You can detach the frame when done.
– function (x) { function definition}: To define your own functions
– rm(comma-separated S-Plus objects): To remove objects
Trellis Graphics I• A matrix of graphs
Example:>par(mfrow=c(2,2)) # 2 X 2 matrix of figures
>x <- 1:100/100:1 >plot(x) # plot cell (1,1)>plot(x, type=“l”) # plot cell (1,2) line>hist(x) # plot cell (2,1) histogram>boxplot(x) # plot cell (2,2) boxplot
Trellis Graphics II
Syntax:Dependent variable ~ explanatory variable |conditioning variable Data set
Output:
>trellis.device(motif)
>dev.off() or >graphics.off()
Trellis Graphics III
Example:histogram(~height | voice.part,
data=singer)
– No dependent variable for histogram
– Height is explanatory variable
– Data set is singer
Trellis Graphics IV
• Layout: layout and skip and aspect parameters (p.147).
• Ordering graphs: left to right, bottom to top. If as.table=T, left to right top to bottom p.149).
Data Mining
• What is Data mining?
• Data mining primitives– Task-relevant data– Kinds of knowledge to be mined– Background knowledge– Interestedness measures– Visualisation of discovered patterns
• Query language
Data Mining• Concept Description (Descriptive
Datamining)– Data generalisation
• Data cube (OLAP) approach (offline pre-computation)• Attribute-oriented induction approach (online aggregation)• Presentation of generalisation
• Descriptive Statistical Measures and Displays
What is Data mining?• Discovery of knowledge from
Databases
– A set of data mining primitives to facilitate such discovery (what data, what kinds of knowledge, measures to be evaluated, how the knowledge is to be visualised)
– A query language for the user to interactively visualise knowledge mined
Data mining primitives I
• Task-relevant data: attributes relevant for the study of the problem at hand
• Kinds of knowledge to be mined: characterisation, discrimination, association, classification, clustering, evolution,…
• Background knowledge: Knowledge about the domain of the problem (concept hierarchies, beliefs about the relationships, expected patterns of data, …)
Data mining primitives II
• Interestedness measures: support measures (prevalence of rule pattern) and confidence measures(strength of the implication of the rule)
• Visualisation of discovered patterns: rules, tables, charts, graphs, decision trees, cubes,…
Task-relevant Data
Steps:
• Derivation of initial relation through database queries (data retrieval operations). (Obtaining a minable view)
• Data cleaning & transformation of the initial relation to facilitate mining
• Data mining
Kinds of knowledge to be mined
• Kinds of knowledge & templates (meta-patterns, meta-rules, meta-queries)– AssociationAn Example:age(X:customer, W) Λ income(X, Y) buys(X, Z)– Classification– Discrimination– Clustering– Evolution analysis
Background knowledge
• Knowledge from the problem domain– usually in the form of
• concept hierarchies (rolling up or drilling down)• schema hierarchies (lattices)• set-grouping hierarchies (successive sub-grouping of
attributes)• rule-based hierarchies
Interestedness measures I
• Simplicity: More complex the structure, the more difficult it is to interpret, and so likely to be less interesting (rule length,…)
• Certainty: Validity, trustworthiness # tuples containing both A and B
confidence(AB)
# tuples containing ASometimes called “certainty factor”
Interestedness measures II
• Utility: Support is the percentage of task-relevant data tuples for which the pattern is true
# tuples containing both A and B
support(AB)
total # tuples
Descriptive Datamining (Concept Description & Characterisation )
• Concept description:Description of data generalised at multiple levels of abstraction
• Concept characterisation: Concise and succinct summarisation of a given collection of data
• Concept comparison: Discrimination
Data Generalisation
• Abstraction of task-relevant high conceptual level data from a database containing relatively low conceptual level data– Data cube (OLAP) approach (offline pre-
computation) (Figs 2.1 & 2.2, pages 46 &47)– Attribute-oriented induction approach (online
aggregation)
• Presentation of generalisation (Tables 5.3 & 5.4 on p. 191, and Figs 5.2, 5.3, & 5.4 on pages 192 & 193)
Descriptive Statistical Measures and Displays I
• Measures of central tendency– Mean, Weighted mean (weights signifying
importance or occurrence frequency)– Median– Mode
• Measures of dispersion– Quartiles, outliers, boxplots
Descriptive Data Exploration
• summary : mean, median, quartiles p.171
• stem : stem and leaf display p.171
• quantile p.172
• stdev p.173
• tapply : splits data p.174
• by p.175
• mean works on vector, and other structures need to be converted to vectors before computing means.
• (example on p.176-7)
Data Preprocessing for Datamining I
• Why– Incomplete
• Attribute values not available, equipment malfunctions, not considered important
– Noisy (errors)• instrument problems, human/computer errors,
transmission errors
– Inconsistent• inconsistencies due to data definitions
Data Preprocessing for Datamining II
• Data Cleaning– Missing values:
• ignore tuple, fill-in values manually, use a global constant (unknown), missing value=attribute mean, missing value = attribute group mean, missing value= most probable value
– Noisy data:• Binning: partitioning into equi-sized bins, smoothing by bin
means or bin boundaries
• Clustering
• Inspection: computer & human
• Regression
– Inconsistencies
Data Preprocessing for Datamining III
• Data Integration: Combining data from different sources into a coherent whole– Schema integration: combining data models (entity
identification problems)– Redundancy (derived values, calculated fields, use of
different key attributes): use of correlations to detect redundancies
– Resolution of data value conflicts (coding values in different measures)
Data Preprocessing for Datamining III
• Transformation– Smoothing– Aggregation– Generalisation– Normalisation– Attribute (or feature) construction
Data Preprocessing for Datamining IV
• Data Reduction & compression– Data cube aggregation (p.117)– Dimension reduction: minimise loss of
information. • Attribute selection
• Decision tree induction
• Principal components analysis