Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data...

84
Geometric Data Analysis Data Exploration & Visualization MAT 6480W / STT 6705V Guy Wolf [email protected] Universit´ e de Montr´ eal Fall 2019 MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 1 / 46

Transcript of Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data...

Page 1: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Geometric Data Analysis

Data Exploration & Visualization

MAT 6480W / STT 6705V

Guy [email protected]

Universite de MontrealFall 2019

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 1 / 46

Page 2: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Outline1 Tabular data

Observations/Data-Points vs. Features/AttributesQualitative vs. Quantitative attributesQualitative: Nominal vs. OrdinalQuantitative: Interval vs. Ratio

2 Summary statisticsFrequency, mode, & percentilesMean & medianRange & varianceCovariance & correlationData quality

3 VisualizationsBox plotsHistogramsStar plots

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 2 / 46

Page 3: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Outline (cont.)Parallel coordinate plotsScatter plotsQuiver plots

4 Transactional dataTerm matrixText documents

5 Structured signals (e.g., audio and EEG)Fourier & waveletsSpectrogram & scalogram

6 Multidimensional signals (e.g., images and videos)Visualization with contour plots

7 Nonparametric (affinity-/distance-based) representationsGraph dataVisualization with matrix plots

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 3 / 46

Page 4: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

What is data?

@@@@R

���

������

@@

@@I

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 4 / 46

Page 5: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

What is data?Experimental vs. observational data

Experimental dataData collected from strictly controlled/designed experiments withefforts made to ensure statistical validity.

ExamplesMedical clinical trialsElection polls

Observational dataData collected from “real-world” settings without control over thecaptured underlying phenomena. It is easier to collect and obtain, butresults and conclusions from such data may be biased or inconclusive.

Most data in “data science” is observational data.MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 5 / 46

Page 6: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Tabular Data

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 6 / 46

Page 7: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Tabular dataOrganizing data in a table of observations-by-features is consideredthe most convenient and standard format for data analysis.

ExampleConsider the following procedure:

1 From each machine, collect 3 temperature measurements(MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD,# processes), and 2 power consumption values (MOBO, GPU)

2 Attach unique identifiers of the machine, OS, and hardwaremanufacturer

3 Every second, store a record with these values from everymachine in the system.

We end up with hundreds of thousands of records, each containing 12fields.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 7 / 46

Page 8: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Tabular dataObservations/Data-Points vs. Features/Attributes

Obs

erva

tions

/obj

ects

/dat

a-po

ints

/sam

ples

/rec

ords

Features/attributes/properties/fields︷ ︸︸ ︷Timestamp OS Temp · · · CPU # proc

... ... ... ... ... ...

9/1/161:00AM LNX 45◦ C · · · 65% 23

... ... ... ... ... ...

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 8 / 46

Page 9: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Tabular dataTypes of features/attributes

It is important to recognize the types of values each feature/attributetakes in order to understand which operations make sense for it.

ExamplesCan we compute an average eye color?How do we compute the difference between phone numbers?Can we say today is “twice as hot/cold” as yesterday?

This is similar to problems like 6 apples / 4 people = 1.5 apples perperson, but 10 people / 4 car seats = 3 cars.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 9 / 46

Page 10: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Tabular dataQualitative vs. Quantitative attributes

Attribute values can be split into two types:

Qualitative attributesAttributes that take values from a (finite) set of categories are calledcategorical or qualitative attributes. In some sense, they describe anobject/observation, rather than measure its properties.

Quantitative attributesAttributes that represent quantities are called numerical orquantitative attributes. They provide concrete quantifiablemeasurements of an object/observation.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 10 / 46

Page 11: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Tabular dataQualitative: Nominal vs. Ordinal

Qualitative attributes can be split further into two types:

Nominal attributesExamples: zip codes, eye color, operating system, genderValues of such attributes just specify names without any particularorder or relation between them (except for = and 6=).

Binary attributes are nominal attributes with only two values (Yes/Noor 0/1). They can be symmetric or asymmetric based in whether ornot their values are equally informative.

Ordinal attributesExamples: ratings, grades, street/avenue numbersValues of such attributes have some order, even though they don’tspecify an exact quantity

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 11 / 46

Page 12: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Tabular dataQuantitative: Interval vs. Ratio

Quantitative attributes can also be split into two types:

Interval attributesExamples: calendar dates, azimuth direction, Fahrenheit temperaturesSuch attributes represent quantities with meaningful difference (orfixed intervals) between their values (but no multiplicative relations).

Ratio attributesExamples: mass, length, distance, currency, age, electrical currentSuch attributes represent quantities that have meaningful ratiosbetween their values. Unlike interval attributes, ratio ones usuallyhave an “absolute zero”.

We can also split quantities into discrete and continuous ones. Allqualitative attributes are considered discrete.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 12 / 46

Page 13: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Tabular dataSummary of attribute types

The types of attributes can be regarded via the operations that canbe applied to them:

Comparison (= and 6=) - every typeOrdering (> and <) - every type except nominalDifferences (−) and addition (+) - only quantitativeDivision (/) and multiplication (×, ·) - only ratio

Other operations (e.g., mean, median, correlation) may also beinapplicable for some types while applicable to others.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 13 / 46

Page 14: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Tabular dataTechnical formats

Tabular data can be stored, collected, or given in several standardformats, such as:

Comma separated file (CSV)Flat file or delimited text file (e.g., space or tab delimited)XML or other log filesProprietary formats (e.g., FCS for biological data or MAT filesfor Matlab data)Database tables

There are several techniques and standard designs to collect and storebig data in databases. Data warehouse, ETL (extract-transform-load),and OLAP (Online Analytical Processing) are some related terms en-countered frequently in the IT industry.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 14 / 46

Page 15: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Tabular dataData warehouse: star and snowflake schemas

Star schema

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 15 / 46

Page 16: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Tabular dataData warehouse: star and snowflake schemas

Snowflake schemaMAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 15 / 46

Page 17: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary statistics

The raw representation of the data is often not convenient for initialexploration and understanding of the data.

How do we get general insights into the data and its attributes as awhole?

Summary statisticsProperties that summarize global information, such as centraltendency, spread, and variations of observations and features.

These statistics provide an important first step in data analysis andmost of them are not difficult to compute in linear time w.r.t the sizeof the data.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 16 / 46

Page 18: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary statisticsFrequency, mode, & percentiles

FrequencyThe portion (e.g., percentage) of the observation with each specificvalue of a categorical or discrete attribute.

ModeThe most frequent value of an attribute in the data.

PercentilesThe p-th percentile (with 0 ≤ p ≤ 100) of an attribute is a value Ppsuch that p% of the observed values of this attributes are less thanPp. We typically take Pp as one of the observed values of theattributes. Alternatives: quartile Qi (i = 1, 2, 3), quantile, etc.

Visual examples: stem-and-leaf displays; quantile & Q − Q plots.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 17 / 46

Page 19: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary statisticsFrequency, mode, & percentiles

FrequencyThe portion (e.g., percentage) of the observation with each specificvalue of a categorical or discrete attribute.

ModeThe most frequent value of an attribute in the data.

PercentilesThe p-th percentile (with 0 ≤ p ≤ 100) of an attribute is a value Ppsuch that p% of the observed values of this attributes are less thanPp. We typically take Pp as one of the observed values of theattributes. Alternatives: quartile Qi (i = 1, 2, 3), quantile, etc.

Visual examples: stem-and-leaf displays; quantile & Q − Q plots.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 17 / 46

Page 20: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary statisticsFrequency, mode, & percentiles

FrequencyThe portion (e.g., percentage) of the observation with each specificvalue of a categorical or discrete attribute.

ModeThe most frequent value of an attribute in the data.

PercentilesThe p-th percentile (with 0 ≤ p ≤ 100) of an attribute is a value Ppsuch that p% of the observed values of this attributes are less thanPp. We typically take Pp as one of the observed values of theattributes. Alternatives: quartile Qi (i = 1, 2, 3), quantile, etc.

Visual examples: stem-and-leaf displays; quantile & Q − Q plots.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 17 / 46

Page 21: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary statisticsFrequency, mode, & percentiles

FrequencyThe portion (e.g., percentage) of the observation with each specificvalue of a categorical or discrete attribute.

ModeThe most frequent value of an attribute in the data.

PercentilesThe p-th percentile (with 0 ≤ p ≤ 100) of an attribute is a value Ppsuch that p% of the observed values of this attributes are less thanPp. We typically take Pp as one of the observed values of theattributes. Alternatives: quartile Qi (i = 1, 2, 3), quantile, etc.

Visual examples: stem-and-leaf displays; quantile & Q − Q plots.MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 17 / 46

Page 22: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary statisticsFrequency, mode, & percentiles

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 17 / 46

Page 23: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary statisticsFrequency, mode, & percentiles

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 17 / 46

Page 24: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary statisticsFrequency, mode, & percentiles

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 17 / 46

Page 25: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary statisticsMean & median

MeanThe mean (or average) x = 1

n∑n

i=1 xn is the most common way tomeasure the central location or value of data points. However, it isvery sensitive to outliers. A trimmed mean is more robust to outliersby disregarding extreme values. Weighted mean also takes intoaccount weights for each observation.

MedianThe median of an attribute is a value such that half of the observedvalues are above it and half are below it. It is the middle value for anodd number of observations, or the average (when it makes sense)between the two middle numbers for an even number of observations.The median corresponds to P50 and Q2.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 18 / 46

Page 26: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary statisticsMean & median

MeanThe mean (or average) x = 1

n∑n

i=1 xn is the most common way tomeasure the central location or value of data points. However, it isvery sensitive to outliers. A trimmed mean is more robust to outliersby disregarding extreme values. Weighted mean also takes intoaccount weights for each observation.

MedianThe median of an attribute is a value such that half of the observedvalues are above it and half are below it. It is the middle value for anodd number of observations, or the average (when it makes sense)between the two middle numbers for an even number of observations.The median corresponds to P50 and Q2.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 18 / 46

Page 27: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary statisticsCentrality and skewed data

Relations between three measures of centrality (mean, median, andmode) can indicate symmetric or skewed distributions of attributes:

symmetric positively skewed negatively skewed

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 19 / 46

Page 28: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary statisticsRange & variance

RangeRange is the difference between max and min observed values of anattribute

VarianceVariance s2

x = 1n∑n

i=1(xi − x)2 and standard deviation (STD)sx =

√s2

x are the most common ways to measure the spread ofvalues. However, like the mean, they are sensitive to outliers.

Other spread measures include:average absolute deviation - the average of |xi − x |median absolute deviation - the median of |xi − x |interquartile range - the difference x75% − x25%

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 20 / 46

Page 29: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary statisticsRange & variance

RangeRange is the difference between max and min observed values of anattribute

VarianceVariance s2

x = 1n∑n

i=1(xi − x)2 and standard deviation (STD)sx =

√s2

x are the most common ways to measure the spread ofvalues. However, like the mean, they are sensitive to outliers.

Other spread measures include:average absolute deviation - the average of |xi − x |median absolute deviation - the median of |xi − x |interquartile range - the difference x75% − x25%

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 20 / 46

Page 30: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary statisticsRange & variance

RangeRange is the difference between max and min observed values of anattribute

VarianceVariance s2

x = 1n∑n

i=1(xi − x)2 and standard deviation (STD)sx =

√s2

x are the most common ways to measure the spread ofvalues. However, like the mean, they are sensitive to outliers.

Other spread measures include:average absolute deviation - the average of |xi − x |median absolute deviation - the median of |xi − x |interquartile range - the difference x75% − x25%

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 20 / 46

Page 31: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary statisticsCovariance & correlation

CovarianceMeasures the degree to which attributes vary together and iscomputed by cov(x , y) = 1

n∑n

i=1(xi − x)(yi − y). This value dependson the magnitude/spread of the attribute values.

For k attributes, these form a k × k covariance matrix, with variancess2

x = cov(x , x) on its diagonal.

CorrelationA value between 0 and 1 that indicates how strongly two attributesare (linearly) related. Pearson correlation: corr(x , y) = cov(x ,y)

sx sy.

Notice that it is independent magnitudes/spreads and corr(x , x) = 1.

Notice that Pearson correlation is the covariance or dot-productbetween standardized attributes.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 21 / 46

Page 32: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary statisticsCovariance & correlation

CovarianceMeasures the degree to which attributes vary together and iscomputed by cov(x , y) = 1

n∑n

i=1(xi − x)(yi − y). This value dependson the magnitude/spread of the attribute values.

For k attributes, these form a k × k covariance matrix, with variancess2

x = cov(x , x) on its diagonal.

CorrelationA value between 0 and 1 that indicates how strongly two attributesare (linearly) related. Pearson correlation: corr(x , y) = cov(x ,y)

sx sy.

Notice that it is independent magnitudes/spreads and corr(x , x) = 1.

Notice that Pearson correlation is the covariance or dot-productbetween standardized attributes.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 21 / 46

Page 33: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary statisticsCovariance & correlation

CovarianceMeasures the degree to which attributes vary together and iscomputed by cov(x , y) = 1

n∑n

i=1(xi − x)(yi − y). This value dependson the magnitude/spread of the attribute values.

For k attributes, these form a k × k covariance matrix, with variancess2

x = cov(x , x) on its diagonal.

CorrelationA value between 0 and 1 that indicates how strongly two attributesare (linearly) related. Pearson correlation: corr(x , y) = cov(x ,y)

sx sy.

Notice that it is independent magnitudes/spreads and corr(x , x) = 1.

Notice that Pearson correlation is the covariance or dot-productbetween standardized attributes.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 21 / 46

Page 34: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary statisticsCovariance & correlation

CorrelationA value between 0 and 1 that indicates how strongly two attributesare (linearly) related. Pearson correlation: corr(x , y) = cov(x ,y)

sx sy.

Notice that it is independent magnitudes/spreads and corr(x , x) = 1.

Notice that Pearson correlation is the covariance or dot-productbetween standardized attributes.

corr(x , y) = s−1x s−1

y1n

n∑i=1

(xi − x)(yi − y)

= 1n

n∑i=1

(xi − xsx

)(yi − ysy

)

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 21 / 46

Page 35: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary statisticsMisleading example: pirates & global warming

Taken from Wikipedia

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 22 / 46

Page 36: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary statisticsData quality

Summary statistics enable identification of various data quality issues,such asPrecisionThe closeness of repeated measurements to one another.

BiasA systematic variation of measurements from the quantity beingmeasured.

AccuracyThe closeness of measurements to the true value of the quantitybeing measured.

Other issues include missing values, outliers, and duplicate values.MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 23 / 46

Page 37: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

VisualizationsWhy do we need visualizations?

While summary statistics provide useful information about the data,they can be overwhelming and hard to track when many attributesare considered.

VisualizationConversion of data into visual elements that express characteristics,relationships, and information about data points and attributes.

Visualizations provide graphic representations that enable us to drawinsights at a single glance.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 24 / 46

Page 38: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

VisualizationsWhy do we need visualizations?

Example

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 24 / 46

Page 39: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

VisualizationsWhy do we need visualizations?

Example (TreeMap)

Taken from WikipediaMAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 24 / 46

Page 40: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

VisualizationsWhy do we need visualizations?

Example (TreeMap)

Taken from WikipediaMAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 24 / 46

Page 41: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

VisualizationsWhat constitutes a good visualization?

No really good answer... but there are some general guidelines:

ACCENT principlesApprehension: we can correctly perceive relations among variables.Clarity: visually distinguish important relations and elements.Consistency: comparing graphical elements/displays shows faithful

(dis)similarities in the data.Efficiency: simplify complex relations and patterns in the

visualization.Necessity: only include necessary graphical elements - no

extraneous elements.Truthfulness: true values (absolute or relative) can be determined

from graphical elements.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 25 / 46

Page 42: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

VisualizationsBox plots

Box plots (invented by J.Tukey) show the five-number distribution ofattribute values based on percentiles:

Outliers XXXXz��

��:

90th percentile -

75th percentileHHHHj

Median -

25th percentile����1

10th percentile�����1

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 26 / 46

Page 43: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

VisualizationsHistograms

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 27 / 46

Page 44: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

VisualizationsHistograms

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 27 / 46

Page 45: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

VisualizationsHistograms

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 27 / 46

Page 46: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

VisualizationsHistograms

Taken from: Pierchala, C. “The choice of age groupings may affect the quality of tabular presentations.” 2002.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 27 / 46

Page 47: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

VisualizationsStar plots

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 28 / 46

Page 48: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

VisualizationsStar plots

Taken from: www.coffeeanalysts.com/2011/11/coffee-spider-graphs-explained/

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 28 / 46

Page 49: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

VisualizationsParallel coordinate plots

Notice that attributes in this case do not have a particular order.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 29 / 46

Page 50: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

VisualizationsScatter plots

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 30 / 46

Page 51: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

VisualizationsQuiver plots

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 31 / 46

Page 52: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Non-tabular Data

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 32 / 46

Page 53: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Transactional data

In transactional data, each observation is a transaction that containsa collection of items or sequence of events.

ExampleMarket basket data Customer #1: {milk, bread, butter}; Customer#2: {orange juice, milk}; Customer #3:{orange juice, peanut butter, jelly, bread}; . . .

Transaction items can also contain numerical attributes, such as thenumber of purchased items (e.g., 3 boxes of cookies) or their price.When sequences (e.g., events, actions, or genes) are considered,temporal/order information may also be included.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 33 / 46

Page 54: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Transactional dataTerm matrix

In some cases, transactional data can be converted to tabular form byconsidering term matrix (a.k.a. bag of words/features techniques).

Example

CustomerID milk bread butter O.J. cheese P.B. jellyCustomer#1 1 1 1 0 0 0 0Customer#2 1 0 0 0 1 0 0Customer#3 0 1 0 0 1 1 1

... ... ... ... ... ... ... ...

This representation looses sequential information, and to applying itto continuous values requires a discretization step.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 34 / 46

Page 55: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Transactional dataText documents

Text documents can be considered as transactional data in one oftwo ways:

1 Each document can be considered as a big transactioncontaining words. Bag of words techniques ignore grammaticalstructures and represent a document as a histogram of wordoccurrences. Similar approaches can also be applied to images,questionnaires, etc., with an appropriate dictionary-buildingclustering step.

2 A document can be considered as a transactional dataset on itsown, which contains word contexts (e.g., with n-grams orskip-grams). Word2vec techniques use this approach toassociate numerical coordinates (typically in R300) to wordsbased on contexts in which they appear.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 35 / 46

Page 56: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Transactional dataText documents

Text documents can be considered as transactional data in one oftwo ways:

1 Each document can be considered as a big transactioncontaining words. Bag of words techniques ignore grammaticalstructures and represent a document as a histogram of wordoccurrences. Similar approaches can also be applied to images,questionnaires, etc., with an appropriate dictionary-buildingclustering step.

2 A document can be considered as a transactional dataset on itsown, which contains word contexts (e.g., with n-grams orskip-grams). Word2vec techniques use this approach toassociate numerical coordinates (typically in R300) to wordsbased on contexts in which they appear.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 35 / 46

Page 57: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Transactional dataText documents

Text documents can be considered as transactional data in one oftwo ways:

1 Each document can be considered as a big transactioncontaining words. Bag of words techniques ignore grammaticalstructures and represent a document as a histogram of wordoccurrences. Similar approaches can also be applied to images,questionnaires, etc., with an appropriate dictionary-buildingclustering step.

2 A document can be considered as a transactional dataset on itsown, which contains word contexts (e.g., with n-grams orskip-grams). Word2vec techniques use this approach toassociate numerical coordinates (typically in R300) to wordsbased on contexts in which they appear.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 35 / 46

Page 58: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Transactional dataText documents

Example (Term analysis of Donald Trump’s twits)

Most frequent words: iPhone vs. Android:

Taken from varianceexplained.org/r/trump-tweets/

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 35 / 46

Page 59: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Transactional dataText documents

Taken from github.com/aubry74/visual-word2vec/

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 35 / 46

Page 60: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Structured signals

Structured signals have well known relations between their“attributes”. They are typically numerical, with temporal or spatialordering.

ExamplesAudio recordingsEEG signalsHeart rateRoom temperatures

Each data-point is then a signal collected over time (or space), andwe can be analyzed with signal processing tools.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 36 / 46

Page 61: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Structured signalsFourier & power spectrum

Time series

Power spectrum

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 37 / 46

Page 62: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Structured signalsFourier & power spectrum

Time series Power spectrum

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 37 / 46

Page 63: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Structured signalsFourier & power spectrum

Time series Power spectrum

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 37 / 46

Page 64: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Structured signalsSTFT & wavelets

STFT

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 38 / 46

Page 65: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Structured signalsSTFT & wavelets

WaveletsMAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 38 / 46

Page 66: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Structured signalsSTFT & wavelets

Lowpass

Scale 1

Scale 2

Haar wavelets

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 38 / 46

Page 67: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Structured signalsSpectrogram & scalogram

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 39 / 46

Page 68: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Structured signalsSpectrogram & scalogram

Spectrogram Scalogram

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 39 / 46

Page 69: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Multidimensional signalsMultidimensional signals are similar have several coordinates thatspecify the relations between their “attributes”.

ExamplesGrayscale images have two spatial coordinates that determinepixel positions.Videos have two spatial & one temporal coordinates thatdetermine pixel positions.Geographic data has two or three coordinates determininglongitude, latitude, and elevation.Colored and hyperspectral images have two spatial coordinatesand one spectral/channel coordinate.

In general, many signal processing approaches can be extended fromone-dimensional signals to multidimensional ones.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 40 / 46

Page 70: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Multidimensional signalsTwo-dimensional wavelets

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 41 / 46

Page 71: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Multidimensional signalsTwo-dimensional wavelets

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 41 / 46

Page 72: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Multidimensional signalsTwo-dimensional wavelets

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 41 / 46

Page 73: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Multidimensional signalsTwo-dimensional wavelets

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 41 / 46

Page 74: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Multidimensional signalsVisualization with contour plots

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 42 / 46

Page 75: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Multidimensional signalsVisualization with contour plots

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 42 / 46

Page 76: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Nonparametric representations

In some cases, the important information in the data is the relationsbetween data points, rather than their attributes.

ExamplesSpatial locations and trajectoriesPhone calls and email correspondencesGene interactions and cell progressions

In these cases an affinity matrix, based on similarity or distances,between data points can be used for analysis.

Essentially, each data point is represented by its relations to otherdata points rather than by its own attributes.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 43 / 46

Page 77: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Nonparametric representationsGraph data

Graphs can be used to formalize relations in data in two ways:

1 Relationships between attributes can form graphs (e.g., moleculedata). In this case each data point is a graph on its own, andthis is a more complicated example of structured data.

2 The graph is considered as the dataset, and each node is adata point (e.g., social networks and web-reference data). In thiscase, an adjacency matrix can form an affinity matrix. Conversely,affinity matrices can form adjacency matrices, so nonparametricdata is often considered as graph data.

Spectral graph methods (e.g., SVD of graph Laplacian) can be usedto associate coordinates to data points in the second case visualizationwith scatter plots and further analysis.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 44 / 46

Page 78: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Nonparametric representationsGraph data

Graphs can be used to formalize relations in data in two ways:

1 Relationships between attributes can form graphs (e.g., moleculedata). In this case each data point is a graph on its own, andthis is a more complicated example of structured data.

2 The graph is considered as the dataset, and each node is adata point (e.g., social networks and web-reference data). In thiscase, an adjacency matrix can form an affinity matrix. Conversely,affinity matrices can form adjacency matrices, so nonparametricdata is often considered as graph data.

Spectral graph methods (e.g., SVD of graph Laplacian) can be usedto associate coordinates to data points in the second case visualizationwith scatter plots and further analysis.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 44 / 46

Page 79: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Nonparametric representationsGraph data

Graphs can be used to formalize relations in data in two ways:

1 Relationships between attributes can form graphs (e.g., moleculedata). In this case each data point is a graph on its own, andthis is a more complicated example of structured data.

2 The graph is considered as the dataset, and each node is adata point (e.g., social networks and web-reference data). In thiscase, an adjacency matrix can form an affinity matrix. Conversely,affinity matrices can form adjacency matrices, so nonparametricdata is often considered as graph data.

Spectral graph methods (e.g., SVD of graph Laplacian) can be usedto associate coordinates to data points in the second case visualizationwith scatter plots and further analysis.

Benzene (C6H6):

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 44 / 46

Page 80: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Nonparametric representationsGraph data

Graphs can be used to formalize relations in data in two ways:

1 Relationships between attributes can form graphs (e.g., moleculedata). In this case each data point is a graph on its own, andthis is a more complicated example of structured data.

2 The graph is considered as the dataset, and each node is adata point (e.g., social networks and web-reference data). In thiscase, an adjacency matrix can form an affinity matrix. Conversely,affinity matrices can form adjacency matrices, so nonparametricdata is often considered as graph data.

Spectral graph methods (e.g., SVD of graph Laplacian) can be usedto associate coordinates to data points in the second case visualizationwith scatter plots and further analysis.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 44 / 46

Page 81: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Nonparametric representationsGraph data

Graphs can be used to formalize relations in data in two ways:

1 Relationships between attributes can form graphs (e.g., moleculedata). In this case each data point is a graph on its own, andthis is a more complicated example of structured data.

2 The graph is considered as the dataset, and each node is adata point (e.g., social networks and web-reference data). In thiscase, an adjacency matrix can form an affinity matrix. Conversely,affinity matrices can form adjacency matrices, so nonparametricdata is often considered as graph data.

Spectral graph methods (e.g., SVD of graph Laplacian) can be usedto associate coordinates to data points in the second case visualizationwith scatter plots and further analysis.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 44 / 46

Page 82: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Nonparametric representationsGraph data

Graphs can be used to formalize relations in data in two ways:

1 Relationships between attributes can form graphs (e.g., moleculedata). In this case each data point is a graph on its own, andthis is a more complicated example of structured data.

2 The graph is considered as the dataset, and each node is adata point (e.g., social networks and web-reference data). In thiscase, an adjacency matrix can form an affinity matrix. Conversely,affinity matrices can form adjacency matrices, so nonparametricdata is often considered as graph data.

Spectral graph methods (e.g., SVD of graph Laplacian) can be usedto associate coordinates to data points in the second case visualizationwith scatter plots and further analysis.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 44 / 46

Page 83: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Nonparametric representationsVisualization with matrix plots

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 45 / 46

Page 84: Geometric Data Analysis Data Exploration & Visualizationmat6480w.guywolf.org/slides/T02 - Data Exploration.pdf · (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes),

Summary

We considered the following data and attribute types, and brieflyshowed how to handle, process, and visualize them:Types of attributes

NominalOrdinalIntervalRatio

Types of dataTabular dataTransactional & text dataStructured (1D, 2D, & more) dataNonparametric & graph data

Exploratory data analysis crucial for obtaining intelligible results, e.g.,by identifying valid applicable operations on the data and possiblytransforming it to more amenable representation for analysis.Other preprocessing steps include normalization/standardization, sam-pling, discretization, aggregation and dimensionality reduction.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 46 / 46