Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix...

47
Noise & Data Reduction

Transcript of Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix...

Page 1: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Noise & Data Reduction

Page 2: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension

Reduction Fourier Analysis - Spectrum Dimension Reduction Data Integration Automatic Concept Hierarchy Generation

Page 3: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Testing Hypothesis

Page 4: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Remember:Central Limit Theorem

The sampling distribution of the mean of samples of size N approaches a normal (Gaussian) distribution as N approaches infinity.

If the samples are drawn from a population with mean and standard deviation , then the mean of the sampling distribution is and its standard deviation is as N increases.

These statements hold irrespective of the shape of the original distribution.

σ μσx =σ Nμ

μ

Page 5: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Z Test standard deviation (population)

t Test sample standard deviation

• when population standard deviation is unknown, samples are small

population mean , sample mean

Z =x − μ

σ / N

t =x − μ

s / N

s =1

N −1∗ x i − x( )

2

i=1

N

∑€

σ =1

N∗ x i − x( )

2

i=1

N

x

Page 6: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

p Values Commonly we reject the H0 when the

probability of obtaining a sample statistic given the null hypothesis is low, say < .05

The null hypothesis is rejected but might be true

We find the probabilities by looking them up in tables, or statistics packages provide them The probability of obtaining a particular sample given

the null hypothesis is called the p value By convention, one usually dose not reject the null

hypothesis unless p < 0.05 (statistically significant)

Page 7: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Example Five cars parked, mean price of the cars is 20.270 €

and the standard deviation of the sample is 5.811€ The mean costs of cars in town is 12.000 €

(population) H0 hypothesis: parked cars are as expensive as the

cars in town

For N-1 (degrees of freedom) t=3.18 has a value less than 0.025, reject H0!

t =20270 −12000

5811/ 5= 3.18

Page 8: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Paired Sample t Test

Given a set of paired observations (from two normal populations)

A B =A-B

x1 y1 x1-x2

x2 y2 x2-y2

x3 y3 x3-y3

x4 y4 x4-y4

x5 y5 x5-y5

Page 9: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Calculate the mean and the standard deviation s of the the differences

H0: =0 (no difference)

H0: =k (difference is a constant)€

x δ

tδ =x δ − μδ

ˆ σ δ

ˆ σ δ =sδ

Page 10: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Confidence Intervals (σ known) Standard error from the standard deviation

95 Percent confidence interval for normal distribution is about the mean

σ x =σ Population

N

x ±1.96 ⋅σ x

Page 11: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Confidence interval when (σ unknown) Standard error from the sample standard deviation

95 Percent confidence interval for t distribution (t0.025 from a table) is

Previous Example:

x ± t0.025 ⋅ ˆ σ x

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“

benötigt.

ˆ σ x =s

N

Page 12: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Overview Data Transformation

Reduce Noise Reduce Data

Page 13: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Data Transformation

Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range

min-max normalization z-score normalization normalization by decimal scaling

Attribute/feature construction New attributes constructed from the given ones

Page 14: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Data Transformation: Normalization

Min-max normalization: to [new_minA, new_maxA]

Ex. Let income range $12,000 to $98,000 normalized to [0.0,

1.0]. Then $73,000 is mapped to

Z-score normalization (μ: mean, σ: standard deviation):

Ex. Let μ = 54,000, σ = 16,000. Then

Normalization by decimal scaling

716.00)00.1(000,12000,98

000,12600,73=+−

−−

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(' +−

−−

=

A

Avv

σ−

='

225.1000,16

000,54600,73=

Page 15: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

How to Handle Noisy Data? (How to Reduce Features?)

Binning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc. Regression

smooth by fitting the data into regression functions Clustering

detect and remove outliers Combined computer and human inspection

detect suspicious values and check by human (e.g., deal with possible outliers)

Page 16: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Data Reduction Strategies

A data warehouse may store terabytes of data Complex data analysis/mining may take a very long time to run on

the complete data set Data reduction

Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results

Data reduction strategies Data cube aggregation Dimensionality reduction—remove unimportant attributes Data Compression Numerosity reduction—fit data into models Discretization and concept hierarchy generation

Page 17: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Simple Discretization Methods: Binning

Equal-width (distance) partitioning: Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute,

the width of intervals will be: W = (B –A)/N. The most straightforward, but outliers may dominate

presentation Skewed data is not handled well.

Equal-depth (frequency) partitioning: Divides the range into N intervals, each containing

approximately same number of samples Good data scaling Managing categorical attributes can be tricky.

Page 18: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Binning Methods for Data Smoothing

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34* Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34* Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29* Smoothing by bin boundaries (min and max are identified, bin value replaced

by the closesed boundary value): - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

Page 19: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Cluster Analysis

Page 20: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Regression

x

y

y = x + 1

X1

Y1

Y1’

Page 21: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Heuristic Feature Selection Methods

There are 2d -1 possible sub-features of d features Several heuristic feature selection methods:

Best single features under the feature independence assumption: choose by significance tests

Best step-wise feature selection: • The best single-feature is picked first• Then next best feature condition to the first, ...

Step-wise feature elimination:• Repeatedly eliminate the worst feature

Best combined feature selection and elimination Optimal branch and bound:

• Use feature elimination and backtracking

Page 22: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Sampling: with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Page 23: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

X1

X2

Y1

Y2

Principal Component AnalysisFrom Covariance Matrix to PCA and Dimension Reduction

Page 24: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Feature space Sample

rx (1),

r x (2),..,

r x (k ),..,

r x (n )

{ }

rx =

x1

x2

..

..

xd

⎪ ⎪ ⎪

⎪ ⎪ ⎪

∈ ℜ d

rx −

r y = (x i − y i)

2

i=1

d

Page 25: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Scaling

A well-known scaling method consists of performing some scaling operations subtracting the mean and dividing the standard deviation

mi sample mean si sample standard deviation

y i =(x i − mi)

si

Page 26: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

According to the scaled metric the scaled feature vector is expressed as

shrinking large variance values si > 1

stretching low variance values si < 1

Fails to preserve distances when general linear transformation is applied!

||r y ||s=

(x i − mi)2

si2

i=1

n

Page 27: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Covariance Measuring the tendency two features xi and xj

varying in the same direction The covariance between features xi and xj is

estimated for n patterns

c ij =

x i(k ) − mi( ) x j

(k ) − m j( )k=1

n

∑n −1

Page 28: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

C =

c11 c12 .. c1d

c21 c22 .. c2d

.. .. .. ..

cd1 cd 2 .. cdd

⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥

Page 29: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Correlation

Covariances are symmetric cij=cji

Covariance is related to correlation

rij =

x i(k ) − mi( ) x j

(k ) − m j( )k=1

n

∑(n −1)sis j

=c ij

sis j

∈ −1,1[ ]

Page 30: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Karhunen-Loève Transformation Covariance matrix C of (a d d matrix)

Symmetric and positive definite

There are d eigenvalues and eigenvectors

is the i ith eigenvalue of C and ui the ith column of U, the ith eigenvectors

UTCU = Λ = diag(λ1,λ 2,...,λ d )

Cr u i = λ

r u i€

I − C( )u = 0

Page 31: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Eigenvectors are always orthogonal U is an orthonormal matrix UUT=UTU=I U defines the K-L transformation The transformed features by the K-L

transformation are given by

K-L transformation rotates the feature space into alignment with uncorrelated features

ry = U

r x (linear Transformation)

Page 32: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Example

1=2.618 2=0.382

u(1)=[1 0.618] u(2)=[-1 1.618]

C =2 1

1 1

⎣ ⎢

⎦ ⎥

I − C = 0

2 − 3λ +1= 0

−0.618 −1

−1 1.618

⎣ ⎢

⎦ ⎥u1

u2

⎣ ⎢

⎦ ⎥= 0

Page 33: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“

benötigt.

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“

benötigt.

Page 34: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

PCA (Principal Components Analysis)

New features y are uncorrelated with the covariance Matrix

Each eigenvector ui is associated with some variance associated by i

Uncorrelated features with higher variance (represented by i) contain more information

Idea: Retain only the significant eigenvectors ui

Page 35: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Dimension Reduction

How many eigenvectors (and corresponding eigenvector) to retain

Kaiser criterion Discards eigenvectors whose eigenvalues

are below 1

Page 36: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Problems Principal components are linear

transformation of the original features

It is difficult to attach any semantic meaning to principal components

Page 37: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Fourier Analysis It is always possible to analyze „complex“

periodic waveforms into a set of sinusoidal waveforms

Any periodic waveform can be approximated by adding together a number of sinusoidal waveforms

Fourier analysis tells us what particular set of sinusoids go together to make up a particular complex waveform

Page 38: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Spectrum In the Fourier analysis of a complex

waveform the amplitude of each sinusoidal component depends on the shape of particular complex wave

• Amplitude of a wave: maximum or minimum deviation from zero line

• T duration of a period

f =1

T

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“

benötigt.

Page 39: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Noise reduction or Dimension Reduction

It is difficult to identify the frequency components by looking at the original signal

Converting to the frequency domain

If dimension reduction, store only a fraction of frequencies (with high amplitude)

If noise reduction (remove high frequencies, fast change, smoothing) (remove low frequencies, slow change, remove global trends) Inverse discrete Fourier transform

Page 40: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“

benötigt.

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“

benötigt.

Page 41: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (Unkomprimiert)“

benötigt.

Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (Unkomprimiert)“

benötigt.

Page 42: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Dimensionality Reduction:Wavelet Transformation

Discrete wavelet transform (DWT): linear signal processing, multi-resolutional analysis

Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients

Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space

Method: Length, L, must be an integer power of 2 (padding with 0’s, when necessary) Each transform has 2 functions: smoothing, difference Applies to pairs of data, resulting in two set of data of length L/2 Applies two functions recursively, until reaches the desired length

Haar2 Daubechie4

Page 43: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Data Integration

Data integration: Combines data from multiple sources into a coherent store

Schema integration: e.g., A.cust-id B.cust-# Integrate metadata from different sources

Entity identification problem: Identify real world entities from multiple data sources, e.g., Bill Clinton =

William Clinton Detecting and resolving data value conflicts

For the same real world entity, attribute values from different sources are different

Possible reasons: different representations, different scales, e.g., metric vs. British units

Page 44: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Handling Redundancy in Data Integration

Redundant data occur often when integration of multiple databases Object identification: The same attribute or object may have different

names in different databases Derivable data: One attribute may be a “derived” attribute in another

table, e.g., annual revenue

Redundant attributes may be able to be detected by correlation analysis

Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Page 45: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Automatic Concept Hierarchy Generation

Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is placed at the lowest level

of the hierarchy Exceptions, e.g., weekday, month, quarter, year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674,339 distinct values

Page 46: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension

Reduction Fourier Analysis - Spectrum Dimension Reduction Data Integration Automatic Concept Hierarchy Generation

Page 47: Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Mining Association rules Apriori Algorithm (Chapter 6, Han and

Kamber)