Dimensionality reduction in big data...

46
Dimensionality reduction in big data sets Gudmund Horn Hermansen Department of Mathematics at University of Oslo 26 September 2014 1/46

Transcript of Dimensionality reduction in big data...

Dimensionality reduction in big data sets

Gudmund Horn Hermansen

Department of Mathematics at University of Oslo

26 September 2014

1/46

What are the big data phenomena?

• Typically characterised by high dimensionality and/or largesample sizes.

• Granularity and heterogeneities.

• Collection of massive datasets for both ‘classical’ statistical testingand for exploratory science.

• A meeting point of computer science and statistics.

• Sensor network, monitoring, physics, astronomy, genomics, human(online) activity, images, voice translation, documents, ...

2/46

What are big data challenges?

• More data inspires more detailed questions and hypothesis:- personalised medicine.- personalised advertising.- personalised recommendations systems.

• Noise accumulation and/or signal to noise ratio.

• Spurious relationships and lurking variables.

• Too big is not always ‘big’, e.g. numerical integration and fastpredictions.

• Runtime and risk trade-offs (cf. Jordan et al. (2013)).

• Traditional methods may not necessary work or be valid.

• New algorithms and methodology.- Machine learning.

• New (real) computer infrastructure, like Hadoop.

3/46

Why dimensionality reduction?

• Making a massive datasets somewhat less massive.

• Data compression.

• Visualisation.

• More accurate and relevant representations of data.

• Runtime and memory, storage and bandwidth bottlenecks.

• Validity and accuracy of algorithms.

4/46

Outline

Machine learning

Principal component analysis (PCA)

Unsupervised feature learning

Regression and LASSO

Sufficiency

Focused dimensionality reduction

5/46

What is machine learning?“Field of study that gives computers the ability to learnwithout being explicitly programmed.”

Arthur Samuel (1969)

“Algorithms that learn/generalise from experience/data.”

6/46

Unsupervised learning• Let X be the space of input features.

• Let S = {xi : i ≤ n} be a set of training data/observations.

• Unsupervised learning:- Learning from or find structure in unlabelled training data.

• Methods (clustering and dimensionality reduction):- Cluster analysis (k -means clustering).- EM and Mixture of Gaussians.- Factor analysis.- Principal components analysis (PCA).- Independent components analysis (ICA).

• Applications:- Image segmentation.- Pattern recognition.- Cluster users, genes, plants...- Novelty detection.

7/46

Supervised learning• Let X be the space of input features and Y the output values.

• Let S = {(xi , yi) : i ≤ n} be a set of training data/observations.

• Supervised learning:- Learning from labelled training data.- The task is to ‘learn’ a function h : X 7→ Y so that h(x) is a ‘good’

predictor for y .

• Methods:- Linear, logistic nonparametric regression.- Support vector machine (SVM).- Neural networks.- Naive Bayes.- Perceptron.

• Applications:- Handwriting analysis.- Spam detection.- Pattern and speech recognition.- Autonomous driving (car).

8/46

Supervised learning• Let X be the space of input features and Y the output values.

• Let S = {(xi , yi) : i ≤ n} be a set of training data/observations.

• Supervised learning:- Learning from labelled training data.- The task is to ‘learn’ a function h : X 7→ Y so that h(x) is a ‘good’

predictor for y .

•What about focused supervised learning?

• Make good estimators for a general focus parameter µ.- Predictions and forecasts.- Confidence bounds and quantiles.- Threshold probabilities.

9/46

Redundancy in data - a prototype illustration

179.6 179.8 180.0 180.2 180.4 180.6 180.8

70.7

70.8

70.9

71.0

71.1

71.2

height

cm

inch

10/46

Redundancy in data - a prototype illustration

179.6 179.8 180.0 180.2 180.4 180.6 180.8

70.7

70.8

70.9

71.0

71.1

71.2

height

cm

inch

11/46

Redundancy in data - a prototype illustration

179.6 179.8 180.0 180.2 180.4 180.6 180.8

70.7

70.8

70.9

71.0

71.1

71.2

height

cm

inch

12/46

Redundancy in data - a prototype illustration

179.6 179.8 180.0 180.2 180.4 180.6 180.8

height

13/46

Principal component analysis (PCA)

• The most common and by far the most used algorithm fordimensionality reduction.

• Finds the ‘best’ lower-dimensional subspace with respect tominimising squared projection error.

• Let x1, . . . , xn, where xi ∈ Rm, then for a given k ≤ m PCA finds the‘best’ k -dimensional subspace u1, . . . ,uk , with ul ∈ Rk .

• Motivates an approximation of x1, . . . , xn in the linear subspacespanned by u1, . . . ,uk , i.e.

(x1, . . . , xn), where xi =

k∑l=1

αi,juj ≈ xi , for i ≤ n,

for certain weights αi,j .

• Unsupervised.

• Singular value decomposition (SVD).

14/46

PCA illustration

-4 -2 0 2 4 6 8

-10

12

34

56data

x1

x 2

data

15/46

PCA illustration

-4 -2 0 2 4 6 8

-10

12

34

56data

x1

x 2

dataPC axes

16/46

PCA illustration

-4 -2 0 2 4 6 8

-10

12

34

56

approximation

x1

x 2

dataPC axesapproximation

17/46

PCA illustration

-4 -2 0 2 4 6 8

-10

12

34

56

approximation

x1

x 2

dataPC axesapproximation

18/46

PCA is not linear regression

-4 -2 0 2 4 6 8

-10

12

34

56

approximation

x1

x 2

dataLM fitPC axesapproximation

19/46

Text classification illustration• Dictionary with e.g. 50000 words:

aapa, aardvark, aardwolf, aargh, abaca, aback, abacus, ...

• Document i is represented as a vector xi ∈ R50000, for i = 1, . . . ,n.

• The elements of

xi,j = 1 if word j in dictionary is in document i ,

and 0 otherwise.

• Sparse vectors in a very high-dimensional space.

• Documents are seen as similar if the angle between them is small,i.e.

∆(xi , xj) = cos(θ) =x t

i xj

‖xi‖ ‖xj‖,

where θ is the angle between xi and xj .

20/46

Text classification illustration• Consider documents about transportation, will typically include

words like bus, car, vehicle, bike, ...

• Note that the nominator of ∆(xi , xj) is

x ti xj =

k∑l=1

I(documents i and j contain word l),

which is 0 if the documents have no common words.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0 xj

xi

car

vehicle

21/46

Text classification illustration• Consider documents about transportation, will typically include

words like bus, car, vehicle, bike, ...

• Note that the nominator of ∆(xi , xj) is

x ti xj =

k∑l=1

I(documents i and j contain word l),

which is 0 if the documents have no common words.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0 xj

xi

ul

car

vehicle

22/46

Image learning

input

−−−−−−→ Learningalgorithm

• Applications:- Classification.- Find more of same/similar type.- Are these images of the same object.

•Working on a pixel level wont work.

• Deep learning research, see e.g. the work by Andrew Y. Ng.

[http://ufldl.stanford.edu]23/46

Feature representation in image learning

input

−→ Featurerepresentation −→ Learning

algorithm

• Represent images by features.

• Hand designed features such as ‘wheels’, ‘handlebar’, ‘seat’, ...

24/46

Feature representation

Input −−→ Featurerepresentation −−→ Learning

algorithm

• The feature representation is how the algorithm sees the world.- Reduce the dimensionality.- More relevant data representation.- Data compression.- Reduce noise.- Standard methodology.

• Handmade, which is hard, time consuming and require expertknowledge.

• Separate communities (vision, audio, text, ...) has uniquelydesigned features.

• Is there a better way?

25/46

Biological inspiration

• The auditory cortex ‘learns’ how to see.

• The ‘one learning algorithm’ hypothesis.

26/46

Learning input representations

• Is there a better way to represent these images than the raw pixels?

• Is there a similar strategy for audio and text documents?

27/46

Feature learning• Is there a better representation for a 14x14 image patch x from than

by 196 (raw pixel insensitive) numbers in a vector.

18912454

7832...

• How to make a feature representation without any hand design?

28/46

Classification illustrationbikes motorcycles

bike?

random

29/46

Feature learning by sparse coding• Sparse coding was developed in Olshausen & Field (1997) to

explain visual processing in the brain.

• Image input x1, . . . , xn, where xi ∈ Rm×m.

• The idea is to ‘learn’ a set of basis functions φ1, . . . , φk , whereφj ∈ Rm×m, such that each input

x ≈k∑

j=1

ajφj ,

with the restriction that (a1, . . . ,ad ) is mostly sparse.

30/46

Sparse coding illustration

• Example patch:

≈ 0.6× + 0.4× + 0.8×

31/46

Unsupervised feature learning• The algorithm ‘invents’ edge detection.

• Hopefully a better representation.

• Is similar to how we believe the brain process images in visualcortex (area V1) in the brain.

•Works for audio, text and other types of complex data.

• Is better than many of the old standards with hand designedfeatures.

[http://ufldl.stanford.edu]32/46

Outline

Machine learning

Principal component analysis (PCA)

Unsupervised feature learning

Regression and LASSO

Sufficiency

Focused dimensionality reduction

33/46

Regression and LASSO• For observations (xi , yi), for i = 1, . . . ,n, where xi ∈ Rp and p < n.

• The goal is represent the relationship between xi and yi by

yi ≈ yi = β0 + β1xi,1 + · · ·+ βpxi,p

where

β = arg minβ

{ n∑i=1

(yi − β0 − β1xi,1 − · · · − βpxi,p)2}.

34/46

Regression and LASSO• For observations (xi , yi), for i = 1, . . . ,n, where xi ∈ Rp and p < n.

• The goal is represent the relationship between xi and yi by

yi ≈ yi = β0 + β1xi,1 + · · ·+ βpxi,p

by adding a penalty to the estimation

β = arg minβ

{ n∑i=1

(yi − β0 − β1xi,1 − · · · − βpxi,p)2 + λ

p∑j=1

|βj |}

we obtain the LASSO.

• Some β coefficients may be set to zero.

• Feature selection or dimensionality reduction.

35/46

Regression and LASSO• If n < p standard methodology does not work.

• Use PCA or another type of dimensionality reduction as anintermediate step.

• Forward selection: Start with an empty set of features.- Find the optimal feature xi by e.g. p-value or cross-validation.- Include this in your feature set.- Include features until the stopping rule is satisfied.

• Testing all plausible subsets is NP-hard.

• Preselection: Make order list of features- By e.g. pairwise correlations cor(y , xj) for each j = 1, . . . ,p.- Include features until the stopping rule is satisfied.

• Focused preselection and dimensionality reduction.

36/46

Illustration• Let (xi , yi), where xi ∈ R100, n = 50 and where yi depends only on

10 of these features.

• Make several random partitions of the feature set of dimension 25.

• Run standard linear regression and keep track of features withp-value less than 0.05.

• The popularity of each feature is used for the preselection list.

0 20 40 60 80 100

0200

400

600

800

1000

37/46

Illustration• Let (xi , yi), where xi ∈ R100, n = 50 and where yi depends only on

10 of these features.

• Make several random partitions of the feature set of dimension 25.

• Run standard linear regression and keep track of features withp-value less than 0.05.

• The popularity of each feature is used for the preselection list.

• The stopping rule is ‘significant on a 0.05 level’ and less than 25features.

• This ‘finds’ 10 correct features.

• ‘Standard’ preselection select only 6 features, where all are correct.

• Forward selection ends up with 25, including the 10 correct features.

38/46

Large samples (and high-dimensional feature spaces)• Long time series:

- Econometrics.- Stock market trading.- Auctions.

• Monitoring:- Network traffic.- Public security.- Production at a factory.- Air pollution.

• Predicting time to failure, predict crossing of threshold value,classifying behaviour, ...

• Fast and reliable predictions.

39/46

Sufficiency• As a prototype example suppose x1, . . . , xn i.i.d. with xi ∼ N(µ, σ2),

then

s1 =1n

n∑i=1

xi and s2 =1n

n∑i=1

x2i

are sufficient statistics for µ and σ2.

• Dimensionality reduction without (relative) loss of information.

• Exponential class of models:

f (y , θ) = g(y) exp{θtT (y)− c(θ)}

for a suitable T (y) = (T1(y), . . . ,Tp(y))t.

• In exponential families the dimensionality of the sufficient statisticdoes not increase as the sample size n increases.

• Fast updates for on-line data streams.

40/46

Sufficient dimension reduction• In regression the main interest are properties of y |x .

• Let R(x) represent a reduction of x .

• It is said to be a sufficient reduction if the distribution of y |R(x) isthe same as y | x , see e.g. Adragni & Cook (2009).

• No information is lost if the model is correctly specified.

• As an illustration, consider the model

y = α+ βtx + εi ,

where x is conditionally independent of ε and the distribution of y | xis the same as y |βtx .

• Focused sufficient dimensionality reduction, i.e. a reduction thatresults in no loss with respect to a chosen focus parameter.

41/46

Why focused dimensionality reduction?

• Interested in the bright and clear objects.

• But also the background signal.

• Dimensionality reduction should take this into account.

[Big Dipper: The Universe]42/46

Why focused dimensionality reduction?

• Runtime.

• More accurate predictions with larger samples of relevant data.

[www.luftkvalitet.info]43/46

Why focused dimensionality reduction?

• Runtime.

• Advertising, reading suggestions and recommendations.

[www.aftenposten.no]44/46

Concluding Remarks• Focused inference and methodology.

• Scaling up, more data and more features is (usually) moreimportant than smart algorithms Coates et al. (2011).

• Sometimes high dimensionally is ‘good’ as in some kernel methods.

• Vapnik–Chervonenkis (VC) dimension.

• Related methodology:- Hierarchical models.- Bayesian nonparametrics.- Clustering.- Support vector machine (SVD).- Classification.- Partial least squares regression (PLS).- Online learning.- Random forest.- Independent component analysis (ICA).- Subsampling and bag of little bootstraps.

45/46

Bibliography

ADRAGNI, K. P. & COOK, R. D. (2009). Sufficient dimensionreduction and prediction in regression. Philosophical Transactionsof the Royal Society A: Mathematical, Physical and EngineeringSciences 367, 4385–4405.

COATES, A., NG, A. Y. & LEE, H. (2011). An analysis of single-layernetworks in unsupervised feature learning. In InternationalConference on Artificial Intelligence and Statistics.

JORDAN, M. I. et al. (2013). On statistics, computation andscalability. Bernoulli 19, 1378–1390.

OLSHAUSEN, B. A. & FIELD, D. J. (1997). Sparse coding with anovercomplete basis set: A strategy employed by v1? Visionresearch 37, 3311–3325.

46/46