Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization...

53
Optimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October, 2018 Wright (UW-Madison) Optimization in Data Science October 2018 1 / 44

Transcript of Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization...

Page 1: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Optimization in Data Science

Stephen Wright

University of Wisconsin-Madison

UC-Irvine, October, 2018

Wright (UW-Madison) Optimization in Data Science October 2018 1 / 44

Page 2: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Outline

Data Analysis, Machine Learning, Data Science

Context: Data Science

Relating Data Science and Optimization

Formulating 14 specific data science problems as continuousoptimization problems

Wright (UW-Madison) Optimization in Data Science October 2018 2 / 44

Page 3: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Optimization and Data Science

Optimization is being revolutionized by its interactions with machinelearning and data analysis.

New algorithms, and new interest in old algorithms;

Challenging formulations and new paradigms;

Renewed emphasis on certain topics: convex optimization algorithms,complexity, structured nonsmoothness, now nonconvex optimization.

Large research community now working on the machine learning /optimization spectrum. The optimization / ML interface is a keycomponent of many top conferences (ISMP, SIOPT, NIPS, ICML,COLT, ICLR, AISTATS, ...) and journals (Math Programming,SIOPT, ....).

Wright (UW-Madison) Optimization in Data Science October 2018 3 / 44

Page 4: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Data ScienceRelated Terms: AI, Data Analysis, Machine Learning, Statistical Inference,Data Mining.

Extract meaning from data: Understand statistical properties, learnimportant features and fundamental structures in the data.

Use this knowledge to make predictions about other, similar data.

Highly multidisciplinary area!

Foundations in Statistics;

Computer Science: AI, Machine Learning, Databases, ParallelSystems, Architectures (GPUs);

Optimization provides a toolkit of modeling / formulation andalgorithmic techniques.

Modeling and domain-specific knowledge is vital: “80% of data analysis isspent on the process of cleaning and preparing the data.”[Dasu and Johnson, 2003].

(Most academic research deals with the other 20%.)Wright (UW-Madison) Optimization in Data Science October 2018 4 / 44

Page 5: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Typical Setup

After cleaning and formatting, obtain a data set of m objects:

Vectors of features: aj , j = 1, 2, . . . ,m.

Outcome / observation / label yj for each feature vector.

The outcomes yj could be:

a real number: regression

a label indicating that aj lies in one of M classes (for M ≥ 2):classification

no labels (yj is null):

I subspace identification: Locate low-dimensional subspaces thatapproximately contain the (high-dimensional) vectors aj ;

I clustering: Partition the aj into a few clusters.

(Structure may reveal which features in the aj are important /distinctive, or enable predictions to be made about new vectors a.)

Wright (UW-Madison) Optimization in Data Science October 2018 5 / 44

Page 6: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Fundamental Data Analysis Task

Seek a function φ that:

approximately maps aj to yj for each j : φ(aj) ≈ yj , j = 1, 2, . . . ,m.

if there are no labels yj , or if some labels are missing, seek φ thatdoes something useful with the data aj, e.g. assigns each aj to anappropriate cluster or subspace.

satisfies some additional properties — simplicity, structure — thatmake it “plausible” for the application, robust to perturbations in thedata, generalizable to data instances beyond the training set.

Can usually define φ in terms of some parameter vector x — thusidentification of φ becomes a data-fitting problem: Find the best x .

Objective function in this problem often built up of m terms that capturemismatch between predictions and observations for data item (aj , yj).

The process of finding φ is called learning or training.

Wright (UW-Madison) Optimization in Data Science October 2018 6 / 44

Page 7: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

What’s the use of the mapping φ?

Analysis: φ — especially the parameter x that defines it — revealsstructure in the data. Examples:

I Feature selection: reveal the components of vectors aj that aremost important in determining the outputs yj ;

I Uncovers some hidden structure, e.g.

F reveals low-dimensional subspaces that contain the aj ;F find clusters that contain the aj ;F defines a decision tree that defines the mapping aj → yj .

Prediction: Given new data vectors ak , predict outputs yk ← φ(ak).

Wright (UW-Madison) Optimization in Data Science October 2018 7 / 44

Page 8: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Complications

The data items (aj , yj) available for training and testing are viewed as anempirical sample drawn from some underlying reality. Want ourconclusions to generalize to the unknown underlying set.

noise or errors in aj and yj . Would like φ (and x) to be robust tosuch errors. Regularized formulations are used.

avoid overfitting to the training data. Again, generalization /regularization can be used.

missing data: Vectors aj may be missing elements (but may stillcontain useful information).

missing labels: Some or all yj may be missing or null —semi-supervised or unsupervised learning.

online learning: Data (aj , yj) arrives in a stream.

Wright (UW-Madison) Optimization in Data Science October 2018 8 / 44

Page 9: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Continuous Optimization and Data Analysis

Optimization is a major source of algorithms for machine learning and dataanalysis.

Optimization Formulations translate statistical principles (e.g. risk,likelihood, significance, generalizability) into measures and functionsthat can be solved algorithmically.

Optimization Algorithms provide practical means to solve theseproblems, but they must be tailored to the structure and context.

Duality is valuable in several cases (e.g. kernel learning).

Nonsmoothness appears often as a regularization device, but often ina highly structured way that can be exploited by algorithms.

Wright (UW-Madison) Optimization in Data Science October 2018 9 / 44

Page 10: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

ML’s Influence on (Continuous) OptimizationThe needs of ML (including summation form, nonsmooth regularization)has caused revival, reexamination, and development of known approaches,particularly first-order and “zero order” methods.

stochastic gradient

accelerated gradient

coordinate descent

conditional gradient (Frank-Wolfe)

sparse and regularized optimization e.g. forward-backward.

augmented Lagrangian, ADMM

(sampled Newton and quasi-Newton)

This trend continues in nonconvex formulations:

stochastic gradient

steepest descent (+ noise)

trust-region methods

Nonlinear conjugate gradient, L-BFGS, Newton-conjugate gradient.

Wright (UW-Madison) Optimization in Data Science October 2018 10 / 44

Page 11: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

1

1https://blogs.sas.com/content/subconsciousmusings/2017/04/12/

machine-learning-algorithm-use/Wright (UW-Madison) Optimization in Data Science October 2018 11 / 44

Page 12: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

There’s a lot of continuous optimization here (yellow)!

Wright (UW-Madison) Optimization in Data Science October 2018 12 / 44

Page 13: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Microsoft Azure cheat sheet (optimization in yellow)

ANOMALY DETECTION

One-class SVM

PCA-based anomaly detection Fast training

>100 features, aggressive boundary

CLUSTERING

K-means

TWO-CLASS CLASSIFICATION

Two-class decision forest

Two-class boosted decision tree

Two-class decision jungle

Two-class locally deep SVM

Two-class SVM

Two-class averaged perceptron

Two-class logistic regression

Two-class Bayes point machineTwo-class neural network

>100 features, linear model

Accuracy, fast training

Accuracy, fast training,

large memory footprintAccuracy,

small memory footprint

>100 features

Accuracy, long training times

Fast training, linear model

Fast training, linear model

Fast training, linear model

Discovering structure

Finding unusual data points

Predicting values

Predicting categories

Three or more

START

Two

REGRESSION

Ordinal regression

Poisson regression

Fast forest quantile regression

Linear regression

Bayesian linear regression

Neural network regression

Decision forest regression

Boosted decision tree regression

Data in rank ordered categories

Predicting event counts

Predicting a distribution

Fast training, linear model

Linear model, small data sets

Accuracy, long training time

Accuracy, fast training

Accuracy, fast training, large memory footprint

MULTI-CLASS CLASSIFICATION

Multiclass logistic regression

Multiclass neural network

Multiclass decision forest

Multiclass decision jungle

One-v-all multiclass

Fast training, linear model

Accuracy, long training times

Accuracy, fast training

Accuracy, small memory footprint

Depends on the two-class classifier, see notes below

Microsoft Azure Machine Learning: Algorithm Cheat Sheet

© 2015 Microsoft Corporation. All rights reserved. Created by the Azure Machine Learning Team Email: [email protected] Download this poster: http://aka.ms/MLCheatSheet

This cheat sheet helps you choose the best Azure Machine Learning Studio algorithm for your predictive analytics solution. Your decision is driven by both the nature of your data and the question you’re trying to answer.

Wright (UW-Madison) Optimization in Data Science October 2018 13 / 44

Page 14: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Application I: (Linear) Least Squares

minx

f (x) :=1

2

m∑j=1

(aTj x − yj)2 =

1

2‖Ax − y‖2

2.

[Gauss, 1799], [Legendre, 1805]; see [Stigler, 1981].

Here the function mapping data to output is linear: φ(aj) = aTj x .

`2 regularization reduces sensitivity of the solution x to noise in y .

minx

1

2‖Ax − y‖2

2 + λ‖x‖22.

`1 regularization yields solutions x with few nonzeros:

minx

1

2‖Ax − y‖2

2 + λ‖x‖1.

Feature selection: Nonzero locations in x indicate importantcomponents of aj .

Wright (UW-Madison) Optimization in Data Science October 2018 14 / 44

Page 15: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Application II: Robust Linear RegressionLeast squares motivated statistically by assumption of Gaussian errors inobservations yj . What if the errors are distributed otherwise, or contain“outliers”?

Use statistics to write down a likelihood function for x given y , then findthe maximum likelihood estimate — optimization! General form is

minx

1

m

m∑j=1

`(aTj x − yj) + λR(x)

where ` is loss function and R is regularizer.

` and R could be convex or nonconvex.

Tukey biweight: `(θ) = θ2/(1 + θ2). Nonconvex: outliers don’t affectsolution much.

Nonconvex separable regularizers R such as SCAD and MCP behavelike ‖ · ‖1 at zero, but flatten out for larger x .

Wright (UW-Madison) Optimization in Data Science October 2018 15 / 44

Page 16: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Application III: Matrix Completion

Regression over a structured matrix: Observe a matrix X by probing itwith linear operators Aj(X ), giving observations yj , j = 1, 2, . . . ,m. Solvea regression problem:

minX

1

2m

m∑j=1

(Aj(X )− yj)2 =

1

2m‖A(X )− y‖2

2.

Each Aj may observe a single element of X , or a linear combination ofelements. Can be represented as a matrix Aj , so that Aj(X ) = 〈Aj ,X 〉.

Seek the “simplest” X that satisfies the observations. Nuclear-norm(sum-of-singular-values) regularization term induces low rank on X :

minX

1

2m‖A(X )− y‖2

2 + λ‖X‖∗, for some λ > 0.

[Recht et al., 2010]

Wright (UW-Madison) Optimization in Data Science October 2018 16 / 44

Page 17: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Explicit Low-Rank Parametrization

Compact, nonconvex formulation is obtained by parametrizing X directly:

X = LRT , where L ∈ Rm×r , R ∈ Rn×r ,

where r is known (or suspected) rank.

minL,R

1

2m

m∑j=1

(Aj(LRT )− yj)2.

(No need for regularizer — rank is hard-wired into the formulation.)

Despite the nonconvexity, near-global minima can be found when Aj areincoherent. Use appropriate initialization [Candes et al., 2014],[Zheng and Lafferty, 2015] or the observation that all local minima arenear-global [Bhojanapalli et al., 2016].

Wright (UW-Madison) Optimization in Data Science October 2018 17 / 44

Page 18: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Application IV: Nonnegative Matrix Factorization

Given m × n matrix Y , seek factors L (m × r) and R (n × r) that areelement-wise positive, such that LRT ≈ Y .

minL,R

1

2‖LRT − Y ‖2

F subject to L ≥ 0, R ≥ 0.

Applications in computer vision, document clustering, chemometrics, . . .

Could combine with matrix completion, when not all elements of Y areknown, if it makes sense on the application to have nonnegative factors.

If positivity constraint were not present, could solve this in closed formwith an SVD, since Y is observed completely.

Wright (UW-Madison) Optimization in Data Science October 2018 18 / 44

Page 19: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Application V: Sparse Inverse CovarianceLet Z ∈ Rp be a (vector) random variable with zero mean. Letz1, z2, . . . , zN be samples of Z . Sample covariance matrix (estimatescovariance between components of Z ):

S :=1

N − 1

N∑`=1

z`zT` .

Seek a sparse inverse covariance matrix: X ≈ S−1.

X reveals dependencies between components of Z : Xij = 0 if the i and jcomponents of Z are conditionally independent.

Do nodes i and j influence each other directly, or only indirectly via othernodes?

Obtain X from the regularized formulation:

minX〈S ,X 〉 − log det(X ) + λ‖X‖1, where ‖X‖1 =

∑i ,j |Xij |.

[d’Aspremont et al., 2008, Friedman et al., 2008].Wright (UW-Madison) Optimization in Data Science October 2018 19 / 44

Page 20: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Reveals Network Structure. Example with p = 6.

61

2

3

4

5

X =

∗ ∗ 0 0 0 0∗ ∗ ∗ ∗ 0 00 ∗ ∗ 0 ∗ 00 ∗ 0 ∗ ∗ 00 0 ∗ ∗ ∗ ∗0 0 0 0 ∗ ∗

Wright (UW-Madison) Optimization in Data Science October 2018 20 / 44

Page 21: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Application VI: Sparse Principal Components (PCA)Seek sparse approximations to the leading eigenvectors of the samplecovariance matrix S .

For the leading sparse principal component, solve

maxv∈Rn

vTSv = 〈S , vvT 〉 s.t. ‖v‖2 = 1, ‖v‖0 ≤ k ,

for some given k ∈ 1, 2, . . . , n. Convex relaxation replaces vvT by ann × n positive semidefinite proxy M:

maxM∈SRn×n

〈S ,M〉 s.t. M 0, 〈I ,M〉 = 1, ‖M‖1 ≤ R,

where | · |1 is the sum of absolute values [d’Aspremont et al., 2007].

Adjust the parameter R to obtain desired sparsity.

Could also get a nonconvex relaxation by replacing ‖v‖0 by ‖v‖1:

maxv∈Rn

vTSv s.t. ‖v‖2 = 1, ‖v‖1 ≤ R.

Wright (UW-Madison) Optimization in Data Science October 2018 21 / 44

Page 22: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Application VII: Sparse + Low-Rank

Given Y ∈ Rm×n, seek low-rank M and sparse S such that M + S ≈ Y .

Applications:

Robust PCA: Sparse S represents “outlier” observations.

Foreground-Background separation in video processing.

I Each column of Y is one frame of video, each row is a singlepixel evolving in time.

I Low-rank part M represents background, sparse part S representsforeground.

Convex formulation:

minM,S‖M‖∗ + λ‖S‖1 s.t. Y = M + S .

[Candes et al., 2011, Chandrasekaran et al., 2011]

Wright (UW-Madison) Optimization in Data Science October 2018 22 / 44

Page 23: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Sparse + Low-Rank: Compact Formulation

Compact formulation: Variables L ∈ Rn×r , R ∈ Rm×r , S ∈ Rm×n sparse.

minL,R,S

1

2‖LRT + S − Y ‖2

F + λ‖S‖1 (fully observed)

minL,R,S

1

2‖PΦ(LRT + S − Y )‖2

F + λ‖S‖1 (partially observed),

where Φ represents the locations of the observed entries.[Chen and Wainwright, 2015, Yi et al., 2016].

Wright (UW-Madison) Optimization in Data Science October 2018 23 / 44

Page 24: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Application VIII: Subspace IdentificationGiven vectors aj ∈ Rn with missing entries, find a subspace of Rn suchthat all “completed” vectors aj lie approximately in this subspace.

If Ωj ⊂ 1, 2, . . . , n is the set of observed elements in aj , seek X ∈ Rn×d

such that[aj − Xsj ]Ωj

≈ 0,

for some sj ∈ Rd and all j = 1, 2, . . . .[Balzano et al., 2010, Balzano and Wright, 2014].

Application: Structure from motion. Reconstruct opaque object fromplanar projections of surface reference points.

Wright (UW-Madison) Optimization in Data Science October 2018 24 / 44

Page 25: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Application IX: Linear Support Vector Machines

Each item of data belongs to one of two classes: yj = +1 and yj = −1.

Seek (x , β) such that

aTj x − β ≥ 1 when yj = +1;

aTj x − β ≤ −1 when yj = −1.

The mapping is φ(aj) = sign(aTj x − β).

Design an objective so that the jth loss term is zero when φ(aj) = yj ,positive otherwise. A popular one is hinge loss:

H(x , β) =1

m

m∑j=1

max(1− yj(aTj x − β), 0).

Add a regularization term (λ/2)‖x‖22 for some λ > 0 to maximize the

margin between the classes.

Wright (UW-Madison) Optimization in Data Science October 2018 25 / 44

Page 26: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Regularize for Generalizability

Wright (UW-Madison) Optimization in Data Science October 2018 26 / 44

Page 27: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Regularize for Generalizability

Wright (UW-Madison) Optimization in Data Science October 2018 26 / 44

Page 28: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Regularize for Generalizability

Wright (UW-Madison) Optimization in Data Science October 2018 26 / 44

Page 29: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Regularize for Generalizability

Wright (UW-Madison) Optimization in Data Science October 2018 26 / 44

Page 30: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Regularize for Generalizability

Wright (UW-Madison) Optimization in Data Science October 2018 26 / 44

Page 31: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Application X: Nonlinear SVM

Data aj , j = 1, 2, . . . ,m may not be separable neatly into two classesyj = +1 and yj = −1. Apply a nonlinear transformation aj → ψ(aj)(“lifting”) to make separation more effective. Seek (x , β) such that

ψ(aj)T x − β ≥ 1 when yj = +1;

ψ(aj)T x − β ≤ −1 when yj = −1.

Leads to the formulation:

minx

1

m

m∑j=1

max(1− yj(ψ(aj)T x − β), 0) +

1

2λ‖x‖2

2.

Can avoid defining ψ explicitly by using instead the dual of this QP.

Wright (UW-Madison) Optimization in Data Science October 2018 27 / 44

Page 32: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Nonlinear SVM: Dual

Dual is a quadratic program in m variables, with simple constraints:

minα∈Rm

1

2αTQα− eTα s.t. 0 ≤ α ≤ (1/λ)e, yTα = 0.

where Qk` = yky`ψ(ak)Tψ(a`), y = (y1, y2, . . . , ym)T , e = (1, 1, . . . , 1)T .

No need to choose ψ(·) explicitly. Instead choose a kernel K , such that

K (ak , a`) ∼ ψ(ak)Tψ(a`).

[Boser et al., 1992, Cortes and Vapnik, 1995]. “Kernel trick.”

Gaussian kernels are popular:

K (ak , a`) = exp(−‖ak − a`‖2/(2σ)), for some σ > 0.

Wright (UW-Madison) Optimization in Data Science October 2018 28 / 44

Page 33: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Nonlinear SVM

Wright (UW-Madison) Optimization in Data Science October 2018 29 / 44

Page 34: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Application XI: Logistic Regression

Binary logistic regression is similar to binary SVM, except that we seek afunction p that gives odds of data vector a being in class 1 or class −1,rather than making a simple prediction.

Seek odds function p parametrized by x ∈ Rn:

p(a; x) := (1 + eaT x)−1.

Choose x so that p(aj ; x) ≈ 1 when yj = 1 and p(aj ; x) ≈ 0 when yj = −1.

Choose x to minimize a negative log likelihood function:

L(x) = − 1

m

∑yj=−1

log(1− p(aj ; x)) +∑yj=1

log p(aj ; x)

Sparse solutions x are interesting because the indicate which componentsof aj are critical to classification. Can solve: minz L(z) + λ‖z‖1.

Wright (UW-Madison) Optimization in Data Science October 2018 30 / 44

Page 35: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Multiclass Logistic Regression

Have M classes instead of just 2. M can be large e.g. identify phonemesin speech, identify line outages in a power grid.

Labels yj` = 1 if data point j is in class `; yj` = 0 otherwise; ` = 1, . . . ,M.

Find subvectors x[`], ` = 1, 2, . . . ,M such that if aj is in class k we have

aTj x[k] aTj x[`] for all ` 6= k .

Find x[`], ` = 1, 2, . . . ,M by minimizing a negative log-likelihood function:

f (x) = − 1

m

m∑j=1

[M∑`=1

yj`(aTj x[`])− log

(M∑`=1

exp(aTj x[`])

)]

Can use group LASSO regularization terms to select important featuresfrom the vectors aj , by imposing a common sparsity pattern on all x[`].

Wright (UW-Madison) Optimization in Data Science October 2018 31 / 44

Page 36: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Application XII: Atomic-Norm Regularization

Seek an approx minimizer of f (x) such that x combines a small number offundamental elements — atoms. Define the atomic norm of x via its“minimal” representation in terms of the set A of atoms (possibly infinite):

‖x‖A := inf

∑a∈A

ca : x =∑a∈A

caa, ca ≥ 0

.

Compressed Sensing: x ∈ Rn: Atoms are ±ei , whereei = (0, 0, . . . , 0, 1, 0, . . . , 0)T ; then ‖x‖A = ‖x‖1.

Low-rank Matrix Problems: x ∈ Rm×n: Atoms are the rank-onematrices (infinite); then ‖ · ‖A is the nuclear norm.

Signal processing with few frequencies: x(t) =∑k

j=1 cj exp(2πifj t):signal with k frequencies. Atoms are af := exp(2πift) for any f .

Image processing: Atoms are subtrees of wavelet coefficients.

Can solve with Frank-Wolfe, gradient projection [Rao et al., 2015].

Wright (UW-Madison) Optimization in Data Science October 2018 32 / 44

Page 37: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Application XIII: Community Detection in Graphs

Given an undirected graph, find “communities” (subsets of nodes) suchthat nodes inside a given community are more likely to be connected toeach other than to nodes outside that community.

Probability p ∈ (0, 1) of being connected to a node within your community,and q ∈ (0, p) of being connected to a node outside your community.

Wright (UW-Madison) Optimization in Data Science October 2018 33 / 44

Page 38: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Application XIII: Community Detection in Graphs

Given an undirected graph, find “communities” (subsets of nodes) suchthat nodes inside a given community are more likely to be connected toeach other than to nodes outside that community.

Probability p ∈ (0, 1) of being connected to a node within your community,and q ∈ (0, p) of being connected to a node outside your community.

Wright (UW-Madison) Optimization in Data Science October 2018 33 / 44

Page 39: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Community Detection: FormulationGiven adjacency matrix A such that

Aij =

1 if nodes i and j are connected

0 if nodes i and j are not connected,

together with probabilities p and q (both assumed known), find a partitionmatrix X that defines the communities:

Xij =

1 if nodes i and j are in same community

0 if nodes i and j are not in same community.

Likelihood function is

p(Aij |Xij , p, q) =

p if Aij = 1 and Xij = 1

1− p if Aij = 0 and Xij = 1

q if Aij = 1 and Xij = 0

1− q if Aij = 0 and Xij = 0.

Wright (UW-Madison) Optimization in Data Science October 2018 34 / 44

Page 40: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Community Detection: Formulation

The max-log-likelihood problem has the form

maxpartition matrix X

〈X ,A− λJn〉,

where Jn is the n × n matrix of all ones and

λ =log(1− q)− log(1− p)

log p − log q + log(1− q)− log(1− p).

[Li et al., 2018] This is not tractable, so seek relaxations.

Wright (UW-Madison) Optimization in Data Science October 2018 35 / 44

Page 41: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Community Detection: Relaxations

Convex relaxation:

maxX〈X ,A− λJn〉, s.t. X 0, X ≥ 0, Xii = 1, i = 1, 2, . . . , n.

When there are just two communities of similar size, then under some(strong) conditions on p, q, this relaxation will recover the correct X[Li et al., 2018, Theorem 2.1].

Nonconvex relaxation: Write X = 12 (xxT + Jn), where ideally x ∈ Rn has

xi =

1 if i is in community 1

−1 if i is in community 2.

Then writemaxx〈xxT + Jn,A− λJn〉, s.t. ‖x‖2 ≤ 1,

and replace xi ← sign(xi ) for all i to get a prediction.

Wright (UW-Madison) Optimization in Data Science October 2018 36 / 44

Page 42: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Application XIV: Deep Learning

output nodes

input nodes

hidden layers

Inputs are the vectors aj , out-puts are a prediction about ajbelonging to each class (e.g.multiclass logistic regression).

At each layer, inputs are con-verted to outputs by a lineartransformation composed withan element-wise function σ:

a`+1 = σ(W `a` + g `),

where a` is node valuesat layer `, (W `, g `) areparameters in the network.Nowadays, the ReLU functionσ(t) = max(t, 0) is popular.

Wright (UW-Madison) Optimization in Data Science October 2018 37 / 44

Page 43: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Training Deep Learning Networks

The network contains many parameters — (W `, g `), ` = 1, 2, . . . , L —that are selected by training on the data (aj , yj), j = 1, 2, . . . ,m.Objective has the form:

1

m

m∑j=1

h(x ; aj , yj)

where x = (W 1, g 1,W 2, g 2, . . . ) are the parameters in the model and hmeasures the mismatch between observed output yj and the outputsproduced by the model.

Nonlinear, Nonconvex, Nonsmooth.

Many software packages available for training: Caffe, PyTorch, TensorFlow, Theano,... Many run on GPUs.

Wright (UW-Madison) Optimization in Data Science October 2018 38 / 44

Page 44: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Overparametrization in Neural Networks

Number of parameters (elements in x) is often vastly greater than thenumber of data points — sometimes by 1-2 orders of magnitude — yetrecent experience shows that “overfitting” is not necessarily a problem!

Training such networks can often achieve “zero loss,” that is, all items inthe training data set are correctly classified. Two big questions arise.

1. Isn’t this overfitting? Yet such models often generalize well, floutingconventional wisdom.

2. Why is Stochastic Gradient (SGD) reliably finding the globalminimum of a nonsmooth, nonlinear, nonconvex problem?

We have only started to get some intuition on these issues. See forexample [Li and Liang, 2018].

Wright (UW-Madison) Optimization in Data Science October 2018 39 / 44

Page 45: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Overparametrization

In an overparametrized network, we suspect that:

There are many solutions;

A randomly chosen initial point for the weights will be close to one ofthe solutions;

You have to change few (if any) ReLU activations to get from theinitial point to the solution;

Gradient descent (or stochastic gradient descent with big enoughbatches) will get us from the initial point to the solution efficiently.

This remains an active area of investigation, with important consequencesfor the understanding of why neural networks are so effective in someapplications.

Wright (UW-Madison) Optimization in Data Science October 2018 40 / 44

Page 46: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Adversarial Machine Learning

Deep learning classifiers can be fooled with a carefully chosen attack.

(Szegedy et al, Dec 2013): Train digits in the MNIST set, apply carefullychosen perturbation. NN misclassifies, even though it’s visually obviouswhat the digit should be.

Note that a random perturbation is OK, even when large. But a small,carefully crafted perturbation causes misclassification.

Wright (UW-Madison) Optimization in Data Science October 2018 41 / 44

Page 47: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Adversarial ML: The Issues

1. Can we generate efficiently the “carefully chosen perturbations” thatbreak the classifier?

2. Can we train the network to be robust to perturbations of a certainsize?

3. Can we verify that a given network will continue to give the sameclassification when we perturb a given training example x by anyperturbation of a given size ε > 0?

For 1, various optimization formulations have been proposed, dependingon the type of classification done. Usually constrained nonlinear.

To implement 2, we can use robust optimization techniques (but these areexpensive) or selectively generate perturbed data examples and re-train.

Mixed-integer programming formulations have been devised for 3. Veryexpensive even for small networks and data sets (e.g. MNIST, CIFAR).

Wright (UW-Madison) Optimization in Data Science October 2018 42 / 44

Page 48: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Training for Robustness

Instead of incurring a loss h(x ; aj , yj) for parameters x and data item(aj , yj) as above, define the loss to be the worst possible loss for all awithin a ball of radius ε centered at aj . That is,

maxvj :‖vj‖≤ε

h(x ; aj + vj , yj).

Thus, the training problem becomes the following min-max problem:

minx

1

m

m∑j=1

maxvj :‖vj‖≤ε

h(x ; aj + vj , yj).

The inner “max” problems can at least be solved in parallel and sometimesin closed form. We can also often generate a generalized gradient w.r.t. x ,and so implement a first-order method for the outer loop.

But this is expensive in general!

Wright (UW-Madison) Optimization in Data Science October 2018 43 / 44

Page 49: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

Summary

Continuous optimization provides powerful frameworks for formulating andsolving problems in data analysis and machine learning.

I haven’t talked about the contributions of discrete optimization, which arebeing promoted more and more.

BUT it’s usually not enough to just formulate these problems and useoff-the-shelf optimization technology to solve them. The algorithms needto be customized to the problem structure (in particular, large amount ofdata) and the context.

Research in this area has exploded over the past decade and is still goingstrong, with a great many unanswered questions. (Many of them in deeplearning.)

Wright (UW-Madison) Optimization in Data Science October 2018 44 / 44

Page 50: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

References I

Balzano, L., Nowak, R., and Recht, B. (2010).Online identification and tracking of subspaces from highly incomplete information.In 48th Annual Allerton Conference on Communication, Control, and Computing, pages704–711.http://arxiv.org/abs/1006.4046.

Balzano, L. and Wright, S. J. (2014).Local convergence of an algorithm for subspace identification from partial data.Foundations of Computational Mathematics, 14:1–36.DOI: 10.1007/s10208-014-9227-7.

Bhojanapalli, S., Neyshabur, B., and Srebro, N. (2016).Global optimality of local search for low-rank matrix recovery.Technical Report arXiv:1605.07221, Toyota Technological Institute.

Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992).A training algorithm for optimal margin classifiers.In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages144–152.

Candes, E., Li, X., and Soltanolkotabi, M. (2014).Phase retrieval via a Wirtinger flow.Technical Report arXiv:1407.1065, Stanford University.

Wright (UW-Madison) Optimization in Data Science October 2018 1 / 4

Page 51: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

References II

Candes, E. J., Li, X., Ma, Y., and Wright, J. (2011).Robust principal component analysis?Journal of the ACM, 58.3:11.

Chandrasekaran, V., Sanghavi, S., Parrilo, P. A., and Willsky, A. S. (2011).Rank-sparsity incoherence for matrix decomposition.SIAM Journal on Optimization, 21(2):572–596.

Chen, Y. and Wainwright, M. J. (2015).Fast low-rank estimation by projected gradent descent: General statistical and algorithmicguarantees.Technical Report arXiv:1509.03025, University of California-Berkeley.

Cortes, C. and Vapnik, V. N. (1995).Support-vector networks.Machine Learning, 20:273–297.

d’Aspremont, A., Banerjee, O., and El Ghaoui, L. (2008).First-order methods for sparse covariance selection.SIAM Journal on Matrix Analysis and Applications, 30:56–66.

d’Aspremont, A., El Ghaoui, L., Jordan, M. I., and Lanckriet, G. (2007).A direct formulation for sparse PCA using semidefinte programming.SIAM Review, 49(3):434–448.

Wright (UW-Madison) Optimization in Data Science October 2018 2 / 4

Page 52: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

References III

Dasu, T. and Johnson, T. (2003).Exploratory Data Mining and Data Cleaning.John Wiley & Sons.

Friedman, J., Hastie, T., and Tibshirani, R. (2008).Sparse inverse covariance estimation with the graphical lasso.Biostatistics, 9(3):432–441.

Li, X., Chen, Y., and Xu, J. (2018).Convex relaxation methods for community detection.Technical Report arXiv:1810.00315v1, University of California-Davis.

Li, Y. and Liang, Y. (2018).Learning overparametrized neural networks via stochastic gradient descent on structureddata.Technical Report arXiv:1808.01204, University of Wisconsin-Madison.

Rao, N., Shah, P., and Wright, S. J. (2015).Forward-backward greedy algorithms for atomic-norm regularization.IEEE Transactions on Signal Processing, 63:5798–5811.http://arxiv.org/abs/1404.5692.

Wright (UW-Madison) Optimization in Data Science October 2018 3 / 4

Page 53: Optimization in Data Sciencepages.cs.wisc.edu/~swright/talks/steve-uc-irvine-oct18.pdfOptimization in Data Science Stephen Wright University of Wisconsin-Madison UC-Irvine, October,

References IV

Recht, B., Fazel, M., and Parrilo, P. (2010).Guaranteed minimum-rank solutions to linear matrix equations via nuclear normminimization.SIAM Review, 52(3):471–501.

Stigler, S. M. (1981).Gauss and the invention of least squares.Annals of Statistics, 9(3):465–474.

Yi, X., Park, D., Chen, Y., and Caramanis, C. (2016).Fast algorithms for robust pca via gradient descent.Technical Report arXiv:1605.07784, University of Texas-Austin.

Zheng, Q. and Lafferty, J. (2015).A convergent gradient descent algorithm for rank minimization and semidefiniteprogramming from random linear measurements.Technical Report arXiv:1506.06081, Statistics Department, University of Chicago.

Wright (UW-Madison) Optimization in Data Science October 2018 4 / 4