Data Science - Convex optimization and application · 6Data Science - Convex optimization and...

1 Data Science - Convex optimization and application

Data Science - Convex optimization andapplication

SummaryWe begin by some illustrations in challenging topics in modern datascience. Then, this session introduces (or reminds) some basics onoptimization, and illustrate some key applications in supervised clas-sification.

1 Data Science

1.1 What is data science :

Extract from data some knowledge for industrial or academic exploitation.It generally involves :

1. Signal Processing (how to record the data and represent it ?)

2. Modelisation (What is the problem, what kind of mathematical modeland answer ?)

3. Statistics (reliability of estimation procedures ?)

4. Machine Learning (what kind of efficient optimization algorithm ?)

5. Implementation (software needs)

6. Visualization (how can I represent the resulting knowledge ?)

In its whole, this sequence of questions are at the core of Artificial Intel-ligence and may also be referred to as Computer Science problems. In thislecture, we will address some issues raised in red items. Each time, practicalexamples will be provided

Most of our motivation comes from the Big Data world, encountered inimage processing, finance, genetics and many other fields where knowledgeextraction is needed, when facing many observations described by many va-riables.

n : number of observations - p : number of variables per observations

p >> n >> O(1).

1.2 Several examples

Spam detection From a set of labelled messages (spam or not), build a clas-sification for automatic spam rejection.

Select among the words meaningful elements ? Automatic classification ?

Gene expression profiles analysis One measures micro-array datasets builtfrom a huge amount of profile genes expression. Number of genes p (of orderthousands). Number of samples n (of order hundred).

http://wikistat.fr


Diagnostic help : healthy or ill ? Select among the genes meaningful elements ? Automatic classification ?

Recommandation problems

And more recently :

What kind of database ? Reliable recommandation for clients ? Online strategy ?

Credit scoring Build an indicator (Q score) from a dataset for the probabilityof interest in a financial product (Visa premier credit card).

1. Define a model, a question ?2. Use a supervised classification algorithm to rank the best clients.3. Use logistic regression to provide a score.

1.3 What about maths ?

Various mathematical fields we will talk about :

Analysis : Convex optimization, Approximation theory Statistics : Penalized procedures and their reliability Probabilistic methods : concentration inequalities, stochastic processes,

stochastic approximations

Famous keywords : Lasso Boosting Convex relaxation Supervised classification Support Vector Machine

http://wikistat.fr


Aggregation rules Gradient descent Stochastic Gradient descent Sequential prediction Bandit games, minimax policies Matrix completion

In this session : We will slightly talk about optimization, that are mainlyconvex in our statistical worl. Non-convex problems are also very interestingeven though much more difficult to deal with from a numerical point of view.

2 Standard Convex optimisation procedures

2.1 Convex functions

We recall some background material that is necessary for a clear unders-tanding of how some machine learning procedures work. We will cover somebasic relationships between convexity, positive semidefiniteness, local and glo-bal minimizers.

DÉFINITION 1. — [Convex sets, convex functions] A set D is convex if andonly if for any (x1, x2) ∈D2 and all α ∈ [0,1],

x = αx1 + (1 − α)x2 ∈D.

A function f is convex if its domain D is convex f(x) = f(αx1 + (1 − α)x2) ≤ αf(x1) + (1 − α)f(x2).

DÉFINITION 2. — [Positive Semi Definite matrix (PSD)] A p × p matrix H is(PSD) if for all p × 1 vectors z, we have ztHz ≥ 0.

There exists a strong link between SDP matrix and convex functions, givenby the following proposition.

PROPOSITION 3. — A smooth C2(D) function f is convex if and only if D2fis SDP at any point of D.

The proof follows easily from a second order Taylor expansion.

2.2 Example of convex functions

Exponential function : θ ∈ Rz→ exp(aθ) on R whatever a is. Affine function : θ ∈ Rd z→ atθ + b Entropy function : θ ∈ R+ z→ −θ log(θ)

p-norm : θ ∈ Rd z→ ∣θ∥p ∶= p

¿ÁÁÀ d

∑i=1

∥θi∣p with p ≥ 1.

Quadratic form : θ ∈ Rd z→ θtPθ + 2qtθ + r where P is symetric andpositive.

2.3 Why such an interest in convexity ?

From external motivations :

Many problems in machine learning come from the minimization of aconvex criterion and provide meaningful results for the statistical initialtask.

Many optimization problems admit a convex reformulation (SVM clas-sification or regression, LASSO regression, ridge regression, permuta-tion recovery, . . . ).

From a numerical point of view :

Local minimizer = global minimizer. It is a powerful point since in ge-neral, descent methods involve ∇f(x) (or something related to), whichis a local information on f .

x is a local (global) minimizer of f if and only if 0 ∈ ∂f(x). Many fast algorithms for the optimization of convex function exist, and

sometimes are independent on the dimension d of the original space.

http://wikistat.fr


2.4 Why convexity is powerful ?

Two kinds of optimization problems :

On the left : non convex optimization problem, use of Travelling Sales-man type method. Greedy exploration step (simulated annealing, geneticalgortihms).

On the right : convex optimization problem, use local descent methodswith gradients or subgradients.

DÉFINITION 4. — [Subgradient (nonsmooth functions ?)] For any function f ∶Rd Ð→ R, and any x in Rd, the subgradient ∂f(x) is the set of all vectors gsuch that

f(x) − f(y) ≤ ⟨g, x − y⟩.

This set of subgradients may be empty. Fortunately, it is not the case for convexfunctions.

PROPOSITION 5. — f ∶ Rd Ð→ R is convex if and only if ∂f(x) ≠ ∅ for anyx of Rd.

3 Gradient descent method

3.1 On C1L functions

DÉFINITION 6. — [L-Lipschitz] The objective function f is C1L if and only if

∀(x, y) ∈ Rn ∥∇f(x) −∇f(y)∥ ≤ L∥x − y∥

FIGURE 1 – Geometrical illustration of the MM algorithm : we minimize θ Ð→D(θ) with the help of some auxiliary functions θ Ð→ G(θ,α).

Implicitly, such a function f is of course C1, but not necessarily C2. Implicitely,we assumed that f is defined on Rp equipped with a norm ∥.∥.

For C1L functions, we can derive important surrogate inequalities

PROPOSITION 7. — If f ∈ C1L, then

f(y) ≤ φ+(y) = f(x) + ⟨∇f(x), y − x⟩ + L2∥y − x∥2.

3.2 Gradient descent as a Maximization-Minimizationmethod

One way to understand the gradient descent method is more geometrical andrelies on the understanding of « Maximization-Minimization »algorithm. Thegeometrical idea is illustrated in Figure 1.

Imagine that : we are able to produce for each point y ∈ Rn an auxiliary function xÐ→G(x, y) such that

∀x ∈ Rn f(x) ≤ G(x, y) and f(y) = G(y, y).

we have an explicit exact formula that makes it possible to minimizethe auxiliary function xÐ→ G(x, y) :

arg minx∈Rn

G(x, y)

Then, a possible method to minimize f seems to produce a sequence (xk)k≥0as follows.

http://wikistat.fr


FIGURE 2 – Schematic representation of a projected gradient descent.

3.3 Gradient Descent algorithm

In either constrained or unconstrained problems, descent methods are po-werful with convex functions. The most famous local descent method relies onAlgorithm 1.

Algorithm 1 Gradient descent schemeInputFunction f . Stepsize sequences (γk)k∈NInitialization : Pick x0 ∈ Rn.Iterate

∀k ∈ N xk+1 = xk − γk∇f(xk). (1)

Output : limkÐ→+∞ xk

It can be easily adapted with constrained procedure if projection π is explicit(see an example Figure 2).

3.4 Smooth unconstrained case

THÉORÈME 8. — [Convergence of the projected gradient descent method,fixed step-size] If f ∈ C1

L, then the choice γk = L−1 leads to

limkÐ→+∞

∇f(xk) = 0.

If f is convex, we have

f (xt) −min f ≤ 2L∥x0 − x⋆∥2t − 1

.

Remarque. — Note that the two past results do not depend on the dimension of the state

space d. The last result can be extended to the constrained situation.

3.5 Strongly unconstrained case

The results may become even more better if f is assumed to be stronglyconvex.

DÉFINITION 9. — [Strongly convex functions SC(α)] We say that f ∶ Rd Ð→R is α-strongly convex if xÐ→ f(x) − α∥x∥2 is convex.

It can be shown that it is equivalent to

f(y) ≥ f(x) + ⟨∇f(x), y − x⟩ + α2∥y − x∥2 (2)

PROPOSITION 10. — Let f be C1L and α-strongly convex function on Rd, then

∀(x, y) ∈ Rd ⟨∇f(x)−∇f(y), x−y⟩ ≥ αL

α +L∥x−y∥2+ 1

α +L∥∇f(x)−∇f(y)∥2.

THÉORÈME 11. — Let f be a L-smooth and α-strongly convex function, thenthe choice of the step size γ = 2

L+α leads to

f(xn+1) − f(x⋆) ≤L

2exp(− 4n

κ + 1) ∥x1 − x⋆∥2.

http://wikistat.fr


4 Constrained optimization

4.1 Definition of the problem θ unknown vector of Rd to be recovered J ∶ Rd ↦ R function to be minimized fi and gi differentiable functions defining a set of constraints.

Definition of the problem : minθ∈Rd J(θ) such that : fi(θ) = 0,∀i = 1, . . . , n and gi(θ) ≤ 0,∀i = 1, . . . ,m

Set of admissible vectors :

Ω ∶= θ ∈ Rd ∣ fi(θ) = 0,∀i and gj(θ) ≤ 0,∀j

Typical situation :

Ω : circle of radius√

2

Optimal solution : θ⋆ = (−1,−1)t and J(θ⋆) = −2.

Important restriction : we will restrict our study to convex functions J .

DÉFINITION 12. — A constrained problem is convex iff J is a convex function fi are linear or affine functions and gi are convex functions

ExampleminθJ(θ) such that atθ − b = 0

Descent direction h : ∇J(θ)th < 0. Admissible direction h : at(θ + h) − b = 0⇐⇒ ath = 0.

Optimality θ∗ is optimal if there is no admissible descent direction startingfrom θ∗. The only possible case is when ∇J(θ∗) and a are linearly dependent :

∃λ ∈ R ∇J(θ∗) + λa = 0.

In this situation :

∇J(θ) = (2θ1 + θ2 − 2θ1 + 2θ2 + 2

) and a = ( 1−1

)

Hence, we are looking for θ such that ∇J(θ) ∝ a. Computations lead to θ1 =−θ2. Optimal value reached for θ1 = 1/2 (and J(θ∗) = −15/4).

http://wikistat.fr


4.2 Lagrangian function

minθJ(θ) such that f(θ) ∶= atθ − b = 0

We have seen the important role of the scalar value λ above.

DÉFINITION 13. — [Lagrangian function]

L(λ, θ) = J(θ) + λf(θ)

λ is the Lagrange multiplier. The optimal choice of (θ∗, λ∗) corresponds to

∇θL(λ∗, θ∗) = 0 and ∇λL(λ∗, θ∗) = 0.

Argument : θ∗ is optimal if there is no admissible descent directions h.Hence, ∇J and ∇f are linearly dependent. As a consequence, there exists λsuch that

∇θL(λ∗, θ∗) = ∇J(θ) + λ∇f(θ) = 0 (Dual equation)

Since θ must be admissible, we have

∇θL(λ∗, θ∗) = f(θ∗) = 0 (Primal equation)

4.3 Inequality constraint

Case of a unique inequality constraint :

minθJ(θ) such that g(θ) ≤ 0

Descent direction h : ∇J(θ)th < 0. Admissible direction h : ∇g(θ)th ≤ 0 guarantees that g(θ + αh) is de-

creasing with α.Optimality θ∗ is optimal if there is no admissible descent direction starting

from θ∗. The only possible case is when ∇J(θ∗) and ∇g(θ∗) are linearly de-pendent and opposite :

∃λ ∈ R ∇J(θ∗) = −µ∇g(θ∗) with µ ≥ 0.

We can check that θ∗ = (−1,−1).

4.3.1 Lagrangian in general settings

We consider the minimization problem : minθ J(θ) such that gj(θ) ≤ 0,∀j = 1, . . . ,m and fi(θ) = 0,∀i = 1, . . . , n

DÉFINITION 14. — [Lagrangian function] We associate to this problem theLagrange multipliers (λ,µ) = (λ1, . . . , λn, µ1, . . . , µm).

L(θ, λ, µ) = J(θ) +n

∑i=1λifi(θ) +

m

∑j=1

µjgj(θ)

θ primal variables (λ,µ) dual variables

4.3.2 KKT Conditions

DÉFINITION 15. — [KKT Conditions] If J and f, g are smooth, we define theKarush-Kuhn-Tucker (KKT) conditions as Stationarity : ∇θL(λ,µ, θ) = 0. Primal Admissibility : f(θ) = 0 and g(θ) ≤ 0. Dual admissibility µj ≥ 0,∀j = 1, . . . ,m.

THÉORÈME 16. — A convex minimization problem of J under convexconstraints f and g has a solution θ∗ if and only if there exists λ∗ and µ∗

such that KKT conditions hold.

http://wikistat.fr


Example :

J(θ) = 1

2∥θ∥22 s.t. θ1 − 2θ2 + 2 ≤ 0

We get L(θ, µ) = ∥θ∥22

2+ µ(θ1 + 2θ2 + 2) with µ ≥ 0.

Stationarity : (θ1 + µ, θ2 − 2µ) = 0.

θ2 = −2θ1 with θ2 ≤ 0.

We deduce that θ∗ = (−2/5,4/5).

4.3.3 Dual function

We introduce the dual function :

L(λ,µ) = minθL(θ, λ, µ).

We have the following important result

THÉORÈME 17. — Denote the optimal value of the constrained problem p∗ =minJ(θ)∣f(θ) = 0, g(θ) ≤ 0, then

L(λ,µ) ≤ p∗.

Remark : The dual function L is lower than p∗, for any (λ,µ) ∈ Rn ×Rm+ We aim to make this lower bound as close as possible to p∗ : idea to

maximize w.r.t. λ,µ the function L.

DÉFINITION 18. — [Dual problem]

maxλ∈Rn,µ∈Rm

+

L(λ,µ).

L(θ, λ, µ) affine function on λ,µ and thus convex. Hence, L is convex andalmost unconstrained.

Dual problems are easier than primal ones (because of almost constraintsomissions).

Dual problems are equivalent to primal ones : maximization of the dual⇔ minimization of the primal (not shown in this lecture).

Dual solutions permit to recover primal ones with KKT conditions (La-grange multipliers).

Example :

Lagrangian : L(θ, µ) = θ21+θ222

+ µ(θ1 − 2θ2 + 2). Dual function L(µ) = minθ L(θ, µ) = − 5

2µ2 + 2µ.

Dual solution : maxL(µ) such that µ ≥ 0 : µ = 2/5. Primal solution : KKTÔ⇒ θ = (−µ,2µ) = (−2/5,4/5).

To obtain further details, see the Minimax von Neuman’s Theorem . . .

4.4 Take home message from convex optimization Big Data problems arise in a large variety of fields. They are complicated

for a computational reason (and also for a statistical one, see later). Many Big Data problems will be traduced in an optimization of a convex

problem. Efficient algorithms are available to optimize them :

independently on the dimension of the underlying space. Primal - Dual formulations are important to overcome some constraints

on the optimization. Numerical convex solvers are widely and freely distributed.

5 Homework

http://wikistat.fr


5.1 Feature of a good homework

Length limitation : 10 pages !Deadline : 28th of February.Group of 2 students allowed.

This report should be short : strictly less than 10 pages, including thereferences.

The work relies either on an academic widespread subject or on a groupof selected papers. In any case, you have to highlight the relationshipbetween the concerned chapter and the theme you selected.

For the chosen subject, the report should be organized as follows

1. First motivate the problem with a concrete application and propose areasonable modelisation.

2. Second, the report should explain the mathematical difficulties to solvethe model and some recent developments to bypass these difficulties.You can also describe the behaviour of some algorithms.

3. Third, the report should propose either : numerical simulations using packages found on the www or your

own experiments. some sketch of proofs of baseline theoretical results a discussion part that present alternative methods (with references),

exposing pros and cons of each methods.

You can choose to only exploit a subsample of the proposed references,as soon as the content of your work is interesing enough. You can alsocomplement your report with a reproducible set of simulations (use R, Matlabor Python please) that can be inspired from existing packages. (If packages arenot public, send the whole source files). These simulations are not accountedin the 10 pages of the report.

The report files should be named lastname.doc or lastname.pdf and expectedin my mailbox before 28th of February.

And to do this, anything is fair game (you can do what you want and findsources everywhere, but take care to avoid a plagiat !)

5.2 Classification with NN & SVM

The supervised classification problem is a long-standing issue in statisticsand machine learning and many algorithms can be found to deal with thisstandard framework. After a brief introduction and a concrete example, a mo-delisation of this statistical problem, explain the important role of the Bayesclassifier and of the NN rule. Then, present the geometric interpretation of theSVM classifier, the role of convexity and the maths behind. After, discuss onthe influence of the several parameters : number of observations, dimension ofthe ambiant space, etc.

References : CRAN repository Journal of Statistical Software webpage Hastie Tibshirani and Friedman, The elements of statistical learning

data mining inference and prediction Gyorfi, Lugosi, A Probabilistic Theory of Pattern Recognition My website perso.math.univ-toulouse.fr/gadat/ Wikistat wikistat.fr/

http://wikistat.fr

perso.math.univ-toulouse.fr/gadat/

wikistat.fr/

Data Science - Convex optimization and application · 6Data Science - Convex optimization and...

Documents

Transcript of Data Science - Convex optimization and application · 6Data Science - Convex optimization and...