Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of...

71
The Era of “Non”-Optimization Problems Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld Optimization Seminar Series 3:30–4:30 PM CEST Monday September 14 2020 Based on joint work with Dr. Ying Cui at the University of Minnesota and many collaborators.

Transcript of Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of...

Page 1: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

The Era of ldquoNonrdquo-Optimization Problems

Jong-Shi Pang

Daniel J Epstein Department of Industrial and Systems Engineering

University of Southern California

presented at

OneWorld Optimization Seminar Series

330ndash430 PM CEST Monday September 14 2020

Based on joint work with Dr Ying Cui at the University of Minnesota and manycollaborators

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Some General Thoughtsaddressing the negative impact of the field of machine learning

on the optimization field

and raising an alarm

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

A

A Are you satisfied with the comfort zone of convexity and differentiability

A

will not learn from this talk

Yes

A

will not learn from this talk

Live happily ever after

Yes

A

will not learn from this talk

Live happily ever after

B

Yes No

B How to deal with non-convexitynon-smoothness

A

will not learn from this talk

Live happily ever after

B

C1

Yes No

C1 Ignore these properties or give them minimal or even wrong treatment

A

will not learn from this talk

Live happily ever after

B

C1 C2

Yes No

C2 Focus on stylized problems (and there are many interesting ones) and stopat them

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

Yes No

C3 Faithfully treat these non-properties

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 2: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Some General Thoughtsaddressing the negative impact of the field of machine learning

on the optimization field

and raising an alarm

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

A

A Are you satisfied with the comfort zone of convexity and differentiability

A

will not learn from this talk

Yes

A

will not learn from this talk

Live happily ever after

Yes

A

will not learn from this talk

Live happily ever after

B

Yes No

B How to deal with non-convexitynon-smoothness

A

will not learn from this talk

Live happily ever after

B

C1

Yes No

C1 Ignore these properties or give them minimal or even wrong treatment

A

will not learn from this talk

Live happily ever after

B

C1 C2

Yes No

C2 Focus on stylized problems (and there are many interesting ones) and stopat them

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

Yes No

C3 Faithfully treat these non-properties

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 3: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Some General Thoughtsaddressing the negative impact of the field of machine learning

on the optimization field

and raising an alarm

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

A

A Are you satisfied with the comfort zone of convexity and differentiability

A

will not learn from this talk

Yes

A

will not learn from this talk

Live happily ever after

Yes

A

will not learn from this talk

Live happily ever after

B

Yes No

B How to deal with non-convexitynon-smoothness

A

will not learn from this talk

Live happily ever after

B

C1

Yes No

C1 Ignore these properties or give them minimal or even wrong treatment

A

will not learn from this talk

Live happily ever after

B

C1 C2

Yes No

C2 Focus on stylized problems (and there are many interesting ones) and stopat them

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

Yes No

C3 Faithfully treat these non-properties

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 4: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Some General Thoughtsaddressing the negative impact of the field of machine learning

on the optimization field

and raising an alarm

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

A

A Are you satisfied with the comfort zone of convexity and differentiability

A

will not learn from this talk

Yes

A

will not learn from this talk

Live happily ever after

Yes

A

will not learn from this talk

Live happily ever after

B

Yes No

B How to deal with non-convexitynon-smoothness

A

will not learn from this talk

Live happily ever after

B

C1

Yes No

C1 Ignore these properties or give them minimal or even wrong treatment

A

will not learn from this talk

Live happily ever after

B

C1 C2

Yes No

C2 Focus on stylized problems (and there are many interesting ones) and stopat them

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

Yes No

C3 Faithfully treat these non-properties

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 5: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Some General Thoughtsaddressing the negative impact of the field of machine learning

on the optimization field

and raising an alarm

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

A

A Are you satisfied with the comfort zone of convexity and differentiability

A

will not learn from this talk

Yes

A

will not learn from this talk

Live happily ever after

Yes

A

will not learn from this talk

Live happily ever after

B

Yes No

B How to deal with non-convexitynon-smoothness

A

will not learn from this talk

Live happily ever after

B

C1

Yes No

C1 Ignore these properties or give them minimal or even wrong treatment

A

will not learn from this talk

Live happily ever after

B

C1 C2

Yes No

C2 Focus on stylized problems (and there are many interesting ones) and stopat them

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

Yes No

C3 Faithfully treat these non-properties

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 6: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Some General Thoughtsaddressing the negative impact of the field of machine learning

on the optimization field

and raising an alarm

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

A

A Are you satisfied with the comfort zone of convexity and differentiability

A

will not learn from this talk

Yes

A

will not learn from this talk

Live happily ever after

Yes

A

will not learn from this talk

Live happily ever after

B

Yes No

B How to deal with non-convexitynon-smoothness

A

will not learn from this talk

Live happily ever after

B

C1

Yes No

C1 Ignore these properties or give them minimal or even wrong treatment

A

will not learn from this talk

Live happily ever after

B

C1 C2

Yes No

C2 Focus on stylized problems (and there are many interesting ones) and stopat them

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

Yes No

C3 Faithfully treat these non-properties

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 7: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Contents of presentation

bull Some general thoughts of the state of the field

bull A readiness test (a light-hearted humor)

bull Some key ldquononrdquo-features

bull Some novel ldquononrdquo-optimization problems

bull A gentle touch of mathematics

bull The monograph

Some General Thoughtsaddressing the negative impact of the field of machine learning

on the optimization field

and raising an alarm

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

A

A Are you satisfied with the comfort zone of convexity and differentiability

A

will not learn from this talk

Yes

A

will not learn from this talk

Live happily ever after

Yes

A

will not learn from this talk

Live happily ever after

B

Yes No

B How to deal with non-convexitynon-smoothness

A

will not learn from this talk

Live happily ever after

B

C1

Yes No

C1 Ignore these properties or give them minimal or even wrong treatment

A

will not learn from this talk

Live happily ever after

B

C1 C2

Yes No

C2 Focus on stylized problems (and there are many interesting ones) and stopat them

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

Yes No

C3 Faithfully treat these non-properties

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 8: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Some General Thoughtsaddressing the negative impact of the field of machine learning

on the optimization field

and raising an alarm

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

A

A Are you satisfied with the comfort zone of convexity and differentiability

A

will not learn from this talk

Yes

A

will not learn from this talk

Live happily ever after

Yes

A

will not learn from this talk

Live happily ever after

B

Yes No

B How to deal with non-convexitynon-smoothness

A

will not learn from this talk

Live happily ever after

B

C1

Yes No

C1 Ignore these properties or give them minimal or even wrong treatment

A

will not learn from this talk

Live happily ever after

B

C1 C2

Yes No

C2 Focus on stylized problems (and there are many interesting ones) and stopat them

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

Yes No

C3 Faithfully treat these non-properties

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 9: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

A

A Are you satisfied with the comfort zone of convexity and differentiability

A

will not learn from this talk

Yes

A

will not learn from this talk

Live happily ever after

Yes

A

will not learn from this talk

Live happily ever after

B

Yes No

B How to deal with non-convexitynon-smoothness

A

will not learn from this talk

Live happily ever after

B

C1

Yes No

C1 Ignore these properties or give them minimal or even wrong treatment

A

will not learn from this talk

Live happily ever after

B

C1 C2

Yes No

C2 Focus on stylized problems (and there are many interesting ones) and stopat them

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

Yes No

C3 Faithfully treat these non-properties

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 10: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

A

A Are you satisfied with the comfort zone of convexity and differentiability

A

will not learn from this talk

Yes

A

will not learn from this talk

Live happily ever after

Yes

A

will not learn from this talk

Live happily ever after

B

Yes No

B How to deal with non-convexitynon-smoothness

A

will not learn from this talk

Live happily ever after

B

C1

Yes No

C1 Ignore these properties or give them minimal or even wrong treatment

A

will not learn from this talk

Live happily ever after

B

C1 C2

Yes No

C2 Focus on stylized problems (and there are many interesting ones) and stopat them

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

Yes No

C3 Faithfully treat these non-properties

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 11: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years

to the extent that

bull modellers are routinely sacrificing realism for convenience

bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach

bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo

resulting in

bull tools being abused

bull advances in science being stalled and

bull future being built on soft ground

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

A

A Are you satisfied with the comfort zone of convexity and differentiability

A

will not learn from this talk

Yes

A

will not learn from this talk

Live happily ever after

Yes

A

will not learn from this talk

Live happily ever after

B

Yes No

B How to deal with non-convexitynon-smoothness

A

will not learn from this talk

Live happily ever after

B

C1

Yes No

C1 Ignore these properties or give them minimal or even wrong treatment

A

will not learn from this talk

Live happily ever after

B

C1 C2

Yes No

C2 Focus on stylized problems (and there are many interesting ones) and stopat them

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

Yes No

C3 Faithfully treat these non-properties

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 12: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

A

A Are you satisfied with the comfort zone of convexity and differentiability

A

will not learn from this talk

Yes

A

will not learn from this talk

Live happily ever after

Yes

A

will not learn from this talk

Live happily ever after

B

Yes No

B How to deal with non-convexitynon-smoothness

A

will not learn from this talk

Live happily ever after

B

C1

Yes No

C1 Ignore these properties or give them minimal or even wrong treatment

A

will not learn from this talk

Live happily ever after

B

C1 C2

Yes No

C2 Focus on stylized problems (and there are many interesting ones) and stopat them

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

Yes No

C3 Faithfully treat these non-properties

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 13: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

This talkBenefits yes or No

Ready for the ldquononrdquo-era yes or no

A little humor before the storm

A

A Are you satisfied with the comfort zone of convexity and differentiability

A

will not learn from this talk

Yes

A

will not learn from this talk

Live happily ever after

Yes

A

will not learn from this talk

Live happily ever after

B

Yes No

B How to deal with non-convexitynon-smoothness

A

will not learn from this talk

Live happily ever after

B

C1

Yes No

C1 Ignore these properties or give them minimal or even wrong treatment

A

will not learn from this talk

Live happily ever after

B

C1 C2

Yes No

C2 Focus on stylized problems (and there are many interesting ones) and stopat them

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

Yes No

C3 Faithfully treat these non-properties

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 14: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

A

A Are you satisfied with the comfort zone of convexity and differentiability

A

will not learn from this talk

Yes

A

will not learn from this talk

Live happily ever after

Yes

A

will not learn from this talk

Live happily ever after

B

Yes No

B How to deal with non-convexitynon-smoothness

A

will not learn from this talk

Live happily ever after

B

C1

Yes No

C1 Ignore these properties or give them minimal or even wrong treatment

A

will not learn from this talk

Live happily ever after

B

C1 C2

Yes No

C2 Focus on stylized problems (and there are many interesting ones) and stopat them

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

Yes No

C3 Faithfully treat these non-properties

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 15: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

A

will not learn from this talk

Yes

A

will not learn from this talk

Live happily ever after

Yes

A

will not learn from this talk

Live happily ever after

B

Yes No

B How to deal with non-convexitynon-smoothness

A

will not learn from this talk

Live happily ever after

B

C1

Yes No

C1 Ignore these properties or give them minimal or even wrong treatment

A

will not learn from this talk

Live happily ever after

B

C1 C2

Yes No

C2 Focus on stylized problems (and there are many interesting ones) and stopat them

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

Yes No

C3 Faithfully treat these non-properties

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 16: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

A

will not learn from this talk

Live happily ever after

Yes

A

will not learn from this talk

Live happily ever after

B

Yes No

B How to deal with non-convexitynon-smoothness

A

will not learn from this talk

Live happily ever after

B

C1

Yes No

C1 Ignore these properties or give them minimal or even wrong treatment

A

will not learn from this talk

Live happily ever after

B

C1 C2

Yes No

C2 Focus on stylized problems (and there are many interesting ones) and stopat them

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

Yes No

C3 Faithfully treat these non-properties

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 17: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

A

will not learn from this talk

Live happily ever after

B

Yes No

B How to deal with non-convexitynon-smoothness

A

will not learn from this talk

Live happily ever after

B

C1

Yes No

C1 Ignore these properties or give them minimal or even wrong treatment

A

will not learn from this talk

Live happily ever after

B

C1 C2

Yes No

C2 Focus on stylized problems (and there are many interesting ones) and stopat them

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

Yes No

C3 Faithfully treat these non-properties

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 18: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

A

will not learn from this talk

Live happily ever after

B

C1

Yes No

C1 Ignore these properties or give them minimal or even wrong treatment

A

will not learn from this talk

Live happily ever after

B

C1 C2

Yes No

C2 Focus on stylized problems (and there are many interesting ones) and stopat them

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

Yes No

C3 Faithfully treat these non-properties

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 19: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

A

will not learn from this talk

Live happily ever after

B

C1 C2

Yes No

C2 Focus on stylized problems (and there are many interesting ones) and stopat them

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

Yes No

C3 Faithfully treat these non-properties

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 20: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

Yes No

C3 Faithfully treat these non-properties

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 21: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1

Yes No

D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 22: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

A

will not learn from this talk

Live happily ever after

B

C1 C2 C3

D1 D2

Yes No

D2 The hard road

Strive for rigor without sacrificing practice

advance fundamental knowledge and prepare for the future

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 23: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Features

Non-Smooth Non-Convex Non-Deterministic+ +

Non-Problems

Non-Separable

Non-Stylistic Non-Interbreeding

and many more non-topics middot middot middot

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 24: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Examples of Non-Problems

I Structured Learning

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 25: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Piecewise Affine Regression

extending classical affine regression

minimizeaαbβ

1

N

Nsums=1

ys minus

max1leilek1

(ai)gtxs + αi

minus max

1leilek2

(bi)gtxs + βi

︸ ︷︷ ︸

capturing all piecewise affine functions

2

illustrating coupled non-convexity and non-differentiability

Estimate Change PointsHyperplanes

in Linear Regression

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 26: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 27: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 28: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 29: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 30: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic

bull It has the property that every directional stationary solution is a localminimizer

bull It has finitely many stationary values

bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm

bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is

ϕ1

(max

1leilek11ψ1

1i(x)minus max1leilek12

ψ12i(x)

)minus ϕ2

(max

1leilek21ψ2

1i(x)minus max1leilek2

ψ22i(x)

)where all functions are convex

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 31: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Logical Sparsity Learning

minimizeβ isinRd

f(β)

subject todsumj=1

aij |βj |0 le bi i = 1 middot middot middot m

where | t |0

1 if t 6= 0

0 otherwise

bull strongweak hierarchy eg |βj |0 le |βk |0

bull cost-based variable selection coefficients ge 0 but not all equal

bull hierarchicalgroup variable selection

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 32: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 33: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative

bull Soft formulation of ASCs is possible leading to an objective of the kind

minimizeβ isinRd

f(β) + γ

msumi=1

bi minus dsumj=1

aij |βj |0

+

for a given γ gt 0

bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective

bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 34: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Deep Neural Network with Feed-Forward Structure

leading to multi-composite nonconvex nonsmooth optimization problems

minimizeθ=

(W `w`)

L

`=1

1

S

Ssums=1

[ys minus fθ(xs)

]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation

fθ(x) =(WL bull+wL

)︸ ︷︷ ︸Output Layer

middot middot middot max(

0W 2 bull+w2)

︸ ︷︷ ︸Second Layer

max(

0W 1x+ w1)

︸ ︷︷ ︸Input Layer

linearoutput middot middot middot max linear max linear input

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 35: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρ

Lsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 36: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 37: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 38: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 39: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers

bull Note the `1 penalization

minimizezisinZu

ζρ(zu) 1

S

Ssums=1

ϕs(u

sL) + ρLsum`=1

Nsumj=1

∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸

penalizing the components

bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed

bull Framework allows the treatment of data perturbation

bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 40: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Examples of Non-Problems

II Smart Planning

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 41: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Power Systems Planning

ω1 uncertain renewable energy generationω2 uncertain demand of electricity

x production from traditional thermal plants

y production from backup fast-ramping generators

minimizexisinX

ϕ(x) + Eω1ω2

[ψ(x ω1 ω2)

]where the recourse function is

ψ(x ω1 ω2) minimumy

ygtc([ω2 minus xminus ω1]+

)subject to A(ω)x+Dy ge ξ(ω)

Unit cost of fast-ramping generators depends on

I observed renewable production amp demand of electricity

I shortfall due to the first-stage thermal power decisions

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 42: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 43: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 44: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse

minimizex

ζ(x) ϕ(x) + Eξ

[ψ(x ξ)

]subject to x isin X sube Rn1

where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite

ψ(x ξ) minimumy

[f(ξ) +G(ξ)x

]gty + 1

2 ygtQy

subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)

bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind

bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 45: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

(Simplified) Defender-Attacker Game with Costs

Underlying problem is minimize cgtx subject to Ax = b and x ge 0

Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem

minimizexge0

(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb

]+

∥∥∥infin︸ ︷︷ ︸

constraint residual

Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem

maximumδ isin∆

λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective

+λ2

∥∥∥ [ (A+ δA)xminus bminus δb]+

∥∥∥infin︸ ︷︷ ︸

disrupt constraints

minusC(δ)︸︷︷︸cost

bull Many variations are possible

bull Offer alternative approaches to the convexity-based robust formulations

bull Adversarial Programming

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 46: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 47: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Robustification of nonconvex problems

A general nonlinear program

minimizexisinS

θ(x) + f(x)

subject to H(x) = 0 where

bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0

Robust counterpart with separate perturbations

minimizexisinS

θ(x) + maximumδ isin ∆

f(x δ)

subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible

in current way

of modeling

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 48: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

New formulation

via a pessimistic value-function minimization given γ gt 0

minimizexisinS

θ(x) + vγ(x)

where vγ(x) models constraint violation with cost of model attacks

vγ(x) maximum over δ isin ∆ of

φγ(x δ) f(x δ) + γ

H(x δ) infin︸ ︷︷ ︸constraint residual

minus C(δ)︸ ︷︷ ︸attack cost

with the novel features

bull combined perturbation δ in joint uncertainty set ∆ sube RL

bull robustify constraint satisfaction via penalization

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 49: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 50: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 51: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 52: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull Related the VFOP to the two game formulations in terms of directionalsolutions

bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆

bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation

minimizexisinS

ϕ(x) θ(x) + v(x) with

v(x) max1lejleJ

maximumδisin∆capRL+

[gj(x δ)minus hj(x)gtδ

]

bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 53: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Examples of Non-Problems

III Operations Research meets Statistics

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 54: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 55: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Composite CVaR and bPOE minimization

for risk-based statistical estimation to control left and right-tail distributionssimultaneously

Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors

In-CVaR βα (Z)

1

β minus α

intVaRβ(Z)gtzgeVaRα(Z)

z dFZ(z)

=1

β minus α[

(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]

Buffered probability of exceedancedeceedance at level τ gt 0

bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ

bPOD(Zu) min

β isin [0 1] In-CVaR β

0 (Z) ge u

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 56: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Buffered probability of interval (τ u]

We have the following bound for an interval probability

P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)

Two risk minimization problems of a composite random functional

minimizeθisinΘ

In-CVaR βα (Z(θ))︸ ︷︷ ︸

a difference of two convexSP value functions

and minimizeθisinΘ

bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective

where

Z(θ) c

g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex

with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 57: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 58: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 59: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Comments

bull The In-CVaR minimization leads to the following stochastic program

minimize over θ isin Θ of

minimumη1isinΥ1

[ϕ1(θ η1 ω)

]︸ ︷︷ ︸

denoted v1(θ)

minusminimumη2isinΥ2

[ϕ2(θ η2 ω)

]︸ ︷︷ ︸

denoted v2(θ)

Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective

bull Developed a convex programming based sampling method for computing acritical point of the objective

bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 60: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

A touch of mathematicsbridging computations with theory

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 61: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

First of All What Can be Computed

critical(for dc fncs)

Clarke statlimiting stat

directional stat(for dd fncs)

SONC

loc-min

SOSC

Relationship between the stationary points

I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]

eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable

I directional stationarity V prime(x d) ge 0 for all the directions d

bull The sharper the concept the harder to compute

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 62: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Computational Tools

I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions

I A fusion of first-order and second-order algorithms

ndash Dealing with non-convexity majorization-minimization algorithm

ndash Dealing with non-smoothness nonsmooth Newton algorithm

ndash Dealing with pathological constraints (exact) penalty method

I An integration of sampling and deterministic methods

ndash Dealing with population optimization in statistics

ndash Dealing with stochastic programming in operations research

ndash Dealing with risk minimization in financial management and tail control

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 63: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Classes of Nonsmooth Functions

Pervasive in applications

Progressively broader

bull Piecewise smoothlowast eg piecewise linear-quadratic

bull Difference-of convex facilitates algorithmic design via convex majorization

bull Semismooth enables fast algorithms of the Newton-type

bull Bouligand differentiable locally Lipschitz + directionally differentiable

bull Sub-analytic and semi-analytic

lowast and many more

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 64: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

44 CLASSES OF NONSMOOTH FUNCTIONS 231

dcdc composite

diff-max-cvx

local Lip+

dir diff

PC2 PC1 semismooth B-diff

PQ PLQ PA dc icc

s-weak-cve locally dc

PLQ

+ C1LC1 + lower-C2

s-weak-cvxlocally

s-weak-cvx

onopen

convexdomain by def

under(465) +

(466)

+C1

+ openor closed

convexdomain

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 65: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Completed Monograph

Modern Nonconvex Nondifferentiable Optimization

Ying Cui and Jong-Shi Pang

submitted to publisher (Friday September 11 2020)

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 66: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

Table of Contents

Preface i

Acronyms i

Glossary of Notation i

Index i

1 Primer of Mathematical Prerequisites 1

11 Smooth Calculus 1

111 Differentiability of functions of several variables 7

112 Mean-value theorems for vector-valued functions 12

113 Fixed-point theorems and contraction 16

114 Implicit-function theorems 18

12 Matrices and Linear Equations 23

121 Matrix factorizations eigenvalues and singular values 29

122 Selected matrix classes 36

13 Matrix Splitting Methods for Linear Equations 47

14 The Conjugate Gradient Method 50

15 Elements of Set-Valued Analysis 54

iii

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 67: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

iv TABLE OF CONTENTS

2 Optimization Background 59

21 Polyhedral Theory 59

211 The Fourier-Motzkin elimination in linear inequalities 60

212 Properties of polyhedra 63

22 Convex Sets and Functions 67

221 Euclidean projection and proximality 71

23 Convex Quadratic and Nonlinear Programs 79

231 The KKT conditions and CQs 81

24 The Minmax Theory of von Neumann 85

25 Basic Descent Methods for Smooth Problems 86

251 Unconstrained problems Armijo line search 87

252 The PGM for convex programs 91

253 The proximal Newton method 95

26 Nonlinear Programming Methods A Survey 103

3 Structured Learning via Statistics and Optimization 109

31 A Composite View of Statistical Estimation 111

311 The optimization problems 113

312 Learning classes 115

313 Loss functions 122

314 Regularizers for sparsity 123

32 Low Rankness in Matrix Optimization 128

321 Rank constraints Hard and soft formulations 128

322 Shape constrained index models 131

33 Affine Sparsity Constraints 137

4 Nonsmooth Analysis 143

41 B(ouligand)-Differentiable Functions 144

411 Generalized directional derivatives 153

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 68: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

TABLE OF CONTENTS v

42 Second-order Directional Derivatives 156

43 Subdifferentials and Regularity 162

44 Classes of Nonsmooth Functions 176

441 Piecewise affine functions 177

442 Piecewise linear-quadratic functions 190

443 Weakly convex functions 205

444 Difference-of-convex functions 214

445 Semismooth functions 221

446 Implicit convex-concave functions 229

447 A summary 240

5 Value Functions 243

51 Nonlinear Programming based Value Functions 244

511 Interdiction versus enhancement 247

52 Value Functions of Quadratic Programs 251

53 Value Functions Associated with Convex Functions 258

54 Robustification of Optimization Problems 261

541 Attacks without set-up costs 264

542 Attacks with set-up costs 267

55 The Danskin Property 271

56 Gap Functions 278

57 Some Nonconvex Stochastic Value Functions 282

58 Understanding Robust Optimization 289

581 Uncertainty in data realizations 290

582 Uncertainty in the probabilities of realizations 304

6 Stationarity Theory 311

61 First-order Theory 313

611 The convex constrained case 314

612 Composite ε-strong stationarity 320

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 69: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

vi TABLE OF CONTENTS

613 Subdifferential based stationarity 324

62 Directional Saddle-Point Theory 332

63 Difference-of-Convex Constraints 337

631 Extension to piecewise dd-convex programs 345

64 Second-order Theory 348

65 Piecewise Linear-Quadratic Programs 350

651 Review of nonconvex QPs 350

652 Return to PLQ programs 352

653 Finiteness of stationary values 362

654 Testing copositivity One negative eigenvalue 367

7 Computational Algorithms by Surrogation 371

71 Basic Surrogation 372

711 A broad family of upper surrogation functions 379

712 Examples of surrogation families 381

713 Surrogation algorithms for several problem classes 384

72 Refinements and Extensions 392

721 Moreau surrogation 392

722 Combined BCD and ADMM for coupled constraints 400

723 Surrogation with line searches 418

724 Dinkelbachrsquos algorithm with surrogation 424

725 Randomized choice of subproblems 432

726 Inexact surrogation 435

8 Theory of Error Bounds 441

81 An Introduction to Error Bounds 442

811 The stationarity inclusions 444

82 The Piecewise Polyhedral Inclusion 449

83 Error Bounds for Subdifferential-Based Stationarity 454

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 70: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

TABLE OF CONTENTS vii

84 Sequential Convergence Under Sufficient Descent 462

841 Using the local error bound (819) 462

842 The Kurdyka-983040Lojaziewicz theory 467

843 Connections between Theorems 841 and 844 475

85 Error Bounds and Linear Regularity 476

851 Piecewise systems 483

852 Stationarity for convex composite dc optimization 488

853 dd-convex systems 491

854 Linear regularity and directional Slater 497

9 Theory of Exact Penalization 501

91 Exact Penalization of Optimal Solutions 503

92 Exact Penalization of Stationary Solutions 506

93 Pulled-out Penalization of Composite Functions 511

94 Two Applications of Pull-out Penalization 517

941 B-stationarity of dc composite diff-max programs 518

942 Local optimality of composite twice semidiff programs 521

95 Exact Penalization of Affine Sparsity Constraints 527

96 A Special Class of Cardinality Constrained Problems 531

97 Penalized Surrogation for Composite Programs 533

971 Solving the convex penalized subproblems 539

972 Semismooth Newton methods 543

10 Nonconvex Stochastic Programs 547

101 Expectation Functions 548

1011 Sample average approximations 555

102 Basic Stochastic Surrogation 560

1021 Expectation constraints 567

103 Two-stage SPs with Quadratic Recourse 568

1031 The positive definite case 569

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721

Page 71: Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of Industrial and Systems Engineering University of Southern California presented at OneWorld

viii TABLE OF CONTENTS

1032 The positive semidefinite case 571

104 Probabilisitic Error Bounds 588

105 Consistency of Directional Stationary Solutions 591

1051 Convergence of stationary solutions 592

1052 Asymptotic distribution of the stationary values 605

1053 Convergence rate of the stationary points 609

106 Noisy Amplitude-based Phase Retrieval 613

11 Nonconvex Game Problems 625

111 Definitions and Preliminaries 627

1111 Applications 631

1112 Related models 637

112 Existence of a QNE 640

113 Computational Algorithms 648

1131 A Gauss-Seidel best-response algorithm 649

114 The Role of Potential 658

1141 Existence of a potential 665

115 Double-Loop Algorithms 669

116 The Nikaido-Isoda Optimization Formulations 678

1161 Surrogation applied to the NI function φc 681

1162 Difference-of-convexity and surrogation of 983141φc 683

Bibliography 691

Index 721