Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of...
Transcript of Non-Optimization Problems Jong-Shi Pang€¦ · Jong-Shi Pang Daniel J. Epstein Department of...
The Era of ldquoNonrdquo-Optimization Problems
Jong-Shi Pang
Daniel J Epstein Department of Industrial and Systems Engineering
University of Southern California
presented at
OneWorld Optimization Seminar Series
330ndash430 PM CEST Monday September 14 2020
Based on joint work with Dr Ying Cui at the University of Minnesota and manycollaborators
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Some General Thoughtsaddressing the negative impact of the field of machine learning
on the optimization field
and raising an alarm
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
A
A Are you satisfied with the comfort zone of convexity and differentiability
A
will not learn from this talk
Yes
A
will not learn from this talk
Live happily ever after
Yes
A
will not learn from this talk
Live happily ever after
B
Yes No
B How to deal with non-convexitynon-smoothness
A
will not learn from this talk
Live happily ever after
B
C1
Yes No
C1 Ignore these properties or give them minimal or even wrong treatment
A
will not learn from this talk
Live happily ever after
B
C1 C2
Yes No
C2 Focus on stylized problems (and there are many interesting ones) and stopat them
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
Yes No
C3 Faithfully treat these non-properties
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Some General Thoughtsaddressing the negative impact of the field of machine learning
on the optimization field
and raising an alarm
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
A
A Are you satisfied with the comfort zone of convexity and differentiability
A
will not learn from this talk
Yes
A
will not learn from this talk
Live happily ever after
Yes
A
will not learn from this talk
Live happily ever after
B
Yes No
B How to deal with non-convexitynon-smoothness
A
will not learn from this talk
Live happily ever after
B
C1
Yes No
C1 Ignore these properties or give them minimal or even wrong treatment
A
will not learn from this talk
Live happily ever after
B
C1 C2
Yes No
C2 Focus on stylized problems (and there are many interesting ones) and stopat them
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
Yes No
C3 Faithfully treat these non-properties
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Some General Thoughtsaddressing the negative impact of the field of machine learning
on the optimization field
and raising an alarm
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
A
A Are you satisfied with the comfort zone of convexity and differentiability
A
will not learn from this talk
Yes
A
will not learn from this talk
Live happily ever after
Yes
A
will not learn from this talk
Live happily ever after
B
Yes No
B How to deal with non-convexitynon-smoothness
A
will not learn from this talk
Live happily ever after
B
C1
Yes No
C1 Ignore these properties or give them minimal or even wrong treatment
A
will not learn from this talk
Live happily ever after
B
C1 C2
Yes No
C2 Focus on stylized problems (and there are many interesting ones) and stopat them
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
Yes No
C3 Faithfully treat these non-properties
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Some General Thoughtsaddressing the negative impact of the field of machine learning
on the optimization field
and raising an alarm
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
A
A Are you satisfied with the comfort zone of convexity and differentiability
A
will not learn from this talk
Yes
A
will not learn from this talk
Live happily ever after
Yes
A
will not learn from this talk
Live happily ever after
B
Yes No
B How to deal with non-convexitynon-smoothness
A
will not learn from this talk
Live happily ever after
B
C1
Yes No
C1 Ignore these properties or give them minimal or even wrong treatment
A
will not learn from this talk
Live happily ever after
B
C1 C2
Yes No
C2 Focus on stylized problems (and there are many interesting ones) and stopat them
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
Yes No
C3 Faithfully treat these non-properties
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Some General Thoughtsaddressing the negative impact of the field of machine learning
on the optimization field
and raising an alarm
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
A
A Are you satisfied with the comfort zone of convexity and differentiability
A
will not learn from this talk
Yes
A
will not learn from this talk
Live happily ever after
Yes
A
will not learn from this talk
Live happily ever after
B
Yes No
B How to deal with non-convexitynon-smoothness
A
will not learn from this talk
Live happily ever after
B
C1
Yes No
C1 Ignore these properties or give them minimal or even wrong treatment
A
will not learn from this talk
Live happily ever after
B
C1 C2
Yes No
C2 Focus on stylized problems (and there are many interesting ones) and stopat them
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
Yes No
C3 Faithfully treat these non-properties
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Some General Thoughtsaddressing the negative impact of the field of machine learning
on the optimization field
and raising an alarm
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
A
A Are you satisfied with the comfort zone of convexity and differentiability
A
will not learn from this talk
Yes
A
will not learn from this talk
Live happily ever after
Yes
A
will not learn from this talk
Live happily ever after
B
Yes No
B How to deal with non-convexitynon-smoothness
A
will not learn from this talk
Live happily ever after
B
C1
Yes No
C1 Ignore these properties or give them minimal or even wrong treatment
A
will not learn from this talk
Live happily ever after
B
C1 C2
Yes No
C2 Focus on stylized problems (and there are many interesting ones) and stopat them
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
Yes No
C3 Faithfully treat these non-properties
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Contents of presentation
bull Some general thoughts of the state of the field
bull A readiness test (a light-hearted humor)
bull Some key ldquononrdquo-features
bull Some novel ldquononrdquo-optimization problems
bull A gentle touch of mathematics
bull The monograph
Some General Thoughtsaddressing the negative impact of the field of machine learning
on the optimization field
and raising an alarm
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
A
A Are you satisfied with the comfort zone of convexity and differentiability
A
will not learn from this talk
Yes
A
will not learn from this talk
Live happily ever after
Yes
A
will not learn from this talk
Live happily ever after
B
Yes No
B How to deal with non-convexitynon-smoothness
A
will not learn from this talk
Live happily ever after
B
C1
Yes No
C1 Ignore these properties or give them minimal or even wrong treatment
A
will not learn from this talk
Live happily ever after
B
C1 C2
Yes No
C2 Focus on stylized problems (and there are many interesting ones) and stopat them
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
Yes No
C3 Faithfully treat these non-properties
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Some General Thoughtsaddressing the negative impact of the field of machine learning
on the optimization field
and raising an alarm
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
A
A Are you satisfied with the comfort zone of convexity and differentiability
A
will not learn from this talk
Yes
A
will not learn from this talk
Live happily ever after
Yes
A
will not learn from this talk
Live happily ever after
B
Yes No
B How to deal with non-convexitynon-smoothness
A
will not learn from this talk
Live happily ever after
B
C1
Yes No
C1 Ignore these properties or give them minimal or even wrong treatment
A
will not learn from this talk
Live happily ever after
B
C1 C2
Yes No
C2 Focus on stylized problems (and there are many interesting ones) and stopat them
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
Yes No
C3 Faithfully treat these non-properties
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
A
A Are you satisfied with the comfort zone of convexity and differentiability
A
will not learn from this talk
Yes
A
will not learn from this talk
Live happily ever after
Yes
A
will not learn from this talk
Live happily ever after
B
Yes No
B How to deal with non-convexitynon-smoothness
A
will not learn from this talk
Live happily ever after
B
C1
Yes No
C1 Ignore these properties or give them minimal or even wrong treatment
A
will not learn from this talk
Live happily ever after
B
C1 C2
Yes No
C2 Focus on stylized problems (and there are many interesting ones) and stopat them
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
Yes No
C3 Faithfully treat these non-properties
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
A
A Are you satisfied with the comfort zone of convexity and differentiability
A
will not learn from this talk
Yes
A
will not learn from this talk
Live happily ever after
Yes
A
will not learn from this talk
Live happily ever after
B
Yes No
B How to deal with non-convexitynon-smoothness
A
will not learn from this talk
Live happily ever after
B
C1
Yes No
C1 Ignore these properties or give them minimal or even wrong treatment
A
will not learn from this talk
Live happily ever after
B
C1 C2
Yes No
C2 Focus on stylized problems (and there are many interesting ones) and stopat them
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
Yes No
C3 Faithfully treat these non-properties
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
bull Convexity andor differentiability have been the safe haven for optimizationand its applications for decades particularly popular in the machine learning inrecent years
to the extent that
bull modellers are routinely sacrificing realism for convenience
bull algorithm designers will do whatever to stay in this comfort zone even at theexpense of potentially wrong approach
bull practitioners are content with the resulting models algorithms andldquosolutionsrdquo
resulting in
bull tools being abused
bull advances in science being stalled and
bull future being built on soft ground
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
A
A Are you satisfied with the comfort zone of convexity and differentiability
A
will not learn from this talk
Yes
A
will not learn from this talk
Live happily ever after
Yes
A
will not learn from this talk
Live happily ever after
B
Yes No
B How to deal with non-convexitynon-smoothness
A
will not learn from this talk
Live happily ever after
B
C1
Yes No
C1 Ignore these properties or give them minimal or even wrong treatment
A
will not learn from this talk
Live happily ever after
B
C1 C2
Yes No
C2 Focus on stylized problems (and there are many interesting ones) and stopat them
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
Yes No
C3 Faithfully treat these non-properties
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
A
A Are you satisfied with the comfort zone of convexity and differentiability
A
will not learn from this talk
Yes
A
will not learn from this talk
Live happily ever after
Yes
A
will not learn from this talk
Live happily ever after
B
Yes No
B How to deal with non-convexitynon-smoothness
A
will not learn from this talk
Live happily ever after
B
C1
Yes No
C1 Ignore these properties or give them minimal or even wrong treatment
A
will not learn from this talk
Live happily ever after
B
C1 C2
Yes No
C2 Focus on stylized problems (and there are many interesting ones) and stopat them
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
Yes No
C3 Faithfully treat these non-properties
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
This talkBenefits yes or No
Ready for the ldquononrdquo-era yes or no
A little humor before the storm
A
A Are you satisfied with the comfort zone of convexity and differentiability
A
will not learn from this talk
Yes
A
will not learn from this talk
Live happily ever after
Yes
A
will not learn from this talk
Live happily ever after
B
Yes No
B How to deal with non-convexitynon-smoothness
A
will not learn from this talk
Live happily ever after
B
C1
Yes No
C1 Ignore these properties or give them minimal or even wrong treatment
A
will not learn from this talk
Live happily ever after
B
C1 C2
Yes No
C2 Focus on stylized problems (and there are many interesting ones) and stopat them
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
Yes No
C3 Faithfully treat these non-properties
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
A
A Are you satisfied with the comfort zone of convexity and differentiability
A
will not learn from this talk
Yes
A
will not learn from this talk
Live happily ever after
Yes
A
will not learn from this talk
Live happily ever after
B
Yes No
B How to deal with non-convexitynon-smoothness
A
will not learn from this talk
Live happily ever after
B
C1
Yes No
C1 Ignore these properties or give them minimal or even wrong treatment
A
will not learn from this talk
Live happily ever after
B
C1 C2
Yes No
C2 Focus on stylized problems (and there are many interesting ones) and stopat them
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
Yes No
C3 Faithfully treat these non-properties
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
A
will not learn from this talk
Yes
A
will not learn from this talk
Live happily ever after
Yes
A
will not learn from this talk
Live happily ever after
B
Yes No
B How to deal with non-convexitynon-smoothness
A
will not learn from this talk
Live happily ever after
B
C1
Yes No
C1 Ignore these properties or give them minimal or even wrong treatment
A
will not learn from this talk
Live happily ever after
B
C1 C2
Yes No
C2 Focus on stylized problems (and there are many interesting ones) and stopat them
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
Yes No
C3 Faithfully treat these non-properties
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
A
will not learn from this talk
Live happily ever after
Yes
A
will not learn from this talk
Live happily ever after
B
Yes No
B How to deal with non-convexitynon-smoothness
A
will not learn from this talk
Live happily ever after
B
C1
Yes No
C1 Ignore these properties or give them minimal or even wrong treatment
A
will not learn from this talk
Live happily ever after
B
C1 C2
Yes No
C2 Focus on stylized problems (and there are many interesting ones) and stopat them
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
Yes No
C3 Faithfully treat these non-properties
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
A
will not learn from this talk
Live happily ever after
B
Yes No
B How to deal with non-convexitynon-smoothness
A
will not learn from this talk
Live happily ever after
B
C1
Yes No
C1 Ignore these properties or give them minimal or even wrong treatment
A
will not learn from this talk
Live happily ever after
B
C1 C2
Yes No
C2 Focus on stylized problems (and there are many interesting ones) and stopat them
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
Yes No
C3 Faithfully treat these non-properties
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
A
will not learn from this talk
Live happily ever after
B
C1
Yes No
C1 Ignore these properties or give them minimal or even wrong treatment
A
will not learn from this talk
Live happily ever after
B
C1 C2
Yes No
C2 Focus on stylized problems (and there are many interesting ones) and stopat them
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
Yes No
C3 Faithfully treat these non-properties
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
A
will not learn from this talk
Live happily ever after
B
C1 C2
Yes No
C2 Focus on stylized problems (and there are many interesting ones) and stopat them
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
Yes No
C3 Faithfully treat these non-properties
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
Yes No
C3 Faithfully treat these non-properties
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1
Yes No
D1 Employ heuristics and potentially leading to heuristics on heuristics whencomplex models are being formulated
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
A
will not learn from this talk
Live happily ever after
B
C1 C2 C3
D1 D2
Yes No
D2 The hard road
Strive for rigor without sacrificing practice
advance fundamental knowledge and prepare for the future
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Features
Non-Smooth Non-Convex Non-Deterministic+ +
Non-Problems
Non-Separable
Non-Stylistic Non-Interbreeding
and many more non-topics middot middot middot
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Examples of Non-Problems
I Structured Learning
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Piecewise Affine Regression
extending classical affine regression
minimizeaαbβ
1
N
Nsums=1
ys minus
max1leilek1
(ai)gtxs + αi
minus max
1leilek2
(bi)gtxs + βi
︸ ︷︷ ︸
capturing all piecewise affine functions
2
illustrating coupled non-convexity and non-differentiability
Estimate Change PointsHyperplanes
in Linear Regression
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull This is a convex quadratic composed with a piecewise affine function thus ispiecewise linear-quadratic
bull It has the property that every directional stationary solution is a localminimizer
bull It has finitely many stationary values
bull A directional stationary solution can be computed by an iterativeconvex-programming based algorithm
bull Leads to several classes of nonconvex nonsmooth functions most general is adifference-of-convex function composed with a difference-of-convex function anexample of which is
ϕ1
(max
1leilek11ψ1
1i(x)minus max1leilek12
ψ12i(x)
)minus ϕ2
(max
1leilek21ψ2
1i(x)minus max1leilek2
ψ22i(x)
)where all functions are convex
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Logical Sparsity Learning
minimizeβ isinRd
f(β)
subject todsumj=1
aij |βj |0 le bi i = 1 middot middot middot m
where | t |0
1 if t 6= 0
0 otherwise
bull strongweak hierarchy eg |βj |0 le |βk |0
bull cost-based variable selection coefficients ge 0 but not all equal
bull hierarchicalgroup variable selection
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull The `0 function is discrete in nature an ASC can be formulated either ascomplementarity constraints or by integer variables under boundedness of theoriginal variables (the big M-formulation) when A is nonnegative
bull Soft formulation of ASCs is possible leading to an objective of the kind
minimizeβ isinRd
f(β) + γ
msumi=1
bi minus dsumj=1
aij |βj |0
+
for a given γ gt 0
bull When the matrix A has negative entries the resulting ASC set may not closedits optimization formulation may not have a lower semi-continuous objective
bull When | bull |0 is approximated by a folded concave function obtain somenondifferentiable difference-of-convex constraints
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Deep Neural Network with Feed-Forward Structure
leading to multi-composite nonconvex nonsmooth optimization problems
minimizeθ=
(W `w`)
L
`=1
1
S
Ssums=1
[ys minus fθ(xs)
]2where fθ is the L-fold composite function with the non-smooth max ReLUactivation
fθ(x) =(WL bull+wL
)︸ ︷︷ ︸Output Layer
middot middot middot max(
0W 2 bull+w2)
︸ ︷︷ ︸Second Layer
max(
0W 1x+ w1)
︸ ︷︷ ︸Input Layer
linearoutput middot middot middot max linear max linear input
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρ
Lsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull Problem can be treated by exact penalization a theory that addressesstationary solutions rather than minimizers
bull Note the `1 penalization
minimizezisinZu
ζρ(zu) 1
S
Ssums=1
ϕs(u
sL) + ρLsum`=1
Nsumj=1
∣∣∣us`j minus ψ`j(z` us`minus1)∣∣∣︸ ︷︷ ︸
penalizing the components
bull Rigorous algorithms with supporting convergence theory for computing adirectional stationary point can be developed
bull Framework allows the treatment of data perturbation
bull Leads to an advanced study of exact penalization and error bounds ofnondifferentiable optimization problems
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Examples of Non-Problems
II Smart Planning
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Power Systems Planning
ω1 uncertain renewable energy generationω2 uncertain demand of electricity
x production from traditional thermal plants
y production from backup fast-ramping generators
minimizexisinX
ϕ(x) + Eω1ω2
[ψ(x ω1 ω2)
]where the recourse function is
ψ(x ω1 ω2) minimumy
ygtc([ω2 minus xminus ω1]+
)subject to A(ω)x+Dy ge ξ(ω)
Unit cost of fast-ramping generators depends on
I observed renewable production amp demand of electricity
I shortfall due to the first-stage thermal power decisions
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull Leads to the study of the class of two-stage linearly bi-parameterizedstochastic program with quadratic recourse
minimizex
ζ(x) ϕ(x) + Eξ
[ψ(x ξ)
]subject to x isin X sube Rn1
where the recourse function ψ(x ξ) is the optimal objective value of thequadratic program (QP) with Q being symmetric positive semidefinite
ψ(x ξ) minimumy
[f(ξ) +G(ξ)x
]gty + 1
2 ygtQy
subject to y isin Y (x ξ) y isin Rn2 | A(ξ)x+Dy ge b(ξ)
bull The recourse function ψ(bull ξ) is of the implicit convex-concave kind
bull Developed a Regularization Convexification sampling combined method forcomputing a generalized critical point with almost sure convergence guarantee
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
(Simplified) Defender-Attacker Game with Costs
Underlying problem is minimize cgtx subject to Ax = b and x ge 0
Defenderrsquos problem Anticipating the attackerrsquos disruption δ isin ∆ the defenderundertakes the operation by solving the problem
minimizexge0
(c+ δc)gtx+ ρ∥∥∥ [ (A+ δA)xminus bminus δb
]+
∥∥∥infin︸ ︷︷ ︸
constraint residual
Attackerrsquos problem Anticipating the defenderrsquos activity x the attacker aimsto disrupt the defenderrsquos operations by solving the problem
maximumδ isin∆
λ1 (c+ δc)gtx︸ ︷︷ ︸disrupt objective
+λ2
∥∥∥ [ (A+ δA)xminus bminus δb]+
∥∥∥infin︸ ︷︷ ︸
disrupt constraints
minusC(δ)︸︷︷︸cost
bull Many variations are possible
bull Offer alternative approaches to the convexity-based robust formulations
bull Adversarial Programming
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Robustification of nonconvex problems
A general nonlinear program
minimizexisinS
θ(x) + f(x)
subject to H(x) = 0 where
bull S is a closed convex set in Rn not subject to alterationsbull θ and f are real-valued functions defined on an open convex set O in Rncontaining Sbull H O rarr Rm is a vector-valued (nonsmooth) function that can be used tomodel both inequality constraints g(x) le 0 and equality constraints h(x) = 0
Robust counterpart with separate perturbations
minimizexisinS
θ(x) + maximumδ isin ∆
f(x δ)
subject to H(x δ) = 0 forall δ isin ∆︸ ︷︷ ︸easily infeasible
in current way
of modeling
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
New formulation
via a pessimistic value-function minimization given γ gt 0
minimizexisinS
θ(x) + vγ(x)
where vγ(x) models constraint violation with cost of model attacks
vγ(x) maximum over δ isin ∆ of
φγ(x δ) f(x δ) + γ
H(x δ) infin︸ ︷︷ ︸constraint residual
minus C(δ)︸ ︷︷ ︸attack cost
with the novel features
bull combined perturbation δ in joint uncertainty set ∆ sube RL
bull robustify constraint satisfaction via penalization
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull Related the VFOP to the two game formulations in terms of directionalsolutions
bull Showed that the value function admits an explicit form under piecewisestructures of f(bull δ) and H(bull δ) a linear structure of C(δ) with a set-upcomponent and special simplex-type uncertainty set ∆
bull Introduced a Majorization-Minimization (MM) algorithm for solving a unifiedformulation
minimizexisinS
ϕ(x) θ(x) + v(x) with
v(x) max1lejleJ
maximumδisin∆capRL+
[gj(x δ)minus hj(x)gtδ
]
bull Implemented algorithm and reported computation results to demonstrate theviability of the robust paradigm in the ldquononrdquo-framework
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Examples of Non-Problems
III Operations Research meets Statistics
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Composite CVaR and bPOE minimization
for risk-based statistical estimation to control left and right-tail distributionssimultaneously
Interval Conditional Value-at-Risk of a random variable Z at levels0 le α lt β le 1 treating two-sided estimation errors
In-CVaR βα (Z)
1
β minus α
intVaRβ(Z)gtzgeVaRα(Z)
z dFZ(z)
=1
β minus α[
(1minus α)CVaRα(Z)minus (1minus β)CVaRβ(Z)]
Buffered probability of exceedancedeceedance at level τ gt 0
bPOE(Z τ) 1minusminα isin [0 1] CVaRα(Z) ge τ
bPOD(Zu) min
β isin [0 1] In-CVaR β
0 (Z) ge u
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Buffered probability of interval (τ u]
We have the following bound for an interval probability
P(u ge Z gt τ) le bPOE(Z τ) + bPOD(Zu)minus 1 bPOIuτ (Z)
Two risk minimization problems of a composite random functional
minimizeθisinΘ
In-CVaR βα (Z(θ))︸ ︷︷ ︸
a difference of two convexSP value functions
and minimizeθisinΘ
bPOIuτ (Z(θ))︸ ︷︷ ︸a stochastic program witha difference-of-convex objective
where
Z(θ) c
g(θZ)minus h(θZ)︸ ︷︷ ︸g(bull z)minus h(bull z) is difference of convex
with c Rrarr R being a univariate piecewise affine function Z is am-dimensional random vector and for each z g(bull z) and h(bull z) are bothconvex functions
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Comments
bull The In-CVaR minimization leads to the following stochastic program
minimize over θ isin Θ of
minimumη1isinΥ1
Eω
[ϕ1(θ η1 ω)
]︸ ︷︷ ︸
denoted v1(θ)
minusminimumη2isinΥ2
Eω
[ϕ2(θ η2 ω)
]︸ ︷︷ ︸
denoted v2(θ)
Both functions v1 and v2 are (convex) value functions of a univariateexpectation minimization thus overall objective is a difference-of-convexobjective
bull Developed a convex programming based sampling method for computing acritical point of the objective
bull Numerical results demonstrated superior performance in the handling ofoutliers in regression and classification problems
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
A touch of mathematicsbridging computations with theory
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
First of All What Can be Computed
critical(for dc fncs)
Clarke statlimiting stat
directional stat(for dd fncs)
SONC
loc-min
SOSC
Relationship between the stationary points
I Stationarity- smooth case nablaV (x) = 0- nonsmooth case 0 isin partV (x) [subdifferential many calculus rules fail]
eg partV1(x) + partV2(x) 6= part(V1 + V2)(x)unless one of them is continuously differentiable
I directional stationarity V prime(x d) ge 0 for all the directions d
bull The sharper the concept the harder to compute
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Computational Tools
I Built on a theory of surrogation supplemented by exactpenalization of stationary solutions
I A fusion of first-order and second-order algorithms
ndash Dealing with non-convexity majorization-minimization algorithm
ndash Dealing with non-smoothness nonsmooth Newton algorithm
ndash Dealing with pathological constraints (exact) penalty method
I An integration of sampling and deterministic methods
ndash Dealing with population optimization in statistics
ndash Dealing with stochastic programming in operations research
ndash Dealing with risk minimization in financial management and tail control
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Classes of Nonsmooth Functions
Pervasive in applications
Progressively broader
bull Piecewise smoothlowast eg piecewise linear-quadratic
bull Difference-of convex facilitates algorithmic design via convex majorization
bull Semismooth enables fast algorithms of the Newton-type
bull Bouligand differentiable locally Lipschitz + directionally differentiable
bull Sub-analytic and semi-analytic
lowast and many more
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
44 CLASSES OF NONSMOOTH FUNCTIONS 231
dcdc composite
diff-max-cvx
local Lip+
dir diff
PC2 PC1 semismooth B-diff
PQ PLQ PA dc icc
s-weak-cve locally dc
PLQ
+ C1LC1 + lower-C2
s-weak-cvxlocally
s-weak-cvx
onopen
convexdomain by def
under(465) +
(466)
+C1
+ openor closed
convexdomain
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Completed Monograph
Modern Nonconvex Nondifferentiable Optimization
Ying Cui and Jong-Shi Pang
submitted to publisher (Friday September 11 2020)
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
Table of Contents
Preface i
Acronyms i
Glossary of Notation i
Index i
1 Primer of Mathematical Prerequisites 1
11 Smooth Calculus 1
111 Differentiability of functions of several variables 7
112 Mean-value theorems for vector-valued functions 12
113 Fixed-point theorems and contraction 16
114 Implicit-function theorems 18
12 Matrices and Linear Equations 23
121 Matrix factorizations eigenvalues and singular values 29
122 Selected matrix classes 36
13 Matrix Splitting Methods for Linear Equations 47
14 The Conjugate Gradient Method 50
15 Elements of Set-Valued Analysis 54
iii
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
iv TABLE OF CONTENTS
2 Optimization Background 59
21 Polyhedral Theory 59
211 The Fourier-Motzkin elimination in linear inequalities 60
212 Properties of polyhedra 63
22 Convex Sets and Functions 67
221 Euclidean projection and proximality 71
23 Convex Quadratic and Nonlinear Programs 79
231 The KKT conditions and CQs 81
24 The Minmax Theory of von Neumann 85
25 Basic Descent Methods for Smooth Problems 86
251 Unconstrained problems Armijo line search 87
252 The PGM for convex programs 91
253 The proximal Newton method 95
26 Nonlinear Programming Methods A Survey 103
3 Structured Learning via Statistics and Optimization 109
31 A Composite View of Statistical Estimation 111
311 The optimization problems 113
312 Learning classes 115
313 Loss functions 122
314 Regularizers for sparsity 123
32 Low Rankness in Matrix Optimization 128
321 Rank constraints Hard and soft formulations 128
322 Shape constrained index models 131
33 Affine Sparsity Constraints 137
4 Nonsmooth Analysis 143
41 B(ouligand)-Differentiable Functions 144
411 Generalized directional derivatives 153
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
TABLE OF CONTENTS v
42 Second-order Directional Derivatives 156
43 Subdifferentials and Regularity 162
44 Classes of Nonsmooth Functions 176
441 Piecewise affine functions 177
442 Piecewise linear-quadratic functions 190
443 Weakly convex functions 205
444 Difference-of-convex functions 214
445 Semismooth functions 221
446 Implicit convex-concave functions 229
447 A summary 240
5 Value Functions 243
51 Nonlinear Programming based Value Functions 244
511 Interdiction versus enhancement 247
52 Value Functions of Quadratic Programs 251
53 Value Functions Associated with Convex Functions 258
54 Robustification of Optimization Problems 261
541 Attacks without set-up costs 264
542 Attacks with set-up costs 267
55 The Danskin Property 271
56 Gap Functions 278
57 Some Nonconvex Stochastic Value Functions 282
58 Understanding Robust Optimization 289
581 Uncertainty in data realizations 290
582 Uncertainty in the probabilities of realizations 304
6 Stationarity Theory 311
61 First-order Theory 313
611 The convex constrained case 314
612 Composite ε-strong stationarity 320
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
vi TABLE OF CONTENTS
613 Subdifferential based stationarity 324
62 Directional Saddle-Point Theory 332
63 Difference-of-Convex Constraints 337
631 Extension to piecewise dd-convex programs 345
64 Second-order Theory 348
65 Piecewise Linear-Quadratic Programs 350
651 Review of nonconvex QPs 350
652 Return to PLQ programs 352
653 Finiteness of stationary values 362
654 Testing copositivity One negative eigenvalue 367
7 Computational Algorithms by Surrogation 371
71 Basic Surrogation 372
711 A broad family of upper surrogation functions 379
712 Examples of surrogation families 381
713 Surrogation algorithms for several problem classes 384
72 Refinements and Extensions 392
721 Moreau surrogation 392
722 Combined BCD and ADMM for coupled constraints 400
723 Surrogation with line searches 418
724 Dinkelbachrsquos algorithm with surrogation 424
725 Randomized choice of subproblems 432
726 Inexact surrogation 435
8 Theory of Error Bounds 441
81 An Introduction to Error Bounds 442
811 The stationarity inclusions 444
82 The Piecewise Polyhedral Inclusion 449
83 Error Bounds for Subdifferential-Based Stationarity 454
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
TABLE OF CONTENTS vii
84 Sequential Convergence Under Sufficient Descent 462
841 Using the local error bound (819) 462
842 The Kurdyka-983040Lojaziewicz theory 467
843 Connections between Theorems 841 and 844 475
85 Error Bounds and Linear Regularity 476
851 Piecewise systems 483
852 Stationarity for convex composite dc optimization 488
853 dd-convex systems 491
854 Linear regularity and directional Slater 497
9 Theory of Exact Penalization 501
91 Exact Penalization of Optimal Solutions 503
92 Exact Penalization of Stationary Solutions 506
93 Pulled-out Penalization of Composite Functions 511
94 Two Applications of Pull-out Penalization 517
941 B-stationarity of dc composite diff-max programs 518
942 Local optimality of composite twice semidiff programs 521
95 Exact Penalization of Affine Sparsity Constraints 527
96 A Special Class of Cardinality Constrained Problems 531
97 Penalized Surrogation for Composite Programs 533
971 Solving the convex penalized subproblems 539
972 Semismooth Newton methods 543
10 Nonconvex Stochastic Programs 547
101 Expectation Functions 548
1011 Sample average approximations 555
102 Basic Stochastic Surrogation 560
1021 Expectation constraints 567
103 Two-stage SPs with Quadratic Recourse 568
1031 The positive definite case 569
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721
viii TABLE OF CONTENTS
1032 The positive semidefinite case 571
104 Probabilisitic Error Bounds 588
105 Consistency of Directional Stationary Solutions 591
1051 Convergence of stationary solutions 592
1052 Asymptotic distribution of the stationary values 605
1053 Convergence rate of the stationary points 609
106 Noisy Amplitude-based Phase Retrieval 613
11 Nonconvex Game Problems 625
111 Definitions and Preliminaries 627
1111 Applications 631
1112 Related models 637
112 Existence of a QNE 640
113 Computational Algorithms 648
1131 A Gauss-Seidel best-response algorithm 649
114 The Role of Potential 658
1141 Existence of a potential 665
115 Double-Loop Algorithms 669
116 The Nikaido-Isoda Optimization Formulations 678
1161 Surrogation applied to the NI function φc 681
1162 Difference-of-convexity and surrogation of 983141φc 683
Bibliography 691
Index 721