The Reﬂectron: Exploiting geometry for learning ... · Generalized linear models (GLMs) represent...

The Reflectron: Exploiting geometry for learninggeneralized linear models

Nicholas M. BoffiPaulson School of Engineering and Applied Sciences

Harvard UniversityCambridge, MA [email protected]

Jean-Jacques E. SlotineNonlinear Systems Laboratory

Massachusetts Institute of TechnologyCambridge, MA 02139

[email protected]

Abstract

Generalized linear models (GLMs) extend linear regression by generating thedependent variables through a nonlinear function of a predictor in a ReproducingKernel Hilbert Space. Despite nonconvexity of the underlying optimization prob-lem, the GLM-tron algorithm of Kakade et al. (2011) provably learns GLMs withguarantees of computational and statistical efficiency. We present an extensionof the GLM-tron to a mirror descent or natural gradient-like setting, which wecall the Reflectron. The Reflectron enjoys the same statistical guarantees as theGLM-tron for any choice of potential function ψ. We show that ψ can be used toexploit the underlying optimization geometry and improve statistical guarantees,or to define an optimization geometry and thereby implicitly regularize the model.The implicit bias of the algorithm can be used to impose advantageous – such assparsity-promoting – priors on the learned weights. Our results extend to the caseof multiple outputs with or without weight sharing, and we further show that theReflectron can be used for online learning of GLMs in the realizable or boundednoise settings. We primarily perform our analysis in continuous-time, leading tosimple derivations. We subsequently prove matching guarantees for a discreteimplementation. We supplement our theoretical analysis with simulations on realand synthetic datasets demonstrating the validity of our theoretical results.

1 Introduction

Generalized linear models (GLMs) represent a powerful extension of linear regression. In a GLM,the dependent variables yi are assumed to be given as a known nonlinear “link” function u of a linearpredictor of the covariates, E [yi|xi] = u

(wTxi

), for some fixed vector of parameters w. GLMs

are readily kernelizable, which captures the more flexible setting E [yi|xi] = u (〈α(xi),w〉) whereα(·) is a feature map in a Reproducing Kernel Hilbert Space (RKHS) and 〈·, ·〉 denotes the RKHSinner product. A prominent example of a GLM that arises in practice is logistic regression, whichhas wide-reaching applications in the natural, social, and medical sciences (Sur and Candès, 2019).Extensive details on GLMs can be found in the standard reference McCullagh and Nelder (1989).

The GLM-tron of Kakade et al. (2011) is the first computationally and statistically efficient algorithmfor learning GLMs. Inspired by the Isotron of Kalai and Sastry (2009), it is a simple and intuitivePerceptron-like algorithm applicable for learning arbitrary GLMs with a nondecreasing and Lipschitzlink function. In this work, we revisit the GLM-tron from a new perspective, leveraging recentdevelopments in continuous-time optimization and adaptive control theory (Boffi and Slotine, 2019).We consider the continuous-time limit of the GLM-tron, and generalize the resulting continuous-time dynamics to a mirror descent-like (Beck and Teboulle, 2003; Krichene et al., 2015) setting,which we call the Reflectron. By their equivalence in continuous-time, our analysis also applies

Preprint. Under review.

arX

iv:2

006.

0857

5v2

[cs

.LG

] 2

6 A

ug 2

020

to natural gradient variants of the GLM-tron (Amari, 1998; Pascanu and Bengio, 2013). We provenon-asymptotic generalization error bounds for the resulting family of continuous-time dynamics– parameterized by the choice of potential function ψ – and we further prove convergence ratesin the realizable setting. Our continuous-time Reflectron immediately gives rise to a wealth ofdiscrete-time algorithms by choice of discretization method, allowing us to leverage the vast body ofwork in numerical analysis (Butcher, 2001) and the widespread availability of off-the-shelf black-boxordinary differential equation solvers. As a concrete example, we prove that a simple forward-Eulerdiscretization preserves the guarantees of the continuous-time dynamics.

We connect the Reflectron algorithm with the growing body of literature on the implicit bias ofoptimization algorithms by applying a recent continuous-time limit (Boffi and Slotine, 2019) of asimple proof methodology for analyzing the implicit regularization of stochastic mirror descent (Az-izan et al., 2019; Azizan and Hassibi, 2019). We show that, in the realizable setting, the choice ofpotential function ψ implicitly biases the learned weights to minimize ψ. We extend our results to avector-valued setting which allows for weight sharing between output components. We prove thatconvergence, implicit regularization, and similar generalization error bounds hold in this setting. Wefurther consider the problem of online learning of GLMs in the realizable and bounded noise settings,and prove O

(1T

)and O

(1√T

)bounds on the generalization error in these two regimes, respectively.

1.1 Related work and significance

The GLM-tron has recently seen impressive applications in both statistical learning and learning-basedcontrol theory. The original work applied the GLM-tron to efficiently learn Single Index Models(SIMs) (Kakade et al., 2011). A recent extension (the BregmanTron) uses Bregman divergences toobtain improved guarantees for learning SIMs, though their use of Bregman divergences is differentfrom ours (Nock and Menon, 2020). Frei et al. (2020) use similar proof techniques to those of Kakadeet al. (2011) to show that gradient descent can be used to learn generalized linear models. Fosteret al. (2020) utilized the GLM-tron to develop an adaptive control law for stochastic, nonlinear,and discrete-time dynamical systems. Goel and Klivans (2017) use the kernelized GLM-tron, theAlphatron, to provably learn two hidden layer neural networks, while Goel et al. (2018) generalize theAlphatron to the Convotron for provable learning of one hidden layer convolutional neural networks.Orthogonally, but much like Foster et al. (2020), GLM-tron-like update laws have been developed inthe adaptive control literature (Tyukin et al., 2007), along with mirror descent and momentum-likevariants (Boffi and Slotine, 2019), where they can be used for provable control of unknown andnonlinearly parameterized dynamical systems. Our work extends this line of research by allowingfor the incorporation of local geometry into the GLM-tron update for implicit regularization andimproved statistical efficiency in a statistical learning context.

Continuous-time approaches in machine learning and optimization have become increasingly fruitfultools for analysis. Su et al. (2016) derive a continuous-time limit of Nesterov’s celebrated acceleratedgradient method (Nesterov, 1983), and show that this limit enables intuitive proofs via Lyapunovstability theory. Krichene et al. (2015) perform a similar analysis for mirror descent, while Zhanget al. (2018) show that using standard Runge-Kutta integrators on the second-order dynamics of Suet al. (2016) preserves acceleration. Lee et al. (2016) show via dynamical systems theory that saddlepoints are almost surely avoided by gradient descent with a random initialization. Boffi and Slotine(2020) use a continuous-time view of distributed stochastic gradient descent methods to analyzethe effect of distributed coupling on SGD noise. Remarkably, Wibisono et al. (2016), Betancourtet al. (2018), and Wilson et al. (2016) show that many accelerated optimization algorithms canbe generated by discretizing the Euler-Lagrange equations of a functional known as the BregmanLagrangian. In deep learning, Chen et al. (2018) show that residual networks can be interpreted as aforward-Euler discretization of a continuous-time dynamical system, and use higher-order integratorsto arrive at alternative architectures. Our work continues in this promising direction, and highlightsthat continuous-time offers clean and intuitive proofs that can later be discretized to obtain guaranteesfor discrete-time algorithms.

As exemplified by the field of deep learning, modern machine learning frequently takes place ina high-dimensional regime with more parameters than examples. It is now well-known that deepnetworks will interpolate noisy data, yet exhibit low generalization error despite interpolation whentrained on meaningful data (Zhang et al., 2016). Defying classical statistical wisdom, an explanationfor this apparent paradox has been given in the implicit bias of optimization algorithms and the

2

double-descent curve (Belkin et al., 2019a). The notion of implicit bias captures the preference of anoptimization algorithm to converge to a particular kind of interpolating solution – such as a minimumnorm solution – when many options exist. Surprisingly, similar “harmless” or “benign” interpolationphenomena have been observed even in much simpler settings such as overparameterized linearregression (Bartlett et al., 2019; Muthukumar et al., 2019; Hastie et al., 2019) and random featuremodels (Belkin et al., 2019b; Mei and Montanari, 2019). Understanding implicit bias has thus becomean important area of research, with applications ranging from modern deep learning to pure statistics.

Implicit bias has been categorized for separable classification problems (Soudry et al., 2018; Nacsonet al., 2018), regression problems using mirror descent (Gunasekar et al., 2018b), and multilayermodels (Gunasekar et al., 2018a; Woodworth et al., 2020; Gunasekar et al., 2017). Approximateresults and empirical evidence are also available for nonlinear deep networks (Azizan et al., 2019).Our work contributes to the understanding of implicit bias in a practically relevant class of nonconvexlearning problems, where proofs of convergence and bounds on the generalization error are attainable:GLM regression.

2 Problem setting and background

Our setup follows the original work of Kakade et al. (2011). We assume the dataset {xi, yi}mi=1 issampled i.i.d. from a distribution D supported on X × [0, 1], where E [yi|xi] = u (〈α(xi),θ〉) forα(x) ∈ Rp a finite-dimensional feature map with associated kernel K(x1,x2) = 〈α(x1),α(x2)〉and θ ∈ Rp a fixed unknown vector of parameters. u : R→ [0, 1] is a known, nondecreasing, and L-Lipschitz link function. We assume that ‖θ‖2 ≤W for some fixed bound W and that ‖α(x)‖2 ≤ Cfor all x ∈ X for some fixed bound C. Our goal is to approximate E [yi|xi] as measured by theexpected squared loss. To this end, for a hypothesis h(·) we define the quantities

err(h) = Ex,y

[(h (x)− y)

2], (1)

ε(h) = err(h)− err (E [y|x]) = Ex,y

[(h(x)− u (〈α(x),θ〉))2

], (2)

and we denote their empirical counterparts over the dataset as err(h) and ε(h). Above, err(h)measures the generalization error of h(·), while ε measures the excess risk compared to the Bayes-optimal predictor. Towards minimizing err(h), we present a family of mirror descent-like algorithmsfor minimizing ε(h) over parametric hypotheses of the form h(x) = u

(⟨α(x), θ

⟩). Via standard

statistical bounds (Bartlett and Mendelson, 2002), we transfer our guarantees on ε(h) to ε(h), whichin turn implies a small err(h). The starting point of our analysis is the GLM-tron of Kakade et al.(2011). The GLM-tron is an iterative update law of the form

θt+1 = θt −1

m

m∑i=1

(u(⟨

α(xi), θ⟩)− yi

)α(xi), (3)

with initialization θ1 = 0. (3) is a gradient-like update law, obtained from gradient descent on thesquare loss err(h) by replacing the derivative of u with 1. It admits a natural continuous-time limit,

d

dtθ = − 1

m

m∑i=1

(u(⟨


)α(xi), (4)

where (4) is obtained from (3) via a forward-Euler discretization with a timestep ∆t = 1. Throughoutthis paper, we will use the notation d

dtx = x interchangeably for any time-dependent signal x(t).

3 The Reflectron Algorithm

We define the Reflectron algorithm in continuous-time as the mirror descent-like dynamics

d

dt∇ψ(θ) = − 1

m

m∑i=1

(u(⟨


)α(xi), (5)

for ψ : M → R,M ⊆ Rp a convex function. The parameters of the hypothesis ht at time t areobtained by applying the inverse gradient of ψ to the output of the Reflectron.

3

3.1 Statistical guarantees

The following theorem gives our statistical guarantees for the Reflectron. It implies that for anychoice of strongly convex function ψ, the Reflectron finds a nearly Bayes-optimal predictor when runfor sufficiently long time.

Theorem 3.1 (Statistical guarantees for the Reflectron). Suppose that {xi, yi}mi=1 are drawn i.i.d.from a distribution D supported on X × [0, 1] with E [y|x] = u (〈θ,α(x)〉) for a known nonde-creasing and L-Lipschitz link function u : R → [0, 1], a kernel function K with correspondingfinite-dimensional feature map α(x) ∈ Rp, and an unknown vector of parameters θ ∈ Rp. Assumethat dψ (θ ‖ 0) ≤ σ

2W2 where ψ is σ-strongly convex with respect to ‖ · ‖2, and that ‖α(x)‖2 ≤ C

for all x ∈ X . Then, for any δ ∈ (0, 1), with probability at least 1− δ over the draws of the {xi, yi},there exists some time t < O

(σWC

√m/ log (1/δ)

)such that the hypothesis ht = u

(⟨θ(t),α(x)

⟩)satisfies

max {ε(ht), ε(ht)} ≤ O

(LCW

√log(1/δ)

m

),

where θ(t) is output by the Reflectron at time t with θ(0) = 0.

Proof. Consider the rate of change of the Bregman divergence (Bregman, 1967)

dψ (θ1 ‖ θ2) = ψ(θ1)− ψ(θ2)− 〈∇ψ(θ2),θ1 − θ2〉

between the parameters for the Bayes-optimal predictor and θ(t), ddtdψ

(θ∥∥∥ θ) =⟨

θ − θ,∇2ψ(θ)

˙θ⟩

. Observe that ddt∇ψ

(θ)

= ∇2ψ(θ)

˙θ, so that

d

dtdψ

(θ∥∥∥ θ) =

1

m

m∑i=1

(u (〈α(xi),θ〉)− u

(⟨α(xi), θ

⟩))⟨α(xi), θ − θ

⟩+

1

m

m∑i=1

(yi − u (〈α(xi),θ〉))⟨α(xi), θ − θ

⟩.

Using that u is L-Lipschitz and nondecreasing,

d

dtdψ

(θ∥∥∥ θ) ≤ − 1

Lε(ht) +

1

m

m∑i=1

(yi − u (〈α(xi),θ〉))⟨α(xi), θ − θ

⟩. (6)

Now, note that each 1C (yi − u (〈α(xi),θ〉))α(xi) is a zero-mean i.i.d. random variable with norm

bounded by 1 almost surely. Then, by Lemma C.1, ‖ 1Cm

∑mi=1 (yi − u(〈α(xi),θ〉))α(xi)‖2 ≤ η

with probability at least 1 − δ where η =2(

1+√

log(1/δ)/2)

√m

. Assuming that ‖θ(t) − θ‖2 ≤ W attime t, we conclude that

d

dtdψ

(θ∥∥∥ θ) ≤ − 1

Lε(ht) + CWη.

Hence, either ddtdψ

(θ∥∥∥ θ) < −CWη, or ε(ht) ≤ 2LCWη. In the latter case, our re-

sult is proven. In the former, by our assumptions dψ (θ ‖ 0) = dψ

(θ∥∥∥ θ(0)

)≤ σ

2W2,

and hence ‖θ(0) − θ‖2 ≤ W by σ-strong convexity of ψ with respect to ‖ · ‖2. Further-more, again by σ-strong convexity and by the decrease condition on the Bregman divergence,

‖θ(t) − θ‖2 ≤√

2σdψ

(θ∥∥∥ θ(t)

)≤√

2σdψ

(θ∥∥∥ θ(0)

)≤ W . Thus it can be until at most

tf =dψ(θ ‖ θ(0))

CWη = σ/2W 2

CWη = σW2Cη to satisfy ε(ht) ≤ 2LCWη. Hence there is some ht with t < tf

such that ε(ht) ≤ O(LCW

√log(1/δ)

m

). To transfer this bound on ε to ε, we need to bound the quan-

tity |ε(ht)− ε(ht)|. Combining Theorems C.1 and C.2 gives a bound on the Rademacher complexity

4

Rm(F) ≤ O(LCW

√1m

)where F = {u (〈w,α(x)〉) : x ∈ X , ‖w‖2 ≤ 2W, ‖α(x)‖2 ≤ C},

and clearly ht ∈ F . Application of Theorem C.3 to the square loss1 immediately implies

|ε(ht)− ε(ht)| ≤ O(LCW√m

)+O

(√log(1/δ)

m

)with probability at least 1− δ. The conclusion of

the theorem follows by a union bound.

Because ε(ht) = err(ht) up to a constant, we can find a good predictor ht by using a hold-out set toestimate err(ht) and by taking the best predictor on the hold-out set.

Several comments on our results are now in order. Our proof of Theorem 3.1 is similar to thecorresponding proofs of the GLM-tron (Kakade et al., 2011) and the Alphatron (Goel and Klivans,2017), but has two primary modifications. First, we consider the Bregman divergence under ψbetween the Bayes-optimal parameters and the current parameter estimates, rather than the squaredEuclidean distance. Second, rather than analyzing the iteration on ‖θt − θ‖22 as in the discrete-timecase, we analyze the dynamics of the Bregman divergence. Taking ψ = 1

2‖ ·‖22 recovers the guarantee

of the Alphatron2 and taking α(x) = x recovers the guarantee of the GLM-tron (up to forwardEuler discretization-specific details). Frei et al. (2020) recently considered using gradient descentfor learning GLMs, and obtained statistical guarantees in the agnostic setting using proof techniquessimilar to those used for the GLM-tron. Combining their proof techniques with our own, matchingguarantees for the use of mirror descent to learn GLMs can be obtained.

In Appendix E, we consider ψ strongly convex with respect to ‖ · ‖1, where an improved statisticalguarantee can be obtained if bounds are known on ‖α(x)‖∞ rather than ‖α(x)‖2.

As our analysis applies in the continuous setting, many discrete variants may be obtained by choiceof discretization method. As a concrete example, Appendix G considers a simple forward-Eulerdiscretization with projection, where matching statistical guarantees are obtained.

3.2 Implicit regularization

We now consider an alternative setting, and probe how the choice of ψ impacts the parameters learnedby the Reflectron. We require the following assumption.

Assumption 3.1. The dataset {xi, yi}mi=1 is realizable. That is, there exists some fixed parametervector θ ∈ Rp such that yi = u (〈θ,α(xi)〉) for all i.

Assumption 3.1 allows us to understand both overfitting and interpolation by the Reflectron. Inmany cases, even the noisy dataset of Section 3.1 may satisfy Assumption 3.1. We begin by provingconvergence of the Reflectron in the realizable setting.

Lemma 3.1 (Convergence of the Reflectron for a realizable dataset). Suppose that {xi, yi}mi=1 aredrawn i.i.d. from a distribution D supported on X × [0, 1] and that Assumption 3.1 is satisfiedwhere u is a known, nondecreasing, and L-Lipschitz function. Let ψ be any convex function withinvertible Hessian over the trajectory θ(t). Then ε(ht) → 0 where ht(x) = u

(⟨θ(t),α(x)

⟩)is

the hypothesis with parameters output by the Reflectron at time t with θ(0) arbitrary. Furthermore,inft′∈[0,t] {ε(ht′)} ≤ O

(1t

).

Proof. Under the assumptions of the lemma, (6) shows that

d

dtdψ

(θ∥∥∥ θ) ≤ − 1

Lε(ht) ≤ 0.

Integrating both sides of the above gives the bound∫ t

0

ε(ht′)dt′ ≤ L

(dψ (θ ‖ 0)− dψ

(θ∥∥∥ θ(t)

))≤ Ldψ (θ ‖ 0) .

1Note that while the square loss is not bounded or Lipschitz in general, it is both over the domain [0, 1] withbound b = 1 and Lipshitz constant L′ = 1

2Our setting corresponds to the case where ξ = 0 in the notation of Goel and Klivans (2017)

5

Explicit computation shows that ddt ε(ht) is bounded, so that ε(ht) is uniformly continuous in t. By

Barbalat’s Lemma (Lemma C.2), this implies that ε→ 0 as t→∞.

Furthermore, this simple analysis immediately gives us a convergence rate. Indeed, one can write

inft′∈[0,t]

{ε(ht′)} t =

∫ t

0

inft′∈[0,t]

{ε(ht′)} dt′′ ≤∫ t

0

ε(ht′)dt′ ≤ Ldψ (θ ‖ 0) ,

so that inft′∈[0,t] {ε(ht′)} ≤Ldψ(θ ‖ 0)

t .

Lemma 3.1 shows that the Reflectron will converge to an interpolating solution in the realizablesetting and that the best hypothesis up to time t does so at anO

(1t

)rate. Note that Lemma 3.1 allows

for arbitrary initialization, while Theorem 3.1 requires θ(0) = 0.

In general, there may be many possible vectors θ consistent with the data. The following theoremprovides insight into the parameters learned by the Reflectron.

Theorem 3.2 (Implicit regularization of the Reflectron). Consider the setting of Lemma 3.1.Let A = {θ : u

(⟨θ,α(xi)

⟩)= yi, i = 1, . . . ,m} be the set of parameters that interpo-

late the data, and assume that θ(t) → θ∞ ∈ A. Further assume that u(·) is invertible.

Then θ∞ = arg minθ∈A dψ

(θ∥∥∥ θ(0)

). In particular, if θ(0) = arg minw∈M ψ(w), then

θ∞ = arg minθ∈A ψ(w).

Proof. Let θ ∈ A be arbitrary. Then,

d

dtdψ

(θ∥∥∥ θ(t)

)= − 1

m

m∑i=1

(u(⟨

θ(t),α(xi)⟩)− yi

)⟨θ(t)− θ,α(xi)

⟩,

= − 1

m

m∑i=1

(u(⟨

θ(t),α(xi)⟩)− yi

)(⟨θ(t),α(xi)

⟩− u−1 (yi)

).

Above, we used that θ ∈ A and that u(·) is invertible, so that u(⟨θ,α(xi)

⟩)= yi im-

plies that⟨θ,α(xi)

⟩= u−1(yi). For clarity, define the error on example i as yi

(θ(t)

)=

u(⟨

θ(t),α(xi)⟩)− yi. Integrating both sides of the above from 0 to∞, we find that

dψ

(θ∥∥∥ θ∞) = dψ

(θ∥∥∥ θ(0)

)− 1

m

m∑i=1

∫ ∞0

(yi

(θ(t)

)(⟨θ(t),α(xi)

⟩− u−1 (yi)

))dt.

The above relation is true for any θ ∈ A. Furthermore, the integral on the right-hand side isindependent of θ. Hence the arg min of the two Bregman divergences must be equal, which showsthat θ∞ = arg minθ∈A dψ

(θ∥∥∥ θ(0)

).

Theorem 3.2 elucidates the implicit bias of the Reflectron. Out of all possible interpolating parameters,the Reflectron finds those that minimize the Bregman divergence between the manifold of interpolatingparameters and the initialization.

Note that the difference between the Reflectron and mirror descent on the square loss is the appearanceof u′

(⟨θ(t),α(xi)

⟩)in the sum, which is independent of θ. Hence the presence of this link

derivative does not affect our argument, which shows that gradient descent on GLMs exhibits thesame implicit bias.

4 Vector-valued GLMs

In this section, we consider an extension to the case of vector-valued target variables yi ∈ Rn. Weassume that u(x)i = ui(xi) where each ui is Li-Lipschitz and nondecreasing in its argument. We

6

define the expected and empirical error measures in this setting by replacing squared terms withsquared Euclidean norms in the definitions (1) and (2).

In many cases, it is desirable to allow for weight sharing between output variables in a model. Forinstance, if a vector-valued GLM estimation problem originates in a control or system identificationcontext, parameters can have physical meaning and may appear in multiple equations. Similarly,convolutional neural networks exploit weight sharing as a beneficial prior for imposing translationequivariance. We can provably learn and implicitly regularize weight-shared GLMs via a vector-valued Reflectron. Define the dynamics

d

dt∇ψ

(θ)

= − 1

m

m∑i=1

AT (xi)(u(A(xi)θ

)− yi

)(7)

with θ ∈ Rp, A(xi) ∈ Rn×p, and ψ :M→ Rp convex. Note that (7) encompasses models of theform h(x) = u

(Θα(x)

)with Θ ∈ Rn×q and α(x) ∈ Rq by unraveling Θ into a vector of size nq

and defining A(x) appropriately in terms of α(x). Appendix D discusses this case explicitly, wheretighter bounds are attainable and matrix-valued regularizers can be used. Our model is similar inspirit to the Convotron of Goel et al. (2018), but exploits mirror descent and applies for vector-valuedoutputs. (7) could in principle be extended to provably learn regularized single-layer convolutionalnetworks with multiple outputs via the distributional assumptions of Goel et al. (2018), which allowan application of average pooling.

We can state analogous guarantees as in the scalar-valued case. We begin with convergence.Lemma 4.1 (Convergence of the vector-valued Reflectron for a realizable dataset). Suppose that{xi,yi}mi=1 are drawn i.i.d. from a distributionD supported onX×[0, 1]n and that Assumption 3.1 issatisfied where the component functions u(x)i = ui(xi) are known, nondecreasing, and Li-Lipschitz.Let ψ be any convex function with invertible Hessian over the trajectory θ(t). Then ε(ht)→ 0 where

ht(x) = u(A(x)θ

)is the hypothesis with parameters output by the vector-valued Reflectron (7) at

time t with θ(0) arbitrary. Furthermore, inft′∈[0,t] {ε(ht′)} ≤ O(

1t

).

The proof is given in Appendix A.1. As in the scalar-valued case, the choice of ψ implicitly biases θ.Theorem 4.1 (Implicit regularization of the vector-valued Reflectron). Consider the setting ofLemma 4.1. Assume that θ(t)→ θ∞ where θ∞ interpolates the data, and assume that u(·) is invert-

ible. Then θ∞ = arg minθ∈A dψ

(θ∥∥∥ θ(0)

)where A is defined analogously as in Theorem 3.2. In

particular, if θ(0) = arg minw∈M ψ(w), then θ∞ = arg minθ∈A ψ(θ).

The proof is given in Appendix A.2. Finally, we may also state a statistical guarantee for (7).Theorem 4.2 (Statistical guarantees for the vector-valued Reflectron). Suppose that {xi,yi}mi=1 aredrawn i.i.d. from a distribution supported on X × [0, 1]n with E [y|x] = u (A(x)θ) for a knownfunction u(x) and an unknown vector of parameters θ ∈ Rp. Assume that u(x)i = ui(xi) whereeach ui : Xi → [0, 1] isLi-Lipschitz and nondecreasing in its argument. Assume that A : X → Rn×pis a known matrix with ‖A(x)‖2 ≤ C for all x ∈ X . Further assume that ‖αi(x)‖2 ≤ Cr fori = 1, . . . , n where αi(x) is the ith row of A(x). Let dψ (θ ‖ 0) ≤ σ

2W2 where ψ is σ-strongly

convex with respect to ‖ · ‖2. Then, for any δ ∈ (0, 1) with probability at least 1 − δ over the

draws of {xi,yi}, there exists some time t < O(σWC√n

√m/ log (1/δ)

)such that the hypothesis

ht(x) = u(A(x)θ(t)

)satisfies

ε(ht) ≤ O(

maxk{Lk}CW√m

√n log(1/δ)

),

ε(ht) ≤ O(

maxk{Lk}W√m

(Crn

3/2 + C√n log(1/δ)

))where θ(t) is output by the vector-valued Reflectron (7) at time t with θ(0) = 0.

The proof is given in Appendix A.3.

7

0 50 100 150 200Time

1.5

2.0

2.5

Erro

r

l1.1l2

A err(ht), err(ht)

0 50 100 150 200Time

20

40

60

80

Accu

racy

l1.1l2

B Classification accuracy

0 50 100 150 200Time

1.5

2.0

2.5

3.0

Erro

reulerdopri5

dop853lsoda

C Integration method comparison

Figure 1: (A) Trajectories of the empirical risk err(ht) and generalization error err(ht). err(ht) isshown in solid while err(ht) is shown dashed. (B) Trajectories of the training and test set accuracy.Training is shown in solid while testing is shown dashed. (C) A comparison of the empirical risk andgeneralization error dynamics as a function of the integration method with ψ(·) = ‖ · ‖21.1. All solidcurves and all dashed curves lie directly on top of each other and are shifted for visual clarity.

5 Simulations

As a simple illustration of our theoretical results, we perform classification on the MNIST datasetusing a single-layer multiclass classification model. Details of the simulation setup can be foundin Appendix B.1. In Figure 1A we show the empirical risk and generalization error trajectories forthe Reflectron with ψ(·) = ‖ · ‖21.1 and ψ(·) = ‖ · ‖22. The latter option reduces to the GLM-tronwhile the first, following Theorem 3.2, approximates the l1 norm and imposes sparsity. The dynamicsare integrated with the dop853 integrator from scipy.integrate.ode. Both choices converge tosimilar values of err and err. In Figure 1B, we show the training and test set accuracy. Both choicesof norm converge to similar accuracy values. In Figure 1C, we show the curves err(ht) and err(ht)with ψ(·) = ‖ · ‖21.1 for four integration methods. The curves lie directly on top of each other and areshifted arbitrarily for clarity. This agreement highlights the validity of our continuous-time analysis,and that many possible discrete-time algorithms are captured by the continuous-time dynamics.

In Figure 2, we show histograms of the final parameter matrices learned by the Reflectron withψ(·) = ‖ · ‖21.1 and ψ(·) = ‖ · ‖22. The histograms validate the prediction of Theorem 3.2. A sparsevector is found for ψ(·) = ‖ · ‖21.1, which obtains similar accuracy values to ψ(·) = ‖ · ‖22, as seen inFig. 1C. 73.51% of parameters have magnitude less than 10−3 for ψ(·) = ‖ · ‖21.1, while only 12.27%do for ψ(·) = ‖ · ‖22. Future work will apply the Reflectron to models that combine sparse learningwith a fixed expressive representation of the data, such as the scattering transform (Mallat, 2011;Bruna and Mallat, 2012; Talmon et al., 2015; Oyallon et al., 2019; Zarka et al., 2019).

6 Broader Impact

In this work, we developed mirror-descent like variants of the GLM-tron algorithm of Kakade et al.(2011). We proved guarantees on convergence and generalization, and characterized the implicitbias of our algorithms in terms of the potential function ψ. We generalized our results to the case ofvector-valued target variables while allowing for the possibility of weight sharing between outputs.Our algorithms have applications in several settings. Using the techniques in Foster et al. (2020),they may be generalized to the adaptive control context for provably regularized online learning

8

0 2 4 6Parameter value

0

2000

4000

6000

Num

ber o

f par

amet

ers

A Θl1.1

∞


0

1000

2000

3000

Num

ber o

f par

amet

ers

B Θl2

∞

Figure 2: Histograms of parameter values for (A) the parameters found by the Reflectron withψ = ‖ · ‖21.1, and (B) the parameters found by the Reflectron with ψ = ‖ · ‖22. This choice reducesto the GLM-tron. 73.51% of parameters have magnitude less than 10−3 with ψ(·) = ‖ · ‖21.1, whileonly 12.27% do for ψ(·) = ‖ · ‖22. Lowering this threshold shows that the l1.1 regularized solutionhas many near-zero parameters, while the l2 solution is diffuse.

and control of stochastic dynamical systems. Applications in control will advance automation, butmay have negative downstream consequences for those working in areas that can be replaced byadaptive control or robotic systems. Our algorithms can also be used for recovering the weights of acontinuous- or discrete-time recurrent neural network from online data, which may have applicationsin recurrent network pruning (via sparsity promoting biases such as an l1 approximation), or incomputational neuroscience.

Acknowledgments and Disclosure of Funding

We thank Stephen Tu for many helpful discussions.

ReferencesAmari, S.-i. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2):251–

276.

Azizan, N. and Hassibi, B. (2019). Stochastic gradient/mirror descent: Minimax optimality andimplicit regularization. In International Conference on Learning Representations.

Azizan, N., Lale, S., and Hassibi, B. (2019). Stochastic mirror descent on overparameterized nonlinearmodels: Convergence, implicit regularization, and generalization. arXiv:1906.03830.

Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. (2019). Benign overfitting in linear regression.arXiv:1906.11300.

Bartlett, P. L. and Mendelson, S. (2002). Rademacher and gaussian complexities: Risk bounds andstructural results. J. Mach. Learn. Res., 3:463–482.

Beck, A. and Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods forconvex optimization. Operations Research Letters, 31(3):167 – 175.

Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019a). Reconciling modern machine-learning practiceand the classical bias–variance trade-off. Proceedings of the National Academy of Sciences,116(32):15849–15854.

Belkin, M., Hsu, D., and Xu, J. (2019b). Two models of double descent for weak features.arXiv:1903.07571.

Betancourt, M., Jordan, M. I., and Wilson, A. C. (2018). On symplectic optimization.arXiv:1802.03653.

Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Schapire, R. (2011). Contextual banditalgorithms with supervised learning guarantees. volume 15 of Proceedings of Machine LearningResearch, pages 19–26, Fort Lauderdale, FL, USA. JMLR Workshop and Conference Proceedings.

9

Boffi, N. M. and Slotine, J.-J. E. (2019). Higher-order algorithms and implicit regularization fornonlinearly parameterized adaptive control. arXiv:1912.13154.

Boffi, N. M. and Slotine, J.-J. E. (2020). A continuous-time analysis of distributed stochastic gradient.Neural Computation, 32(1):36–96.

Bregman, L. (1967). The relaxation method of finding the common point of convex sets and itsapplication to the solution of problems in convex programming. USSR Computational Mathematicsand Mathematical Physics, 7(3):200 – 217.

Bruna, J. and Mallat, S. (2012). Invariant scattering convolution networks. arXiv:1203.1513.

Butcher, J. (2001). Numerical methods for ordinary differential equations in the 20th century.

Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. (2018). Neural ordinary differentialequations. arXiv:1806.07366.

Foster, D. J., Rakhlin, A., and Sarkar, T. (2020). Learning nonlinear dynamical systems from a singletrajectory. arXiv:2004.14681.

Frei, S., Cao, Y., and Gu, Q. (2020). Agnostic learning of a single neuron with gradient descent.arXiv:2005.14426.

Goel, S. and Klivans, A. (2017). Learning neural networks with two nonlinear layers in polynomialtime. arXiv:1709.06010.

Goel, S., Klivans, A., and Meka, R. (2018). Learning one convolutional layer with overlappingpatches. arXiv:1802.02547.

Gunasekar, S., Lee, J., Soudry, D., and Srebro, N. (2018a). Characterizing implicit bias in terms ofoptimization geometry. arXiv:1802.08246.

Gunasekar, S., Lee, J. D., Soudry, D., and Srebro, N. (2018b). Implicit bias of gradient descent onlinear convolutional networks. In Advances in Neural Information Processing Systems 31, pages9461–9471. Curran Associates, Inc.

Gunasekar, S., Woodworth, B., Bhojanapalli, S., Neyshabur, B., and Srebro, N. (2017). Implicitregularization in matrix factorization. arXiv:1705.09280.

Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2019). Surprises in high-dimensionalridgeless least squares interpolation. arXiv:1903.08560.

Ji, Z. and Telgarsky, M. (2019). Polylogarithmic width suffices for gradient descent to achievearbitrarily small test error with shallow relu networks. arXiv:1909.12292.

Kakade, S., Kalai, A. T., Kanade, V., and Shamir, O. (2011). Efficient learning of generalized linearand single index models with isotonic regression. arXiv:1104.2018.

Kakade, S. M., Sridharan, K., and Tewari, A. (2009). On the complexity of linear prediction:Risk bounds, margin bounds, and regularization. In Koller, D., Schuurmans, D., Bengio, Y., andBottou, L., editors, Advances in Neural Information Processing Systems 21, pages 793–800. CurranAssociates, Inc.

Kalai, A. T. and Sastry, R. (2009). The isotron algorithm: High-dimensional isotonic regression. InProceedings of the 22nd Annual Conference on Learning Theory (COLT), 2009.

Krichene, W., Bayen, A., and Bartlett, P. L. (2015). Accelerated mirror descent in continuous anddiscrete time. In Advances in Neural Information Processing Systems 28, pages 2845–2853. CurranAssociates, Inc.

Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. (2016). Gradient descent converges tominimizers. arXiv:1602.04915.

Mallat, S. (2011). Group invariant scattering. arXiv:1101.2286.

10

Maurer, A. (2016). A vector-contraction inequality for rademacher complexities. arXiv:1605.00251.

McCullagh, P. and Nelder, J. (1989). Generalized Linear Models, Second Edition. CRC Press.

Mei, S. and Montanari, A. (2019). The generalization error of random features regression: Preciseasymptotics and double descent curve. arXiv:1908.05355.

Muthukumar, V., Vodrahalli, K., and Sahai, A. (2019). Harmless interpolation of noisy data inregression. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 2299–2303.

Nacson, M. S., Lee, J. D., Gunasekar, S., Savarese, P. H. P., Srebro, N., and Soudry, D. (2018).Convergence of gradient descent on separable data. arXiv:1803.01905.

Nesterov, Y. (1983). A Method for Solving a Convex Programming Problem with Convergence RateO(1/k2). Soviet Mathematics Doklady, 26:367–372.

Nock, R. and Menon, A. K. (2020). Supervised learning: No loss no cry. arXiv:2002.03555.

Oyallon, E., Zagoruyko, S., Huang, G., Komodakis, N., Lacoste-Julien, S., Blaschko, M., andBelilovsky, E. (2019). Scattering networks for hybrid representation learning. IEEE Transactionson Pattern Analysis and Machine Intelligence, 41(9):2208–2221.

Pascanu, R. and Bengio, Y. (2013). Revisiting natural gradient for deep networks. arXiv:1301.3584.

Slotine, J.-J. and Li, W. (1991). Applied Nonlinear Control. Prentice Hall.

Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. (2018). The implicit bias ofgradient descent on separable data. J. Mach. Learn. Res., 19(1):2822–2878.

Su, W., Boyd, S., and Candès, E. J. (2016). A differential equation for modeling nesterov’s acceleratedgradient method: Theory and insights. Journal of Machine Learning Research, 17(153):1–43.

Sur, P. and Candès, E. J. (2019). A modern maximum-likelihood theory for high-dimensional logisticregression. Proceedings of the National Academy of Sciences, 116(29):14516–14525.

Talmon, R., Mallat, S., Zaveri, H., and Coifman, R. R. (2015). Manifold learning for latent variableinference in dynamical systems. IEEE Transactions on Signal Processing, 63(15):3843–3856.

Tyukin, I. Y., Prokhorov, D. V., and van Leeuwen, C. (2007). Adaptation and parameter estimationin systems with unstable target dynamics and nonlinear parametrization. IEEE Transactions onAutomatic Control, 52(9):1543–1559.

Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. CambridgeUniversity Press.

Wibisono, A., Wilson, A. C., and Jordan, M. I. (2016). A variational perspective on acceleratedmethods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351–E7358.

Wilson, A. C., Recht, B., and Jordan, M. I. (2016). A lyapunov analysis of momentum methods inoptimization. arXiv:1611.02635.

Woodworth, B., Gunasekar, S., Lee, J. D., Moroshko, E., Savarese, P., Golan, I., Soudry, D., andSrebro, N. (2020). Kernel and rich regimes in overparametrized models. arXiv:2002.09277.

Zarka, J., Thiry, L., Angles, T., and Mallat, S. (2019). Deep network classification by scattering andhomotopy dictionary learning. arXiv:1910.03561.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016). Understanding deep learningrequires rethinking generalization. arXiv:1611.03530.

Zhang, J., Mokhtari, A., Sra, S., and Jadbabaie, A. (2018). Direct runge-kutta discretization achievesacceleration. arXiv:1805.00521.

11

A Omitted proofs

A.1 Proof of Lemma 4.1

Proof. The proof is analogous to Lemma 3.1. We have that

˙θ = − 1

m

m∑i=1

(∇2ψ(θ)

)−1

AT (xi)(u(A(xi)θ

)− u (A(xi)θ)

),

so that

d

dtdψ

(θ∥∥∥ θ) =

(θ − θ

)T (− 1

m

m∑i=1

AT (xi)(u(A(xi)θ

)− u (A(xi)θ)

))

=∑k

(θk − θk

)− 1

m

m∑i=1

∑j

Ajk(xi)

(uj

(∑k

Ajk(xi)θk

)− uj

(∑k

Ajk(xi)θk

))= − 1

m

m∑i=1

∑j

[(uj

(∑k

Ajk(xi)θk

)− uj

(∑k

Ajk(xi)θk

))(∑k

Ajk(xi)(θk − θk

))]

≤ − 1

m

m∑i=1

∑j

1

Lj

(uj

(∑k

Ajk(xi)θk

)− uj

(∑k

Ajk(xi)θk

))2

≤ − 1

mmaxk{Lk}

m∑i=1

∥∥∥u(A(xi)θ)− u (A(xi)θ)

∥∥∥2

2

= − 1

maxk{Lk}ε(ht) ≤ 0

The conclusions of the lemma follow identically by the machinery of the proof of Lemma 3.1.

A.2 Proof of Theorem 4.1

Proof. The proof follows the same structure as in the scalar-valued case. Let θ ∈ A. Then,

d

dtdψ

(θ∥∥∥ θ) =

(θ − θ

)T (− 1

m

m∑i=1

AT (xi)(u(A(xi)θ

)− yi

))

= Tr

[(θ − θ

)T (− 1

m

m∑i=1

AT (xi)(u(A(xi)θ

)− yi

))]

= − 1

m

m∑i=1

Tr

[(θ − θ

)TAT (xi)

(u(A(xi)θ

)− yi

)]

= − 1

m

m∑i=1

Tr

[(u(A(xi)θ

)− yi

)(θ − θ

)TAT (xi)

]

= − 1

m

m∑i=1

(u(A(xi)θ

)− yi

)TA(xi)

(θ − θ

)= − 1

m

m∑i=1

(u(A(xi)θ

)− yi

)T (A(xi)θ − u−1 (yi)

)In the derivation above, we have replaced A(xi)θ by u−1(yi) following our assumptions that θ ∈ Aand u is invertible. Integrating both sides of the above from 0 to∞, we find that

dψ

(θ∥∥∥ θ∞) = dψ

(θ∥∥∥ θ(0)

)− 1

m

m∑i=1

∫ ∞0

(u(A(xi)θ(t)

)− yi

)T (A(xi)θ(t)− u−1 (yi)

)dt

12

The above relation is true for any θ ∈ A. Furthermore, the integral on the right-hand side isindependent of θ. Hence the arg min of the two Bregman divergences must be equal, which showsthat

θ∞ = arg minθ∈A

dψ

(θ∥∥∥ θ(0)

)Initializing θ(0) = arg minw∈M ψ(w) completes the proof.

A.3 Proof of Theorem 4.2

Proof. Consider the rate of change of the Bregman divergence between the parameters for the Bayes-optimal predictor and the parameters produced by the Reflectron at time t. By the same method as inLemma 4.1, we immediately have

d

dtdψ

(θ∥∥∥ θ) ≤ − 1

maxk{Lk}ε (ht) +

1

m

m∑i=1

(yi − u (A(xi)θ))T

A(xi)(θ − θ

)Now, note that each 1

C√n

(yi − u (A(xi)θ))T

A(xi) is a zero-mean i.i.d. random vari-able with Euclidean norm bounded by 1 almost surely. Then, by Lemma C.1,‖ 1C√nm

∑mi=1 (yi − u(A(xi)θ)

TA(xi)‖2 ≤ η with probability at least 1 − δ where η =

2(

1+√

log(1/δ)/2)

√m

. Assuming that ‖θ(t)− θ‖2 ≤W at time t, we conclude that

d

dtdψ

(θ∥∥∥ θ) ≤ − 1

maxk{Lk}ε(ht) +

√nCWη.


(θ∥∥∥ θ) < −√nCWη, or ε(ht) ≤ 2 (maxk{Lk})

√nCWη. In the latter case,

our result is proven. In the former, by our assumptions dψ (θ ‖ 0) = dψ

(θ∥∥∥ θ(0)

)≤ σ

2W2,

and hence ‖θ(0) − θ‖ ≤ W by σ-strong convexity of ψ with respect to ‖ · ‖2. Furthermore,

‖θ(t)− θ‖ ≤√

2σdψ

(θ∥∥∥ θ(t)

)≤√

2σdψ

(θ∥∥∥ θ(0)

)≤W . Thus it can be until at most

tf =dψ

(θ∥∥∥ θ(0)

)√nCWη

=σ/2W 2

√nCWη

=σW

2√nCη

until ε(ht) ≤ 2 maxk{Lk}√nCWη. Hence there is some ht with t < tf such that

ε(ht) ≤ O

(√nmax

k{Lk}CW

√log(1/δ)

m

).

Combining Theorems C.1 and C.2, we obtain a bound on the Rademacher complexity of the functionclass for each component of u

(A(x)θ

),

Rm (Fk) ≤ O(LkCrW√

m

).

Application of Theorem C.5 to bound the generalization error, noting that the square loss is√n-

Lipschitz and√n-bounded over [0, 1]n, gives us the bound

|ε(ht)− ε(ht)| ≤ O(

maxk{Lk}CrWn3/2

√m

)+O

(√n log(1/δ)

m

)

which completes the proof.

13

B Further simulation details and results

B.1 MNIST simulation details

We implement the vector-valued Reflectron without weight sharing. Rather than implement thecontinuous-time dynamics (7), we directly utilize the dynamics (9), as they are more efficient withoutweight sharing. In Section 5, we show results for the mirror descent-like dynamics. Appendix B.2shows similar results for the natural gradient-like dynamics. Our hypothesis at time t is given by

ht(x) = u(Θ(t)x

)where u(·) is an elementwise sigmoid, x ∈ R784 is an image from the MNIST dataset, and Θ ∈R10×784 is a matrix of parameters to be learned. We use one-hot encoding on the class labels, andthe predicted class is obtained by taking an arg max over the components of ht(x). A training set ofsize 5000 and a test set of size 10000 are both randomly selected from the overall dataset, so thatthe number of parameters np > m. The training data is pre-processed to have zero mean and unitvariance. The testing data is shifted by the mean of the training data and normalized by the standarddeviation of the training data. A value of ∆t = 0.05 is used for all cases, up to adaptive timesteppingperformed by the black-box ODE solvers. Convergence speed can be different for different choicesof ψ. To ensure convergence over similar timescales, we use a learning rate of 1/(1 − q) in thecontinuous-time dynamics with ψ = ‖ · ‖2q .

Because we use an elementwise sigmoid, the output of our model is not required to be a probabilitydistribution over classes. This is equivalent to relaxing the requirement that the output classesbe exclusive. However, the Reflectron works directly on the square loss rather than the negativelog-likelihood, so it is not necessary for our model to output a probability distribution.

The square loss is not typically used in multiclass classification. Designing similar provable algorithmsfor other loss functions – such as the negative log-likelihood – and relaxing the elementwise Lipschitzcondition – for applications to activations such as the softmax – is an interesting direction for futurework.

B.2 Natural gradient

In this section, we show that the same qualitative phenomena as displayed in Section 5 are observedwhen discretizing the natural gradient-like dynamics

˙Θ = − 1

m

m∑i=1

(∇2ψ(Θ)

)−1 (u(Θα(xi)

)− yi

)αT (xi). (8)

Formally, (8) is equivalent to (7) without weight sharing (handled explicitly in Appendix D) incontinuous-time, but will differ in an implementation due to discretization errors. We utilize the samesimulation setup as described in Appendix B.1. Results are shown in Figure 3, where we displayconvergence of the empirical and expected risk in (A), convergence of the classification accuracy in(B), a histogram of the sparse solution found by choosing ψ(·) = ‖ · ‖1.11.1 in (C), and a comparisonof integration methods in (D). ψ(·) is chosen as ‖ · ‖1.11.1 rather than ‖ · ‖21.1 as the former admits adiagonal Hessian, while the Hessian for the latter is dense. Conversely, the latter admits a simpleinversion formula for the gradient, while the former does not. As predicted by Theorem 4.1, thenatural gradient system implicitly regularizes the learned model, as indicated by the sparse matrixdisplayed in Figure 3C. While implicit regularization is only approximate for natural gradient systemsin discrete-time for linear models (Gunasekar et al., 2018a), higher-order integration can ensure thatthe discrepancy between the natural gradient and mirror descent dynamics is small. This allowsthe choice of whichever method is computationally more efficient for the application – invertingthe Hessian matrix ∇2ψ(·) or inverting the gradient function ∇ψ(·). Nevertheless, that implicitregularization is only approximate for this system can be seen by noting that 54.11% of parametershave magnitude less than 10−3 here, compared with 73.51% for the mirror descent-like dynamics.

B.3 Synthetic dataset

Training examples {xi, yi}mi=1 are generated according to the model

yi = σ(θTα(xi)

)+ ξi

14

0 50 100 150 200Time

1.0

1.5

2.0

2.5

Erro

r

l1.1l2

A err(ht), err(ht)

0 50 100 150 200Time

20

40

60

80

Accu

racy

l1.1l2

B Classification accuracy

0 20 40 60 80 100Parameter value

0

2000

4000

6000

8000

Num

ber o

f par

amet

ers

C Θl1.1

∞

0 50 100 150 200Time

1.0

1.5

2.0

2.5

Erro

r

eulerdopri5

dop853lsoda

D Integration method comparison

Figure 3: (A) Trajectories of the empirical risk err(ht) and generalization error err(ht) for the naturalgradient-like dynamics (8). err(ht) is shown in solid while err(ht) is shown dashed. Comparison ofthe sparsity-promoting Reflectron with ψ = ‖ · ‖1.11.1 to the GLM-Tron. (B) Classification accuracyover time. Training curve is shown in solid while the testing curve is shown dashed. (C) Histogramof parameters found by the l1.1-regularized Reflectron via natural gradient. The resulting parametervector has 54.11% of entries with absolute value below 10−3. (D) A comparison of integrationmethods of the dynamics. All solid curves and all dashed curves lie directly on top of each other andare shifted for clarity.

where σ(·) is the sigmoid function and ξi ∼ Unif (aξ, bξ) is a uniform noise term with aξ = −bξ =.05. The yi values are clamped to lie in the interval [0, 1] in the case that ξi pushes them over theboundary. α(xi) is taken to be α(xi) = cos (Vxi) with V ∈ Rp×k a random matrix Vij ∼ N (0, 1).Each xi ∈ Rk is drawn i.i.d. from a Gaussian xi ∼ N (0, σ2

x) with σx = 2. Each component of θ isalso generated i.i.d. from a standard normal distribution θi ∼ N (0, 1). These values were chosenso as to not saturate the sigmoid and to get a range of yi ∈ [0, 1]. We take m = 100, k = 25, andp = 75, so that k < p < m but p

m = 34 , representing the modern regime of high-dimensional logistic

regression where the classical assumption of p� m is violated (Sur and Candès, 2019). We similarlygenerate a test set with 400 examples for estimation of err(ht). We assume access to the real-valuedyi during model training rather than binary labels. A value of ∆t = 0.05 is used for all cases, up toadaptive timestepping performed by the black-box ODE solvers. As in Appendix B.1, we rescale thelearning rate to plot on similar timescales.

In Figure 4A, we show the empirical risk and generalization error trajectories for the Reflectronwith ψ(·) = ‖ · ‖21.1 and ψ(·) = ‖ · ‖22. The dynamics are integrated with the lsoda integrator fromscipy.integrate.ode. Both choices converge to similar values of err and err. In Figure 4B, weshow the curves err(ht) and err(ht) with ψ(·) = ‖ · ‖21.1 for four integration methods. The curves liedirectly on top of each other and are shifted arbitrarily for clarity.

In Figure 5, we show histograms of the parameter values for the Bayes-optimal predictor θ, as wellas the final parameter values found by the Reflectron with ψ(·) = ‖ · ‖21.1 and ψ(·) = ‖ · ‖22. Thehistograms validate the prediction of Theorem 3.2. A sparse vector is found for ψ = ‖ · ‖21.1, whichobtains similar values of err(ht) to ‖ · ‖22, as seen in Fig. 4A.

A similar discussion applies here as in Appendix B.1. The Reflectron is designed for minimizing thesquare loss, which is not typically used in logistic regression. Designing similar provable algorithmsfor other loss functions – such as the negative log-likelihood – and tailoring the choice of ψ to theloss function is an interesting opportunity for future research.

15

0 100 200 300 400 500Time

0.00

0.05

0.10

0.15

Erro

r

l1.1l2

A err(ht), err(ht)

0 100 200 300 400 500Time

0.0

0.1

0.2

0.3

Erro

r

eulerdopri5

dop853lsoda

B Integration method comparison

Figure 4: Trajectories of the empirical risk err(ht) and generalization error err(ht). err(ht) is shownin solid while err(ht) is shown dashed. (A) Comparison of the sparsity-promoting Reflectron withψ = ‖ · ‖21.1 to the GLM-Tron. (B) A comparison of the empirical risk and generalization errordynamics as a function of the integration method with ψ(·) = ‖ · ‖21.1. All solid curves and all dashedcurves lie directly on top of each other and are shifted for clarity.

2 1 0 1 2 3Parameter value

0

2

4

Num

ber o

f par

amet

ers

A θ


0

5

10

15

Num

ber o

f par

amet

ers

B θl1.1

∞


0

1

2

3

4

Num

ber o

f par

amet

ers

C θl2

∞

Figure 5: Histograms of parameter values for (A) the Bayes-optimal predictor θ, (B) the parametersfound by the Reflectron with ψ = ‖ · ‖21.1, and (C) the parameters found by the Reflectron withψ = ‖ · ‖22. This choice reduces to the GLM-tron.

C Required results

In this section, we present some results required for the analysis in the main text.

The following lemma is well-known, and gives a concentration inequality for random variables in aHilbert space.Lemma C.1. Suppose z1, . . . , zm are i.i.d. 0-mean random variables in a Hilbert space such thatPr (‖zi‖ ≤ 1) = 1. Then, with probability at least 1− δ,∥∥∥∥∥ 1

m

m∑i=1

zi

∥∥∥∥∥ ≤ 2

(1 +

√log(1/δ)/2√m

)

The following theorem gives a bound on the Rademacher complexity of a linear predictor.Theorem C.1 (Kakade et al. (2009)). Let X be a subset of a Hilbert space equipped with an innerproduct 〈·, ·〉 such that for each x ∈ X , 〈x,x〉 ≤ X2. LetW = {x 7→ 〈x,w〉 : 〈w,w〉 ≤W 2} be a

16

class of linear functions. Then

Rm(W) ≤ XW√

1

m

whereRm(W) is the Rademacher complexity of the function classW ,

Rm (W) = Exi,εi

[supf∈W

1

m

m∑i=1

εif(xi)

]with εi Rademacher distributed random variables.

The following theorem is useful for bounding the Rademacher complexity of the generalized linearmodels considered in this work. It may also be used to bound the generalization error in terms of theRademacher complexity of a function class.Theorem C.2 (Bartlett and Mendelson (2002)). Let φ : R→ R be Lφ-Lipschitz and suppose thatφ(0) = 0. Let F be a class of functions. ThenRm(φ ◦ F) ≤ 2LφRm(F).

The following theorem allows for a bound on the generalization error if bounds on the empirical riskand the Rademacher complexity of the function class are known.Theorem C.3 (Generalization error via Rademacher complexity (Bartlett and Mendelson, 2002)).Let {xi, yi}mi=1 be an i.i.d. sample from a distribution P over X × Y and let L : Y ′ × Y → R be anL-Lipschitz and b-bounded loss function in its first argument. Let F = {f | f : X → Y ′} be a classof functions. For any positive integer m ≥ 0 and any scalar δ ≥ 0,

supf∈F

∣∣∣∣∣ 1

m

m∑i=1

L(f(xi), yi)− E(x,y)∼P [L(f(x), y)]

∣∣∣∣∣ ≤ 4LRm(F) + 2b

√2

mlog

(1

δ

)with probability at least 1− δ over the draws of the {xi,yi}.

The following theorem comes from the main result and the example in Section 4.2 of Maurer (2016).It enables us to bound the Rademacher complexity of the square loss composed with the link functionu in the vector-valued output case handled in Appendix D.Theorem C.4 (Vector contraction inequality (Maurer, 2016)). Let X be a subset of a Hilbert spaceequipped with an inner product 〈·, ·〉 such that for each x ∈ X , 〈x,x〉 ≤ X2. Let B (X ,Rn) bethe set of bounded linear transformations from X into Rn. Define a class of functions F = {x 7→Wx|W ∈ B (X ,Rn) , ‖W‖ ≤W}. Let hi : Rn → R have Lipschitz constant L. Then

Eεi

[supf∈F

1

m

m∑i=1

εihi (f(xi))

]≤ O

(LXW

√n

m

).

where the εi are Rademacher random variables.

The following theorem is similar to Theorem C.4, but gives a weaker bound linear in the outputdimension. See the main theorem and Section 4.1 in (Maurer, 2016).Theorem C.5 (Weak vector contraction inequality (Maurer, 2016)). Let X be any set, let F bea class of functions f : X → Rn, and let hi : Rn → R have Lipschitz constant L. DefineFk = {x 7→ fk(x) : f(x) = (f1(x), f2(x), . . . , fn(x)) ∈ F} as the projection onto the kthcoordinate class of F . Then

Eεi

[supf∈F

1

m

m∑i=1

εihi (f(xi))

]≤√

2L

n∑k=1

Eεi

[supf∈Fk

1

m

m∑i=1

εif(xi)

].

The following lemma is a technical result from functional analysis which has seen widespreadapplication in adaptive control theory (Slotine and Li, 1991).

Lemma C.2 (Barbalat’s Lemma). Assume that limt→∞∫ t

0|x(τ)|dτ < ∞. If x(t) is uniformly

continuous, then limt→∞ x(t) = 0.

Note that a sufficient condition for uniform continuity of x(t) is that x(t) is bounded.

17

D Vector-valued Reflectron without weight sharing

In this section, we handle the vector-valued Reflectron without weight sharing. We define thevector-valued Reflectron without weight sharing via the continuous-time dynamics

d

dt∇ψ

(Θ(t)

)= − 1

m

m∑i=1

(u(Θα(xi)

)− yi

)αT (xi) (9)

where ψ :M→ R,M⊆ Rn×p is a strictly convex function on matrices. (9) is similar to runningthe scalar-valued Reflectron (5) on each component individually. However, analyzing it in the form(9) enables more general results on implicit regularization – which apply to the matrix Θ directlyrather than its rows – and tighter bounds on the generalization error.

We begin with a convergence theorem.

Lemma D.1 (Convergence of the vector-valued Reflectron without weight sharing for a realizabledataset). Suppose that {xi,yi}mi=1 are drawn i.i.d. from a distribution D supported on X × [0, 1]n

and that Assumption 3.1 is satisfied where the component functions u(x)i = ui(xi) are known,nondecreasing, and Li-Lipschitz functions. Let ψ be any convex function with invertible Hessiantensor over the trajectory Θ(t). Then ε(ht)→ 0 where ht(x) = u

(Θ(t)α(x)

)is the hypothesis

with parameters output by the vector-valued Reflectron at time t with Θ(0) arbitrary. Furthermore,inft′∈[0,t] {ε(ht′)} ≤ O

(1t

).

Proof. The proof is analogous to Lemma 3.1, but requires more technical manipulation of indices.Define the Bregman divergence in a natural way

dψ

(Θ∥∥∥ Θ

)= ψ (Θ)− ψ

(Θ)−∑ij

∂ψ(Θ)

∂Θij

(Θij − Θij

).

Then we have the equality

d

dtdψ

(Θ∥∥∥ Θ

)=(Θ−Θ

): ∇2ψ

(Θ)

:˙Θ

where for a symmetric four-tensor A, the real number C : A : B =∑ijkl CijAijklBkl and

∇2ψ(Θ)ijkl =∂2ψ(Θ)∂Θkl∂Θij

. Note that the inverse Hessian tensor is a map(∇2ψ(·)

)−1: Rn×p →

Rn×p. By definition, (9) then implies that

˙Θ = − 1

m

m∑i=1

(∇2ψ

(Θ))−1 (

u(Θα(xi)

)− u (Θα (xi))

)αT (xi),

18

so that

d

dtdψ

(Θ∥∥∥ Θ

)=(Θ−Θ

):

(− 1

m

m∑i=1

(u(Θα(xi)

)− u (Θα (xi))

)αT (xi)

)

=∑kl

(Θ−Θ

)kl

(− 1

m

m∑i=1

(u(Θα(xi)

)− u (Θα (xi))

)αT (xi)

)kl

=∑kl

(Θkl −Θkl

)(− 1

m

m∑i=1

(uk

(∑q

Θkqαq(xi)

)− uk

(∑q

Θkqαq (xi)

))αl(xi)

)

= − 1

m

∑i

(∑kl

(Θkl −Θkl

)(uk

(∑q

Θkqαq(xi)

)− uk

(∑q

Θkqαq (xi)

))αl(xi)

)

= − 1

m

∑i

(∑k

(uk

(∑q

Θkqαq(xi)

)− uk

(∑q

Θkqαq (xi)

))(∑q

(Θkq −Θkq

)αq(xi)

))

≤ − 1

m

∑i

∑k

1

Lk

(uk

(∑q

Θkqαq(xi)

)− uk

(∑q

Θkqαq (xi)

))2

= − 1

mmaxk{Lk}∑i

∥∥∥u(Θα (xi))− u (Θα (xi))

∥∥∥2

2

= − 1

maxk{Lk}ε (ht) ≤ 0.

The conclusions of the Lemma follow identically by the machinery of the proof of Lemma 3.1.

In a similar manner, we can state a result on the implicit regularization of the matrix Θ.

Theorem D.1 (Implicit regularization of the vector-valued Reflectron without weight sharing). Con-sider the setting of Lemma D.1. Assume that Θ(t) → Θ∞ where Θ∞ interpolates the data, andassume that u(·) is invertible. Then

Θ∞ = arg minΘ∈A

dψ

(Θ∥∥∥ Θ(0)

)

where A is defined analogously as in the proof of Theorem 3.2. In particular, if Θ(0) =

arg minW∈M ψ(W), then Θ∞ = arg minΘ∈A ψ(Θ).

Proof. The proof follows the same structure as in the scalar-valued case. Let Θ ∈ A. Then,

d

dtdψ

(Θ∥∥∥ Θ(t)

)=(Θ− Θ

):

(− 1

m

m∑i=1

(u(Θα(xi)

)− yi

)αT (xi)

)

19

Note that for two matrices A,B ∈ Rn×p we have the equality A : B = Tr(ATB

). Hence,

= Tr

[(Θ− Θ

)T (− 1

m

m∑i=1

(u(Θα(xi)

)− yi

)αT (xi)

)]

= Tr

[(− 1

m

m∑i=1

(u(Θα(xi)

)− yi

)αT (xi)

)(Θ− Θ

)T]

= − 1

m

m∑i=1

Tr

[(u(Θα(xi)

)− yi

)αT (xi)

(Θ− Θ

)T]

= − 1

m

m∑i=1

Tr

[(u(Θα(xi)

)− yi

)((Θ− Θ

)α(xi)

)T]

= − 1

m

m∑i=1

Tr

[(u(Θα(xi)

)− yi

)(Θα(xi)− u−1 (yi)

)T]

= − 1

m

m∑i=1

(u(Θα(xi)

)− yi

)T (Θα(xi)− u−1 (yi)

)In the derivation above, we have replaced Θα(xi) by u−1(yi) following our assumptions thatΘ ∈ A and that u is invertible. Integrating both sides of the above from 0 to∞, we find that

dψ

(Θ∥∥∥ Θ∞

)= dψ

(Θ∥∥∥ Θ(0)

)− 1

m

m∑i=1

∫ ∞0

(u(Θ(t)α(xi)

)− yi

)T (Θ(t)α(xi)− u−1 (yi)

)dt

The above relation is true for any Θ ∈ A. Furthermore, the integral on the right-hand side isindependent of Θ. Hence the arg min of the two Bregman divergences must be equal, which showsthat

Θ∞ = arg minΘ∈A

dψ

(Θ∥∥∥ Θ(0)

)Initializing Θ(0) = arg minW∈M ψ(W) completes the proof.

By leveraging a recent contraction inequality for vector-valued output classes (Maurer, 2016), we canstrengthen the guarantees of Theorem 4.2 without weight sharing by lowering the dependence fromn3/2 to n.Theorem D.2 (Statistical guarantees for the vector-valued Reflectron without weight sharing).Suppose that {xi,yi}mi=1 are drawn i.i.d. from a distribution supported on X × [0, 1]n withE [y|x] = u (Θα(x)) for a known function u(x) and an unknown matrix of parameters Θ ∈ Rn×p.Assume that u(x)i = ui(xi) where each ui : Xi → [0, 1] is Li-Lipschitz and nondecreasing in itsargument. Assume that α : X → Rp is a known finite-dimensional feature map with ‖α(x)‖2 ≤ Cfor all x ∈ X . Let dψ (Θ ‖ 0) ≤ σ

2W2 where ψ is σ-strongly convex with respect to ‖ · ‖F . Then,

for any δ ∈ (0, 1), with probability at least 1− δ over the draws of (xi,yi), there exists some time

t < O(σWC√n

√m/ log

(1δ

))such that the hypothesis ht(x) = u

(Θ(t)α(x)

)satisfies

ε(ht) ≤ O(

maxk{Lk}CW√m

√n log(1/δ)

),

ε(ht) ≤ O(

maxk{Lk}CW√m

(n+

√n log(1/δ)

)),

where Θ(t) is output by the Reflectron at time t with Θ(0) = 0.

Proof. Consider the rate of change of the Bregman divergence between the parameters for theBayes-optimal predictor and the parameters produced by the Reflectron at time t,

d

dtdψ

(Θ∥∥∥ Θ

)=(Θ−Θ

): ∇2ψ(Θ) :

˙Θ.

20

Following the proofs of Lemma D.1 and Theorem 3.1, we immediately have the inequality

d

dtdψ

(Θ∥∥∥ Θ

)≤ − 1

maxk{Lk}ε (ht) +

1

m

(Θ−Θ

):

(m∑i=1

(yi − u (Θα (xi)))αT (xi)

)Now, note that each 1

C√n

(yi − u (Θα (xi)))αT (xi) is a zero-mean i.i.d. random

variable with Frobenius norm bounded by 1 almost surely. Then, by Lemma C.1,‖ 1C√nm

∑mi=1 (yi − u(Θα(xi))α

T (xi)‖F ≤ η with probability at least 1 − δ where η =

2(

1+√

log(1/δ)/2)

√m

. Assuming that ‖Θ(t)−Θ‖F ≤W at time t, we conclude that

d

dtdψ

(Θ∥∥∥ Θ

)≤ − 1

maxk{Lk}ε(ht) +

√nCWη.


(Θ∥∥∥ Θ

)< −√nCWη, or ε(ht) ≤ 2 (maxk{Lk})

√nCWη. In the latter case,

our result is proven. In the former, by our assumptions dψ (Θ ‖ 0) = dψ

(Θ∥∥∥ Θ(0)

)≤ σ

2W2,

and hence ‖Θ(0) − Θ‖ ≤ W by σ-strong convexity of ψ with respect to ‖ · ‖F . Furthermore,

‖Θ(t)−Θ‖ ≤√

2σdψ

(Θ∥∥∥ Θ(t)

)≤√

2σdψ

(Θ∥∥∥ Θ(0)

)≤W . Thus it can be until at most

tf =dψ

(Θ∥∥∥ Θ(0)

)√nCWη

=σ/2W 2

√nCWη

=σW

2√nCη

until ε(ht) ≤ 2 maxk{Lk}√nCWη. Hence there is some ht with t < tf such that

ε(ht) ≤ O

(√nmax

k{Lk}CW

√log(1/δ)

m

).

Note that u(·) is Lipschitz with constant maxk{Lk}. Let ly(y) = ‖y − y‖2 denote the square losson example y. Then ly ◦ u : Rn → R is

√nmaxk{Lk} Lipschitz and

√n-bounded over the domain

[0, 1]n. We want to bound ε(ht)− ε(ht), which can be performed via

supf∈F

[EX [(ly ◦ u) (f(X))]− 1

m

m∑i=1

(ly ◦ u) (f(xi))

]

≤ O

(Exi,εi

[supf∈F

1

m

m∑i=1

εi (ly ◦ u) (f(xi))

])+O

(√n log(1/δ)

m

)

with probability at least 1− δ, where F ={

Θα(x) : ‖Θ‖F ≤ 2W, ‖α(x)‖2 ≤ C}

and the εi areRademacher random variables (Wainwright, 2019). Application of Theorem C.4 to bound the firstterm on the right-hand side above gives the bound

ε(ht) ≤ ε(ht) +O(

maxk{Lk}CWn√m

)+O

(√n log(1/δ)

m

)which completes the proof.

E Relative entropy and optimization over the simplex

Choosing ψ(x) =∑i xi log(xi) to be the relative entropy for xi > 0,

∑ni=1 xi = 1 gives the

well-known exponential weights algorithm for learning parameters on the probability simplex. Herewe show that by a straightforward modification of the assumptions and proof of Theorem 3.1, theReflectron can similarly be run with the potential function chosen as the relative entropy.Theorem E.1. Suppose that {xi, yi}mi=1 are drawn i.i.d. from a distributionD supported onX×[0, 1]with E [y|x] = u (〈θ,α(x)〉) for a known nondecreasing and L-Lipschitz link function u : R→ [0, 1],

21

a kernel function K with corresponding finite-dimensional feature map α(x) ∈ Rp, and an unknown

vector of parameters θ ∈ Rp. Assume that dψ(θ∥∥∥ θ(0)

)≤ σ

2W2 where ψ is σ-strongly convex

with respect to ‖·‖1, and that ‖α(x)‖∞ ≤ C for all x ∈ X . Then, for any δ ∈ (0, 1), with probability

at least 1 − δ over the draws of the {xi, yi}, there exists some time t < O(σWC

√m/ log (p/δ)

)such that the hypothesis ht = u

(⟨θ(t),α(x)

⟩)satisfies


(LCW

√log(p/δ)

m

).

Proof. The proof is nearly identical to Theorem 3.1. As in (6), we immediately have that

d

dtdψ

(θ∥∥∥ θ) ≤ − 1

Lε(ht) +

1

m

m∑i=1

(yi − u (〈α(xi),θ〉))⟨α(xi), θ − θ

⟩By Holder’s inequality,

1

m

m∑i=1

(yi − u (〈α(xi),θ〉))⟨α(xi), θ − θ

⟩≤

∥∥∥∥∥ 1

m

m∑i=1

(yi − u (〈α(xi),θ〉))α(xi)

∥∥∥∥∥∞

∥∥∥θ − θ∥∥∥

1

Each component of each vector 1C (yi − u(〈α(xi),θ〉))α(xi) is a zero mean i.i.d. random variable

with absolute value bounded by one almost surely. Hence by a Hoeffding bound and a unionbound,

∥∥ 1Cm

∑mi=1 (yi − u(〈α(xi),θ〉))α(xi)

∥∥∞ ≤ η with probability at least 1 − δ where η =

2(

1+√

log(p/δ)/2)

√m

. Assuming that ‖θ(t)− θ‖1 ≤W at time t, we conclude that

d

dtdψ

(θ∥∥∥ θ) ≤ − 1

Lε(ht) + CWη.

From here, the proof is identical to that of Theorem 3.1, exploiting σ-strong convexity of the Bregmandivergence with respect to ‖ · ‖1.

Noting by Pinsker’s theorem that the relative entropy is 1-strongly convex with respect to ‖ · ‖1 onthe probability simplex, Theorem E.1 immediately shows that the Reflectron may be run with therelative entropy (and corresponding Bregman divergence the KL-divergence) if the learned weightsare constrained to lie on the simplex.

More generally, Theorem E.1 shows that if the feature map admits a dimension-independent boundin l∞ norm but a dimension-dependent bound in l2 norm, potential functions strongly convex withrespect to the l1 norm can reduce the dimension dependence to logarithmic. Other choices of stronglyconvex potentials with respect to l1 norm include, for example, 1

q‖ · ‖qq with q = 1 + 1

log(p)

F von Neumann entropy and optimization over the spectrahedron

Similar to Appendix E, choosing ψ(Θ) in the vector-valued output case (without weight sharing)to be the von Neumann entropy ψ(Θ) =

∑i λi log(λi) for Θ > 0 enables us to optimize over the

spectrahedron S = {X > 0 : ‖X‖∗ = 1} where ‖X‖∗ denotes the nuclear norm of X . We note thatthe von Neumann entropy is 1

2 -strongly convex with respect to ‖ · ‖∗ on the spectrahedron.

Theorem F.1. Suppose that {xi,yi}mi=1 are drawn i.i.d. from a distribution supported on X × [0, 1]n

with E [y|x] = u (Θα(x)) for a known function u(x) and an unknown matrix of parametersΘ ∈ Rn×n. Assume that u(x)i = ui(xi) where each ui : Xi → [0, 1] is Li-Lipschitz andnondecreasing in its argument. Suppose that α : X → Rn is a known finite-dimensional featuremap with ‖α(x)‖2 ≤ C for all x ∈ X . Let dψ

(Θ∥∥∥ Θ(0)

)≤ σ

2W2 where ψ is σ-strongly

convex with respect to ‖ · ‖∗. Then, for any δ ∈ (0, 1), with probability at least 1 − δ over the

draws of the {xi,yi}, there exists some time t < O(σWC√n

√m/ log (1/δ)

)such that the hypothesis

22

ht(x) = u(Θ(t)α(x)

)satisfies

ε(ht) ≤ O(

maxk{Lk}CW√m

√n log(1/δ)

),

ε(ht) ≤ O(

maxk{Lk}CW√m

(n+

√n log(1/δ)

)).

Proof. As in Theorem D.2, we have the inequality

d

dtdψ

(Θ∥∥∥ Θ

)≤ − 1

maxk{Lk}ε (ht) +

1

m

(Θ−Θ

):

(m∑i=1


)Note that we can bound

1

m

(Θ−Θ

):

(m∑i=1


)≤

∥∥∥∥∥ 1

m

(m∑i=1


)∥∥∥∥∥F

∥∥∥Θ−Θ∥∥∥F

≤

∥∥∥∥∥ 1

m

(m∑i=1


)∥∥∥∥∥F

∥∥∥Θ−Θ∥∥∥∗

where we have exploited monotonicity of the Schatten p-norms to bound ‖ · ‖F ≤ ‖ · ‖∗.By Lemma C.1, ‖ 1

C√nm

∑mi=1 (yi − u(Θα(xi))α

T (xi)‖F ≤ η with probability at least 1 − δ

where η =2(

1+√

log(1/δ)/2)

√m

. Assuming that ‖Θ(t)−Θ‖∗ ≤W at time t, we conclude that

d

dtdψ

(Θ∥∥∥ Θ

)≤ − 1

maxk{Lk}ε(ht) +

√nCWη.

From here, the proof is identical to that of Theorem D.2, now exploiting σ-strong convexity of ψ withrespect to ‖ · ‖∗.

G Discrete-time implementation

In this section, we consider a forward-Euler discretization of the Reflectron algorithm with projection.Let C denote a closed and convex constraint set – for example, the probability simplex ∆p = {θ ∈Rp|θi > 0,

∑ni=1 θi = 1}. Let the Bregman projection under ψ be denoted by Πψ

C ,

ΠψC (z) = arg min

x∈C∩Mdψ (x ‖ z) .

We then define the discrete-time iteration with projection,

∇ψ(φt+1

)= ∇ψ

(θt

)− λ

m

m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi), (10)

θt+1 = ΠψC (φt+1). (11)

To analyze this discrete-time iteration, we require two standard facts about the Bregman divergence

Lemma G.1 (Bregman three-point identity). Let ψ :M→ Rp denote a σ-strongly convex function.Then for all x,y, z ∈M,

〈∇ψ(x)−∇ψ(y),x− z〉 = dψ (x ‖ y) + dψ (z ‖ x)− dψ (z ‖ y) . (12)

Lemma G.2 (Generalized Pythagorean Theorem). Let ψ : M → R denote a σ-strongly convexfunction. Let x0 ∈M and let x∗ = Πψ

C (x0) be its projection onto a closed and convex set C. Thenfor any y ∈ C,

dψ (y ‖ x0) ≥ dψ (y ‖ x∗) + dψ (x∗ ‖ x0) . (13)

23

We now state an analogue of Theorem 3.1 for the discrete-time algorithm (10) & (11), which showsthat the discrete-time iteration preserves the statistical guarantees of the continuous-time dynamics(5). For simplicity, we handle the scalar output case and assume that ψ is σ-strongly convex withrespect to ‖ · ‖2. The proof may be readily extended to the case of strong convexity with respect to‖ · ‖1 as in Theorems E.1 and F.1, and may be extended to the cases of vector valued outputs with orwithout weight sharing as in Theorems 4.2 and D.2.Theorem G.1 (Statistical guarantees for the discrete-time Reflectron). Suppose that {xi, yi}mi=1are drawn i.i.d. from a distribution D supported on X × [0, 1] with E [y|x] = u (〈θ,α(x)〉) fora known nondecreasing and L-Lipschitz link function u : R → [0, 1], a kernel function K withcorresponding finite-dimensional feature map α(x) ∈ Rp, and an unknown vector of parametersθ ∈ C ⊆ Rp. Assume that dψ (θ ‖ 0) ≤ σ

2W2 where ψ is σ-strongly convex with respect to ‖ · ‖2,

and that ‖α(x)‖2 ≤ C for all x ∈ X . Then for any δ ∈ (0, 1), with probability at least 1− δ over

the draws of the {xi, yi}, if λ ≤ σC2L there exists some time t < O

(LCW

√m/ log (1/δ)

)such

that the hypothesis ht = u(⟨

θ(t),α(x)⟩)

satisfies


(LCW

√log(1/δ)

m

),

where θt is output by the discrete-time Reflectron (10) & (11) at time t with θ1 = 0.

Proof. By the Bregman three-point identity (12), with z = θ, x = φt+1, and y = θt,

dψ

(θ∥∥∥ φt+1

)= dψ

(θ∥∥∥ θt)− dψ (φt+1

∥∥∥ θt)+⟨∇ψ(φt+1)−∇ψ(θt), φt+1 − θ

⟩,

= dψ

(θ∥∥∥ θt)− dψ (φt+1

∥∥∥ θt)−⟨ λ

m

m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi), φt+1 − θ

⟩,

= dψ

(θ∥∥∥ θt)− dψ (φt+1

∥∥∥ θt)−⟨ λ

m

m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi), θt − θ

⟩,

−

⟨λ

m

m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi), φt+1 − θt

⟩.

Now consider grouping the second and final terms in the last expression as follows

λ

m

⟨m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi), θt − φt+1

⟩− dψ

(φt+1

∥∥∥ θt)=

λ

m

⟨m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi), θt − φt+1

⟩−(ψ(φt+1)− ψ(θt)−

⟨∇ψ(θt), φt+1 − θt

⟩),

=

⟨λ

m

m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi)−∇ψ(θt), θt − φt+1

⟩− ψ(φt+1) + ψ(θt),

= −⟨∇ψ(φt+1), θt − φt+1

⟩− ψ(φt+1) + ψ(θt),

= dψ

(θt

∥∥∥ φt+1

).

With this simplified expression, the iteration becomes

dψ

(θ∥∥∥ φt+1

)= dψ

(θ∥∥∥ θt)+ dψ

(θt

∥∥∥ φt+1

)+λ

m

⟨m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi),θ − θt

⟩.

By the generalized Pythagorean Theorem (13),

dψ

(θ∥∥∥ θt+1

)≤ dψ

(θ∥∥∥ θt)+ dψ

(θt

∥∥∥ φt+1

)+

λ

m

⟨m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi),θ − θt

⟩. (14)

24

By duality, we may rewrite the second Bregman divergence on the right-hand side as a Bregmandivergence with respect to the Fenchel conjugate of ψ, denoted as ψ∗,

dψ

(θ∥∥∥ θt+1

)≤ dψ

(θ∥∥∥ θt)+ dψ∗

(∇ψ

(φt+1

) ∥∥∥ ∇ψ (θt))+λ

m

⟨m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi),θ − θt

⟩.

Because ψ is σ-strongly convex with respect to ‖ · ‖2, ψ∗ is 1σ -smooth with respected to ‖ · ‖2. Thus,

dψ

(θ∥∥∥ θt+1

)≤ dψ

(θ∥∥∥ θt)+

1

2σ

∥∥∥∇ψ (φt+1

)−∇ψ

(θt

)∥∥∥2

2

+λ

m

⟨m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi),θ − θt

⟩,

= dψ

(θ∥∥∥ θt)+

λ2

2σ

∥∥∥∥∥ 1

m

m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi)

∥∥∥∥∥2

2

+λ

m

⟨m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi),θ − θt

⟩. (15)

Above, we applied 1σ -smoothness and then used (10) to express the increment in ∇ψ. The second

term in (15) can be bounded as∥∥∥∥∥ 1

m

m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi)

∥∥∥∥∥2

2

=

∥∥∥∥∥ 1

m

m∑i=1

(u(⟨

α(xi), θt

⟩)− u (〈α(xi),θ〉)

)α(xi) +

1

m

m∑i=1

(u (〈α(xi),θ〉)− yi)α(xi)

∥∥∥∥∥2

2

,

≤

∥∥∥∥∥ 1

m

m∑i=1

(u(⟨

α(xi), θt

⟩)− u (〈α(xi),θ〉)

)α(xi)

∥∥∥∥∥2

2

+

∥∥∥∥∥ 1

m

m∑i=1


∥∥∥∥∥2

2

+ 2

∥∥∥∥∥ 1

m

m∑i=1

(u(⟨

α(xi), θt

⟩)− u (〈α(xi),θ〉)

)α(xi)

∥∥∥∥∥2

∥∥∥∥∥ 1

m

m∑i=1


∥∥∥∥∥2

.

By Lemma C.1, ‖ 1Cm

∑mi=1 (yi − u(〈α(xi),θ〉))α(xi)‖2 ≤ η with probability at least 1− δ where

η =2(

1+√

log(1/δ)/2)

√m

< 1. Furthermore, by Jensen’s inequality∥∥∥∥∥ 1

m

m∑i=1

(u(⟨

α(xi), θt

⟩)− u (〈α(xi),θ〉)

)α(xi)

∥∥∥∥∥2

2

≤ C2ε(ht).

Assuming that W ≥ 1 ≥ η so that η2 ≤ ηW , we then have the upper bound∥∥∥∥∥ 1

m

m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi)

∥∥∥∥∥2

2

≤ C2 (ε(ht) + ηW (1 + 2CL)) .

Suppose that ‖θ − θt‖2 ≤W at iteration t. We may then also bound the final term in (15) as in theproof of Theorem 3.1,

1

m

⟨m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi),θ − θt

⟩≤ − 1

Lε(ht) + CWη.

Combining these bounds, the iteration of the Bregman divergence between the Bayes-optimalparameters and the output of the discrete-time Reflectron becomes

dψ

(θ∥∥∥ θt+1

)≤ dψ

(θ∥∥∥ θt)+

λ2C2

2σ(ε(ht) + ηW (1 + 2CL))− λ

Lε(ht) + λCWη,

= dψ

(θ∥∥∥ θt)+ λ

((λC2

2σ− 1

L

)ε(ht) + CWη

(1 +

λC

2σ(1 + 2CL)

)).

25

For λ ≤ σC2L , we have

dψ

(θ∥∥∥ θt+1

)≤ dψ

(θ∥∥∥ θt)+

σ

C2L

(− 1

2Lε(ht) + 2CWη +

Wη

2L

).

Thus, at each iteration either dψ(θ∥∥∥ θt+1

)− dψ

(θ∥∥∥ θt) ≤ −σWη

CL or ε(ht) ≤ (6CL+ 1)Wη.Following the same reasoning as in the proof of Theorem 3.1, in the latter case our result is proven,and in the former case there can be at most

(σW 2

2

)(CLσWη

)= CLW

2η = tf iterations before ε(ht) ≤

(6CL+ 1)Wη. That is, there must be some ht with t ≤ tf such that ε(ht) ≤ O(CLW

√log(1/δ)

m

).

The conclusion of the theorem now follows identically as in Theorem 3.1, through application of auniform law of large numbers to transfer the bound on ε(ht) to ε(ht) and a union bound.

We now prove that the discrete-time dynamics (10) & (11) preserve the interpolation and implicitregularization properties of the continuous-time dynamics (5) exactly. We begin with convergence toan interpolating solution.Lemma G.3 (Convergence of the discrete-time Reflectron for a realizable dataset). Suppose that{xi, yi}mi=1 are drawn i.i.d. from a distribution D supported on X × [0, 1] and that Assumption 3.1is satisfied where u is a known, nondecreasing, and L-Lipschitz function, and where θ ∈ C. Let ψbe any σ-strongly convex function with respect to ‖ · ‖2. Choose λ < 2σ

C2L where ‖α(x)‖2 ≤ C.

Then ε(ht) → 0 where ht(x) = u(⟨

θt,α(x)⟩)

is the hypothesis with parameters output by the

discrete-time Reflectron (10) & (11) at iteration t with θ1 arbitrary. Furthermore,

mint′∈[1,t]

{ε(ht′)} ≤ O(

1

t

).

Proof. From (15), we have a bound on the iteration for the Bregman divergence between the interpo-lating parameters and the current parameter estimates,

dψ

(θ∥∥∥ θt+1

)≤ dψ

(θ∥∥∥ θt)+

λ2

2σ

∥∥∥∥∥ 1

m

m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi)

∥∥∥∥∥2

2

+λ

m

⟨m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi),θ − θt

⟩.

Under the realizability assumption of the lemma, we may bound the second term above as

λ2

2σ

∥∥∥∥∥ 1

m

m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi)

∥∥∥∥∥2

2

≤ λ2C2

2σε(ht).

We may similarly bound the final term, exploiting monotonicity and Lipschitz continuity of u(·), as

λ

m

⟨m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi),θ − θt

⟩≤ −λ

Lε(ht).

Putting these together, we have the refined bound on the iteration

dψ

(θ∥∥∥ θt+1

)≤ dψ

(θ∥∥∥ θt)+ λ

(λC2

2σ− 1

L

)ε(ht).

Let 0 < ζ < 1. For λ ≤ 2σ(ζ+1)C2L ,

dψ

(θ∥∥∥ θt+1

)≤ dψ

(θ∥∥∥ θt)− ζ

Lε(ht).

Summing both sides of the above inequality from t = 1 to T reveals thatT∑t=1

ε(ht) ≤L

ζ

(dψ (θ ‖ θ1)− dψ

(θ∥∥∥ θT+1

))≤ L

ζdψ (θ ‖ θ1) .

26

Because T was arbitrary and the upper bound is independent of T ,∑∞t=1 ε(ht) exists and hence

ε(ht)→ 0 as t→∞. Furthermore,

mint′∈[1,T ]

{ε(ht′)}T =

T∑t=1

mint′∈[1,T ]

{ε(ht′)} ≤T∑t=1

ε(ht) ≤L

ζdψ (θ ‖ θ1) ,

so that mint′∈[1,T ] {ε(ht′)} ≤Ldψ(θ ‖ θ1)

ζT . To conclude the proof, observe that because ζ wasarbitrary, we may take the limit as ζ → 0 for the maximum allowed learning rate.

Now that convergence to an interpolating solution in the realizable setting has been established, wecharacterize the implicit regularization of the iteration (10) in the following theorem. The proofstrategy is similar to that of Azizan and Hassibi (2019).Theorem G.2 (Implicit regularization of the discrete-time Reflectron). Suppose that {xi, yi}mi=1 aredrawn i.i.d. from a distribution D supported on X × [0, 1] and that Assumption 3.1 is satisfied with ua known, nondecreasing, and L-Lipschitz function and with θ ∈ C. Let ψ be σ-strongly convex withrespect to ‖ · ‖2. Let A = {θ ∈ C : u

(⟨θ,α(xi)

⟩)= yi, i = 1, . . . ,m} be the set of parameters

that interpolate the data, and assume that θt → θ∞ ∈ A. Further assume that u(·) is invertible, and

that λ < 2σC2L . Then θ∞ = arg minθ∈A dψ

(θ∥∥∥ θ1

). In particular, if θ1 = arg minw∈C∩M ψ(w),

then θ∞ = arg minθ∈A ψ(w).

Proof. Let θ ∈ A be arbitrary. From (14),

dψ

(θ∥∥∥ θt+1

)= dψ

(θ∥∥∥ θt)+ dψ

(θt

∥∥∥ θt+1

)+λ

m

⟨m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)α(xi),θ − θt

⟩,

= dψ

(θ∥∥∥ θt)+ dψ

(θt

∥∥∥ θt+1

)+λ

m

m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)⟨α(xi),θ − θt

⟩,

= dψ

(θ∥∥∥ θt)+ dψ

(θt

∥∥∥ θt+1

)+λ

m

m∑i=1

(u(⟨

α(xi), θt

⟩)− yi

)(u−1 (yi)−

⟨α(xi), θt

⟩),

where we have used that θ ∈ A and applied invertibility of u(·). The proof of Lemma G.3 has alreadyestablished summability of an upper bound on the final two terms of the above expression. Hence,summing both sides from t = 1 to∞ and rearranging

dψ

(θ∥∥∥ θ∞) = dψ

(θ∥∥∥ θ1

)+

∞∑t=1

dψ

(θt

∥∥∥ θt+1

)− λ

m

m∑i=1

∞∑t=1

(yi − u

(⟨α(xi), θt

⟩))(u−1 (yi)−

⟨α(xi), θt

⟩).

The above relation is true for any θ ∈ A. Furthermore, the only dependence of the right-hand sideon θ is through the first Bregman divergence. Hence the arg min of the two Bregman divergencesinvolving θ must be equal, which shows that θ∞ = arg minθ∈A dψ

(θ∥∥∥ θ1

). Choosing θ1 =

arg minw∈C∩M ψ(w) completes the proof.

H Online learning

We now consider two cases of an online learning setting similar to stochastic gradient descent. Wefirst prove a theorem for the realizable setting, where we can prove fast O (1/T ) convergence ratesfor the generalization error. We then consider the bounded noise setting, where the rate is reduced toO(

1/√T)

. As in Appendix G, we focus on the case where ψ is σ-strongly convex with respect to‖ · ‖2. Our results readily generalize to strong convexity with respect to ‖ · ‖1.

27

Concretely, at each iteration t we receive a new data point xt and corresponding label yt. The onlineReflectron iteration is then given by

∇ψ(φt+1

)= ∇ψ

(θt

)− λ

(u(⟨

α(xt), θt

⟩)− yt

)α(xt), (16)

θt+1 = ΠψC (φt+1), (17)

with C a convex set of constraints.

Our proofs utilize the following martingale bound, which has been used in similar analyses prior tothis work (Ji and Telgarsky, 2019; Frei et al., 2020).

Lemma H.1 (Beygelzimer et al. (2011)). Let {Yt}∞t=1 be a martingale adapted to the filtration{Ft}∞t=1. Let {Dt}∞t=1 be the corresponding martingale difference sequence. Define

Vt =

t∑k=1

E[D2k|Fk−1

],

and assume that Dt ≤ R almost surely. Then for any δ ∈ (0, 1), with probability at least 1− δ,

Yt ≤ R log(1/δ) + (e− 2)Vt/R.

With Lemma H.1 in hand, we may state our first result.

Theorem H.1 (Online learning with realizable data). Suppose that {xi}∞i=1 are drawn i.i.d. froma distribution D supported on X and let yi = u (〈θ,α(x)〉) for a known nondecreasing and L-Lipschitz link function u : R → [0, 1], a kernel function K with corresponding finite-dimensionalfeature map α(x) ∈ Rp, and an unknown vector of parameters θ ∈ C ⊆ Rp. Suppose that ψ isσ-strongly convex with respect to ‖ · ‖2, and that ‖α(x)‖2 ≤ C for all x ∈ X . Let λ < 2σ

LC2 . Thenfor any δ ∈ (0, 1), with probability at least 1− δ over the draws of the {xi},

mint<T

ε(ht) ≤ O

L2C2dψ

(θ∥∥∥ θ1

)log(1/δ)

σT

,

where ht is the hypothesis output by the online Reflectron (16) & (17) at time t.

Proof. From (15), we have the bound

dψ

(θ∥∥∥ θt+1

)≤ dψ

(θ∥∥∥ θt)+

λ2

2σ

∥∥∥(u(⟨α(xt), θt

⟩)− u (〈α(xt),θ〉)

)α(xt)

∥∥∥2

2

+ λ⟨(u(⟨

α(xt), θt

⟩)− u (〈α(xt),θ〉)

)α(xt),θ − θt

⟩,

≤ dψ(θ∥∥∥ θt)+

λ2C2

2σ

(u(⟨

α(xt), θt

⟩)− u (〈α(xt),θ〉)

)2

− λ

L

(u(⟨

α(xt), θt

⟩)− u (〈α(xt),θ〉)

)2

,

= dψ

(θ∥∥∥ θt)− λ

L

(1− λLC2

2σ

)(u(⟨

α(xt), θt

⟩)− u (〈α(xt),θ〉)

)2

.

Let 0 < β < 1. Taking λ = (1−β)2σLC2 ,

dψ

(θ∥∥∥ θt+1

)≤ dψ

(θ∥∥∥ θt)− 2σ(1− β)β

L2C2

(u(⟨

α(xt), θt

⟩)− u (〈α(xt),θ〉)

)2

,

so that dψ(θ∥∥∥ θt+1

)≤ dψ

(θ∥∥∥ θt) ≤ . . . ≤ dψ (θ ∥∥∥ θ1

). Summing both sides from 1 to T − 1

leads to the inequality

dψ

(θ∥∥∥ θT) ≤ dψ (θ ∥∥∥ θ1

)− 2σ(1− β)β

L2C2

T−1∑t=1

(u(⟨

α(xt), θt

⟩)− u (〈α(xt),θ〉)

)2

.

28

Rearranging, using positivity of the Bregman divergence, and defining εt =

12

(u(⟨

α(xt), θt

⟩)− u (〈α(xt),θ〉)

)2

, we conclude that

T−1∑t=1

εt ≤L2C2

2σ(1− β)βdψ

(θ∥∥∥ θ1

). (18)

We would now like to transfer the bound (18) to a bound on ε(ht). Define Dt = ε (ht) − εt,and note that {Dt}∞t=1 is a martingale difference sequence adapted to the filtration {Ft =σ(x1,x2, . . . ,xt)}∞t=1 given by the sequence of sigma algebras generated by the first t draws. Notethat, almost surely,

Dt ≤ ε(ht) =1

2Ex∼D

[u(⟨

θt,α(x)⟩)− u (〈θ,α(x)〉)2

]≤ 1

2L2C2‖θt − θ‖22 ≤

L2C2

σdψ

(θ∥∥∥ θ1

),

where we have applied σ-strong convexity of ψ with respect to ‖ · ‖2 to upper bound ‖θt − θ‖22 by aBregman divergence. Now, consider the bound on the conditional variance

E[D2t |Ft−1

]= E

[ε(ht)

2 − 2ε(ht)εt + ε2t |Ft−1

],

= ε(ht)2 − 2ε(ht)

2 + E[ε2t |Ft−1

],

≤ E[ε2t |Ft−1

],

=1

4E[(u(⟨

α(xt), θt

⟩)− u (〈α(xt),θ〉)

)4

|Ft−1

],

≤L2C2dψ

(θ∥∥∥ θ1

)σ

ε(ht).

From these two bounds, we conclude by Lemma H.1 that with probability at least 1− δ,t∑

τ=1

(ε(hτ )− ετ ) ≤ L2C2

σdψ

(θ∥∥∥ θ1

)log(1/δ) + (e− 2)

t∑τ=1

ε(hτ ).

Rearranging terms,

(3− e)t∑

τ=1

ε(hτ ) ≤ L2C2

σdψ

(θ∥∥∥ θ1

)log(1/δ) +

t∑τ=1

ετ .

Applying the bound from (18),

(3− e)t∑

τ=1

ε(hτ ) ≤ L2C2

σdψ

(θ∥∥∥ θ1

)(log(1/δ) +

1

2(1− β)β

).

We then conclude

mint<T

ε(ht) ≤1

T

T∑τ=1

ε(hτ ) ≤L2C2dψ

(θ∥∥∥ θ1

)σ(3− e)T

(log(1/δ) +

1

σ(1− β)β

),

which completes the proof.

We now consider the case of online learning with bounded noise. The analysis combines techniquesfrom Theorems H.1 and G.1, and concludes a slower O

(1/√T)

rate than Theorem H.1.

Theorem H.2 (Online learning with bounded noise). Suppose that {xi, yi}mi=1 are drawn i.i.d. froma distributionD supported onX×[0, 1] with E [y|x] = u (〈θ,α(x)〉) for a known nondecreasing andL-Lipschitz link function u : R→ [0, 1], a kernel function K with corresponding finite-dimensionalfeature map α(x) ∈ Rp, and an unknown vector of parameters θ ∈ C ⊆ Rp. Assume that C iscompact, and let R = Diam(C). Suppose that ψ is σ-strongly convex with respect to ‖ · ‖2, and that

‖α(x)‖2 ≤ C for all x ∈ X . Fix a horizon T , and choose λ < min{

2σC2L ,

1√T

}. Then for any

δ ∈ (0, 1), with probability at least 1− δ over the draws of the {xi, yi},

mint<T

ε(ht) ≤ O(L√T

(dψ

(θ∥∥∥ θ1

)+√CR log(6/δ) +

C2

σ

)).

29

Proof. From (15), we have the bound

dψ

(θ∥∥∥ θt+1

)≤ dψ

(θ∥∥∥ θt)+

λ2

2σ

∥∥∥(u(⟨α(xt), θt

⟩)− yt

)α(xt)

∥∥∥2

2

+ λ⟨(u(⟨

α(xt), θt

⟩)− yt

)α(xt),θ − θt

⟩.

Note that we can write(u(⟨

α(xt), θt

⟩)− yt

)2

=(u(⟨

α(xt), θt

⟩)− u (〈α(xt),θ〉)

)2

+ (u (〈α(xt),θ〉)− yt)2

+ 2(u(⟨

α(xt), θt

⟩)− u (〈α(xt),θ〉)

)(u (〈α(xt),θ〉)− yt) ,

and using that u is nondecreasing and L-Lipschitz,⟨(u(⟨

α(xt), θt

⟩)− yt

)α(xt),θ − θt

⟩≤ − 1

L

(u(⟨

α(xt), θt

⟩)− u (〈α(xt),θ〉)

)2

+ (u (〈α(xt),θ〉)− yt)⟨α(xt),θ − θt

⟩.

Putting these together, we conclude the bound,

dψ

(θ∥∥∥ θt+1

)≤ dψ

(θ∥∥∥ θt)− λ( 1

L− λC2

2σ

)(u(⟨

α(xt), θt

⟩)− u (〈α(xt),θ〉)

)2

+ λ (u (〈α(xt),θ〉)− yt)(⟨

α(xt),θ − θt

⟩+λC2

σ

(u(⟨

α(xt), θt

⟩)− u (〈α(xt),θ〉)

))+λ2C2

2σ(u (〈α(xt),θ〉)− yt)2

.

Summing both sides from t = 1 to T ,

dψ

(θ∥∥∥ θT+1

)≤ dψ

(θ∥∥∥ θ1

)− λ

(1

L− λC2

2σ

) T∑t=1

(u(⟨

α(xt), θt

⟩)− u (〈α(xt),θ〉)

)2

+ λ

T∑t=1

(u (〈α(xt),θ〉)− yt)(⟨

α(xt),θ − θt

⟩+λC2

σ

(u(⟨

α(xt), θt

⟩)− u (〈α(xt),θ〉)

))

+λ2C2

2σ

T∑t=1

(u (〈α(xt),θ〉)− yt)2.

Define the filtration {Ft = σ(x1, y1,x2, y2, . . . ,xt, yt,xt+1)}, and note that α(xt) and θt are bothmeasurable with respect to Ft−1. Using that E [yt|xt] = u (〈α(xt),θ〉), this shows that the twosequences

D(1)t = (u (〈α(xt),θ〉)− yt)

⟨α(xt),θ − θt

⟩,

D(2)t = (u (〈α(xt),θ〉)− yt)

(u(⟨

α(xt), θt

⟩)− u (〈α(xt),θ〉)

),

are martingale difference sequences adapted to {Ft}. Furthermore, note that |D(1)t | ≤ CR and

|D(2)t | ≤ LCR almost surely where R = Diam(C). Hence, by an Azuma-Hoeffding bound, with

probability at least 1− δ/3,T∑t=1

D(1)t ≤

√CRT log(6/δ),

T∑t=1

D(2)t ≤

√LCRT log(6/δ).

We also have the almost sure bound on the remaining variance term,T∑t=1

(u (〈α(xt),θ〉)− yt)2 ≤ T.

30

Putting these bounds together and rearranging, we conclude that with probability at least 1− 2δ/3,

λ

(1

L− λC2

2σ

) T∑t=1

(u(⟨

α(xt), θt

⟩)− u (〈α(xt),θ〉)

)2

≤ dψ(θ∥∥∥ θ1

)− dψ

(θ∥∥∥ θT+1

)+ λ√CRT log(6/δ) +

λ2C2

σ

√LCRT log(6/δ) +

λ2C2T

2σ.

Let β ∈ (0, 1) and take λ = min{

2σ(1−β)C2L , 1√

T

}. Define β′ = 1 − 2σC2L√

T, and define β =

max{β, β′}. Then 1L −

λC2

2σ = βL . Defining εt =

(u(⟨

α(xt), θt

⟩)− u (〈α(xt),θ〉)

)2

, we find

T∑t=1

εt ≤L

βmax

{√T ,

C2L

2σ(1− β)

}dψ

(θ∥∥∥ θ1

)+L

β

√CRT log(6/δ)

+C2L

σβ

√LCR log (6/δ) +

C2L√T

2σβ. (19)

As in the proof of Theorem H.1, we now want to transfer this bound to a bound on ε(ht) viaLemma H.1. Define D(3)

t = ε(ht)− εt, and note that this is a martingale difference sequence adaptedto the filtration {Ft = σ(x1, y1,x2, y2, . . . ,xt, yt)}. D(3)

t satisfies the following inequalities almostsurely,

D(3)t ≤

L2C2R

2,

E[(D

(3)t

)2

|Ft−1

]≤ L2C2R

2ε(ht).

Thus, by Lemma H.1, with probability at least 1− δ/3,

T∑τ=1

ε(hτ ) ≤ L2C2R

2(3− e)log(3/δ) +

1

3− e

T∑τ=1

ετ .

Using (19), with probability at least 1− δ,

T∑τ=1

ε(hτ ) ≤ L2C2R

2(3− e)log(3/δ) +

L

β(3− e)max

{√T ,

C2L

2σ(1− β)

}dψ

(θ∥∥∥ θ1

)+

L

β(3− e)√CRT log(6/δ)

+C2L

σβ(3− e)√LCR log (6/δ) +

C2L√T

2σβ(3− e).

Noting that mint<T ε(hτ ) ≤ 1T

∑Tτ=1 ε(hτ ) completes the proof.

31

The Reﬂectron: Exploiting geometry for learning ... · Generalized linear models (GLMs) represent...

Documents

Transcript of The Reﬂectron: Exploiting geometry for learning ... · Generalized linear models (GLMs) represent...