The unreasonable e ectiveness of mathematics, revisitedThe typical interpretation of Wigner test is...

Transcript of The unreasonable e ectiveness of mathematics, revisitedThe typical interpretation of Wigner test is...

The unreasonable e�ectiveness of mathematics,

revisited

Big data and neuroscience

Jaime Gómez-Ramírez

Fundación Reina So�a. Centre for Research in Neurodegenarative Diseases

April 11 2018

Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited1 / 1
Outline

The e�ectiveness of mathematics

Einstein: The mostincomprehensible thingabout the world is that is

comprehensible

Wigner: The unreasonablee�ectiveness ofmathematics

Gelfand: The UnreasonableIne�ectiveness of

Mathematics in biology


heat loss in co�eedQdT = As(Tcoffe − Troom)


Wigner's 1960 essay "the enormous usefulness of mathematics innatural science is something bordering on the mysterious"

The typical interpretation of Wigner test is as follows:

premise math concepts arise from aesthetic impulse in humanspremise is unreasonable to think that those same impulses are

e�ectiveobservation nevertheless it so happens that they are e�ectiveconsequence it follows that math concepts are unreasonably

e�ective (assuming that the aesthetic premise asvalid)

e.g imaginary numbers, tensor. Math concepts appear and propagate




premise math concepts arise from aesthetic impulse in humans

premise is unreasonable to think that those same impulses aree�ective

observation nevertheless it so happens that they are e�ectiveconsequence it follows that math concepts are unreasonably







e�ective

observation nevertheless it so happens that they are e�ectiveconsequence it follows that math concepts are unreasonably







e�ectiveobservation nevertheless it so happens that they are e�ective

consequence it follows that math concepts are unreasonablye�ective (assuming that the aesthetic premise asvalid)



Wigner did seminal work on group theory applied to discoversymmetry principles

group theory replaced previous methods of analysis in quantummechanics, group pest, �nding invariants instead of seeking forexplicit solution by calculus

The goal of science is not to explain nature (the black box) but toexplain the regularities in the behavior of the object "Not thethings in themselves but the relationships between the things.

(Poincaré)

The search for causal explanation in terms of mathematicalprinciples necessitates the belief of the mathematical structure ofthe universe the c-word





(Poincaré)






(Poincaré)



We are "lucky" that regularities exist and that we can grasp themmathematically

This is Newton's contribution and this is in essence why deeplearning works

Regularities are invariant with respect to space and time.A,B...→ X,Y.... under T T (A), T (B)→ T (X), T (Y )Convolutional networks exploit image invariance to work (A cat isa cat is a cat)

>>

t =√

2sg

What makes possible for us to discoverregularities is the division betweeninitial conditions and regularities.

Laws of nature are IF initialconditions THEN event.

That's why causality is so hard, weneed to include/exclude all possiblecombination of antecedents (initialconditions)

>>

Good doesn't play dice eg. stochasticbrownian motion

Our knowledge of nature contains 'a strangehierarchy' (Events we observed → Laws(regularities to discover) → Symmetry(invariance principles)

The future is always uncertain butnevertheless there are correlations - laws-that we can discover

>>




>>




AI, Machine Learning, Deep Learning

>>

AI

Machine Learning

ANN are non linear mapping systems whosefunctioning principles are vaguely based onthe nervous systems of mammals

Deep learning

Data the most valuable asset and computation is a cheap commodity(informationwants to be free)

Perceptron

>>

y = f(∑k

wxxk) (1)

"A Logical Calculus of Ideas Immanent in NervousActivity McCulloch, Pitts, 1943"'If it doesn't rain (x1w1) and homework done(x2w2), go to the movies y (output)'

neurons with a binary threshold activationfunction analogous to �rst order logicsentences

By itself a neuron (or an ann) does very littlebut a su�ciently large network withappropriate structure and properly chosenweights can approximate with arbitraryaccuracy any function

Perceptron

A perceptron is any feedforward network of nodes with responseslike equation ??.

y = f(∑k

wxxk) = f(z) (2)

In general, f is bounded nondecreasing nonlinear squeezing functionfunction, eg. the sigmoid

f(z) =1

1 + e−z, f ′(z) =

e−z

(1 + e−z)2

Perceptron

A perceptron is any feedforward network of nodes with responseslike equation ??.

y = f(∑k

wxxk) = f(z) (2)

In general, f is bounded nondecreasing nonlinear squeezing functionfunction, eg. the sigmoid

f(z) =1

1 + e−z, f ′(z) =

e−z

(1 + e−z)2

Perceptron

Other choices are the tanh, step function and more recently therelu function .

y = ReLU(z) = max(0, z), y′ = 1, z > 0

ReLu works better, faster (gradient constant), y'(0) approximatedy = ln(1.0 + ex)

Reduced likelihood of gradient to vanish

Sparsity produced when z ≤ 0, sigmoids on the other hand tend torepresent more dense representations

Perceptron


y = ReLU(z) = max(0, z), y′ = 1, z > 0




Perceptron


y = ReLU(z) = max(0, z), y′ = 1, z > 0




Perceptron


y = ReLU(z) = max(0, z), y′ = 1, z > 0




What can and can't perceptrons do?

(Single-layer) perceptrons can correctly classify only data sets thatare linearly separable (they can be separated by a hyperplane)

The XOR function is famously non linearly separable and this isvery important because many classi�cation problems are notlinearly separable.


(Single-layer) perceptrons can correctly classify only data sets thatare linearly separable (they can be separated by a hyperplane)

The XOR function is famously non linearly separable and this isvery important because many classi�cation problems are notlinearly separable.


There are 22dboolean functions of d boolean input variables and

only O(222) are linearly separable.

For d=2 14/16 are linearly separable (XOR and its complement arethe exceptions), but for d = 4, only 1882/65536 are linearlyseparable.

Although at that time it was known that multilayer networks weremore powerful than single layer ones, the learning algorithms formultilayer architectures were not known


There are 22dboolean functions of d boolean input variables and

only O(222) are linearly separable.

For d=2 14/16 are linearly separable (XOR and its complement arethe exceptions), but for d = 4, only 1882/65536 are linearlyseparable.

Although at that time it was known that multilayer networks weremore powerful than single layer ones, the learning algorithms formultilayer architectures were not known

Deep networks

ANN learn by example and use backpropagation

If data are well-behaved it will learn not only the trainingexamples but also the underlying relationships

ANN are adaptive and self-repairing, also has some fault tolerancedue to its redundant parallel structure (dense connectivity makes itresilient to minor damage, graceful degradation)

Units within a layer are independent so they can be be evaluatedsimultaneously eg. network with 2,000 nodes in two layers willproduce a response in 2 time steps rather than in 2,000 steps ifeach neuron required to be processes serially (dependent)

Until the advent of GPUs this advantage were not fully exploitedby computers

Deep networks






Deep networks






Deep networks






Deep networks






Deep networks

Table: ANN versus real nervous system

MLP Nervous System

feedforward recurrentdense(fullyconnected) sparse(local)O(102,3,4) O(1010), O(1015)static dynamic:spike trains, synchronization, fatigue

A frame

>> >>

Why MLP is better than one layer?

>>

y = mx is a system with oneparameter, m, what kind of datasets

can separate? only the linearly

separable ones

>>

y = sin(kx), also has oneparameter, the frequency k, but can

separate any arbitrary distribution

of points in the x-axis

Universality of MLP

Any bounded function can be approximated with arbitraryaccuracy if enough hidden units are available -multilayerperceptrons are universal approximators

How many layers do we need for this astounding property(universal approximators)? Kolmogorov showed that one hiddenlayer is su�cient

Any continuous function with n variables to a m-dimensionaloutput can be implemented by a network with one hiddden layer

Unfortunately the proof is not constructive, that is, it does not tellhow the weights should be chosen to produce such a function

Universality of MLP





Universality of MLP





Universality of MLP





How important is the universality of MLP?

Is it universal approximation a rare property? Not really, manyother systems such as polynomials, trigonometric polynomials (egFourier series), wavelets, kernel regression systems (svm) have alsouniversal properties

Architecture

>>

>>

First layer detects the edges, andthe second has the abstract conceptof loop and straight lines, this isactually the hope of having a layerstructure and it works becausewhat Wigner already said

Gradient descent

Cost C(w), the gradient dC(w)dw = 0 (huge column vector with784 + 16 ∗ 16 + 16 ∗ 10 + 16 + 16 + 10 dimensions).The negative of the gradient which is the direction of the steepestincrease gives the direction to take to decrease the error(cost) morequickly

Backprop

The method to calculate the gradient vector, which tells you whichdirection to take and how step the step is

1 compute ∇C2 take step in −∇C direction3 repeat

Learning is �nding the minimizing the weight function.Backprop is the algo used in gradient descent.Learning is 'just' �nding the right weights and biases.

Backprop in action, chain rule

The cost of one training example isC0 = (a

L−y)2, the last activationis aL = σ(wLa

L−1 + bL) = σ(zL)

How sensitive is the Cost function to smallchanges in the weight?

∂C0∂wL

= ∂zL

∂wL∂aL

∂zL∂C0∂aL

∂C0∂aL

= 2(aL − y), ∂aL∂zL

= σ′(zL),∂zL

∂wL= aL−1

Average over all training examples∂C∂wL

= 1n∑n−1

k=0∂Ck∂wL

∇C = [ ∂C∂w1

, ∂C∂b1, ..., ∂C

∂wL]

Curse of dimensionality

Curse of dimensionality refers to the apparent intractability ofsystematically searching through a high-dimensional space

As n get bigger it gets harder and harder to sample all the boxes,with n dimensions each allowing for m states, we will have mn

possible combinations

Blessing of dimensionality

In MLP approximation error decreases with the number of trainingsamples error O(1/sqrtN) and also with the number of hiddenunits error O(1/M) and unlike other systems, eg polynomials thisis independent of the input size and avoid the curse ofdimensionality problem.

From these results we can build bounds, for example

N > O(Mp/�) (3)

where N is the number of samples, M the hidden nodes, p inputdimension (Mp number of parameters) and � the desiredapproximation error.

More layers is better and do not harm

Bias variance trade o�

Bias�variance tradeo� is the problem of simultaneously minimizing twosources of error in the estimand. The bias-variance decomposition:

MSE = E((θ̂−θ)2) = E(θ̂−θ)2+V ar(θ̂)+ = (Bias(θ))2+V ar(θ) (4)

The bias/variance trade o� in deep learning is not exactly a trade o� itcan be tackle algorithmically


Table: Bias variance

high var high bias high bias and var low bias and var

2% 15% 15% 0.5%11% 15% 30% 1%

you don't have the dialectical tension one thing or the other but in thetable we have 4 cases rather than a trade o� and luckily we can takeaction that �t every case.


A bigger network will improve your�tting without hurting the varianceproblem, with the caveat that youregularize properly.

Before we couldn't make better onewithout hurting the other, now we canget both better.

Ensemble models

Idea: you don't want an organization with all the same('good') youmay want to introduce variability

decision trees are grown by introducing a random element, eg ateach node choose randomly the features to split the node

Random forest (randomly constructed trees), each voting for aclass, Bagging: boosting + aggregation.Great predictors but interpretability is obscured by the complexityof the model -accuracy generally requires more complex prediction methods-

Computational Topology

>>

topology is concernedwith the properties ofspace that are preser-ved under continuousdeformations: stretching,crumpling and bending,but not tearing or gluing

Topology is an intermediate analysis mediumthat focuses on coarse structures.

Why to use topology over Big data?

It studies the invariants of continuousformations of the shape of data -resistant tothreshold selection problem-It allows measures of shape (clumps, holesand voids) which are invariant across scales

Persistent homology

>>

>>

Edges in a graph capturedyadic relationships.

Graphs can't capture highorder relationships butsimplicial complex can

A simplicial complex is ageneralized graph consisting onvertices, edges, triangles andsimplices of higher dimensionglue together.

Persistent homology

>>

C0(X) =< v1, v2, v3, v4 >,C1(X) =<e1, e2, e3, e4, e5 >,C2(X) = σ1

boundaryoperatorρ : C1(X)→C0(X), ρ2 :C2(X)→ C1(X)when applied toan edge it yieldsa di�erence ofvertices, higherorder operatorto act ontriangles(2-simplices),

Loop is when wehaveρ1(e1+e2+e3) =0 =ρ1(e1 + e5 + e4),both loops are inthe kernel of ρ1,Ker(ρ1) = {x ∈C1(X), ρ1(x) =0}

Persistent homology

>>

e1 + e2 + e3 is obtained as the image of thetriangle σ1 under the map ρ2, whereas is notthe image of a triangle, in other words,Im(ρ2) = {y ∈ C1, ∃x ∈ C2(X), ρ2(x) = y}),then e1 + e2 + e3 ∈ Im(ρ2) ande1 + e5 + e4 6∈ Im(ρ2).The 1-D homology is the quotient spaceH1(X) = [Ker(ρ1)/Im(ρ2)]

Hi(X) =Ker(ρi)

Im(ρi+1(5)

Persistent homology

>>

>>

Conclusions

With enough imagination a classi�er(regression) can be useful tosolve a large a number of problems

Deep learning works because there is structure in the world but wedon't know why because we don't know anything about the initialconditionslaws of nature are precise beyond anything reasonable; we know virtuallynothing about the initial conditions (Wigner)

There are other ways to reduce complexity in big data whilepreserving maximal intrinsic information -computational topology

Occam's dilemma (lex parsimoniae): accuracy generally requiresmore complex prediction methods, simple and interpretablefunctions do not make the most accurate predictions

The curse of dimensionality can be a blessing

Thanks!

The unreasonable e ectiveness of mathematics, revisitedThe typical interpretation of Wigner test is...

Documents

Transcript of The unreasonable e ectiveness of mathematics, revisitedThe typical interpretation of Wigner test is...