Scuola di Calcolo Scientifico con MATLAB (SCSM) 2017 · Given a set of data D = {(xn,yn),n=1,...,N}...

Machine Learning con Matlab

Scuola di Calcolo Scientifico con MATLAB (SCSM) 2017 Palermo 31 Luglio - 4 Agosto 2017


www.u4learn.it Ing. Giuseppe La Tona


Sommario

•  Machine Learning definition •  Machine Learning Problems •  Artificial Neural Networks (ANN) •  Nearest Neighbor classification •  Mixture Models and k-means •  Graphical Models


Machine Learning

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.” (Tom M. Mitchell)


Example

1.2. AN EXAMPLE 5

Figure 1.1: The objects to be classified are first sensed by a transducer (camera),whose signals are preprocessed, then the features extracted and finally the classifi-cation emitted (here either “salmon” or “sea bass”). Although the information flowis often chosen to be from the source to the classifier (“bottom-up”), some systemsemploy “top-down” flow as well, in which earlier levels of processing can be alteredbased on the tentative or preliminary response in later levels (gray arrows). Yet otherscombine two or more stages into a unified step, such as simultaneous segmentationand feature extraction.

salmon — i.e., the more costly this type of error — the lower we should set the decisionthreshold x∗ in Fig. 1.3.

Such considerations suggest that there is an overall single cost associated with ourdecision, and our true task is to make a decision rule (i.e., set a decision boundary)so as to minimize such a cost. This is the central task of decision theory of which decision

theorypattern classification is perhaps the most important subfield.Even if we know the costs associated with our decisions and choose the optimal

decision boundary x∗, we may be dissatisfied with the resulting performance. Ourfirst impulse might be to seek yet a different feature on which to separate the fish.Let us assume, though, that no other single visual feature yields better performancethan that based on lightness. To improve recognition, then, we must resort to the use

6 CHAPTER 1. INTRODUCTION

salmon sea bass

Length

Count

l*

0

2

4

68

10

12

1618

20

22

5 10 2015 25

Figure 1.2: Histograms for the length feature for the two categories. No single thresh-old value l∗ (decision boundary) will serve to unambiguously discriminate betweenthe two categories; using length alone, we will have some errors. The value l∗ markedwill lead to the smallest number of errors, on average.

2 4 6 8 100

2

4

6

8

10

12

14

Lightness

Count

x*

salmon sea bass

Figure 1.3: Histograms for the lightness feature for the two categories. No singlethreshold value x∗ (decision boundary) will serve to unambiguously discriminate be-tween the two categories; using lightness alone, we will have some errors. The valuex∗ marked will lead to the smallest number of errors, on average.


Example 1.2. AN EXAMPLE 7

2 4 6 8 1014

15

16

17

18

19

20

21

22

Width

Lightness

salmon sea bass

Figure 1.4: The two features of lightness and width for sea bass and salmon. Thedark line might serve as a decision boundary of our classifier. Overall classificationerror on the data shown is lower than if we use only one feature as in Fig. 1.3, butthere will still be some errors.

of more than one feature at a time.In our search for other features, we might try to capitalize on the observation that

sea bass are typically wider than salmon. Now we have two features for classifyingfish — the lightness x1 and the width x2. If we ignore how these features might bemeasured in practice, we realize that the feature extractor has thus reduced the imageof each fish to a point or feature vector x in a two-dimensional feature space, where

x =!

x1

x2

"

.

Our problem now is to partition the feature space into two regions, where for allpatterns in one region we will call the fish a sea bass, and all points in the other wecall it a salmon. Suppose that we measure the feature vectors for our samples andobtain the scattering of points shown in Fig. 1.4. This plot suggests the following rulefor separating the fish: Classify the fish as sea bass if its feature vector falls above thedecision boundary shown, and as salmon otherwise. decision

boundaryThis rule appears to do a good job of separating our samples and suggests thatperhaps incorporating yet more features would be desirable. Besides the lightnessand width of the fish, we might include some shape parameter, such as the vertexangle of the dorsal fin, or the placement of the eyes (as expressed as a proportion ofthe mouth-to-tail distance), and so on. How do we know beforehand which of thesefeatures will work best? Some features might be redundant: for instance if the eyecolor of all fish correlated perfectly with width, then classification performance neednot be improved if we also include eye color as a feature. Even if the difficulty orcomputational cost in attaining more features is of no concern, might we ever havetoo many features?

Suppose that other features are too expensive or expensive to measure, or providelittle improvement (or possibly even degrade the performance) in the approach de-scribed above, and that we are forced to make our decision based on the two featuresin Fig. 1.4. If our models were extremely complicated, our classifier would have adecision boundary more complex than the simple straight line. In that case all the


Machine Learning Sub-Problems

• Overfitting • Noise • Feature Extraction • Model Selection • Prior Knowledge • Missing Features

8 CHAPTER 1. INTRODUCTION

2 4 6 8 1014

15

16

17

18

19

20

21

22

Width

Lightness

?

salmon sea bass

Figure 1.5: Overly complex models for the fish will lead to decision boundaries that arecomplicated. While such a decision may lead to perfect classification of our trainingsamples, it would lead to poor performance on future patterns. The novel test pointmarked ? is evidently most likely a salmon, whereas the complex decision boundaryshown leads it to be misclassified as a sea bass.

training patterns would be separated perfectly, as shown in Fig. 1.5. With such a“solution,” though, our satisfaction would be premature because the central aim ofdesigning a classifier is to suggest actions when presented with novel patterns, i.e.,fish not yet seen. This is the issue of generalization. It is unlikely that the complexgeneral-

ization decision boundary in Fig. 1.5 would provide good generalization, since it seems to be“tuned” to the particular training samples, rather than some underlying characteris-tics or true model of all the sea bass and salmon that will have to be separated.

Naturally, one approach would be to get more training samples for obtaining abetter estimate of the true underlying characteristics, for instance the probabilitydistributions of the categories. In most pattern recognition problems, however, theamount of such data we can obtain easily is often quite limited. Even with a vastamount of training data in a continuous feature space though, if we followed theapproach in Fig. 1.5 our classifier would give a horrendously complicated decisionboundary — one that would be unlikely to do well on novel patterns.

Rather, then, we might seek to “simplify” the recognizer, motivated by a beliefthat the underlying models will not require a decision boundary that is as complex asthat in Fig. 1.5. Indeed, we might be satisfied with the slightly poorer performanceon the training samples if it means that our classifier will have better performanceon novel patterns.∗ But if designing a very complex recognizer is unlikely to givegood generalization, precisely how should we quantify and favor simpler classifiers?How would our system automatically determine that the simple curve in Fig. 1.6is preferable to the manifestly simpler straight line in Fig. 1.4 or the complicatedboundary in Fig. 1.5? Assuming that we somehow manage to optimize this tradeoff,can we then predict how well our system will generalize to new patterns? These aresome of the central problems in statistical pattern recognition.

For the same incoming patterns, we might need to use a drastically different cost

∗ The philosophical underpinnings of this approach derive from William of Occam (1284-1347?), whoadvocated favoring simpler explanations over those that are needlessly complicated — Entia nonsunt multiplicanda praeter necessitatem (“Entities are not to be multiplied without necessity”).Decisions based on overly complex models often lead to lower accuracy of the classifier.


Styles of Machine Learning

•  Supervised Learning •  Unsupervised Learning •  Anomaly detection •  On-line learning •  Semi-supervised learning


Supervised Learning

Given a set of data D = {(xn,yn),n=1,...,N} the task is to learn the relationship between the input x and output y such that, when given a novel input x∗ the predicted output y∗ is accurate. The pair (x∗,y∗) is not in D but assumed to be generated by the same unknown process that generated D. To specify explicitly what accuracy means one defines a loss function L(ypred, ytrue) or, conversely, a utility function U = −L.


Supervised Learning

Example: A father decides to teach his young son what a sports car is. Finding it difficult to explain in words, he decides to give some examples. They stand on a motorway bridge and, as each car passes underneath, the father cries out ‘that’s a sports car!’ when a sports car passes by. After ten minutes, the father asks his son if he’s understood what a sports car is. The son says, ‘sure, it’s easy’. An old red VW Beetle passes by, and the son shouts – ‘that’s a sports car!’. Dejected, the father asks – ‘why do you say that?’. ‘Because all sports cars are red!’, replies the son.


Unsupervised Learning

Given a set of data D = {xn,n=1,...,N} in unsupervised learning we aim to find a plausible compact description of the data. An objective is used to quantify the accuracy of the description. In unsupervised learning there is no special prediction variable so that, from a probabilistic perspective, we are interested in modelling the distribution p(x). The likelihood of the model to generate the data is a popular measure of the accuracy of the description.


Unsupervised Learning


Other Types of Learning

•  Anomaly Detection •  Detec%nganomalouseventsinindustrialprocesses(plant

monitoring),enginemonitoringandunexpectedbuyingbehaviourpa;ernsincustomersallfallundertheareaofanomalydetec%on.

•  Online Learning (supervised and unsupervised) •  Inonlinelearningdataarrivessequen%allyandwe

con%nuallyupdateourmodelasnewdatabecomesavailable.

•  Semi-supervised learning


Machine Learning Problems

•  Classification •  Regression •  Clustering •  Density Estimation •  Dimensionality Reduction


Exercise

• A blog platform needs an automatic tagging service.

• From the text of a blog article recommend a list of tags

• How would you proceed?

• Which questions should you first ask?


Machine Learning Steps


Datasets

Training set Validation set Test set


Artificial Neural Networks

Neuron or network node

R. Rojas: Neural Networks, Springer-Verlag, Berlin, 1996

1.3 Artificial neural networks 23

not understand fully. Each individual neuron is as complex or more complexthan any of our computers. For this reason, we will call the elementary compo-nents of artificial neural networks simply “computing units” and not neurons.In the mid-1980s, the PDP (Parallel Distributed Processing) group alreadyagreed to this convention at the insistence of Francis Crick [95].

1.3 Artificial neural networks

The discussion in the last section is only an example of how important it isto define the primitive functions and composition rules of the computationalmodel. If we are computing with a conventional von Neumann processor, aminimal set of machine instructions is needed in order to implement all com-putable functions. In the case of artificial neural networks, the primitive func-tions are located in the nodes of the network and the composition rules arecontained implicitly in the interconnection pattern of the nodes, in the syn-chrony or asynchrony of the transmission of information, and in the presenceor absence of cycles.

1.3.1 Networks of primitive functions

Figure 1.14 shows the structure of an abstract neuron with n inputs. Eachinput channel i can transmit a real value xi. The primitive function f com-puted in the body of the abstract neuron can be selected arbitrarily. Usuallythe input channels have an associated weight, which means that the incominginformation xi is multiplied by the corresponding weight wi. The transmittedinformation is integrated at the neuron (usually just by adding the differentsignals) and the primitive function is then evaluated.

f (w1 x1 + w2 x2 + + wnxn )

w1

w2

wn

x1

x2

xn

f...

...

Fig. 1.14. An abstract neuron

If we conceive of each node in an artificial neural network as a primitivefunction capable of transforming its input in a precisely defined output, thenartificial neural networks are nothing but networks of primitive functions.Different models of artificial neural networks differ mainly in the assump-tions about the primitive functions used, the interconnection pattern, and thetiming of the transmission of information.

R. Rojas: Neural Networks, Springer-Verlag, Berlin, 1996R. Rojas: Neural Networks, Springer-Verlag, Berlin, 1996

Black box representation


2

Threshold Logic

2.1 Networks of functions

We deal in this chapter with the simplest kind of computing units used tobuild artificial neural networks. These computing elements are a generalizationof the common logic gates used in conventional computing and, since theyoperate by comparing their total input with a threshold, this field of researchis known as threshold logic.

2.1.1 Feed-forward and recurrent networks

Our review in the previous chapter of the characteristics and structure of bi-ological neural networks provides us with the initial motivation for a deeperinquiry into the properties of networks of abstract neurons. From the view-point of the engineer, it is important to define how a network should behave,without having to specify completely all of its parameters, which are to befound in a learning process. Artificial neural networks are used in many casesas a black box : a certain input should produce a desired output, but how thenetwork achieves this result is left to a self-organizing process.

x1

x2

xn

y1

y2

ym

F...

...

Fig. 2.1. A neural network as a black box

In general we are interested in mapping an n-dimensional real input(x1, x2, . . . , xn) to an m-dimensional real output (y1, y2, . . . , ym). A neural




General network node


2.1 Networks of functions 31

We will avoid giving a general definition of a neural network at this point.So many models have been proposed which differ in so many respects that anydefinition trying to encompass this variety would be unnecessarily clumsy. Aswe show in this chapter, it is not necessary to start building neural networkswith “high powered” computing units, as some authors do [384]. We will startour investigations with the general notion that a neural network is a networkof functions in which synchronization can be considered explicitly or not.

2.1.2 The computing units

The nodes of the networks we consider will be called computing elements orsimply units. We assume that the edges of the network transmit informationin a predetermined direction and the number of incoming edges into a nodeis not restricted by some upper bound. This is called the unlimited fan-inproperty of our computing units.

x1

x2

xn

f f (x1 , x2 , ..., xn )

Fig. 2.4. Evaluation of a function of n arguments

The primitive function computed at each node is in general a function of narguments. Normally, however, we try to use very simple primitive functionsof one argument at the nodes. This means that the incoming n argumentshave to be reduced to a single numerical value. Therefore computing unitsare split into two functional parts: an integration function g reduces the narguments to a single value and the output or activation function f producesthe output of this node taking that single value as its argument. Figure 2.5shows this general structure of the computing units. Usually the integrationfunction g is the addition function.

fg

x1

x2

xn

f (g(x1 , x2 ,..., xn ))

Fig. 2.5. Generic computing unit

McCulloch–Pitts networks are even simpler than this, because they usesolely binary signals, i.e., ones or zeros. The nodes produce only binary results


Binary threshold function


2.2 Synthesis of Boolean functions 33

θ

1

0

Fig. 2.7. The step function with threshold θ

In the following subsection we assume provisionally that there is no delayin the computation of the output.

2.2 Synthesis of Boolean functions

The power of threshold gates of the McCulloch–Pitts type can be illustratedby showing how to synthesize any given logical function of n arguments. Wedeal firstly with the more simple kind of logic gates.

2.2.1 Conjunction, disjunction, negation

Mappings from {0, 1}n onto {0, 1} are called logical or Boolean functions.Simple logical functions can be implemented directly with a single McCulloch–Pitts unit. The output value 1 can be associated with the logical value trueand 0 with the logical value false. It is straightforward to verify that the twounits of Figure 2.8 compute the functions AND and OR respectively.

2 1

AND OR

x1

x2 x2

x1

Fig. 2.8. Implementation of AND and OR gates

A single unit can compute the disjunction or the conjunction of n argu-ments as is shown in Figure 2.9, where the conjunction of three and fourarguments is computed by two units. The same kind of computation requiresseveral conventional logic gates with two inputs. It should be clear from thissimple example that threshold logic elements can reduce the complexity ofthe circuit used to implement a given logical function.




Input space separation Binary threshold function


62 3 Weighted Networks – The Perceptron

3.2.2 The XOR problem

We can now deal with the problem of determining which logical functions canbe implemented with a single perceptron. A perceptron network is capableof computing any logical function, since perceptrons are even more powerfulthan unweighted McCulloch–Pitts elements. If we reduce the network to asingle element, which functions are still computable?

Taking the functions of two variables as an example we can gain someinsight into this problem. Table 3.1 shows all 16 possible Boolean functionsof two variables f0 to f15. Each column fi shows the value of the function foreach combination of the two variables x1 and x2. The function f0, for example,is the zero function whereas f14 is the OR-function.

Table 3.1. The 16 Boolean functions of two variables

x1 x2 f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15

0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 10 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 11 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 11 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

Perceptron-computable functions are those for which the points whosefunction value is 0 can be separated from the points whose function value is1 using a line. Figure 3.6 shows two possible separations to compute the ORand the AND functions.

1 1

10

0

0 0

1

OR AND

Fig. 3.6. Separations of input space corresponding to OR and AND

It is clear that two of the functions in the table cannot be computed inthis way. They are the function XOR and identity (f6 and f9). It is intuitivelyevident that no line can produce the necessary separation of the input space.This can also be shown analytically.



66 3 Weighted Networks – The Perceptron

1

1

0

0

Fig. 3.8. Non-linear separation of input space

In the general case we want to distinguish between regions of space. Aneural network must learn to identify these regions and to associate themwith the correct response. The main problem is determining whether the freeparameters of these decision regions can be found using a learning algorithm.In the next chapter we show that it is always possible to find these freeparameters for linear decision curves, if the patterns to be classified are indeedlinearly separable. Finding learning algorithms for other kinds of decisioncurves is an important research topic not dealt with here [45, 4].

3.4 Applications and biological analogy

The appeal of the perceptron model is grounded on its simplicity and thewide range of applications that it has found. As we show in this section,weighted threshold elements can play an important role in image processingand computer vision.

3.4.1 Edge detection with perceptrons

A good example of the pattern recognition capabilities of perceptrons is edgedetection (Figure 3.9). Assume that a method of extracting the edges of afigure darker than the background (or the converse) is needed. Each pixel inthe figure is compared to its immediate neighbors and in the case where thepixel is black and one of its neighbors white, it will be classified as part ofan edge. This can be programmed sequentially in a computer, but since thedecision about each point uses only local information, it is straightforward toimplement the strategy in parallel.

Assume that the figures to be processed are projected on a screen in whicheach pixel is connected to a perceptron, which also receives inputs from itsimmediate neighbors. Figure 3.10 shows the shape of the receptive field (aso-called Moore neighborhood) and the weights of the connections to theperceptron. The central point is weighted with 8 and the rest with −1. In thefield of image processing this is called a convolution operator, because it is



Feed-Forward ANN


7.3 The case of layered networks 165

7.3.1 Extended network

We will consider a network with n input sites, k hidden, and m output units.

The weight between input site i and hidden unit j will be called w(1)ij . The

weight between hidden unit i and output unit j will be called w(2)ij . The bias

−θ of each unit is implemented as the weight of an additional edge. Inputvectors are thus extended with a 1 component, and the same is done with theoutput vector from the hidden layer. Figure 7.17 shows how this is done. The

weight between the constant 1 and the hidden unit j is called w(1)n+1,j and the

weight between the constant 1 and the output unit j is denoted by w(2)k+1,j .

n

input sitesm

output units

1

1

hidden units

connection matrix

W1 connection matrix

W2

w n+1,(1)

w +1,ksite n+1

k

k m(2)

.

.

.

.

.

.

.

.

.

Fig. 7.17. Notation for the three-layered network

There are (n + 1) × k weights between input sites and hidden units and(k + 1)×m between hidden and output units. Let W1 denote the (n + 1)× k

matrix with component w(1)ij at the i-th row and the j-th column. Similarly

let W2 denote the (k + 1) × m matrix with components w(2)ij . We use an

overlined notation to emphasize that the last row of both matrices correspondsto the biases of the computing units. The matrix of weights without this lastrow will be needed in the backpropagation step. The n-dimensional inputvector o = (o1, . . . , on) is extended, transforming it to o = (o1, . . . , on, 1). Theexcitation netj of the j-th hidden unit is given by

netj =n+1!

i=1

w(1)ij oi.

The activation function is a sigmoid and the output o(1)j of this unit is thus



Recurrent ANN


Recurrent ANN

•  Dealing with Time Series •  Meteorologicalforecast•  Energyconsump%on•  Orderrequestforecast•  Trafficforecast•  Financialmarketforecast


Nonlinear Autoregressive Exogenous model (NARX)

• Exogenous input •  Temperature•  Hourofday


Self Organizing Maps

Nature-inspired Autonomous units organizing to adapt to a space input Organization maintaining topology


Kohonen’s model

Multi-dimensional lattices of computing units Each unit has associated a weight w also called prototype vector w has the dimension of the input space Each unit has lateral connections to several neighbors


Kohonen’s model

We have a train set D of vectors sampled from the input space The network learns to adapt to the input space updating the weights of its computing units


Learning algorithm

Consider an n-dimensional input space A one-dimensional SOM is a chain of computing units When an input x is received each unit mi computes the Euclidean distance between x and its weight wi The unit k with the smallest value(highest excitement) is selected(fires)


Learning algorithm

• The neighbors of k are also updated

• We define a neighborhood function ϕ(i,k)

•  i.e. ϕ(i,k)=1 if d(i,k)<r otherwise ϕ(i,k)=0

15.2 Kohonen’s model 395

tions to several neighbors. Examples of this kind of lateral coupling are theinhibitory connections used by von der Malsburg in his self-organizing models[284]. Connections of this type are also present, for example, in the humanretina, as we discussed in Chap. 3.

15.2 Kohonen’s model

In this section we deal with some examples of the ordered structures known asKohonen networks. The grid of computing elements allows us to identify theimmediate neighbors of a unit. This is very important, since during learningthe weights of computing units and their neighbors are updated. The objectiveof such a learning approach is that neighboring units learn to react to closelyrelated signals.

15.2.1 Learning algorithm

Consider the problem of charting an n-dimensional space using a one-dimensional chain of Kohonen units. The units are all arranged in sequenceand are numbered from 1 to m (Figure 15.4). Each unit becomes the n-dimensional input x and computes the corresponding excitation. The n-dimensional weight vectors w1,w2, . . . ,wm are used for the computation. Theobjective of the charting process is that each unit learns to specialize on dif-ferent regions of input space. When an input from such a region is fed intothe network, the corresponding unit should compute the maximum excitation.Kohonen’s learning algorithm is used to guarantee that this effect is achieved.

...

1 2 3 m

x

w1w

2w

3wm-1

wm

neighborhood of unit 2 with radius 1

Fig. 15.4. A one-dimensional lattice of computing units

A Kohonen unit computes the Euclidian distance between an input xand its weight vector w. This new definition of excitation is more appro-priate for certain applications and also easier to visualize. In the Kohonenone-dimensional network, the neighborhood of radius 1 of a unit at the k-thposition consists of the units at the positions k − 1 and k + 1. Units at both




Learning algorithm

Init: a learning constant η, a neighborhood function ϕ are selected. The m weight vectors are initialized randomly Select an input vector ξ using the desired probability distribution over the input space. The unit k with the maximum excitation is selected (that is, for which the distance between wi and ξ is minimal, i = 1,...,m). The weight vectors are updated using the neighborhood function and the update rule Stop if the maximum number of iterations has been reached; otherwise modify η and φ as scheduled and continue with step 1.

396 15 Kohonen Networks

ends of the chain have asymmetrical neighborhoods. The neighborhood of ra-dius r of unit k consists of all units located up to r positions from k to theleft or to the right of the chain.

Kohonen learning uses a neighborhood function φ, whose value φ(i, k)represents the strength of the coupling between unit i and unit k during thetraining process. A simple choice is defining φ(i, k) = 1 for all units i in aneighborhood of radius r of unit k and φ(i, k) = 0 for all other units. We willlater discuss the problems that can arise when some kinds of neighborhoodfunctions are chosen. The learning algorithm for Kohonen networks is thefollowing:

Algorithm 15.2.1 Kohonen learning

start : The n-dimensional weight vectors w1,w2, . . . ,wm of the m computingunits are selected at random. An initial radius r, a learning constantη, and a neighborhood function φ are selected.

step 1 : Select an input vector ξ using the desired probability distribution overthe input space.

step 2 : The unit k with the maximum excitation is selected (that is, for whichthe distance between wi and ξ is minimal, i = 1, . . . , m).

step 3 : The weight vectors are updated using the neighborhood function andthe update rule

wi ← wi + ηφ(i, k)(ξ −wi), for i = 1, . . . , m.

step 4 : Stop if the maximum number of iterations has been reached; otherwisemodify η and φ as scheduled and continue with step 1.

The modifications of the weight vectors (step 3) attracts them in the direc-tion of the input ξ. By repeating this simple process several times, we expectto arrive at a uniform distribution of weight vectors in input space (if theinputs have also been uniformly selected). The radius of the neighborhood isreduced according to a previous plan, which we call a schedule. The effect isthat each time a unit is updated, neighboring units are also updated. If theweight vector of a unit is attracted to a region in input space, the neighborsare also attracted, although to a lesser degree. During the learning processboth the size of the neighborhood and the value of φ fall gradually, so thatthe influence of each unit upon its neighbors is reduced. The learning constantcontrols the magnitude of the weight updates and is also reduced gradually.The net effect of the selected schedule is to produce larger corrections at thebeginning of training than at the end.

Figure 15.5 shows the results of an experiment with a one-dimensionalKohonen network. Each unit is represented by a dot. The input domain isa triangle. At the end of the learning process the weight vectors reach adistribution which transforms each unit into a “representative” of a smallregion of input space. The unit in the lower corner, for example, is the one




Learning algorithm

Each step attracts the weight of the excited unit toward the input Repeating this process, we expect to arrive at a uniform distribution of weight vectors in input space (if the inputs have also been uniformly selected).


Effect on neighbors

The radius of the neighborhood is reduced according to a schedule Each time a unit is updated, neighboring units are also updated If the weight vector of a unit is attracted to a region in input space, the neighbors are also attracted, but to a lesser degree During the learning process both the size of the neighborhood and the value of φ fall gradually, so that the influence of each unit upon its neighbors is reduced.


Schedule and learning constant

The learning constant controls the magnitude of the weight updates and is reduced gradually The net effect of the selected schedule is to produce larger corrections at the beginning of training than at the end


Linear SOM example


.

.. . . .....

.

Fig. 15.5. Map of a triangular region

which responds with the largest excitation for vectors in the shaded region. Ifwe adopt a “winner-takes-all” activation strategy, then it will be the only oneto fire.

The same experiment can be repeated for differently shaped domains. Thechain of Kohonen units will adopt the form of a so-called Peano curve. Fig-ure 15.6 is a series of snapshots of the learning process from 0 to 25000 iter-ations [255]. At the beginning, before training starts, the chain is distributedrandomly in the domain of definition. The chain unwraps little by little and theunits distribute gradually in input space. Finally, when learning approachesits end, only small corrections affect the unit’s weights. At that point, theneighborhood radius has been reduced to zero and the learning constant hasreached a small value.

Fig. 15.6. Mapping a chain to a triangle



•  The weight vectors reach a distribution which transforms each unit into a “representative” of a small region of input space.

•  The unit in the lower corner responds with the largest excitation to vectors in the shaded region.


Bi-dimensional networks 398 15 Kohonen Networks

Kohonen networks can be arranged in multidimensional grids. An inter-esting choice is a planar network, as shown in Figure 15.7. The neighborhoodof radius r of unit k consists, in this case, of all other units located at most rplaces to the left or right, up or down in the grid. With this convention, theneighborhood of a unit is a quadratic portion of the network. Of course wecan define more sophisticated neighborhoods, but this simple approach is allthat is needed in most applications.

Figure 15.7 shows the flattening of a two-dimensional Kohonen network ina quadratic input space. The four diagrams display the state of the networkafter 100, 1000, 5000, and 10000 iterations. In the second diagram severaliterations have been overlapped to give a feeling of the iteration process. Sincein this experiment the dimension of the input domain and of the network arethe same, the learning process reaches a very satisfactory result.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 15.7. Mapping a square with a two-dimensional lattice. The diagram on theupper right shows some overlapped iterations of the learning process. The diagrambelow it is the final state after 10000 iterations.

Settling on a stable state is not so easy in the case of multidimensionalnetworks. There are many factors which play a role in the convergence process,such as the size of the selected neighborhood, the shape of the neighborhoodfunction and the scheduling selected to modify both. Figure 15.8 shows anexample of a network which has reached a state very difficult to correct. A knothas appeared during the training process and, if the plasticity of the network




has reached a low level, the knot will not be undone by further training, asthe overlapped iterations in the diagram on the right, in Figure 15.8 show.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 15.8. Planar network with a knot

Several proofs of convergence have been given for one-dimensional Ko-honen networks in one-dimensional domains. There is no general proof ofconvergence for multidimensional networks.

15.2.2 Mapping high-dimensional spaces

Usually, when an empirical data set is selected, we do not know its real dimen-sion. Even if the input vectors are of dimension n, it could be that the dataconcentrates on a manifold of lower dimension. In general it is not obviouswhich network dimension should be used for a given data set. This generalproblem led Kohonen to consider what happens when a low-dimensional net-work is used to map a higher-dimensional space. In this case the network mustfold in order to fill the available space. Figure 15.9 shows, in the middle, theresult of an experiment in which a two-dimensional network was used to charta three-dimensional box. As can be seen, the network extends in the x and ydimensions and folds in the z direction. The units in the network try as hardas possible to fill the available space, but their quadratic neighborhood posessome limits to this process. Figure 15.9 shows, on the left and on the right,which portions of the network approach the upper or the lower side of thebox. The black and white stripes resemble a zebra pattern.

Remember that earlier we discussed the problem of adapting the planarbrain cortex to a multidimensional sensory world. There are some indicationsthat a self-organizing process of the kind shown in Figure 15.9 could also betaking place in the human brain, although, of course, much of the brain struc-ture emerges pre-wired at birth. The experiments show that the foldings ofthe planar network lead to stripes of alternating colors, that is, stripes which



Anima%on:h;ps://www.youtube.com/watch?v=QvI6L-KqsT4


Mapping high-dimensional spaces

• How a network of dimension n adapts to a space input of higher dimension

•  It must fold to fill the space


00.2

0.40.6

0.81 0

0.20.4

0.60.8

1

−0.4

−0.2

0

0.2

0.4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 15.9. Two-dimensional map of a three-dimensional region

map alternately to one side or the other of input space (for the z dimension).A commonly cited example for this kind of structure in the human brain isthe visual cortex. The brain actually processes not one but two visual images,one displaced with respect to the other. In this case the input domain consistsof two planar regions (the two sides of the box of Figure 15.9). The planarcortex must fold in the same way in order to respond optimally to input fromone or other side of the input domain. The result is the appearance of thestripes of ocular dominance studied by neurobiologists in recent years. Fig-ure 15.10 shows a representation of the ocular dominance columns in LeVays’reconstruction [205]. It is interesting to compare these stripes with the onesfound in our simple experiment with the Kohonen network.

Fig. 15.10. Diagram of eye dominance in the visual cortex. Black stripes representone eye, white stripes the other.

These modest examples show the kinds of interesting consequence that canbe derived from Kohonen’s model. In principle Kohonen networks resemble theunsupervised networks we discussed in Chap. 5. Here and there we try to chartan input space distributing computing units to define a vector quantization.The main difference is that Kohonen networks have a predefined topology. They




What dimension for the network?

In many cases we have experimental data which is coded using n real values, but whose effective dimension is much lower. Points in the surface of a sphere in three-dimensional space. The input vectors have three components, but a two-dimensional Kohonen network will do a better job of charting this input space


Application: function approximation

• Apply planar grid to a surface P {(x,y,f(x,y))|x,y in [0,1]}

• After the learning algorithm is started, the planar network moves in the direction of P and distributes itself to cover the domain.


Application: function approximation

15.4 Applications 409

Fig. 15.17. Control surface of the balancing pole [Ritter et al. 1990]

θ

f

Fig. 15.18. A balancing pole

function f . The network is therefore a kind of look-up table of the values off . The table can be made as sparse or as dense as needed for the applicationat hand. Using a table is in general more efficient than computing the func-tion each time from scratch. If the function is not analytically given, but hasbeen learned using some input-output examples, Kohonen networks resemblebackpropagation networks. The Kohonen network can continue adapting and,in this way, if some parameters of the system change (because some partsbegin to wear), such a modification is automatically taken into account. TheKohonen network behaves like an adaptive table, which can be built using aminimal amount of hardware. This has made Kohonen networks an interestingchoice in robotic applications.

15.4.2 Inverse kinematics

A second application of Kohonen networks is mapping the configuration spaceof a mechanical arm using a two-dimensional grid. Assume that a robot arm



•  The network is a kind of look-up table of the values of f. The table can be made as sparse or as dense as needed


Finding the appropriate power law means that in some cases a functionhas to be fitted to the measurements. Since the dimension of the data cansometimes best be approximated by a fraction, we speak of the fractal di-mension of the data. It has been shown experimentally that if the dimensionof the Kohonen network approaches the fractal dimension of the data, theinterpolation error is smaller than for other network sizes [410]. This can beunderstood as meaning that the units in the network are used optimally toapproximate input space. Other methods of measuring the fractal dimensionof data are discussed by Barnsley [42].

15.4 Applications

Kohonen networks can adapt to domains with the most exotic structures andhave been applied in many different fields, as will be shown in the followingsections.

15.4.1 Approximation of functions

A Kohonen network can be used to approximate the continuous real functionf in the domain of definition [0, 1]× [0, 1]. The set P = {(x, y, f(x, y))|x, y ∈[0, 1]} is a surface in three-dimensional space. We would like to adapt a planargrid to this surface. In this case the set P is the domain which we try to mapwith the Kohonen network.

After the learning algorithm is started, the planar network moves in thedirection of P and distributes itself to cover the domain. Figure 15.17 showsthe result of an experiment in which the function z = 5 sin x + y had to belearned. The combinations of x and y were generated in a small domain. Atthe end of the training process the network has “learned” the function f inthe region of interest.

The function used in the example above is the one which guarantees op-timal control of a pole balancing system. Such a system consists of a pole (ofmass 1) attached to a moving car. The pole can rotate at the point of attach-ment and the car can only move to the left or to the right. The pole should bekept in equilibrium by moving the car in one direction or the other. The nec-essary force to keep the pole in equilibrium is given by f(θ) = α sin θ+β dθ/dt[368], where θ represents the angle between the pole and the vertical, and αand β are constants. (Figure 15.18). For small values of θ a linear approxima-tion can be used. Since a Kohonen network can learn the function f , it canalso be used to provide the automatic control for the pole balancing system.

When a combination of x and y is given (in this case θ and dθ/dt), theunit (i, j) is found for which the Euclidian distance between its associatedweights w(i,j)

1 and w(i,j)2 and (θ, dθ/dt) is minimal. The value of the function

at this point is the learned w(i,j)3 , that is, an approximation to the value of the




Nearest Neighbour Classification

K-Nearest Neighbours

Figure 14.1: In nearest neighbour classification a newvector is assigned the label of the nearest vector in thetraining set. Here there are three classes, with trainingpoints given by the circles, along with their class. Thedots indicate the class of the nearest training vector.The decision boundary is piecewise linear with eachsegment corresponding to the perpendicular bisectorbetween two datapoints belonging to di↵erent classes,giving rise to a Voronoi tessellation of the input space.

Algorithm 14.1 Nearest neighbour algorithm to classify a vector x, given train data D ={(xn, cn), n = 1, . . . , N}:

1: Calculate the dissimilarity of the test point x to each of the train points, dn = d (x,xn), n = 1, . . . , N .2: Find the train point xn⇤

which is nearest to x :

n⇤ = argminn

d (x,xn)

3: Assign the class label c(x) = cn⇤.

4: In the case that there are two or more nearest neighbours with di↵erent class labels, the most numerous classis chosen. If there is no one single most numerous class, we use the K-nearest-neighbours.

• The whole dataset needs to be stored to make a classification since the novel point must be comparedto all of the train points. This can be partially addressed by a method called data editing in whichdatapoints which have little or no e↵ect on the decision boundary are removed from the trainingdataset. Depending on the geometry of the training points, finding the nearest neighbour can also beaccelerated by examining the values of each of the components xi of x in turn. Such an axis-alignedspace-split is called a KD-tree[218] and can reduce the possible set of candidate nearest neighbours inthe training set to the novel x⇤, particularly in low dimensions.

• Each distance calculation can be expensive if the datapoints are high dimensional. Principal Com-ponents Analysis, see chapter(15), is one way to address this and replaces x with a low dimensional

projection p. The Euclidean distance of two datapoints�

xa � xb�2

is then approximately given by�

pa � pb�2, see section(15.2.4). This is both faster to compute and can also improve classification

accuracy since only the large scale characteristics of the data are retained in the PCA projections.

• It is not clear how to deal with missing data or incorporate prior beliefs and domain knowledge.

14.2 K-Nearest Neighbours

If your neighbour is simply mistaken (has an incorrect training class label), or is not a particularly repre-sentative example of his class, then these situations will typically result in an incorrect classification. Byincluding more than the single nearest neighbour, we hope to make a more robust classifier with a smootherdecision boundary (less swayed by single neighbour opinions). If we assume the Euclidean distance as thedissimilarity measure, the K-Nearest Neighbour algorithm considers a hypersphere centred on the test pointx. The radius of the hypersphere is increased until it contains exactly K train inputs. The class label c(x)is then given by the most numerous class within the hypersphere, see fig(14.2).

300 DRAFT February 27, 2012

• Supervised method • Assign to a new input

the class of the nearest input in the training set

• Distances: •  Euclidean•  mahalanobis


Nearest Neighbor Classification

•  Entire dataset must be stored •  Distance calculation may be expensive •  How to deal with missing data? •  How to incorporate prior knowledge?


K Nearest Neighbors

K-Nearest Neighbours

Figure 14.2: In K-nearest neighbours, we centre a hypersphere around thepoint we wish to classify (here the central dot). The inner circle corre-sponds to the nearest neighbour, a square. However, using the 3 near-est neighbours, we find that there are two round-class neighbours and onesquare-class neighbour– and we would therefore classify the central point asround-class. In the case of a tie, one may increase K until the tie is broken.

Choosing K

Whilst there is some sense in making K > 1, there is certainly little sense in making K = N (N being thenumber of training points). For K very large, all classifications will become the same – simply assign eachnovel x to the most numerous class in the train data. This suggests that there is an optimal intermediatesetting of K which gives the best generalisation performance. This can be determined using cross-validation,as described in section(13.2.2).

(a) (b) (c)

Figure 14.3: Some of the train examples of the digit zero (a), one (b) and seven (c). There are 300 trainexamples of each of these three digit classes.

Example 14.1 (Handwritten Digit Example). Consider two classes of handwritten digits, zeros and ones.Each digit contains 28 ⇥ 28 = 784 pixels. The train data consists of 300 zeros, and 300 ones, a subsetof which are plotted in fig(14.3a,b). To test the performance of the nearest neighbour method (based onEuclidean distance) we use an independent test set containing a further 600 digits. The nearest neighbourmethod, applied to this data, correctly predicts the class label of all 600 test points. The reason for the highsuccess rate is that examples of zeros and ones are su�ciently di↵erent that they can be easily distinguished.

A more di�cult task is to distinguish between ones and sevens. We repeat the above experiment, nowusing 300 training examples of ones, and 300 training examples of sevens, fig(14.3b,c). Again, 600 new testexamples (containing 300 ones and 300 sevens) were used to assess the performance. This time, 18 errorsare found using nearest neighbour classification – a 3% error rate for this two class problem. The 18 testpoints on which the nearest neighbour method makes errors are plotted in fig(14.4). If we use K = 3 nearestneighbours, the classification error reduces to 14 – a slight improvement. As an aside, the best machinelearning methods classify real world digits (over all 10 classes) to an error of less than 1 per cent – betterthan the performance of an ‘average’ human.

DRAFT February 27, 2012 301

• More robust classifier • Consider hypersphere

that contains k train inputs and centered on test point

• How to choose k? •  Crossvalida%on


Mixture models

• A mixture model is one in which a set of component models is combined to produce a richer model:

The Gaussian Mixture Model

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

(a)

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

(b)

−10 −8 −6 −4 −2 0 2 4 6 8 10−180

−160

−140

−120

−100

−80

−60

−40

−20

0

(c)

Figure 20.8: (a): A Gaussian mixture model with H = 4 components. There is a component (purple)with large variance and small weight that has little e↵ect on the distribution close to where the other threecomponents have appreciable mass. As we move further away this additional component gains in influence.(b): The GMM probability density function from (a). (c): Plotted on a log scale, the influence of eachGaussian far from the origin becomes clearer.

component breaks away and takes responsibility for explaining the data in its vicinity, see fig(20.7). Theorigin of this initial jostling is an inherent symmetry in the solution – it makes no di↵erence to the likelihoodif we relabel what the components are called. The symmetries can severely handicap EM in fitting a largenumber of component models in the mixture since the number of permutations increases dramatically withthe number of components. A heuristic is to begin with a small number of components, say two, for whichsymmetry breaking is less problematic. Once a local broken solution has been found, more models areincluded into the mixture, initialised close to the currently found solutions. In this way, a hierarchicalscheme is envisaged. Another popular method for initialisation is to center the means to those found by theK-means algorithm, see section(20.3.5) – however, this itself requires a heuristic initialisation.

20.3.3 Classification using Gaussian mixture models

We can use GMMs as part of a class conditional generative model, in order to make a powerful classifier.Consider data drawn from two classes, c 2 {1, 2}. We can fit a GMM p(x|c = 1,X1) to the data X1 fromclass 1, and another GMM p(x|c = 2,X2) to the data X2 from class 2. This gives rise to two class-conditionalGMMs,

p(x|c,Xc) =HX

i=1

p(i|c)N (x mci ,S

ci ) (20.3.21)

For a novel point x⇤, the posterior class probability is

p(c|x⇤,X ) / p(x⇤|c,Xc)p(c) (20.3.22)

where p(c) is the prior class probability. The maximum likelihood setting is that p(c) is proportional to thenumber of training points in class c.

Overconfident classification

Consider a testpoint x⇤ a long way from the training data for both classes. For such a point, the proba-bility that either of the two class models generated the data is very low. Nevertheless, one probability willbe exponentially higher than the other (since the Gaussians drop exponentially quickly at di↵erent rates),meaning that the posterior probability will be confidently close to 1 for that class which has a componentclosest to x⇤. This is an unfortunate property since we would end up confidently predicting the class ofnovel data that is not similar to anything we’ve seen before. We would prefer the opposite e↵ect that fornovel data far from the training data, the classification confidence drops and all classes become equally likely.

A remedy for this situation is to include an additional component in the Gaussian mixture for each classthat is very broad. We first collect the input data from all classes into a dataset X , and let m be the meanof all this data and S the covariance. Then for the model of each class c data we include an additional

DRAFT February 27, 2012 407

CHAPTER 20

Mixture Models

Mixture models assume that the data is essentially clustered with each component in the mixture repre-senting a cluster. In this chapter we view mixture models from the viewpoint of learning with missing dataand discuss some of the classical algorithms, such as EM training of Gaussian mixture models. We alsodiscuss more powerful models which allow for the possibility that an object can be a member of more thanone cluster. These models have applications in areas such as document modelling.

20.1 Density Estimation Using Mixtures

A mixture model is one in which a set of component models is combined to produce a richer model:

p(v) =HX

h=1

p(v|h)p(h) (20.1.1)

The variable v is ‘visible’ or ‘observable’ and the discrete variable h with dom(h) = {1, . . . , H} indexes eachcomponent model p(v|h), along with its weight p(h). The variable v can be either discrete or continuous.Mixture models have natural application in clustering data, where h indexes the cluster. This interpretationcan be gained from considering how to generate a sample datapoint v from the model equation (20.1.1).First we sample a cluster h from p(h), and then draw a visible state v from p(v|h).

For a set of i.i.d. data v1, . . . , vN , a mixture model is of the form, fig(20.1),

p(v1, . . . , vN , h1, . . . , hn) =NY

n=1

p(vn|hn)p(hn) (20.1.2)

from which the observation likelihood is given by

p(v1, . . . , vN ) =NY

n=1

X

hn

p(vn|hn)p(hn) (20.1.3)

Finding the most likely assignment of datapoints to clusters is achieved by inference of

argmaxh1,...,hN

p(h1, . . . , hN |v1, . . . , vN ) (20.1.4)

which, thanks to the factorised form of the distribution is equivalent to computing argmaxhn p(hn|vn) foreach datapoint.

397


K-means clustering

Partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.


Graphical models

Scuola di Calcolo Scientifico con MATLAB (SCSM) 2017 · Given a set of data D = {(xn,yn),n=1,...,N}...

Documents

Transcript of Scuola di Calcolo Scientifico con MATLAB (SCSM) 2017 · Given a set of data D = {(xn,yn),n=1,...,N}...