Chapter 7 Optimization Methods - Inspiring Innovationypeng/NN/F11NN/lecture... · Chapter 7...
Transcript of Chapter 7 Optimization Methods - Inspiring Innovationypeng/NN/F11NN/lecture... · Chapter 7...
IntroductionIntroduction• Examples of optimization problems
IC d i ( l t i i )– IC design (placement, wiring)– Graph theoretic problems (partitioning, coloring, vertex
covering)g)– Planning– Scheduling– Other combinatorial optimization problems (knapsack, TSP)
• Approaches A h– AI: state space search
– NN– Genetic algorithms– Genetic algorithms– Mathematical programming
IntroductionIntroduction
• NN models to coverNN models to cover– Continuous Hopfield mode
• Combinatorial optimizationCombinatorial optimization– Simulated annealing
• Escape from local minimumEscape from local minimum– Boltzmann machine (§6.4) – Evolutionary computing (genetic algorithms)Evolutionary computing (genetic algorithms)
Introduction• Formulating optimization problems in NN
– System state: S(t) = (x1(t) x (t)) where xi(t) is the currentSystem state: S(t) (x1(t), …, xn(t)) where xi(t) is the current value of node i at time/step t
• State space: the set of all possible states– State changes as any node changes its value based on
• Inputs from other nodes; Inter-node weights; Node functioni f ibl if i i fi ll i i h f h– A state is feasible if it satisfies all constraints without further
modification (e.g., a legal tour in TSP)A solution state is a feasible state that optimizes some given– A solution state is a feasible state that optimizes some given objective function (e.g., a legal tour with minimum length)
• Global optimum: the best in the entire state space• Local optimum: the best in a subspace of the state space (e.g., cannot be better by changing value of any SINGLE node)
Introduction• Energy minimization
– A popular way for NN-based optimization methodsA popular way for NN based optimization methods– Sum-up problem constraints and cost functions and other
considerations into an energy function E:• E is a function of system state• Lower energy states correspond to better solutions
P lt f t i t i l ti ( ith bi i )• Penalty for constraint violation (with big energy increase)– Work out the node function and weights so that
• The energy can only be reduced when the system movesThe energy can only be reduced when the system moves– The hard part is to ensure that
• Every solution state corresponds to a (local) minimum energy y p ( ) gystate
• Optimal solution corresponds to a globally minimum energy statestate
Hopfield Model for Optimizationp p• Constraint satisfaction combinational optimization.• A solution must satisfy• A solution must satisfy
– a set of given constraints (strong) and – be optimal w.r.t. a cost or utility function (weak)be optimal w.r.t. a cost or utility function (weak)
• Using node functions defined in Hopfield model• What we need:
– Energy function derived from the cost function• Must be quadratic
– Representing the constraints, • Relative importance between constraints• Penalty for constraint violation• Penalty for constraint violation
– Extract weights
Hopfield Model for TSP• Constraints:
1 Each city can be visited no more than once1. Each city can be visited no more than once2. Every city must be visited3. TS can only visit cities one at a time4. Tour should be the shortest– Constraints 1 – 3 are hard constraints (they must be satisfied to
be qualified as a legal tour or a Hamiltonian circuit)be qualified as a legal tour or a Hamiltonian circuit)– Constraint 4 is soft, it is the objective function for optimization,
suboptimal but good results may be acceptable• Design the network structure:
Different possible ways to represent TSP by NN:p y p y– node - city: hard to represent the order of cities in forming a
circuit (SOM solution)d d t f ( 1)/2 d t b ti t d d– node - edge: n out of n(n-1)/2 nodes must become activated and
they must form a circuit.
• Hopfield’s solution:– n by n network, each node is
connected to every other node0001054321
Aconnected to every other node.
• Node output approach 0 or 1• row: city 10000
0100000001
DCB
y• column: position in the tour:
Tour: B-A-E-C-D-B0010010000
ED
– Output (state) of each node is denoted:ixv
thxi indexposition: index,city:where
iixv
ixvth
xi
thxi
)(dfi ii ltouraincitytheasvisitednotiscity:0
touraincitytheasvisitediscity:1
yxd
xiuxi
andcitiesbetweenncecost/dista
)(node ofactivationinternal:
yxdxy andcitiesbetweenncecost/dista
Energy functionn n nA
n n n
n
x
n
i
n
ijjxjxi
B
vvAE1 1 ,1
2
(penalty for the row constraint: no city shall be visited more than once)
n ni x xyy
yixi
C
vvB
2
1 1 ,12 (penalty for the column constraint:
cities can be visited one at a time)
n n nx i
xi
dD
nvC1 1
2
)(
)(2
(penalty for the tour legs: it must have exactly n cities)
A B C D are constants to be determined by trial and error
x xyy i
iyiyxixy vvvd1 ,1 1
1,1, )(2 (penalty for the tour length)
– A, B, C, D are constants, to be determined by trial-and-error.–
a legal tour, then the first three terms in E become zero, and the If each approach either 0 or 1, and if those with 1 representxi xiv v
last term gives the tour length. This is because
i ihiiforbeforeeitheriscityif
)(yd
dxy
otherwise 0circuittheincityafter)( 1,1, xvvvd iyiyxixy
Obtaining weight matrix
• Note– In CHM, xyyjxixi vwdtdu /,
– We want u to change in the way to always reduce E. ij x
xiyjyyjxixi
,
– Try to extract and from
• Determine from E so that with yjxiw , dtduxi /
dtduxi / ( ) / 0dE t dt xi
(gradient descent approach again)xi
ddEdE
xi
xi
xi
xi xi dtdu
dudv
vE
dtdE
0)( then,if 2
xixi
xi
xi
xi
vE
dudv
dtdE
vE
dtdu
xi
xi
vE
dtdu
ij
xj
xi
vA (row inhibition: x = y, i != j)
xy
yi
C
vB
)(
(column inhibition: x != y, i = j)
( l b l i hibi i i ! j)
iyiyxy
y jyj
vvdD
nvC
)(
)(
11 (tour length)
(global inhibition: xi != yj)
xy
iyiyxy vvd )( 1,1, (tour length)
(2) Since , weights thus
should include the following
xiyj
xiyjyjxixixi vwnetdtdu ,/
– A: between nodes in the same row– B: between nodes in the same column– C: between any two nodes– D: dxy between nodes in different row but adjacentD: dxy between nodes in different row but adjacent
column)1()1(, CBAw xyijijxyyjxi
if1
)( 1,1,
Dd ijijxy
yjjyyj
– each node also has a positive bias Cn
otherwise 0if 1 where
yxxy
– each node also has a positive bias Cnxi
Notes
1. Since , W can also be used for discrete model
)0 (assuming , ,,, xixixiyjyjxi wwwused for discrete model.
2. Initialization: randomly assign between 0 and 1 such that
dtduxi /)0(nv )0(such that
3. No need to store explicit weight matrix, weights can be t d h d d
nvix
xi )0(,
computed when needed4. Hopfield’s own experiments
A = B = D = 500, C = 200, n = 1520 trials (with different distance matrices): all trials converged: 16 to legal tours 8 to shortest tours 2 to second shortest tours16 to legal tours, 8 to shortest tours, 2 to second shortest tours.
5. Termination: when output of every node is – either close to 0 and decreasingeither close to 0 and decreasing – or close to 1 and increasing
6 Problems of continuous HM for optimization6. Problems of continuous HM for optimization– Only guarantees local minimum state (E always
decreasing)decreasing)– No general guiding principles for determining
parameters (e.g., A, B, C, D in TSP)p ( g , , , , )– Energy functions are hard to come up and different
functions may result in different solution qualitiesy
i x xy
xixix i ij
xjxi vvBvvAE22
x i i xxixi
yj
D
vvC )1()1(2
22 another energy function for TSP
x i xy
iyiyxixy vvvdD )(2 1,1,
Simulated AnnealingS u ated ea g• A general purpose global optimization technique
i i• Motivation
BP/HM:– Gradient descent to minimal error/energy function E.– Iterative improvement: each step improves the solution. – As optimization: stops when no improvement is possible
without making it worse first.– Problem: trapped to local minimal .
key: ttEtEE 0)()1(– Possible solution to escaping from local minimal:
allow E to increase occasionally (by adding random noise).
Annealing Process in Metallurgy• To improve quality of metal works.• Energy of a state (a config. of atoms in a metal piece)
– depends on the relative locations between atoms.– minimum energy state: crystal lattice, durable, less fragile/crisp
many atoms are dislocated from crystal lattice causing higher– many atoms are dislocated from crystal lattice, causing higher (internal) energy.
• Each atom is able to randomly move – How easy and how far an atom moves depends on the
temperature (T), and if the atom is in right placeDislocation and other disruptions from min energy state can be– Dislocation and other disruptions from min. energy state can be eliminated by the atom’s random moves: thermal agitation.
– Takes too long if done at room temperatureg p• Annealing: (to shorten the agitation time)
– starting at a very high T, gradually reduce Tl h id f li i i i• SA: apply the idea of annealing to NN optimization
Statistical Mechanics• System of multi-particles,
– Each particle can change its state Hard to know the system’s exact state/config and its energy– Hard to know the system’s exact state/config. , and its energy.
– Statistical approach: probability of the system is at a given state.assume all possible states obey Boltzmann-Gibbs distribution:assume all possible states obey Boltzmann Gibbs distribution:
: the energy when the system is at state : the probability the system is at stateP
E
p y y
1 sofactor ion normalizat theis where,1 Pezez
P EE
re temperatuabsolute : constant,Boltzmann : where,)( 1 TKTK BB
TK retemperatuartificialforusingandIgnoring
TETEETE
TETE
B
eeee
PPe
zP
TK
//)(/
// and ,1
re temperatuartificialfor usingand Ignoring
ePz
L t 0EEE
TETE ePPe
zP // and ,1
Let (1)(2) diff li l i h hi h T i
0 EEE
PPPP 1/
Pz
(2) differ little with high T, more opportunity to change state in the beginning of annealing.
differ a lot ith low T help to keep the s stem
PP and
PP and differ a lot with low T, help to keep the system at low E state at the end of annealing.when T0 (system is infinitely morePP /
PP and
when T0, (system is infinitely more likely to be in the global minimum energy state than in any other state).
PP /
y )(3) Based on B-G distribution, at equilibrium
( ) ( ) ( ) ( )P S P S S P S P S S
( ) /( ) ( )( ) ( )
E E TP S S P S eP S S P S
Easier to move from higher enenrgy state to lower energy state
M t li l ith f ti i ti (1953)• Metropolis algorithm for optimization(1953): current state
diff f b ll dSS
Sa new state differs from by a small randomdisplacement, then will be accepted with the probability
SS
S
probability
otherwise
0)(if 1)( /)( TEEe
EEssP
otherwisee
Random noise introduced here
/
1. the system moves to state if it reduces 2. the system is allowed to occasionally move to state
E
/ (with probability ) even it increases . This prob approaches 0 when T approaches
E Te E
to 03 The system still obeys B-G distribution3. The system still obeys B-G distribution
Simulated Annealing in NNSimulated Annealing in NN• Algorithm (very close to Metropolis algorithm)
1 set the network to an initial state S1. set the network to an initial state Sset the initial temperature T >> 1
2 do the following steps many times until quasi thermal2. do the following steps many times until quasi thermalequilibrium is reached at the current T
2.1. randomly select a state displacement Sy p2.2. compute 2.3.
)()( sEssEE
sssE :then0 if
sssep
pTE :thenif
1and0betweennumberrandomagenerateelse/
3. reduce T according to the cooling schedule4. if T > T-lower-bound, then go to 2 else stop
p
• Comments– thermal equilibrium (step 2) is hard to test, usually with
a pre-set iteration number/time– displacement may be randomly generated
• choose one component of S to change or S
• changes to all components of the entire state vector• should be smallS
– cooling schedule• Initial T: 1/T ~ 0 (so any state change can be accepted) • Simple example:• Another example:
: where 0 1T T
)0(T
– you may store the state with the lowest energy among
endtheatslower linear,non)1log(
)0()(
k
TkT
you may store the state with the lowest energy among all states generated so far
SA for discrete Hopfield ModelSA for discrete Hopfield Model• In step 2, each time only one node say xi is selected for
possible update all other nodes are fixedpossible update, all other nodes are fixed.
: energy with 1; : energy with 0a i b iE x E x
/
: the probability to set 1; 1a
i iE T
P xeP
/ / ( ) / 1
This acceptance criterion is different from Metropolis alg.a b b ai E T E T E E TP
e e e
5.0)betteris0(0if5.0)betteris1(0if
iii
iiabi
PxEPxEEE
0if
0if 1smalliswhen /
iTE
ii
iii
Ee
EPT
i
i
• Localize the computation• Localize the computation12
n
ij i j k kE w x x x 1
0 1
2
k k
j j i
k x x kj j k kE E E w x net
/
11
k k
k
j jj
k net TP
a bno need to compute E and E
• It can be shown that both acceptance criterion guarantees the B G distribution if a thermal equilibrium is reached
/1 knet Te
the B-G distribution if a thermal equilibrium is reached.• When applying to TSP, using the energy function designed
for continuous HMfor continuous HM
Variations of SAVariations of SA• Gauss machine: a more general framework
noiserandomtheiswhere
kjkjk xwnet
havewe0ifnoiserandomtheis where
HM
1/ criterianacceptanceas)1(withSAtoclosevery
/8and0withondistritutiGaussianobeysif
TEe
TSDmean
1/ )1(withSAexact
ondistritutilogisticobeysif
TEe
• Cauchy machine• Cauchy machineobeys Cauchy distribution
11 Eacceptance criteria: )arctan(1
21
TE
TkT 0)(lif t
• Need special random number generator for a particular k
kT 0)(:coolingfaster
p g pdistributionGauss: density function2
2
1)(x
exP
Gauss: density function2
)( exP
Boltzmann Machine (BM) (§6.4)Boltzmann Machine (BM) (§6.4) • Hopfield model + hidden nodes + simulated annealing• BM Architecture
– a set of visible nodes: nodes can be accessed from outsidet f hidd d– a set of hidden nodes:
• adding hidden nodes to increase the computing power• Increase the capacity when used as associative• Increase the capacity when used as associative
memory (increase distance between patterns)– connection between nodes
• Fully connected between any two nodes (not layered)• Symmetric connection: .0, iijiij www
– nodes are the same as in discrete HM: – energy function:
jj
ij
ijiji xwnet
xxxwE 1energy function:
i ij i
iijiij xxxwE2
• BM computing (SA) with a given set of weightsBM computing (SA), with a given set of weights1. Apply an input pattern to the visible nodes.
– some components may be missing or corrupted ---pattern p y g p pcompletion/correction;
– some components may be permanently clamped to the input values (as recall key or problem input parameters)input values (as recall key or problem input parameters).
2. Assign randomly 0/1 to all unknown nodes( including all hidden nodes and visible nodes with( including all hidden nodes and visible nodes withmissing input values).
3. Perform SA process according to a given coolingp g g gschedule.– randomly picked non-clamped node i is assigned value of
1 with probability , 0 with probability
Boltzmann distribution for states
1/ )1( Tnetie1/ )1(1 Tnetie
– Boltzmann distribution for states
• BM learning ( obtaining weights from exemplars)– what is to be learned?
• probability distribution of visible vectors in the environment. • exemplars: assuming randomly drawn from the entire• exemplars: assuming randomly drawn from the entire
population of possible visible vectors.• construct a model of the environment that has the prob. distri.
of visible nodes the same as the one in the exemplar set.– There may be many models satisfying this condition
• because the model involves hidden nodes• because the model involves hidden nodes.
32hidden)001(
withmodel a needP
Infinite ways to assign prob to
1visible
)111()0,1,1()1,0,1()0,0,1(
PPPP
exaplers)(from40)1( 1 xP
assign prob. to individual states
• let the model have equal probability of theses states (max. entropy);4.0
)1,1,1( Pexaplers)(from4.0)1( 1xP
q p y ( py);• let these states obey B-G distribution (prob. proportional to -energy).
BM Learning rule:– BM Learning rule:: the set of training exemplars ( visible vectors): the set of vectors appearing on the hidden nodes
aV
H : the set of vectors appearing on the hidden nodestwo phases:
• clamping phase: each exemplar is clamped to
bH
Vclamping phase: each exemplar is clamped to visible nodes. (associate a state Hb to Va)
• free-run phase: none of the visible node is clamped
aV
(make (Hb , Va) pair a min. energy state)
b bilit th t l i li d i)( : probability that exemplar is applied inclamping phase (determined by the training set)
: probability that the system is stabilized with
)( aVP
)(VP
aV
: probability that the system is stabilized with at visible nodes in free-run (determined by the model)
)( aVPaV
model)
l i i t t t th i ht t i h th t• learning is to construct the weight matrix such that – is as close to as possible; and
E h i i d i h i i)( aVP )( aVP
– Each is associated with a minimum energy state• A measure of the closeness of two probability distributions
(called ma im m likelihood as mmetric di ergence
aV
(called maximum likelihood, asymmetric divergence, Kullback-Leibler distance, or relative-entropy):
VP )(
• It can be shown
a a
aa VP
VPVPH)()(ln)(
)()(ifonly0and0 VPVPHH • It can be shown• BM learning takes the gradient descent approach to
minimal H
)()(ifonly0and,0 aa VPVPHH
minimal H
/
( , ) (1)abE T
a beP V H
Z
Derivation of BM learning rule
(3)s ssE x x/
,
where
, (2)
sE Ts
s s ss ij i j i ii j i i
ZZ e
E w x x x
(3)
( )( ) ln (4)( )
si j
ij
aa
a a
x xw
P VH P VP V
( )( ) ln ( ) (ln ( ) ln ( ))( )
( ) ( )
aa a a aa a
ij ij a ij
P VH P V P V P V P Vw w P V w
P V P V
( ) ( ) ( ) ln ( ) ( ) ( , ) (5)( ) ( )
a aa a a a ba a a b
ij a ij a ij
P V P VP V P V P V P V Hw P V w P V w
F t t t/ / /
// /2 2
For any system state 1 1 ( ) ( / )
r r rsr r
E T E T E TE TE T E T
r r sij ij ij ij ij
re e eP Z e e Z E T e
w w Z Z w Z w Z w
/ // / / /
( / ) ( / )s sr r r r
ij ij ij ij ij
E T E TE T E T E T E Tr r s s
r s i j i js sij ij
e e e e e eE T E T x x x xZ w Z Z w ZT ZT Z
(by Eq. 3)
( ) / (6)
where the last equality comes from Eq. 1.
r r s sr i j r s i js
Px x P Px x T
Derivation of BM learning rule
Substituting (6) into Eq. (5) (with )br V H B i t t f i ds s bSubstituting (6) into Eq. (5) (with )and ignoring 1/T
( ) ( , ) ( )
a b
aa ba b
ij a ij
r V H
P VH P V Hw P V w
Because is a constant for given and ,( ) ( , )( )
( )
s ss i js
s saa b s i ja b s
a
p x x a bP V P V H p x xP V
P V
( )( )( ( , )( )
( ) ( ) )) (7)
ij a ij
ab abaa b i ja b
a
s sa
P V P V H x xP V
P V P V H
( ) ( , )( )
(9)
s s as i j a bs a b
as s
s i j ijs
P Vp x x P V HP V
p x x p
( ) ( , ) )) (7)
( )s sa
a b s i ja b sa
P V H p x xP V
Because ( | ) ( | )b a b aP H V P H V
Here we used the fact that ( ) ( , )( )
aa ba b
P V P V HP V
(both determined by the model, conditioned with )
( ) ( , )( )
( ) ( | )
a
ab abaa b i ja b
aab ab
VP V P V H x xP V
P V P H V
( )( )
( ) ( )( )( ) 1
a ba ba
aaa
a
P VP V P VP VP V
( ) ( | )
( , ). (8)
ab aba b a i jab
ab aba b i jab
ij
P V P H V x xP V H x x
p
( ) 1aaP V
Finally, substituting (8) and (9) into (7), we haveH ( ) and /ij ij ij ij ij ij
ij
H p p w H w p pw
stateinlycollectivearenodesvisible: aHV
nodehiddenaorvisibleaeitheris: statein ly collective are nodeshidden
statein ly collective are nodes visible :
iab
i
ba
xxb
aHV
)( where)(1
abj
abibaijijij xxHVPPPPH
nodehidden aor visibleaeither is : ii xx
)(
)()(
ab
jab
iab
baij
jiab
baijijijij
xxHVPP
VTw
)(then
ijijij
ij
ab
PPwHw
fihidddi ibl
possible allover on"" are and both likely how :
b
xxP abj
abiij
runsfreein statehidden andstatesisible v ba
possible allover on"" are and both likely how : xxP abj
abiij
runs clampedin statehidden and states isible v bajj
BM Learning algorithm (for auto-association)BM Learning algorithm (for auto association)1. compute
1.1. clamp one training vector to the visible nodes of the
ijPp g
network1.2. anneal the network according to the annealing
schedule until equilibrium is reached at a pre-set lowtemperature T1 (close to 0).
1.3. continue to run the network for many cycles at T1. After each cycle, determine which pairs of connected node are “on” simultaneously.
1.4. average the co-occurrence results from 1.31 5 t t 1 1 t 1 4 f ll t i i t d1.5. repeat steps 1.1 to 1.4 for all training vectors and
average the co-occurrence results to estimate for each pair of connected nodes
ijP
for each pair of connected nodes.
2 C t P2. Computethe same steps as 1.2 to 1.4 (no visible node is clamped).
3 C l l d l i h h )(
ijP
3. Calculate and apply weight change4. Repeat steps 1 to 3 until is sufficiently small.
)( ijijij ppw ijij pp
• Comments on BM learning
• BM is a stochastic machine not a deterministic one.– Solution is obtained at the end of annealing processg p
• It has higher representative/computation power than HM+SA (due to the existence of hidden nodes).
• Since learning takes gradient descent approach, only local optimal result is guaranteed (H may not be reduced to 0).
• Learning can be extremely slow due to repeated SA involved• Learning can be extremely slow, due to repeated SA involved• There are a number of variations of BM learning
– The one in text is for hetero-associationThe one in text is for hetero association– Alternatively, let pij be prob. of i and j both “on” or both “off”
• There are a number of variations of BM computing– Clamp input vector (it may contain noise)– Transient input (resulting min. energy state may be irrelevant to input)– Clamp input to an intermediate temperatureClamp input to an intermediate temperature
• Comments on BM learning
• Speed up:– Hardware implementationp– Mean field theory: turning BM to deterministic by replacing
random variables xi by its expected (mean) valuesd h i ilib i i li
ix • no need to reach quasi equilibrium in annealing• , can be computed by10 ix
1 1bipolar: tanh ( )
because average
i ij jx w xT
:
/ /1 1
/
because average 1 ( 1) ( 1) ( 1) (1 ) (1 (1 ) )
1 1 1-(1 ) tanh( / )
i i
i
net T net Ti i i
net T
x p x p x e ee t T
:
/ / / (1 ) tanh( / )1 1 1
binary: use logistic funi i i inet T net T net T net T
e e e
ction
• jiij xxP by ingapproximat
E l i t• Example: associate memory
• Store 5 vectors of dim. 4 (beyond the capacity of HM)– Using 4 hidden nodes: total 8 nodes– Total # of system states: 2^8 = 256
• 0.2 if is one of the 5 stored patterns( ) VaP Va
• Suppose the training is successful
( )0 otherwise
P Va
( 0) :H P P
–
5 t t t f h f th 5 t d tt h l b ll l t
0 if is one of the 5 stored patterns( , )=0 otherwisea b
VaP V H
– 5 system states, one for each of the 5 stored patterns, have globally lowest energy
• Computing with a given recall inputCl i i ibl d li i di– Clamp input pattern to visible nodes, annealing to an intermediate temperature (the system moves to the vicinity of one of the 5 low energy state)
– Unclamp the visible nodes, annealing until temperature close to zero for pattern completion/correctionpattern completion/correction
Evolutionary Computing (§7.5)Evolutionary Computing (§7.5)• Another expensive method for global optimization• Stochastic state-space search emulating biological
evolutionary mechanisms– Biological reproduction
• Most properties of offspring are inherited from parents, each parent contributes different part of the offspring’s chromosome/gene structure (cross-over)
• Some are resulted from random perturbation of gene• Some are resulted from random perturbation of gene structures (mutation)
– Biological evolution: survival of the fittestBiological evolution: survival of the fittest• Individuals of greater fitness have more offspring• Genes that contribute to greater fitness are more predominant g p
in the population
Overviewpopulation
selection of parents for reproduction
Overview
selection of parents for reproduction (based on a fitness function)
parentsreproduction (cross-over + mutation)
next generation of population
• Variations of evolutionary computing:• Variations of evolutionary computing:• Genetic algorithm• Evolutionary strategies (using real-representations and relying more on
t ti d l ti )mutation and selection )• Genetic programming (find an optimal computer program from a
population of programs by evolutionary computing)
• Individual:Basics
• Individual: • corresponding to a state• represented as a string of symbols (genes and chromosomes), similar to a
ffeature vector.• Population of individuals (at current generation) P (t)• Fitness function f: estimates the goodness of individualsf f g• Selection for reproduction:
• randomly select a pair of parents from the current population• individuals with higher fitness function values have higher probabilities to• individuals with higher fitness function values have higher probabilities to
be selected• Reproduction:
ll ff i t i h it d bi d f t f• crossover allows offspring to inherit and combine good features from their parents (1-point crossover)
• mutation (randomly altering genes) may produce new (hopefully good) features
• Replacement• Wholesale: replace the parent generation by child generationp p g y g• Throw away bad individuals when the limit of population size is reached
Comments• Initialization:
• Random, ,• Plus sub-optimal states generated from fast heuristic methods
• Termination:• All individual in the population are almost identical (converged)• All individual in the population are almost identical (converged)• Fitness values stop to improve over many generations • Pre-set max # of iterations exceeded
T d lt• To ensure good results• Population size must be large (but how large?)• Allow it to run for a long time (but how long?)• Very expensive
• Combining GA with NN• Using GA to obtain optimal values for NN parameters (# of hidden g p p (
nodes, learning rate, momentum, weights)• Using NN to obtain good quality initial populations for GA