Heterogeneous adaptive systems

Heterogeneous adaptive Heterogeneous adaptive systemssystems

Włodzisław Duch & Krzysztof Grąbczewski

Department of Informatics, Nicholas Copernicus University, Torun, Poland.

http://www.is.umk.pl

Why is this important?Why is this important?Why is this important?Why is this important?MLPs are universal approximators, best choice? Wrong bias => poor results, complex networks. No single method may achieve best results for all datasets.

2-class problems, two situations:

Class 1 inside the sphere, Class 2 outside.

MLP: at least N +1 hyperplanes, O(N2) parameters.

RBF: 1 Gaussian, O(N) parameters.

C1 in the corner defined by (1,1 ... 1) hyperplane, C2 outside.

MLP: 1 hyperplane, O(N) parameters.

RBF: many Gaussians, O(N2) parameters, poor approx.

Combination: needs both hyperplane and hypersphere!

InspirationsInspirationsLogical rule: IF x1>0 & x2>0 THEN Class1 Else Class2

is not properly represented neither by MLP nor RBF!

Result: decision trees and logical rules perform on some datasets (cf. hypothyroid) significantly better than MLPs!

Speed of learning+network complexity depends on TF.

Fast learning requires flexible „brain modules” - TF.

• Biological inspirations: sigmoidal neurons are crude approximation to the lowest neural level.

• Interesting brain functions are done by interacting minicolumns, implementing complex functions.

• Human categorization: never so simple.

• Modular networks: networks of networks.

• First step beyond single neurons: transfer functions

providing flexible decision borders.

Heterogeneous systemsHeterogeneous systemsHomogenous systems: one type of “building blocks”, same type of decision borders.Ex: neural networks, SVMs, decision trees, kNNs ….

Committees combine many models together, but lead to complex models that are difficult to understand.

Discovering simplest class structures, its inductive bias:requires heterogeneous adaptive systems (HAS).Ockham razor: simpler systems are better.

HAS examples:NN with many types of neuron transfer functions.k-NN with different distance functions.DT with different types of test criteria.

TF in Neural NetworksTF in Neural NetworksChoices with selection of optimal functions:1. Homogenous NN: select best TF, try several types

Ex: RBF networks; SVM kernels (may give 50=>80% change).

2. Heterogeneous NN: one network, several types of TF. Ex: Adaptive Subspace SOM (Kohonen 1995), linear subspaces.Projections on a space of various basis functions.

3. Input enhancement: adding fi(X) to achieve separability. Ex: functional link networks (Pao 1989), tensor products of inputs; D-MLP model.

Heterogeneous:

1. Start from large network with different TF, use regularization to prune

2. Construct network adding nodes selected from a pool of candidates

3. Use very flexible TF, force them to specialize.

Taxonomy Taxonomy - activation f.- activation f.Taxonomy Taxonomy - activation f.- activation f.

Taxonomy Taxonomy - output f.- output f.Taxonomy Taxonomy - output f.- output f.

Taxonomy Taxonomy - TF- TFTaxonomy Taxonomy - TF- TF

Most flexible TFsMost flexible TFsMost flexible TFsMost flexible TFs

Conical functions: mixed activations

( ; , , , ) ( - ) CA X W R W X R X R

Lorentzian: mixed activations

1 22

1; , , ,

1 ( )GLC

X W R

W X X R

1

; , , , , 1i i i i

Ns b s b

i i i i i ii

SBi e X D e e X D e

X D b s α β

Bicentral - separable functions

Optimal Transfer Function networkOptimal Transfer Function networkOptimal Transfer Function networkOptimal Transfer Function network

OTF-NN, based on IncNet ontogenic network architecture (N Jankowski), statistical criteria for pruning/growth + Kalman filter learning.

XOR solution with:

2 Gaussian functions

1 Gaussian + 1 sigmoidal function

2 sigmoidal functions.

1 Gaussian with G(W.X) activation.

OTF for half sphere/subspaceOTF for half sphere/subspaceOTF for half sphere/subspaceOTF for half sphere/subspace

2D and 10D problem considered, 2000 points.

OTF starts with 3 Gaussian + 3 sigmoidal f.

2-3 neuron solutions found, 97.5-99% accuracy.

Simplest solution:

1 Gaussian + 1 sigmoid

3 sigmoidal functions – acceptable solution.

Heterogeneous FSMHeterogeneous FSMHeterogeneous FSMHeterogeneous FSM

Feature Space Mapping: neurofuzzy ontogenic network, selects a separable localized transfer function from a pool of several types of functions.

Rotated halfspace + Gauss

Simplest solution found:

1 Gaussian + 1 rectangular function.

In 5D and 10D needs many points.

Similarity-based HASSimilarity-based HASSimilarity-based HASSimilarity-based HAS

Local distance functions optimized differently in different regions of feature space.

Weighted Minkovsky distance functions: Ex: =20

1

, 0N

i i ii

D W X R

X R

and other types of functions, including probabilistic functions, changing piecewise linear decision borders.

RBF networks with different transfer function;

LVQ with different local functions.

HAS decision treesHAS decision treesHAS decision treesHAS decision trees

Decision trees select the best feature/threshold value for univariate and multivariate trees:

Decision borders: hyperplanes.

Introducing tests based on L Minkovsky metric.

or i k i i ki

X W X

For L2 spherical decision border are produced.

For L∞ rectangular border are produced.

Many choices, for example Fisher Linear Discrimination decision trees.

1/

i i Ri

X R X R

SSV HAS DTSSV HAS DTSSV HAS DTSSV HAS DT

Define left and right areas for test T with threshold s:

Count how many pairs of vectors from different classes are separated and how many vectors from the same class are separated.

( ) 2 , , , ,

min , , , , ,

c cc C

c cc C

SSV s LS s T D D RS s T D D D

LS s T D D RS s T D D

: ( ) , continuous, ,

: ( ) , discrete

, , , ,

x D T s fLS s T D

x D T s f

RS s T D D LS s T D

X

X

SSV HAS algorithmSSV HAS algorithmSSV HAS algorithmSSV HAS algorithmCompromise between complexity/flexibility:

• Use training vectors for reference R• Calculate TR(X)=D(X,R) for all data vectors, i.e. the distance matrix. • Use TR(X) as additional test conditions.• Calculate SSV(s) for each condition and select the best split.

Different distance functions lead to different decision borders.

Several distance functions are used simultaneously.

2000 points, noisy 10 D plane, rotated 45o, + half-sphere centered on the plane.

Standard SSV tree: 44 rules, 99.7%

HAS SSV tree (Euclidean): 15 rules, 99.9%

SSV HAS IrisSSV HAS IrisSSV HAS IrisSSV HAS IrisIris data: 3 classes, 50 samples/class.

SSV solution with the usual conditions (6 errors, 96%), or with

distance test using vectors from a give node only:

if petal length < 2.45 then class 1

if petal length > 2.45 and petal width < 1.65 then class 2

if petal length > 2.45 and petal width > 1.65 then class 3

SSV with Euclidean distance tests using all training vectors as reference (5 errors, 96.7%)

1. if petal length < 2.45 then class 1

2. if petal length > 2.45 and ||X-R15|| < 4.02 then class 2

3. if petal length > 2.45 and ||X-R15|| > 4.02 then class 3

||X-R15|| is the Euclidean distance to the vector R15.

SSV HAS WisconsinSSV HAS WisconsinSSV HAS WisconsinSSV HAS WisconsinWisconsin breast cancer dataset (UCI)699 cases, 9 features (cell parameters, 1..10)Classes: benign 458 (65.5%) & malignant 241 (34.5%).

Single rule gives simplest known description of this data:

IF ||X-R303|| < 20.27 then malignant

else benign

18 errors, 97.4% accuracy. Good prototype for malignant!

Simple thresholds, that’s what MDs like the most!

Best L1O error 98.3% (FSM),

best 10CV around 97.5% (Naïve Bayes + kernel, SVM)

C 4.5 gives 94.7±2.0%

SSV without distances: 96.4±2.1%

Several simple rules of similar accuracy are created in CV tests.

ConclusionsConclusionsHeterogeneous systems are worth investigating.

Good biological justification of HAS approach.

Better learning cannot repair wrong bias of the model.StatLog report: large differences of RBF and MLP on many datasets.

Networks, trees, kNN should select/optimize their functions.

Radial and sigmoidal functions in NN are not the only choice.

Simple solutions may be discovered by HAS systems.

Open questions:

How to train heterogeneous systems? Find optimal balance between complexity/flexibility?Ex. complexity of nodes vs. interactions (weights)?Hierarchical, modular networks: nodes that are networks themselves.

The End ?

Perhaps still the beginning ...

Heterogeneous adaptive systems

Documents

Transcript of Heterogeneous adaptive systems