Anis Hamdi May 19 th, 2010 Advisor, Dr. Hichem Frigui Ensemble Learning Method for Hidden Markov...

Click here to load reader

download Anis Hamdi May 19 th, 2010 Advisor, Dr. Hichem Frigui Ensemble Learning Method for Hidden Markov Models.

of 42

Transcript of Anis Hamdi May 19 th, 2010 Advisor, Dr. Hichem Frigui Ensemble Learning Method for Hidden Markov...

Adaptive Edge Histogram Descriptor for Landmine Detection using GPR*

Anis Hamdi

May 19th, 2010

Advisor, Dr. Hichem FriguiEnsemble Learning Method for Hidden Markov Models

1OutlineIntroduction

Hidden Markov Models

Ensemble HMM classifierMotivationsEnsemble HMM ArchitectureSimilarity matrix computationHierarchical clusteringModel trainingDecision level fusion

Application to Landmine Detection

Proposed Future WorkAfter the introduction, give some background material on HMMs

Then I ll present the proposed ensemble method for HMMs,First MOTIVATIONS, then details the 4 steps that are:

IN PART 4, will show the results of the eHMM on realworld data set, that is landmine detection

Finally Conclusion and Future WORK

2IntroductionClassification is one of the key tasks in data mining.

Statistical Learning problems in many fields involve sequential data: speech signals, stock market prices, protein sequences, etc.

The scope of this work is the classification of sequential data.

The standard approach in model-based classification is to learn a model for each class. The main challenge for complex classification problems is how to account for the intra-class variability

For static data, Gaussians mixture models have been widely used.For sequential data, we intend to use a mixture of Hidden Markov Models to model the potential intra-class variability3

Introduction (cont.)

S: Sequences, pis are HMM probabilities, i are the HMM parameters

1

2

3

4Given a set of N states {s1,s2, , sN}, and a set of M observations {v1,v2,vM} ,

The process moves from one state to another generating a sequence of states q1,q2, , qt , such that P(qt=sj|qt-1=si) =aij , 1i,jN : state transition probabilities

States are not visible, but at each state, the model randomly generates one observation ot according to P(ot=vk|qt=si).P(ot=vk|qt=si)=bik , 1iN 1kM : state emission probabilities

The probability that the system starts at state i is: P(q1=si)=i, 1iN: initial state probability.

Compact representation of the HMM Model is M=(A, B, ).

Related work: Discrete HMMsq1q2q3o1o2o3AB5Related work: Discrete HMMs (cont.)Evaluation problem. Given the model =(A, B, ) and the observation sequence O = o1 o2 ... oT, calculate the probability that model has generated sequence O.Backward-forward procedure

Decoding problem. Given the HMM =(A, B, ) and the observation sequence O = o1 o2 ... oT, find the most likely sequence of hidden states si that produced this observation sequence O.Viterbi algorithm

Learning problem. Given K training observation sequences O=[O(1) O(2)O(K)] and the general structure of the HMM (number of hidden states and number of codewords), determine the HMM parameters = (A, B, ) that best fit the training data.Maximum Likelihood (ML), Minimum Classification Error (MCE), and Variational Bayesian (VB) training6OutlineIntroduction

Hidden Markov Models

Ensemble HMM classifierMotivationsEnsemble HMM ArchitectureSimilarity matrix computationHierarchical clusteringModel trainingDecision level fusion

Application to Landmine Detection

Proposed Future Work7

Ensemble HMM: MotivationsUsing all sequences to train a single model for class 1 may lead toToo much averaging of the sequencesLoss of discriminative characteristics within class 1

One model needs to be learned for each group of similar sequences

How to group sequences? Ground truth is not sufficient.

Sequences belonging to class 1Sequences belonging to class 0

8Ensemble HMM: OverviewWe assume that the data is generated by K HMM models.

These different models reflect the natural partitions within the data, regardless of the ground truth labels.

Partitioning and model identification is achieved through clustering in the loglikelihood space.

Resulting clusters can vary:Different sizesHomogeneous or heterogeneousAdapt learning to different clusters

Fuse the multiple HMM outputs9eHMM: Block DiagramBW trainingMCE trainingCluster 1 Cluster J...Cluster J+1 Cluster E ConfidenceSimilarity Matrix ComputationHierarchical Clustering...Homogeneous clustersMixed clustersModel J...Model J+1,CModel J+1,1Decision Level Fusion Training DataSmall clustersVB trainingModel 1Model K...Model E+1Cluster E+1 Cluster K .... . .. . .. . . Model E,CModel E,1Max. . .MaxeHMM: Similarity Matrix ComputationFitting individual models to sequencesInitial HMM for each sequenceFix the number of states, N. Cluster the sequence elements into N clusters. Each cluster center is a state representative.Define the codebook symbols as the sequence vectors

Training: Baum-Welch algorithm is used to learn the HMM parameters that best fit a particular sequence.Overfitting: we seek each model to perfectly fit the corresponding sequence. We are not looking to use the trained model for generalization.

12..R

1

11

eHMM: Similarity Matrix ComputationComputing the similarity matrixTest each training sequence with each learned modelConstruct a pair-wise penalized log-likelihood matrixPr(Oi|j) : the probability of sequence Oi being generated by Model jsq(i) : the representative of state q in model i q(ij) = q1(ij) qT(ij): the most likely hidden state sequence that generated the sequence Oi from the model j .: mixing factor

L is not symmetric, we use the following schemeto transform it to a similarity matrix:12

eHMM: Similarity Matrix ComputationPenalized loglikelihoodLoglikelihood of sequence Oi being generated from model j : two similar sequences should have high likelihood values for being generated from their respective HMM models.

Viterbi path mismatch term, Two similar sequence should have similar Viterbi paths.

Mixing factor, : trade-off parameter between the likelihood-based similarity and the Viterbi-path-mismatch based similarity.

13eHMM: Similarity-based ClusteringThe previous step resulted in a penalized-loglikelihood-based similarity matrix,

Since the data is available in relational form, we use a standard hierarchical clustering algorithm with the complete link inter-cluster distance.

Agglomerative hierarchical clustering is a bottom-up approach that starts with all the data points as clusters. Then, it proceeds to merging the most similar clusters using an inter-cluster distance.

In the complete link based algorithm, the distance between two clusters is the maximum of all pair-wise distances between sequences in the two clusters. It produces compact clusters. 14eHMM: Models ConstructionModels initializationFor each model k Initial values for the initial state and state transition probabilities (k and Ak) of model k are obtained by averaging the initial state and state transition probabilities of the individual models of the sequences belonging to cluster k.

The state representatives, s(k), of model k are obtained by clustering the observation vectors of the sequences belonging to cluster k into N clusters.

The codebook symbols, V(k), of model k are obtained by clustering the observation vectors of the sequences belonging to cluster k into M clusters.For each symbol v(k)m, the membership in each state s(k)n is computed using

1

2KTypo vk corrected to vm

Once Pi A B initialized, we proceed to the training15eHMM: Models ConstructionModels trainingSequences are presumably similar and mainly belong to the same ground truth class. In this case It is expected that the class conditional posterior probability is uni-modal and peaked around the MLE of the parameters.A maximum likelihood estimation would result in a HMM model parameters that best fit this particular class.

For clusters with a mixture of sequences belonging to different classes, it is expected that the posterior distribution is multimodal. We initialize a model for each class within this cluster. We then focus on finding the class boundaries within the posterior probability.The models parameters are jointly optimized such that the overall misclassification error is minimized

MLE and MCE approaches need a large number of data points to give good estimates of the model parameters. Bayesian approach is used to approximate the class conditional posterior distribution.The variational Bayesian training is suitable for clusters with small number of sequences. Current implementation not ready16eHMM: Models ConstructionModels trainingFor clusters that are dominated by sequences from only one class, we use the standard Baum-Welch re-estimation procedure.jBW , j = 1..J, models

For clusters with a mixture of observations belonging to different classes, we use discriminative training based on minimizing the misclassification error to learn a model for each class. i,cMCE , i= J+1..E, c = 1..C, models

For clusters containing a small number of sequences, we use a variational Bayesian method to update the model parameters given the observed datakVB , k = E+1..K, models

17eHMM: Decision Level Fusion

Let = {jBW, i,cMCE, kVB}, where j = 1..J, i= J+1..E, c = 1..C, and k = E+1..K. be the resulting mixture model after the eHMM training.

To test a new sequence O, we18eHMM: Decision Level Fusion

Let F(r,k) = log Pr(Or|k), 1 r R, 1 k K, be the R-by-K loglikelihood matrix.

Each row Fi, i = 1 .. R, of F represents the feature vector of the sequence i in the decision space.

Thus, the set of sequences is mapped to an Euclidean confidence space via a function .

19eHMM: Decision Level Fusion ANN combinationSimple combination methods could be used, such as mean, maximum, majority voting. However these linear methods are not trainable and require the proper identification of cluster to class associations.

Thus we uses a simple neural network to model the potentially nonlinear mapping between the individual confidence values and the predicted output confidence/class.

The combination function is:

And the final output is a sigmoid function:

20For each expert network, wherewith U a weight vector and f alink function; f is the identity functionfor regression problems and logistic function for binary classification.

The output of each expert network is

eHMM: Decision Level Fusion HME combinationThe input to the HME network is a K-dimensional vector F.

The network is comprised of expert networks, and gating networks.

21eHMM: Decision Level Fusion HME combinationFor the gating networks,

with vi a weight vector.

The vector weights U and vi arethe HME parameters and can be learned using a gradient descent method or an EM-like approach.

22OutlineIntroduction

Hidden Markov Models

Ensemble HMM classifier

Application to Landmine DetectionGPR dataEHD feature ExtractionBaseline HMM classifierEnsemble HMM classifierExperimental results

Proposed Future Work23Application to Landmine Detection:GPR dataGround Penetrating Radar (GPR) offers the promise of detecting landmines with little or no metal content, at the expense of higher false alarm rate.

A GPR signature is a 3-dimensional matrix of sample values S(z,x,y). (z,x,y) represent depth, cross-track position, and down-track positions, respectively.

The down track position is considered as the time variable in our HMM modeling

NIITEK vehicle mounted GPR systemGPR scansGPR signature24Application to Landmine Detection:EHD feature extractionSimple edge detector operators are used to identify edges and group them into five categories: horizontal, vertical, diagonal, anti-diagonal, and isotropic (non-edge)

Illustration of the EHD feature extraction process

25Application to Landmine Detection:Baseline HMM classifier

Illustration of the baseline HMM mine modelBaseline HMM classifier has two HMM models, one for mine and one for background. Each model has four states.

The mine model assumes that mine signatures have a hyperbolic shape.

Each model produces a likelihood value, and a most likely Viterbi path by backtracking through the model states using the backward-forward and the Viterbi algorithm, respectively.

The confidence value assighed to each observation sequence, O, is

Illustration of the baseline HMM architecture26Application to Landmine Detection:eHMM landmine detector

27Application to Landmine Detection: eHMM landmine detector(1) Feature extraction, results in a set of R sequences of length T=15 each.

(2) Similarity matrix computationFit a model to each sequenceCompute the likelihood and Viterbi path of each sequence in each modelDeduce the pair-wise similarity matrix

(3) Pair-wise similarity based clustering, using standard hierarchical algorithm with the complete link distance, K=20.

28Application to Landmine Detection: eHMM landmine detector (cont.)(4) Models initialization and trainingFor each cluster k, initialization of k=(A, B, ) is done using the sequences (and their corresponding models r)belonging to the cluster. For clusters that will be trained using MCE, one model is initialized for each class: k mine and k background.

Training is done according to the procedure described earlier: Large clusters dominated by a majority of mines or clutter signatures are trained using the maximum likelihood estimationLarge clusters containing a mixture of signatures form both classes are trained using MCE based discriminative training.Small Clusters are trained using the variational Bayesian method .

(5) Decision level fusionDone using ANN and HME fusion methods.Details are provided for the general ensemble HMM classifier.

29Application to Landmine Detection:The datasetThe eHMM was trained and tested on GPR data collected by a NIITEK system.

Data was collected from 3 different locations.A total of 12 lanesTotal of 1616 signatures605 mine signatures.1011 clutter signatures.

The EHD features are used. Each signature is represented by a sequence of 15 5-dimensional vectors.

30Application to Landmine Detection:eHMM clustering results

(a) similarity matrix after clustering (b) dendrogramAs sketched in figure (a), the diagonal blocks of the matrix are darker, which corresponds to higher intra-clustersimilarities.

Dendrogram in figure (b) shows that, at a certain threshold, we can identify two main groups of clusters.

On the left hand side of the dendrogram, clusters (3; 17; ; 13) have mainly clutter signatures.On the right hand side, clusters (1; 15; ; 5) are dominated by mine signatures.

As it can be seen in the next figure, 31Application to Landmine Detection:eHMM clustering results

Distribution of the alarms in each cluster: (a) per class, (b) per type, (c) per depth.some clusters (e.g. cluster 1, 5, and 15) have a large number of mines and few or no clutter alarms

some clusters are dominated by clutter with few mines (e.g. cluster 2, 13, and 18)The few mines included in these clusters are typically mines with weak signatures that are either low metal minesor mines buried at deep depths

Other clusters are composed of a mixture of mines and clutter signatures (e.g. cluster 3, 6, and 11). The mines within these clusters are either low metal mines (figure (b), cluster 6) or mines buried at deep depths(figure (c), clusters 3 and 11). 32Application to Landmine Detection:eHMM clustering resultsCluster 5Cluster 18Sequences

State RepresentativeTransition Matrix A

HVDADNES10.050.050.180.220.50S20.040.050.460.270.19S30.050.030.440.450.04S40.040.050.240.520.15S1S2S3S4S10.690.310.000.00S20.000.470.530.00S30.000.000.850.15S40.380.000.000.62S1S2S3S4S10.770.230.000.00S20.000.760.160.08S30.000.000.840.16S40.280.000.000.72HVDADNES10.060.030.100.110.69S20.060.030.110.180.62S30.060.030.190.150.56S40.060.040.200.100.6033Application to Landmine Detection:Individual models performances(a) a sample signature from cluster 1 (b) models responses to the signature in (a)(a) a sample signature from cluster 2 (b) models responses to the signature in (a)

As expected, the highest likelihood occurs when testing the sequence with theHMM model of cluster 1. Moreover, the higher likelihood values correspond to themine-dominated clusters (1; 4; 5; 6; ::; 14; 15; 17; ::).

figure 2 shows that a test sequence belonging to cluster 2 has highlikelihood in clutter-dominated clusters' models.34

Application to Landmine Detection:Individual models performancesScatter plot of the loglikelihoods of the training data in model 5 (strong mines) versus. model 1 (weak mines). Clutter, low metal (LM), and high metal (HM) signatures at different depths are shown with different symbols and colors.Even though the two models are dominated by mine signatures, we see that not all confidence values are highly correlated.

In fact, some strong mine signatures have high likelihoods in model 5 and lower likelihoods in model 1 (upper left side of the scatter plot, region R1). This can be attributed to the fact that cluster 5 contains mainly strong mines and is more likely to yield highloglikelihood when testing a strong mine signature.

On the other hand, in region R2, the performance of cluster 1 model is better as it gave higher likelihood values to the"weak" mines in that region.

In the proposal report a similar scatter plot between model 5 (strong mines) and model 2 (clutter) is presented. It shoes the decorraltion35Application to Landmine Detection:Individual models performancesIndividual ROCs of some models. Solid lines: clusters dominated by mines.Dashed lines: clusters dominated by clutter.

The individual ROCs show that the models perform differently at different levels of false alarms rate.

We notice also that no model consistently outperforms the other models.36Application to Landmine Detection:eHMM performanceFor the remainder of the experiments, we use 4-fold cross validation technique to average the results of the eHMM on unseen data.

Comparison of the eHMM with the best 3 cluster models (1, 2, and 12)For the remainder of the experiments, we use 4-fold cross validation techniqueto average the results of the eHMM on unseen data.

In each fold, the eHMM is trained on a subset of the original data (three-fourths of the number of data samples)and tested on the remaining samples.37Application to Landmine Detection:eHMM performanceComparison of the eHMM with the baseline HMM.

Ann based eHMM38Application to Landmine Detection:eHMM performanceScatter plot of the confidence values of the test data in the eHMM vs. the baseline HMM classifier.

Clutter, low metal (LM), and high metal (HM) signaturesat different depths are shown with different symbols and colors.

eHMM outperforms the baseline HMM,as the majority of mine signatures are located above the diagonal of the scatter plot.

Mine signatures belonging to region R1 are assigned relatively highconfidence values by the eHMM but the baseline HMM assigns them low confidencevalues. Those signatures are all weak mines (low metal and buried at 3" or more).

R2 BOTH RELATIVELY HIGH

R3 contains mainly background signatures and weak mine signatures (low metal minesburied at deep depths) that are assigned low confidence values by both classifiers.39ConclusionsEnsemble HMM classifier is proposedLearn one model per training sequenceCluster sequences in the log-likelihood spaceLearn a HMM model for each cluster using optimized training techniques

The multiple models are expected to capture the intra-class variations.

The output of the multiple models are fused using ANN or HME.

In an application to the landmine detection problem, the eHMM steps are individually analyzed and the overall performance is significantly better than the baseline DHMM40Proposed Future WorkeHMM implementation improvementsJoint clustering, training, and fusion optimizationUse variational Bayesian learning for small clustersUse BIC to optimize the HMM models structuresUse BIC to optimize the number of clusters

ApplicationsIndentify potential cross domain applications to evaluate the eHMMCompare the eHMM performance to other ensemble methods such as the Adaboost algorithm with HMMs as weak classifiers41Thank you!

Questions?42