Branch And Bound and Beam Search Feature Selection Algorithms

L.G.H.C. Nalinda2011CS00511000058

Feature Selection Algorithms

Branch and Bound & Beam Search

Dr. Ajantha Athukorala

The Necessity

Data In Digital Era▷How to turn mountains of data into “Nuggets”

Data BaseMachine LearningData MiningStatistics ++

▷A way of effective processing – “FEATURE SELECTION”

▷Feature SelectionReduces Number of FeaturesRemove NoiseSpeed up Data Mining Algorithms

Feature Selection▷Process of Selecting Optimal Subset of Features

Feature Selection have been NP HardData Mining: Classification, Clustering, Association Rule,

Regression

▷Subset GenerationHeuristic search, with each state in the search space specifying a

candidate subset for evaluationSearch Strategy – N features 2N

- Complete, Sequential, Random

▷Complete Search (no optimal subset is missed)Guarantee to find optimal result according to evalcriteria using.Order of the search space O(2N)Branch and Bound + Beam Search

Fundamentals

Objective▷ How to find the best subset of features that

optimize the criteria function, J(x) from set of measurements of feature variables.

▷Optimization is for the set of all possible subsets of size d,Xd of the p possible measurements, x1………xp

▷Goal – find the subset of features which optimize the J(X) function

J( )= max J(X)

Monotonicity▷ Exponential search and feature selection criterion is

monotonic.

▷Monotonicity identify the branches that do not contain the optimal solutions for feature selections

▷Given the set, only subset of features contribute to be optimal features.

▷If feature selection criterion yields smaller values on a subset

- Subset and it’s derivations cannot be optimal J( ) ≥ J( ) ≥ ……….. ≥ J( ) where = Y \ {y1, y2....... yj }

[Xj – set of features obtained by removing j features y1,y2,..yj from Y]

Branch and Bound

▷Select best features from set of N features

▷Introduced by A.H. Land and A.G. Doig for discrete programming and combinational optimization problems.

▷Applied this concept for feature selection by Nareda and Fukunaga.

▷Follows Divide and Conquer approach

▷Optimal Search that doesn’t involve exhaustive search.

▷Assumption: Criterion function agrees with Monotonicity condition

▷OperationsBranch – Partition full set of features into smaller subsetsBound – Provide a bound for the best solution in the subset

where it discard if bound points out that it can’t contain an optimal solution.

▷ApplicationsTravelling salesman problemFalse Noise Analysis0/1 Knapsack ProblemK-Nearest Neighbor Search Integer ProgrammingSet Inversions

Feature Selection in Machine Learining

Pseudocode

Flow Chart for Branch and Bound

Terminology• Zj– Index of discarded feature.

• Sj – List of succeessors for the considered node

• N – Number of features in full feature set.

• M`- Desired number of features to be selected .

• Jfeature – Feature subset selection criterion (with monotonicity property)

• α – Bound

• j – Level of the tree.

B&B

1. Root j = 0, α = - ∞ (root level as level zero and initialize minus infinity as the bound value)

2. Create the successors list for current level identified by Sj

Corresponding level includes all possible values that Zj at level j can have with maximum possible index of features to be (m+j) i.e.Successor nodes at this level contains subsets of features with one feature deleted (ascending order) from the list of previous level’s parent node .

Analysis of Pseudo-code

3. Select a new node from the current level as,- If Sj ( List of successors) is empty then go to step 5- Else find a value k which has the maximum Jfeature value (then

Zj = k and delete k from list Sj)

4. If Sj ( List of successors) is empty then go to step 5

-If last level reached, move to step 6 ( Here we have reached the amount of features that we are expecting)

-else step to next level( j = j + 1) and move to step 2

5. Return to a lower level,- if j = 0 , terminate- otherwise continue with step 3

Whenever the criterion evaluated for any node is less than the bound α , all nodes that are successors of that node also have criterion values less than α . So, we prune them.

6. (At last level) α = Jfeature (z1, z2, … , zd) and continue from step 5. Objective is to

- Optimize the criterion function while updating lower bound .

Overview - Branch and Bound ▷Construct an ordered tree preserving following

property.

Jk means k variables to eliminated and the order of Zp is determined by a discrimination criterion.

▷Traverse tree from following Depth-First Search.

▷At each level – evaluate criterion and sort

▷Prune the tree (Any node less than the Bound α is pruned)

Demonstration▷Selecting the best six features from ten features

▷Number of levels = (10 – 6) = 4

▷Number of leaf nodes = = = 210

▷Assumption: Initial feature set = {1,2,3,4,5,6,7,8,9,10}

At root: level number = 0j = 0zj = 0α = - ∞

▷Level 0 [1,2,3,4,5,6,7,8,9,10]

▷Level 1 - Contains subset of the total set at Level 0 with one variable

removed- Create the successor list for the current level consist of all

possible values- Successor list (Sj) = {1,2,3,4,5,6,7,8,9}

▷Number of leaf nodes = = = 210

▷Assumption: Initial feature set = {1,2,3,4,5,6}At root: level number = 0j = 0zj = 0α = - ∞

▷Level 0 [1,2,3,4,5,6,7,8,9,10]

▷Level 1: 9 features




At level 4 there will be 210 Leaf nodes

Constructed Tree▷ Constructed tree is NOT symmetric.

▷Variable features are removed ONLY in ascending orderThis avoid the chances of subsets being replicated. Eg: to give subset (123) has the same result as removing 5 and

then 4

▷Unnecessary repetition in calculation can removed by applying this.

Backtracking and Feature Selection

Backtracking▷ The searching algorithms used for the constructed tree

is “Depth First Search”

▷Search happens from Right to Left Least dense part to the part with most branches

▷Start from the right most set (1,2,3,4,5,6) with a J value of 80.5.

▷Search backtracks to the nearest branching node and proceeds down the rightmost branch evaluation all nodes up until leaf node reaches.

▷If the node value is less than the J value stored, no more further traverse down that branch occurs.

▷If the node values are greater than the J value stored traversal goes until a leaf node find and if J value at new leaf node is higher than J value store, update the new J value. This process happens recursively.

Optimal Feature Subset▷ Optimal Jvalues or the updated bounded value of the

backtracking algorithm is α = 82.6

▷The Corresponding applicable feature subset = [1,2,3,4,5,10]

▷Hence the optimal feature set = [1,2,3,4,5,10]

▷Classify as a slow algorithmWorst Case – Exponential time complexitiesAverage case – Reasonably fast

Algorithm characteristics

Branch and Bound in Research DomainResearch“Evaluation of Feature Selection Techniques for Analysis of Functional MRI and EEG”

CitationBurrell, L., Smart, O., Georgoulas, G. K., Marsh, E., & Vachtsevanos, G. J. (2007, June). Evaluation of Feature Selection Techniques for Analysis of Functional MRI and EEG. In DMIN (pp. 256-262).

From paper:In order to classify the pathological events in the human body, Branch and Bound algorithm has applied to functional MRI and EEG data.

fMRI data

iEEG data

▷ Extracted from each patient dataset- fMRI data : 12 features- iEEG data : 14 features

▷Features have expressed in mathematically. Several analysis domain have considered in constructing mathematical expressions (time, frequency, statistics, information theory)

▷Executed for varying feature subset sizes with the objective function, feature vector, and classification vector as the algorithm inputs.

▷For the purpose of classifying classes these extracted features have used. Evaluation has done using K-Nearest Neighbour (k-NN) classifier and quantifies the accuracy of the extracted feature set.

Observations of the research

▷ Patient with high signal-to noise ratio for which only few features are needed.

▷ Patient with poor signal to noise ration for which the Branch and Bound algorithm achieve the best classification accuracy.

▷ But still it requires 13 of the 14 features (iEEG data) to achieve the corresponding optimal accuracy.

▷ Sequential Forward Floating Selection requires 6-8 features to achieve it’s optimal classification accuracy.

▷ Less features less computational cost.

▷ B&B achieves it’s optimal classification accuracy with a higher computational cost. ( More features to be extracted for the classification). B & B algorithm does not outperform any of the other feature extraction methods in both fMRI and iEEG data

Recommendations for Brach and Bound▷If the search is exhaustive and complete traverse is

needed, Branch and Bound would come in handy where it omit the construction of certain search tree branches.

Limitations of Branch and Bound Algorithm▷In certain circumstance it can be slower than

exhaustive search. ▷Weak performance circumstances:

Criterion Estimation being slow: (evaluated feature subsets are larger)

Sub-tree cut offs are less frequent nearer the root(High criterion values)

Suggested Recommendations▷Peter Somon and Pavel Pudil introduced Fast

Branch & Bound Principal

▷Incorporate a prediction mechanismInaccuracy of this mechanism should NOT affect the

optimality of result, speed wise acceptableInformation about the individual feature contribution to the

criterion value gathered during the algorithm

Beam Search

Introduction▷Heuristic method for solving combinatorial

optimization problems.▷Nodes that have high probabilities at each level of the

search tree are selected for further branching, while the remaining nodes are pruned off permanently.

▷Only a predetermined number of best partial solutions are kept as candidate at each level

▷Traverse tree from following Breadth-First Search.▷Examines number of alternatives or beams in parallel.▷Beam width can be either fixed or vary.▷Solution for excessive memory requirement of Best-

First-Search

Applications of Beam Search▷Speech Recognition via Artificial Intelligence approach.

▷Image processing

▷Job shop problem with both make span and mean tardiness performance measure problem.

▷Single machine early/tardy problem.

Feature Extraction in Machine Learining

How Beam Search works in Feature Extraction▷Consists in a truncated branch and bound where only the

most promising β feature nodes will be retained (instead of all feature nodes)

▷β parameter is known as the Beam width which is fixed to a value before feature extraction starts.

▷The other feature nodes are simply discarded ( not in the β node set)

▷No backtracking mechanism is utilized as in the Branch and Bound algorithm, since the intent of this technique is to extract the features quickly .

Pseudo Code | Beam Search

Terminology• kbw– Beam width

• Sbsf – Predefined threshold for feature subset pruning

• B – Partial solution feature set

• C – Children of the partial solution in B

• HEURISTIC – Feature criterion heuristic funciton

B&B

Algorithm Analysis1. Algorithm maintains set of B partial solutions. In the

beginning, B only contains the empty partial feature solutions.

2. Feature set-C contains all of the children of the partial feature solutions in B.

3. Select the best Kbw after each of the n features individually evaluated.

4. Add a new feature to each of these Kbw features, forming Kbw(n-1) 2 tuples of feature.

5. Each partial feature solution is then retrieved from C and then will be evaluated.

6. From all possible tuples by appending the Kbw tuples with other features (not already in the existing tuples)

7. If the feature criterion value (J) is lower than the threshold then the partial feature subset is discarded. If it is higher, then it would append to B

Demonstration▷Initial Feature set [1,2,3,4,5]

▷Goal – Extract optimal 3 features out of initial five

▷Assuming that the kbw (Beam width) is 2

▷Predefined threshold for the feature criterion value (J) = 88

Tree Analysis▷For each of the five features, individual feature

criterion values will be constructedFeature 1 and Feature 5 surpasses the pre-defined J value of 88Kbw defined as two, hence only Feature 1 and Feature 5 subsets

will only branch out further.

▷Forms all possible tuples by appending with the other feature

▷Evaluated each ramified nodes using criterion function J

▷(1,4), (1,5) feature subsets branched through feature 1 node and (3,5),(4,5) subsets branched through the feature 5 node are having the highest corresponding J values. Hence those are branched out next.

Optimal Feature Subset

▷Finally derives at the desired size of the feature set size (Leaf node). Optimal feature set (Highest Jvalue) contains in the feature set (2,3,5).

▷Optimal Jvalue at the leaf nodes stage in the beam algorithm is J = 100

▷Corresponding applicable feature subset = [2,3,5]

▷Hence optimal set = [2,3,5]

Algorithm characteristics▷No backtracking available with in the algorithm to trace back.

▷Pruned branches might have the optimal solution. (Not always gives the optimal solution)

Beam Search in Research ContextResearchBeam search for feature extraction in Automatic SVM Defect classification

CitationGupta, P., Doermann, D., & DeMenthon, D. (2002). Beam search for feature selection in automatic SVM defect classification. In Pattern Recognition, 2002. Proceedings. 16th International Conference on (Vol. 2, pp. 212-215). IEEE.

▷Beam Search is used with the SVM based on classifier for automatic defect classification

Reduce the dimensionality of the feature space substantially.Improves classifier performance

▷Uses the heuristics functionality of beam search to reduce the search space.

▷Implemented beam search with a SVM classifier to select the candidate subset for the Automatic defect classification

▷Performance of the classifier depend on the quality of the feature extracted

Data & Features▷Semiconductor industry uses Automatic Defect

Classification▷Categorizing wafer defects into classes based on

information provided by sensing and imaging devices

▷Each defect is described by a high – dimensional feature vector consisting of about 100 features

▷Attempts to capture features that show high variability between different classes and thus help in distinguishing between them.

▷Spread factor (η) is defined measure the power of each feature to be able to distinguish between classes.

Research Results▷Significant reduction in the size of the feature vector

with the use of beam search achieved.▷Time taken to train the SVM classifier was seem to

reduce and improved▷Size of the feature subset if reduced by at least 70% for

all the binary classification

Research Results (Cont.)▷Reduction in computation and memory comes at a cost,

in this case, the algorithm is not guaranteed to find an optimal solution and cannot recover from wrong decisions.

▷If a node leading to the optimal solution is discarded during the search, there is no longer any way to reach that optimal solution

▷Optimal solution is NOT Guaranteed.

Recommendations for Beam Search▷Varying the beam width parameter trades off the risk of

missing optimal goal state against the computational cost of the search

A wider beam considers more candidate solutions, whilst taking up more memory and processing power.

A narrow beam considers less candidate solutions, leading of missing potential optimal solution lists.

▷Hence wider beam width allows greater safety, but at the cost of increased computational effort.

THANK YOU

Branch And Bound and Beam Search Feature Selection Algorithms

Engineering

Transcript of Branch And Bound and Beam Search Feature Selection Algorithms