Hierarchical Classification by Jurgen Van Gael
-
Upload
pydata -
Category
Technology
-
view
325 -
download
1
description
Transcript of Hierarchical Classification by Jurgen Van Gael
Hierarchical Classification Jurgen Van Gael - .
About • Computer Scientist w/ background in ML. • London Machine Learning Meetup. • Founder of Math.NET numerical library. • Previously @ Microsoft Research. • Data science team lead at Rangespan.
Taxonomy Classification • Input: raw product data • Output: classification models, classified product data
ROOT
Electronics
Audio
Audio Cables Amps …
Computers …
Clothing
Pants T-‐‑Shirts …
Toys
Model Rockets …
…
Data Collection
Feature Extraction
Training Testing
Labeling
Feature Extraction
Name: INK-M50 Black Ink Cartridge (600 pages) Manufacturer: Samsung Description: null Label: toner-inkjet-cartridges
"category": "toner-inkjet-cartridges”, "features": ["cartridge", "samsung", "black", "ink", "ink-m50", "pages”]
Feature Extraction: • Text cleaning (stopword, lexicalisation) • Unigram + Bigram Features • LDA Topic Features
Data Collection
Feature Extraction
Training Testing
Labelling
h"p://radimrehurek.com/gensim
Training, Testing & Labelling
Hierarchical Classification
D
A C B
E
D A C E B
4 (5) way multiclass classification
Hierarchical Classification
D
A C B
E D
A C B
E
2 + 3 way multiclass classification
Naïve Bayes Neural Network Logistic Regression Support Vector Machines … ?
Logistic Regression -‐‑ Model word printer-‐‑
ink printer-‐‑hardware
cartridge 4.0 0.3
the 0.0 0.0
samsung 0.5 0.5
black 0.5 0.3
printer -‐‑1.0 2.0
ink 5.0 -‐‑1.7
… … …
For each class For each feature
Add the weight
Exponentiate & Normalize
10.0 Σ= -‐‑0.6
Pr= 0.99997 0.0003
Data Collection
Feature Extraction
Training Testing
Labelling
Logistic Regression -‐‑ Inference
• Optimise using Wapiti. • Hyperparameter optimisation using grid search. • Using development set to stop training?
Data Collection
Feature Extraction
Training Testing
Labelling
h"p://wapiti.limsi.fr/
ROOT
Electronics Clothing
Data Collection
Feature Extraction
Training Testing
Labelling
Cross Validation Calibration • Estimate classifier errors. • DO NOT
o Test on training data. o Leave data aside.
• Are my probability estimates correct.
• Computation: o Take x data points with p(.|x) =
0.9, o Check that about 90% of labels
were correct.
Data Collection
Feature Extraction
Training Testing
Labelling
Training Data
Error = 1.2%
Error = 1.1%
Error = 1.2%
Error = 1.2%
Error = 1.3%
=
Error = 1.2%
Data Collection
Feature Extraction
Training Testing
Labelling
ROOT
Electronics Clothing
Using Bayes rule to chain classifiers:
Active Learning
ROOT
Electronics Clothing
p(electronics|{text}) = 0.1 Data
Collection
Feature Extraction
Training Testing
Labelling
• High probability datapoints o Upload to production
• Low probability datapoints o Subsample o Acquire more labels
Data Collection
Feature Extraction
Training Testing
Labelling
ROOT
Electronics Clothing
p(electronics|{text}) = 0.1
e.g. Mechanical Turk
Implementation
Implementation MongoDB S3 Raw S3 Training Data S3 Models
1. JSON export 2. Feature Extraction 3. Training 4. Classification
Training MapReduce
• Dumbo on Hadoop
• 2000 classifiers
• 5 fold CV (+ full)
• 20 hypers on grid
= 200.000 training runs
Labelling
• 128 chunks
• Full Cascade each chunk
D
A CB
E
Chunk 1
Chunk 2
Chunk 3
Chunk N …
D
A CB
ED
A CB
ED
A CB
E
Thoughts • Extra’s:
o Partial labeling: stop when probability becomes low.
o Data ensemble learning.
• Most time spent feature engineering. • Tie the parameters of the classifiers?
o Frustratingly easy domain adaptation, Hal Daume III
• Partially flattening the hierarchy for training?