ch ch Caruana Ls - IJCLab Events Directory (Indico)
Transcript of ch ch Caruana Ls - IJCLab Events Directory (Indico)
75% of M
achine Learning is preparing to do m
achine learning
and 15% is w
hat you do afterwards…
most M
L research about the middle 10%
75% of M
achine Learning is preparing to do m
achine learning
and 15% is w
hat you do afterwards…
most M
L research about the middle 10%
Goals For This Talk
•Foster research on the complete M
L pipeline•Describe a few
open problems in AutoM
L•Suggest future challenge/com
petition problems
•Ultimate goal is to m
ake the practice of ML m
ore reliable so youdon’t need a Ph.D. in M
L + 10 years experience to do ML w
ell
•How/W
here do we start?
•Start by looking at difference betw
een ML in Lab and M
L in the field
UC-Irvine/CIFAR
vs.Real-W
orld
•Download data
•N
o collection, cleaning, …
•Know how
well others did
•Metric(s) pre-defined
•Change algs, params, and coding
until doing well on m
etric•Som
etimes add data
UC-Irvine/CIFAR
vs.Real-W
orld
•Download data
•N
o collection, cleaning, …•Know
how w
ell others did•M
etric(s) pre-defined•Change algs, param
s, and coding until doing w
ell on metric
•Sometim
es add data
•Problem undefined
•Don’t know
how w
ell you can do•
Add new features and data feeds
•Clean, clean, clean
•M
ost effort goes into the data!•
Coding of data is critical•Choose practical algorithm
s•
Debug, debug, debug•
Wash, rinse, repeat•
month after m
onth after month!
Machine Learning (Research) Pipeline
Modify
LearningAlgorithm
Up-Sample
MinorityClass
FeatureSelection
Re-CodeFeatures
Tweak Hyper
Parameters
Deal With
Missing Values
CheckCalibration
DiscoverLeakage
Machine Learning (Engineering) Pipeline
•By real engineers, teams of engineers, …
•On real data, to real m
etrics, …•O
n schedule, on budget, …•M
ust be maintainable, repeatable, docum
entable, …
Goals For This Talk
•Foster research on the complete M
L pipeline•Describe a few
open problems in AutoM
L•Suggest future challenge/com
petition problems
•So let’s just jump in…
Importance of H
yper-Parameter O
ptimization
•Hyper-Parameter O
ptimization is m
ost mature subarea in AutoM
L•
Manual heuristic search: surprisingly sub-optim
al•
Grid search: effective with sm
all number of param
eters•
Random search: better than grid w
ith larger number of param
eters•
Bayesian Optim
ization: better than random w
ith very large # parameters
•…
•With m
odern algorithms (boosting, deep neural nets, …
) parameter
optimization is m
uch more critical than you m
ight think…•
… because m
odern high-flying algorithms are all low
-bias, high variance
•How m
any people here use automatic hyper-param
eter optimization?
Importance of H
yper-Parameter O
ptimization
•Around 2000-2005, some thought supervised learning w
as done
•Quiz: they thought best algorithm
was:
•Neural N
ets?•Boosting?•Random
Forests?•SVM
s?
Importance of H
yper-Parameter O
ptimization
•Around 2000-2005, some thought supervised learning w
as done
•Quiz: they thought best algorithm
was:
•Neural N
ets?•Boosting?•Random
Forests?•SVM
s?
Importance of H
yper-Parameter O
ptimization
9SVM
s (circa 2000-2005)9
Bing Ranker: FastRankvs. N
euralNetRanker (circa 2010)9
Best DNN on CIFAR-10 and -100 use m
assive parameter optim
ization9
Optim
ize usual hyper-parameters such as learning rate, initialization, drop-out
9O
ptimize hyper-param
eters per layer(s)9
Optim
ize augmentations
9O
ptimize netw
ork architecture
9O
ur results: +1-4% on DNN
s by doing careful Bayesian Optim
ization9
TIMIT benefits from
careful hyper-parameter optim
ization9
Why didn’t deep nets get discovered in m
id 90’s?9
Didn’t explore the space and hyper-parameters thoroughly enough?
ML Algorithm
is an Important H
yper-Parameter
Threshold M
etricsR
ank/Ordering M
etricsProbability M
etricsM
odelA
ccuracyF-Score
Lift
RO
C A
reaAverage Precision
Break E
ven Point
Squared E
rrorC
ross-E
ntropyC
alibrationM
ean
BE
ST0.928
0.9180.975
0.9870.958
0.9580.919
0.9440.989
0.953
BST-D
T0.860
0.8540.956
0.9770.958
0.9520.929
0.9320.808
0.914R
ND
-FOR
0.8660.871
0.9580.977
0.9570.948
0.8920.898
0.7020.897
ANN
0.8170.875
0.9470.963
0.9260.929
0.8720.878
0.8260.892
SVM
0.8230.851
0.9280.961
0.9310.929
0.8820.880
0.7690.884
BAG-DT
0.8360.849
0.9530.972
0.9500.928
0.8750.901
0.6370.878
KN
N0.759
0.8200.914
0.9370.893
0.8980.786
0.8050.706
0.835B
ST-STM
P0.698
0.7600.898
0.9260.871
0.8540.740
0.7830.678
0.801D
T0.611
0.7710.856
0.8710.789
0.8080.586
0.6250.688
0.734LOG-REG
0.6020.623
0.8290.849
0.7320.714
0.6140.620
0.6780.696
NA
ÏVE
-B0.536
0.6150.786
0.8330.733
0.7300.539
0.5650.161
0.611
Importance of H
yper-Parameter O
ptimization
•Hyper-parameter optim
ization is example of w
hat AutoML can achieve
•20 years ago selecting hyper-parameters w
ere part of the craft of ML
•N
eural nets: number of hidden units, learning rate, m
omentum
, …•
Knowing how
to select hyper-parameters is part of w
hat made you an expert
•Now, m
ultiple papers and algorithms for hyper-param
eter optimization
•Thriving research comm
unity with m
ultiple workshops
•Makes a significant difference in accuracy of trained m
odels•O
pen source code•N
eed to view other steps in M
L pipeline as new research opportunities
AutoML Tools to Better U
nderstand Data
•Auto variable type determination
•0, 1, 2, 3, 4, 5: nom
inal, ordinal, integer, continuous?•
Are there dates in fields?•
Is a field a unique identifier or sequence number?
•Auto coding•
Different coding needed for NNs, SVM
s, KNN vs. decision tree-based m
ethods•Auto m
issing value detector•
0, 1, 2•
-1, 0, +1•
Can’t just try everything ---missing variables often cause leakage!
•Auto anomaly detection
•spurious strings, m
issing entries in table, …
First Real Data Set I Worked W
ith (1995 /)
•Pneumonia data from
1992-1995•14,199 patients•< 200 features•m
ix of Booleans, categorical, and continuous variables•m
issing values•
MAR ---M
issing At Random•
missing correlated w
ith target class (caused leakage!)
•Quickly w
rote simple unix
utility to help better understand the data
DataDiff
•Automatically recognize changes in data
•changes in DB design, broken sensors, new
semantics, new
feeds, …•
In real world, DBs and data sources are living, breathing, evolving entities
•Hum
ans make m
istakes, forget what they did 1
sttime, retire, …
•C-Section 1993-1995 vs. 1996-1998 data (m
issing values recoded, …)
•30-day Hospital Re-Adm
ission 2011-2013 vs new 2014 sam
ple
•dDiffis not as trivial as it might seem
:•
Density estimation is hard in high dim
ensions (but this is a special case)•
Don’t care about simple drift if learned m
odel can handle it•
E.g., from 50-50 m
ale-female to 40-60 m
ale-female
•Care m
ost about changes that affect model accuracy or utility
•W
arning flags, default to more robust m
odel, auto-retrain/adapt, …
Model Protection W
rappers•
Model trained to predict 30-day re-adm
ission was
deployed at a children’s hospital•
Real-world:
•w
orld is always changing
•test data often looks very different from
train data
•M
odel should know data it w
as trained on and raise red flags when it detects
run-time (test) data looks m
eaningfully different•
Can make this part of standard practice
Feedback ---the Future Curse of ML!
•If you train a model on patient data
•And model is used to change practice of m
edicine (intervention)•N
ext time you collect data it is affected by m
odel…•…
so how do you collect unbiased data 2
ndtime around?
•This is a deep, fundamental problem
that in some dom
ains is not easy to solve (ethically, or efficiently) ---problem
with non-causal learning
•Could approach this as a dDiffproblem that looks not just at input
features but at labels and relationship between inputs and outputs
Post-Processing: Calibrated Probabilities
•Probabilities make com
plex systems easier to engineer
•Uniform language that is easy to explain/understand
•Consistent from rev to rev (elim
inates threshold effects)•W
here do probabilities come from
?•
Careful choice of learning algorithm?
•M
ost learning algorithms do N
OT generate good probabilities•
Even the best can often be improved
•Post-Calibration?
Platt Scaling by Fitting a Sigmoid
•Linear scaling of SVM [-∞
,+∞] predictions to [0,1] is bad
•Platt’s Method [Platt 1999]:
•scale predictions by fitting sigm
oid on a valid
atio
n setusing 3-fold CV and
Bayes-motivated sm
oothing to avoid overfitting
Auto-Calibrate
•N
ot as easy to make bulletproof as you m
ight think•
Depends on sample size
•Depends on data skew
•Depends on RO
C•
Probably depends on source model that generated scores in 1
stplace•
Try multiple m
ethods and reliab
lypick best…
•Autom
atically select sample to be used for post-training calibration?
•Use cross-validation for calibration sam
ples when sm
all data?•
Easy to use tool for automatic calibration w
ould see widespread use
•Current tools require expertise and careful use
•Data m
ining challenge problem on calibration
•Foster new
research on new calibration m
ethods
AutoML O
pen Problems
•robust attribute typing and coding ---the spec is never right
•dDiff---because the w
orld never stops changing•
runtime w
rappers ---a model has to know
its limitations
•feedback cycle detection ---and w
e never stop changing the world
•auto calibration ---probabilities are good, but not easy to autom
ate•
auto leakage detection ---because data is never good enough•
skewed data expert ---because rare classes are very com
mon
•auto cross-validate ---because cross-validation isn’t really as sim
ple as you think•
auto metric selection ---w
hich metrics are sensitive to changes
•auto com
pression ---make sm
all model as sm
all and fast as possible•
auto transfer ---sometim
es transfer helps, sometim
es transfer hurts
Leakage and other “accidents”
•50%
of data mining com
petitions have leakage!!!•
win data m
ining competitions
•KDD2011 best paper aw
ard:
“Leakage in Data Mining: Form
ulation, Detection, and Avoidance”
ShacharKaufman, Saharon
Rosset, Claudia Perlich, Ori Stitelm
an
•Pneum
onia leakage 1: missing values
•Pneum
onia leakage 2: 4k features (AUC = 0.99)•
important to have expectations and know
when they are violated
•Autom
atic leakage detection: •
sequential analysis, missing value analysis, feature analysis, dDifftrain to real test, …
AutoML O
pen Problems
•robust attribute typing and coding ---the spec is never right
•dDiff---because the w
orld never stops changing•
runtime w
rappers ---a model has to know
its limitations
•feedback cycle detection ---and w
e never stop changing the world
•auto calibration ---probabilities are good, but not easy to autom
ate•
auto leakage detection ---because data is never good enough•
skewed data expert ---because rare classes are very com
mon
•auto cross-validate ---because cross-validation isn’t really as sim
ple as you think•
auto metric selection ---w
hich metrics are sensitive to changes
•auto com
pression ---make sm
all model as sm
all and fast as possible•
auto transfer ---sometim
es transfer helps, sometim
es transfer hurts
Want to do N
ew Research that G
ets Cited?
•Pick a part of the ML pipeline that’s still largely m
anual•Define w
hat it would m
ean to make it (m
ore) automatic
•Develop and publish methods
•Fully-autom
atic “robot” that solves problem•
Assistant that helps human recognize and solve problem
•Tools that alert w
hen problem (probably) exists
•Make data sets publicly available
•Make code available for use as a baseline (and possibly openSource)
•Organize a challenge com
petition on that part of the pipeline•Good w
ay to pick a thesis topic!
Summ
ary
•AutoML is a grow
th research are•
Comm
unity has neglected 85% of the challenges of doing real M
L•
Many independent sub-problem
s all worthy of attention
•Every tim
e you stub your toe on real problem => opportunity for new
research
•Hyper-Parameter O
ptimization often critical ---start using it!
•Suggest we all do research and w
rite paper on dDiffthis year•
dDiffworksop
in 1-2 years?•
dDiffchallenge/competition in 1-2 years?
•M
ake dDifftools Open Source and available in R and Linux
•Tools w
ill imm
ediately see widespread use