Machine learning AI - primer
Transcript of Machine learning AI - primer
Ashvin Bashyam, Anika Gupta
Machine Learning Primer
What is driving success in machine learning?
CONFIDENTIAL 29/5/19
Drivers of progress in machine learning today1. Access to large, structured datasets2. Improved algorithms for learning from large datasets
3. Specialized computing hardware & compute power to accelerate training on large datasets
Where is ML deployed today?• Image classification and captioning à self-driving cars
• Language translation à Google Translate
• Speech recognition à iPhone/Android speech-to-text
• Recommendation engines à Netflix
ModernAlgorithms
TraditionalAlgorithms
AI startup funding growth in US
Dissecting the jargon
CONFIDENTIAL 39/5/19
Artificial intelligence
Machine learning
Linear fit Deep learning… … …
computers accomplishing tasks that previously
required human intelligence
using data to build a system that identifies patterns and makes
decisions
a type of machine learning that iteratively
extracts complex patterns from data
“When you’re fundraising, it’s AI.When you’re hiring, it’s ML.”
-@xaprb on Twitter
Building blocks to learn from data
CONFIDENTIAL 49/5/19
Individuals
Attributes 1…n
Samples
Access high quality data Quantify prediction qualitySelect model classFrame the problemIs goal to classify or cluster?
1 2 3 4
What defines a high quality dataset for ML?
CONFIDENTIAL 59/5/19
- Quantity of data – more data improves prediction accuracy
- Ideally: Number of samples ≥ Number of features/attributes
- Representation of population – diversity of training data should span the entire space you’d like to predict over
- Quality of data – data should be free from defects and confounding factors (e.g. sequencing data from same machine; normalization of values to same scale)
- Free from bias – data should not introduce bias due to imbalanced sampling
- Reliably labeled – data should be accurately labeled, especially if human labeling was required
More data leads to improved predictions,especially for more complex models
1
Types of ML problems
CONFIDENTIAL 69/5/19
Clustering scRNA-seqto identify cell types
Identifying neoplasm histology images as malignant or benign
Predict binding affinity of antibodies in silico
Examples:
2
Supervised learning
CONFIDENTIAL 79/5/19
Goal: build a model that can predict a label from featuresData: features and labels
Training: learning a model that accurately predicts labels across the training dataset
Requires a ‘loss function’ to define how to penalize errors
Deployment: using the model to predict the label from a new data point
Examples• Disease diagnosis from biopsies
• Image-based phenotyping from microscopy images
• Arrhythmia detection on Apple Watch
• Protein structured prediction from sequence
2
Unsupervised learning
CONFIDENTIAL 89/5/19
Goal: build a model that can identify clusters of data with similar features
Data: features only (no labels)
Training: learning a model that separates data into distinct clusters
Requires a ‘loss function’ which defines how to penalize errors
Deployment: using the model to predict cluster membership from a new data point
Examples• Identifying cell subsets from scRNA-seq data
• Identifying similar patients from EMRs
• Identifying proteins and genes with similar functions
2
Classes of algorithms
CONFIDENTIAL 99/5/19
With increasing model complexity comes:• Computational requirements
• Risk of overfitting (reduced ability to generalize)
• Amount of data required• Lack of interpretability
• Accuracy
… and many more!
y = mx + b
Linear regression
Deep learning
K-means clustering
Perceptron
Decision Trees
3
Deep learning
CONFIDENTIAL 109/5/19
- Uses: classification or regression- How it works:
- Weighted combination of inputs- Linearly combined within a layer- Nonlinear function applied between layers
- Multiply functions applied at each layer- Output can be final product OR binary
(threshold)- Training: repeat for many iterations, until predictions
are within a desired range of the true values- Why it’s exciting:
- Ability to model extremely complex (nonlinear) functions
- Continuously improves with more data- Learns without human explicitly telling it how to
Feat
ures
of i
nput
dat
a
Example Neural Network
3
Evaluating predictions
CONFIDENTIAL 119/5/19
Loss = ∑(predicted – actual)2
Validation of Model Metrics to QuantifyGoodness of Fit
Model 1Model 2
Classification goal: distinguish true positives from true negatives
Regression goal: minimize the difference between predicted & real
4
How do we segment applications of ML?
CONFIDENTIAL 129/5/19
Biological datasets Images
Chemical structures
Text data (e.g. EMR)
Data that varies over time (e.g. wearables or
ECG)
Classification Regression ClusteringDimensionality
Reduction Decision making
Goal – identify what type of problem ML is being applied towards
Focus on the type of problem being solved and the type of datasets being used
Type of problem
Type of dataset
Common risks & solutions when using machine learning
CONFIDENTIAL 139/5/19
Risk Explanation Mitigation Deal breaker?
Bias Generalization error due to model being too
simple (underfitting)
Increase data variability (e.g. merge datasets),
use a larger set of features, increase model
complexity
No
Variance Error due to hypersensitivity to small fluctuations
in data–too complex of a model (overfitting)
Regularization (enforce simplicity on model),
increase dataset size, use fewer featuresNo
Poor interpretability
The more complicated the model is, the less
interpretable it becomes (hard to identify what
features are most important for predictions)
Use simpler modelsDepends on
context
Poor data quality “Garbage in, garbage out” (imaging example)
Be selective when deciding which data to input
into modelYes
Correlation, not causation
Predictions simply identify patterns—they say
nothing about direction of causality
Causal inference methods, experimental
validation (for biological hypotheses)No
No ground truthIf there is no ‘ground truth’ for comparison, as is
often the case in clustering, evaluating a method
can’t use the typical loss/error benchmarks
Can use data from known entities to map new
candidate inputs and assess similarity of their
profiles
Depends on
context
Questions to ask when evaluating machine learning
CONFIDENTIAL 149/5/19
• Data• Where did the data (and labels) come from? What are the features? What are they trying to predict?• What assumptions are they making about the data? Do those assumptions hold true in the real world?
• Framing-the-problem• Is their biological question accurately represented by their data?• Is the ML framing appropriate for the true question at hand or have they solved a proxy for the question?
• Model• Do their model evaluation metrics make sense for the type of data used?• Do their models generalize well to new data that is sufficiently challenging?
• Validation• How did they validate their findings? Did they use an independent dataset?• Did they withhold access to this dataset while training their model?• Did they avoid over-interpreting the model predictions?
1
2
3
4
If you are confused: Replace ‘machine learning’ with ‘statistics’ and ask the same logical questions
Cultural differences & similarities in ML vs. biology
CONFIDENTIAL 159/5/19
Open source- Everything is open access / open source;
methods are rarely proprietary
Differences in IP- IP is not a patent on the code or algorithm
- IP is skillset to a working build system and access to data
Barriers to entry- Very low for new ML practitioners
- Few top labs or institutions dominate new ideas
Data is everything- Access to data is the #1 barrier to entry
- Biopharma industry is traditionally a data-first field; machine learning is the natural evolution
Both strive to separate noise from signal- Importance of filtering true vs. false positives
Both are a largely an empirical science- Deep learning & experimental biology are both
empirical (change one piece of the puzzle and observe change in outcome)
- Importance of validation data (experiments) for novel findings
Differences Similarities