Machine learning AI - primer

Ashvin Bashyam, Anika Gupta

Machine Learning Primer

What is driving success in machine learning?

CONFIDENTIAL 29/5/19

Drivers of progress in machine learning today1. Access to large, structured datasets2. Improved algorithms for learning from large datasets

3. Specialized computing hardware & compute power to accelerate training on large datasets

Where is ML deployed today?• Image classification and captioning à self-driving cars

• Language translation à Google Translate

• Speech recognition à iPhone/Android speech-to-text

• Recommendation engines à Netflix

ModernAlgorithms

TraditionalAlgorithms

AI startup funding growth in US

Dissecting the jargon


Artificial intelligence

Machine learning

Linear fit Deep learning… … …

computers accomplishing tasks that previously

required human intelligence

using data to build a system that identifies patterns and makes

decisions

a type of machine learning that iteratively

extracts complex patterns from data

“When you’re fundraising, it’s AI.When you’re hiring, it’s ML.”

-@xaprb on Twitter

Building blocks to learn from data


Individuals

Attributes 1…n

Samples

Access high quality data Quantify prediction qualitySelect model classFrame the problemIs goal to classify or cluster?

1 2 3 4

What defines a high quality dataset for ML?


- Quantity of data – more data improves prediction accuracy

- Ideally: Number of samples ≥ Number of features/attributes

- Representation of population – diversity of training data should span the entire space you’d like to predict over

- Quality of data – data should be free from defects and confounding factors (e.g. sequencing data from same machine; normalization of values to same scale)

- Free from bias – data should not introduce bias due to imbalanced sampling

- Reliably labeled – data should be accurately labeled, especially if human labeling was required

More data leads to improved predictions,especially for more complex models

1

Types of ML problems


Clustering scRNA-seqto identify cell types

Identifying neoplasm histology images as malignant or benign

Predict binding affinity of antibodies in silico

Examples:

2

Supervised learning


Goal: build a model that can predict a label from featuresData: features and labels

Training: learning a model that accurately predicts labels across the training dataset

Requires a ‘loss function’ to define how to penalize errors

Deployment: using the model to predict the label from a new data point

Examples• Disease diagnosis from biopsies

• Image-based phenotyping from microscopy images

• Arrhythmia detection on Apple Watch

• Protein structured prediction from sequence

2

Unsupervised learning


Goal: build a model that can identify clusters of data with similar features

Data: features only (no labels)

Training: learning a model that separates data into distinct clusters

Requires a ‘loss function’ which defines how to penalize errors

Deployment: using the model to predict cluster membership from a new data point

Examples• Identifying cell subsets from scRNA-seq data

• Identifying similar patients from EMRs

• Identifying proteins and genes with similar functions

2

Classes of algorithms


With increasing model complexity comes:• Computational requirements

• Risk of overfitting (reduced ability to generalize)

• Amount of data required• Lack of interpretability

• Accuracy

… and many more!

y = mx + b

Linear regression

Deep learning

K-means clustering

Perceptron

Decision Trees

3

Deep learning


- Uses: classification or regression- How it works:

- Weighted combination of inputs- Linearly combined within a layer- Nonlinear function applied between layers

- Multiply functions applied at each layer- Output can be final product OR binary

(threshold)- Training: repeat for many iterations, until predictions

are within a desired range of the true values- Why it’s exciting:

- Ability to model extremely complex (nonlinear) functions

- Continuously improves with more data- Learns without human explicitly telling it how to

Feat

ures

of i

nput

dat

a

Example Neural Network

3

Evaluating predictions


Loss = ∑(predicted – actual)2

Validation of Model Metrics to QuantifyGoodness of Fit

Model 1Model 2

Classification goal: distinguish true positives from true negatives

Regression goal: minimize the difference between predicted & real

4

How do we segment applications of ML?


Biological datasets Images

Chemical structures

Text data (e.g. EMR)

Data that varies over time (e.g. wearables or

ECG)

Classification Regression ClusteringDimensionality

Reduction Decision making

Goal – identify what type of problem ML is being applied towards

Focus on the type of problem being solved and the type of datasets being used

Type of problem

Type of dataset

Common risks & solutions when using machine learning


Risk Explanation Mitigation Deal breaker?

Bias Generalization error due to model being too

simple (underfitting)

Increase data variability (e.g. merge datasets),

use a larger set of features, increase model

complexity

No

Variance Error due to hypersensitivity to small fluctuations

in data–too complex of a model (overfitting)

Regularization (enforce simplicity on model),

increase dataset size, use fewer featuresNo

Poor interpretability

The more complicated the model is, the less

interpretable it becomes (hard to identify what

features are most important for predictions)

Use simpler modelsDepends on

context

Poor data quality “Garbage in, garbage out” (imaging example)

Be selective when deciding which data to input

into modelYes

Correlation, not causation

Predictions simply identify patterns—they say

nothing about direction of causality

Causal inference methods, experimental

validation (for biological hypotheses)No

No ground truthIf there is no ‘ground truth’ for comparison, as is

often the case in clustering, evaluating a method

can’t use the typical loss/error benchmarks

Can use data from known entities to map new

candidate inputs and assess similarity of their

profiles

Depends on

context

Questions to ask when evaluating machine learning


• Data• Where did the data (and labels) come from? What are the features? What are they trying to predict?• What assumptions are they making about the data? Do those assumptions hold true in the real world?

• Framing-the-problem• Is their biological question accurately represented by their data?• Is the ML framing appropriate for the true question at hand or have they solved a proxy for the question?

• Model• Do their model evaluation metrics make sense for the type of data used?• Do their models generalize well to new data that is sufficiently challenging?

• Validation• How did they validate their findings? Did they use an independent dataset?• Did they withhold access to this dataset while training their model?• Did they avoid over-interpreting the model predictions?

1

2

3

4

If you are confused: Replace ‘machine learning’ with ‘statistics’ and ask the same logical questions

Cultural differences & similarities in ML vs. biology


Open source- Everything is open access / open source;

methods are rarely proprietary

Differences in IP- IP is not a patent on the code or algorithm

- IP is skillset to a working build system and access to data

Barriers to entry- Very low for new ML practitioners

- Few top labs or institutions dominate new ideas

Data is everything- Access to data is the #1 barrier to entry

- Biopharma industry is traditionally a data-first field; machine learning is the natural evolution

Both strive to separate noise from signal- Importance of filtering true vs. false positives

Both are a largely an empirical science- Deep learning & experimental biology are both

empirical (change one piece of the puzzle and observe change in outcome)

- Importance of validation data (experiments) for novel findings

Differences Similarities

Machine learning AI - primer

Documents

Transcript of Machine learning AI - primer