2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

46
Clinical Grade Annotations: Public Data Resources for Interpreting Genomic Variants February 19, 2105 Gabe Rudy @gabeinformatics VP Product Management and Engineering Golden Helix

Transcript of 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Page 1: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Clinical Grade Annotations:Public Data Resources for

Interpreting Genomic Variants

February 19, 2105

Gabe Rudy

@gabeinformatics

VP Product Management and Engineering

Golden Helix

Page 2: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

My Background

Golden Helix

- Founded in 1998

- Genetic association software

- Analytic services

- Thousands of users worldwide

- Over 800 customer citations in journals

Products I Build with My Team

- SNP & Variation Suite (SVS)

- SNP, CNV, NGS tertiary analysis

- Import and deal with all flavors of upstream data

- VarSeq

- Annotate and filter variants in gene panels, exomes and

genomes for clinical labs and researchers.

- GenomeBrowse (Free!)

- Visualization of everything with genomic coordinates.

All standardized file formats.

Page 3: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Agenda

Getting High Quality Variant Calls

Data Sharing and the Maturing of Public Resources

2

3

4

Clinical Grade Candidate Variant Identification

How I Met My Exomes1

NGS Clinical Utopia: Are We There Yet?5

Page 4: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Exome Sequencing in Consumer Genomics

Exomes done as part of Pilot

Program

80x coverage

Raw data with no interpretation

Erin

JIA

Gabe

(me)

Ethan

Page 5: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Research or clinical grade?

Total Reads 140M

Unique Align 87%

Mean Target 105x

% Target at 2x 97%

% Target at 10x 94%

% Target at 20x 89%

% Target at 30x 83%

Page 6: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Agenda

Getting High Quality Variant Calls

Data Sharing and the Maturing of Public Resources

2

3

4

Clinical Grade Candidate Variant Identification

How I Met My Exomes1

NGS Clinical Utopia: Are We There Yet?5

Page 7: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

PSPH mis-alignment

Page 8: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Splice Mutation

Page 9: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants
Page 10: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants
Page 11: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants
Page 12: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

GRCh38 – Here Now, but no hurry

A better human reference

- Revised Cambridge Reference

Sequence (rCRS) MT

- Has centromere models

- ~2000 incorrect alleles fixed

- ~100 assembly gaps updated

NCBI Annotations 106 on 38

- dbSNP 141, ClinVar,

RefSeqGene

- Ensembl 76 on both

No Poplulation Catalogs

- Some being ported (by

Ensembl, dbSNP)GRCh37 GRCh38

Ts/Tv 2.06558 2.10171

snps

snps

mnps

mnps

indels

indels

complex

complex

270000

280000

290000

300000

310000

320000

330000

340000

GRCh37 GRCh38

My Exome

331,824

319,442

Page 13: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Blog Post

Page 14: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Agenda

Getting High Quality Variant Calls

Data Sharing and the Maturing of Public Resources

2

3

4

Clinical Grade Candidate Variant Identification

How I Met My Exomes1

NGS Clinical Utopia: Are We There Yet?5

Page 15: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Baylor Workflow - Clinical Exomes Paper

Disease gene related

Medically actionable

deleterious variants

Deleterious variants in ACMG

gene list

Deleterious variants

VUS in dominant gene or

homozygous in recessive

gene

Deleterious variant in gene

with no known disease

Page 16: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Annotate, Then Filter and Interpret

Page 17: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Data Sources to Replicate Workflow

1000 Genomes (Phase 1)

“ESP” (NHLBI 6500 Exomes v2)

HGMD (Public vs Professional)

Variant’s Protein Coding Effect

RNA Splicing Effect (dbscSNV)

- −3 to +8 at the 5’, −12 to +2 at the 3’

Genes Lists:

- Single-Gene Disorder (OMIM with Inheritance)

- Medically Actionable (114 genes NHLBI study)

- Dominant Inheritance (MedGen)

- ACMG Carrier Panel (ACMG Incidental

Findings guidelines)

Page 18: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

My Exome Analyzed

Start: 235,689

847

234,842

224,914

9,928

9,069

807

859

40

242 13

59 565

0

624

624

255

20

20

20

0

0

598

644

Page 19: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

• Pathogenic OTC Variant

• What if I got this through BabySeq?

Page 20: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Agenda

Getting High Quality Variant Calls

Data Sharing and the Maturing of Public Resources

2

3

4

Clinical Grade Candidate Variant Identification

How I Met My Exomes1

NGS Clinical Utopia: Are We There Yet?5

Page 21: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Annotating against Transcripts

RefSeqGenes – Versioned on RNA sequence

- Annotated against human reference by “Annotation Releases”

- Last on GRCh37 was 105 (2013-08-20) – GRCh38 release 106 (2014-01-17)

- 84,950 transcripts, most are “predicted” (XM_” and non-coding)

- Standard in US for reporting variation (NM_016335.4:c.123C>T etc)

- UCSC grabs RNA from RefSeq directly and maps to their genome references

“continuously”

Ensembl – Versioned on Alignment

- GENCODE: Well curated subset of high-quality, validated transcripts

- V75 last version of GRCh37, 2014-06-27

- Many specific bio-types, but protein_coding usued for annotation

- Has mappings to RefSeq IDs, but

Page 22: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Reference Sequence Versus Gene Sequence

EMG1 on GRCh37

“Gap” of the mRNA coding sequence versus reference seq:

Handled differently by 3 different “gene alignments”

Page 23: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Reference Sequence Versus Gene Sequence

EMG1 on GRCh38

Reference sequence patched, no gap

Alignments agree

Page 24: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

RefSeq Accession Not Sufficient for Var-Tx Interaction

RefSeq defines transcripts as mRNA sequence

NCBI “Annotation Releases” (like v105) provides alignments using “Splign”

UCSC pulls RefSeq mRNA and aligns themselves using “BLAT”

They can choose equally valid but different alignments for the same assession

This alignment of NM_052814.3 places the exon at dramatically different loci.

Will result in different annotations of any variant overlapping these exons

Page 25: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

COSMIC

Does not provide data in easy

to use form for NGS

Just announced change in

licensing affective in March

- Access to the COSMIC website will

stay free for all users.

- The new licensing strategy will

charge for-profit organisations to

download COSMIC datasets.

- Download by academic and non-

profit organisations will remain free

2015 Roadmap:

- GRCh38

- More curation

- Visualization improvements

Page 26: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

ClinVar

Submitters:

- OMIM: Johns Hopkins

- Samuels

- Lab for Molecular Medicine

- Invitae

- Emory Genetics Lab

Star rating system

- 0-4 stars – level of review

ClinVar is designed to provide a freely accessible,

public archive of reports of the relationships

among human variations and phenotypes, with

supporting evidence.

Page 27: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

ClinVitae: ClinVar and Friends by Invitae

Sources:

- ClinVar (62,913)

- Emory (13,365)

- ARUP (2,850)

- Carver Mut (199)

- K Cunningham (581)

79,907 V, 9,189 G

- 32,523 Pathogenic

- 38,796 Likely Pathogenic

Provided in HGVS

- 59,878 after mapping to genomic space

Page 28: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

BRCA: The back door to Myriad’s database

1995 – Patent issued

to Myriad Genetics

June 2013 – Patents

invalidated by ruling

Lab setting up Dx

has a lot of catch up

“Free the Data” and

other ways in which

Mryiad’s data is in

ClinVar, etc.

Sharing Clinical Reports Project

Page 29: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

BRCA: In my wife

Page 30: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

HGMD

Data mines academic

papers for reported

functional variants

Also takes

submissions,

corrections reviewed by

team

First available in 1996

- Originally 10k variants

- 105k in Public (2014)

- 148k in “Pro” (2014)

Page 31: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Left-Align Delta F508 to Make it Match

Page 32: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Left-Align Annotations

Using a Smith-

Waterman

algorithm to left-

align variants

from public

databases show

non-obvious

differences

NGS alignment

and variant

calling always

left-aligned

Left-align your

database so they

can be annotated

Page 33: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Changes in Monthly Updates

• 36 variants went missing from

December to Jan release

• Some where Pathogenic

Page 34: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

ClinVar’s VCF File

• ClinVar current relies on their

dbSNP identifier mappings to

“build” VCF files

• There are ~14,000 small variants

in their database without dbSNP

identifiers, and thus missing from

the VCF

• ~5K Pathogenic

• Often these variants are in newer

dbSNP builds, and the ClinVar

mappings are just not updated.

• This variant was in ClinVar, with

genomic coordinates, but no

RSID:

- HGVS(c.): NM_002894.2:c.298C>T

- Chromosome:Start:Stop: 18:20548818:20548818

- (Recently RSID was added)

Page 35: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

dbSNP 141 Had Allele Errors

I reported the issue

7/22/2014

Confirmed, 8/12

generated better VCF

and placed in “test”

folder

Found more issues

Replaced official VCF

in 02/09/2014

We waited until fixed

to publish official

support

Page 36: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Agenda

Getting High Quality Variant Calls

Data Sharing and the Maturing of Public Resources

2

3

4

Clinical Grade Candidate Variant Identification

How I Met My Exomes1

NGS Clinical Utopia: Are We There Yet?5

Page 37: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

asdf

Page 38: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants
Page 39: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

NM_002626.4:c.1877G>C in PFKL

NP_002617.3:p.Arg626Pro missense mutation

Predicted damaging by 4/5 functional predictions

VEST3: 0.948, GERP++: 4.59

ExAC and 1kG have a G>A, but G>C is novel

Variants in region are extremely rare (G>C ExAC 4 of 122,364 alleles) – 0.003%

No ClinVar variants for gene

OMIM entry has no known disease association

PubMed search shows few recent articles: Most recent 1998 paper showed

- phosphofructokinase (PFKL) overexpressed in Down syndrome (DS)

- Transgenic PFKL mice had an abnormal glucose metabolism with reduced clearance

rate from blood and enhanced metabolic rate in brain.

Page 40: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

d

Page 41: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

d

35 LoF Variants, None Homozygous

Page 42: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Training

Most variants are rare or novel

- Training to interpret these is

extensive

MD/Pathology background is

insufficient

Need a PhD in molecular

genetics

There’s only 500 board certified

Clinical Molecular Geneticists

since started

Let’s share in the learning

process

Baylor Exome Sign-Out

Page 43: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Phenotypeing and Matchmaking Portals

Diagnosis often requires finding

another family to confirm a novel

gene to phenotype association

Finding a second family:

- Social media

- PhenoDB

- PhenomeCentral.org

- Orphanet – Resources on over 6000 rare

diseases and orphan drugs.

- European centric: GEN2PHEN (G2P)

Matt Might found a second

family with NGLY1

deficiency through a blog

post that went viral.

Page 44: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

N-Glycanase Deficiency

http://www.ngly1.org/

Matthew Might and Matt Wilsey. The

shifting model in clinical diagnostics:

how next-generation sequencing and

families are altering the way rare

diseases are discovered, studied,

and treated. Genetics in Medicine.

March 2014.

Page 45: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Thank you

Heidi Rehm – Chief Laboratory Director at

Laboratory for Molecular Medicine,

PCPGM

Joel Parker – Cancer Genetics, UNC

Chapel Hill

Gerry Higgins – VP, Pharmacogenomic

Science, Assure Rx Health

Frank Schacherer – Chief Technical

Officer, BIOBASE

Reece Hart – Computational Biologist,

Invitae (now 23andMe)

Greta Linse Peterson – Director of Product

Management and Quality, Golden Helix

Page 46: 2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpreting Genomic Variants

Questions?