MES7594-01 Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant...

Post on 29-Dec-2015

219 views 1 download

Tags:

Transcript of MES7594-01 Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant...

Genome Informatics I (2015 Spring)

MES7594-01 Genome Infor-matics I

- Lecture VIII. Interpreting variants

Sangwoo Kim, Ph.D.Assistant Professor,

Severance Biomedical Research Institute, Yonsei University College of Medicine

Genome Informatics I (2015 Spring)

Overview

• Goal of this lecture– You will learn how to interpret discovered vari-

ants to filter and prioritize for associated pheno-type (e.g. disease) and practice

• Predicting functional impact of vari-ants– Utilizing sequence features– Utilizing protein features

• Popular methods and practice– Polyphen2– Mutationassessor– SeattleSeq

Genome Informatics I (2015 Spring)

FUNCTIONAL IMPACT OF VARIANTS

Genome Informatics I (2015 Spring)

We usually have too many variants

Saksena et al, “Developing Algo-rithms to Discover Novel Cancer Genes: A look at the challenges and approaches”

We want to narrow down the number of “called” vari-ant as small as possible

Genome Informatics I (2015 Spring)

A simple mutation calling does not give you the final answer

mutation calling (NGS)

A lot of candidate variants

some from se-quencing error

some from polymorphisms

some from mapping error

some are pas-sengers

Genome Informatics I (2015 Spring)

A simple mutation calling does not give you the final answer

mutation calling (NGS)

A lot of candidate variants

some from se-quencing error

some from polymorphisms

some from mapping error

some are pas-sengers

A few real patho-genic variants

Genome Informatics I (2015 Spring)

Gold mining

Bunch of candidate variants

Many vari-ants

A few vari-ants

Strategy I: Do they really exist?- Any mistakes in sequencing

and variant calling?- Any non-disease causing poly-

morphisms?

Strategy II: Are they functional?- Are they damaging? pathogenic?- Are they related to phenotypes?

Genome Informatics I (2015 Spring)

Five ways to narrow down1. Include control data

1. eliminate germline variants

2. Use more strict variant quality threshold1. work on only confident variants

3. Filter out polymorphisms1. remove non-damaging polymorphisms

4. Predict functional impacts1. find damaging levels

5. Use disease specific knowledge1. to acquire final candidates

Genome Informatics I (2015 Spring)

Five ways to narrow down1. Include control data

1. eliminate germline variants

2. Use more strict variant quality threshold1. work on only confident variants

3. Filter out polymorphisms1. remove non-damaging polymorphisms

4. Predict functional impacts1. find damaging levels

5. Use disease specific knowledge1. to acquire final candidates

Strategy I

Genome Informatics I (2015 Spring)

Five ways to narrow down1. Include control data

1. eliminate germline variants

2. Use more strict variant quality threshold1. work on only confident variants

3. Filter out polymorphisms1. remove non-damaging polymorphisms

4. Predict functional impacts1. find damaging levels

5. Use disease specific knowledge1. to acquire final candidates

Strategy I

Strategy II

Genome Informatics I (2015 Spring)

1. Include control data

germline

somatic

somatic

100,000~~500,000 100~10

00

100~1000

We should eliminate unwanted germline variants

Genome Informatics I (2015 Spring)

When controls are unavail-able

• Single nucleotide polymorphism rate = 1/100~1/1000

• Whole Genome Sequencing– Total DNA length = 3 billion– Expected SNP numbers = 3~30 million

• Whole Exome Sequencing– Total DNA length = 50 million– Expected SNP numbers = 50~500 thousands

• Targeted Sequencing (Panel)– Total DNA length = 100~1000 thousands– Expected SNP numbers = 1000~10,000

• Hotspot Panel (only for very well known vari-ants)– Controls can be omitted

Genome Informatics I (2015 Spring)

2. Use more strict quality threshold

• Variant quality

Low Variant Quality- This variant (although it has

been called) can be false

Cause of low quality- Low read depth (insufficient

observation)- Bad basecall/mapping quality- Low allele frequency

Genome Informatics I (2015 Spring)

2. Use more strict quality threshold

• Possible actions– Cut out variants based on

• Variant quality (e.g. QUAL<10)• Total read depth (e.g. <20)• Number of alt-depth (e.g. <5)• Allele frequency (e.g. <0.1)

– Prioritize variants• Sort with variant quality and inspect from the top

Genome Informatics I (2015 Spring)

3. Filter out polymorphisms• When you had no control data (panel)

– Check if the variants have been reported as polymor-phism

• When you had control data– You may not have polymorphisms

• Because somatic mutations callers removes germline calls

– However, there are some cases that polymorphisms can be reported (as somatic mutations)• For example, low read depth in control sample

low depthbad region

Variant Undetected

Variant De-tected

Genome Informatics I (2015 Spring)

dbSNP

• Database of SNP

chr7:11584142 A>T

Genome Informatics I (2015 Spring)

dbSNP

• Database of SNP

chr7:11584142 A>T

Genome Informatics I (2015 Spring)

4. Predict functional im-pacts

• Types of point mutations– Coding mutations

• Synonymous (silent)– Amino acid unchanged

• Missense– Amino acid changed

• Nonsense– Stop codon gained

• Readthrough– Stop codon loss

– Non-coding mutations• Intron• Splice-variants• Variants in regulatory elements

Genome Informatics I (2015 Spring)

Functional impacts

• Types of indels– Inframe

• Insertion or deletion in a multiple of 3 base-pairs

– Frameshift

Genome Informatics I (2015 Spring)

General classification (pri-ority)

Genome Informatics I (2015 Spring)

General classification (pri-ority)

high-impactlow-inci-dencelow-confi-dence

High inci-dence

Genome Informatics I (2015 Spring)

Functional impact prediction of missense mutations

• How critical is an AA change to its protein function?– Amino acid conservation

• If the AA is essential, it would be conserved though the evolution

– Amino acid in protein conformation • Substitution of AA in active site would be more dam-

aging

Genome Informatics I (2015 Spring)

Amino acid conservation

Genome Informatics I (2015 Spring)

Protein Structure

Genome Informatics I (2015 Spring)

5. Use disease specific knowledge

• Your knowledge about the disease– e.g. cancer– “Has it been reported in other previous sam-

ples?”– Search it in COSMIC, if you found it is recurrent,

it is likely to be functional

Genome Informatics I (2015 Spring)

Five ways to narrow down1. Include control data

1. eliminate germline variants

2. Use more strict variant quality threshold1. work on only confident variants

3. Filter out polymorphisms1. remove non-damaging polymorphisms

4. Predict functional impacts1. find damaging levels

5. Use disease specific knowledge1. to acquire final candidates

Many, uncertain vari-ants

A few, reliable variants

Genome Informatics I (2015 Spring)

Five ways to narrow down1. Include control data

1. eliminate germline variants

2. Use more strict variant quality threshold1. work on only confident variants

3. Filter out polymorphisms1. remove non-damaging polymorphisms

4. Predict functional impacts1. find damaging levels

5. Use disease specific knowledge1. to acquire final candidates

Many, uncertain vari-ants

A few, reliable variants

Functional study, Mechanism study

Genome Informatics I (2015 Spring)

SUMMARY OF PART I

Genome Informatics I (2015 Spring)

- Connect to Linux cluster, Job script writing and submission- NGS technologies, NGS data - Short read alignment- Variant Calling, CNV, SV calling - Interpretation of discovered variants

Genome Informatics I (2015 Spring)

In the remaining classes

• Genomic data to expression data– Gene mRNA Protein Pathways and Net-

works Phenotype

• Use high throughput data for your study• Don’t forget your project

Genome Informatics I (2015 Spring)

PRACTICE - FUNCTIONAL VARIANT ANNOTATION WITH SEATTLESEQ

Genome Informatics I (2015 Spring)

Today’s data

• Somatic variants in chr22 of anonymous cancer called from Virmid

• Data location– /scratch/2015_GenomeInformatics/{yourdir}/

virmidoutput– If you did not complete somatic calling prac-

tice, copy it from /scratch/2015_GenomeInformatics/public

data download to local PC

① move to your virmid out directory

② check your virmid output

③ click FTP

④ double click

seattle-seq

search then click here!!!

seattle-seq

① write your email

② input your VCF file

③ check!!

④ check!!

① click file > open..

② select ‘all file’

③ select annotated file

①②

Filtering phase• accession (column H)

– for filtering curated isoforms• NM: mNRA• XM: predicted mRNA model filter

• functionGVS (column I)– for filtering damaging mutation type

• missense, missense-near-splice• stop-gain, stop-loss• splice-donor, splice-acceptor• The others filter

① ②

①②

IGV download

search then click here!!!

IGV download

download then double click!!

IGV view

IGV view

IGV view

① input disease bam file

② input normal bam file

③ input VCF file

IGV view