26th International Mammalian Genome Conference 2012

68
26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012 Wi-Fi: twgroup / password: group5500

description

@IMGC2012. #IMGC2012. 26th International Mammalian Genome Conference 2012. Bioinformatics Workshop. Sunday, October 21, 2012. 09.00 – 12.00. Wi-Fi: twgroup / password: group5500. Location: Tarpon Room. IMGS 2012 Bioinformatics Workshop. Deanna Church, NCBI - PowerPoint PPT Presentation

Transcript of 26th International Mammalian Genome Conference 2012

Page 1: 26th International Mammalian  Genome Conference 2012

26th International Mammalian

Genome Conference 2012

Bioinformatics Workshop

Sunday, October 21, 201209.00 – 12.00

Location: Tarpon Room@IMGC2012 #IMGC2012

Wi-Fi: twgroup / password: group5500

Page 2: 26th International Mammalian  Genome Conference 2012

IMGS 2012Bioinformatics Workshop

Deanna Church, NCBICarol Bult, The Jackson Laboratory

Page 3: 26th International Mammalian  Genome Conference 2012

Tutorial Resources

• Galaxy– https://main.g2.bx.psu.edu/

• Genome Analysis for Biologists– http://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/

• NCBI 1000 Genomes Browser– http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/

• Genome Reference Consortium– http://genomereference.org/

Page 4: 26th International Mammalian  Genome Conference 2012

Schedule

9-10 am: Intro• Genome Assembly Basics • Alignment Basics 10-11 am: Getting Stuff Done• File formats (sequences, alignments, annotations)11-12 am: Doing stuff• Typical RNA-Seq workflow• RNA Seq in Galaxy

• Differential Gene Expression with RNA Seq data

Page 5: 26th International Mammalian  Genome Conference 2012

Assembly Basics

19 Oct 2012

Page 6: 26th International Mammalian  Genome Conference 2012

Some assembly required…

Page 7: 26th International Mammalian  Genome Conference 2012

Restrict and make libraries2, 4, 8, 10, 40, 150 kb

End-sequence allclones and retainpairing information“mate-pairs”

Find sequence overlaps

Each end sequenceis referred to as a read

WGS contig

tails

WGS: Sanger Reads Layout-Consensus-Overlap

Page 8: 26th International Mammalian  Genome Conference 2012

http://schatzlab.cshl.edu/teaching/2010/Lecture%203%20-%20Graphs%20and%20Genomes.pdf

Page 9: 26th International Mammalian  Genome Conference 2012

Alignable trace count in frameshift window vs control in Opossum: 51nt window, >95% identity

23,894 genes

452 models with >1 exon, sym.best hit, and one frameshift

334 cases have 3 or less hits

Alexander Souvorov, NCBI

Page 10: 26th International Mammalian  Genome Conference 2012

Fragmented genomes tend to have less frame shifts

Alexander Souvorov, NCBI

Page 11: 26th International Mammalian  Genome Conference 2012

Fragmented genomes tend to have more partial models

Alexander Souvorov, NCBI

Page 12: 26th International Mammalian  Genome Conference 2012

BAC insertBAC vector

Shotgun sequence

Assemble

Fold

sequ

ence

Gaps

deeper sequencecoverage rarelyresolves all gaps

GAPS

“finishers” go in to manually fill the gaps, often by PCR

Clone based assemblies

Page 13: 26th International Mammalian  Genome Conference 2012

Scaffold N50 by chromosome

Page 14: 26th International Mammalian  Genome Conference 2012

7 May 2010

Spanned Gaps by Assembly

Page 15: 26th International Mammalian  Genome Conference 2012

Church et al., 2011 PLoS Biology

http://genomereference.org

Page 16: 26th International Mammalian  Genome Conference 2012

NCBI36 (hg18)

GRCh

37 (h

g19)

Page 17: 26th International Mammalian  Genome Conference 2012

NCBI35 (hg17)

GRCh37 (hg19)

AL139246.20

AL139246.21

Page 18: 26th International Mammalian  Genome Conference 2012

Build sequence contigs based on contigs defined in TPF (Tiling Path File).

Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis

Switch point

Consensus sequence

Page 19: 26th International Mammalian  Genome Conference 2012

NCBI36

Page 20: 26th International Mammalian  Genome Conference 2012

nsv832911 (nstd68) Submitted on NCBI35 (hg17)

Page 21: 26th International Mammalian  Genome Conference 2012

NCBI35 (hg17) Tiling Path

GRCh37 (hg19) Tiling Path

Gap Inserted

Moved approximately 2 Mb distal on chr15

NC_0000015.8 (chr15)

NC_0000015.9 (chr15)

Removed from assembly

Added to assembly

HG-24

Page 22: 26th International Mammalian  Genome Conference 2012

Sequences from haplotype 1Sequences from haplotype 2

Old Assembly model: compress into a consensus

New Assembly model: represent both haplotypes

Page 23: 26th International Mammalian  Genome Conference 2012

AC074378.4AC079749.5

AC134921.2AC147055.2

AC140484.1AC019173.4

AC093720.2AC021146.7

NCBI36 NC_000004.10 (chr4) Tiling Path

Xue Y et al, 2008

TMPRSS11E TMPRSS11E2

GRCh37 NC_000004.11 (chr4) Tiling Path

AC074378.4AC079749.5

AC134921.1AC147055.2

AC093720.2AC021146.7

TMPRSS11E

GRCh37: NT_167250.1 (UGT2B17 alternate locus)

AC074378.4AC140484.1

AC019173.4AC226496.2

AC021146.7

TMPRSS11E2

nsv532126 (nstd37)

Page 24: 26th International Mammalian  Genome Conference 2012

GRCh37 (hg19)

http://genomereference.org

7 alternate haplotypesat the MHC

Alternate loci released as:FASTA

AGPAlignment to chromosome

UGT2B17 MHC MAPT

Page 25: 26th International Mammalian  Genome Conference 2012

Assembly (e.g. GRCh37.p2)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Patches…

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Genomic Region(ABO)

Genomic Region(SMA)

Genomic Region

(PECAM1)

Page 26: 26th International Mammalian  Genome Conference 2012

MHC (chr6)Chr 6 representation (PGF)

Alt_Ref_Locus_2 (COX)

Page 27: 26th International Mammalian  Genome Conference 2012

Richa AgarwalaEugene Yaschenko

Page 28: 26th International Mammalian  Genome Conference 2012
Page 29: 26th International Mammalian  Genome Conference 2012

GenBank

Data Archives

Data in a common format Data in a single location (and mirrored) Most quality checked prior to deposition Robust data tracking mechanism (accession.version) Data owned by submitter

Page 30: 26th International Mammalian  Genome Conference 2012

Data tracking

ABC14-1065514J1GapsPhase LengthDate

FP565796.1 1 121-Oct-2009

FP565796.2 1 014-Oct-2010

FP565796.3 3 007-Nov-2010

Page 31: 26th International Mammalian  Genome Conference 2012

Mouse chrX: 35,000,000-36,000000

Page 32: 26th International Mammalian  Genome Conference 2012

Mouse chrX: 35,000,000-36,000000

X

MGSCv3 MGSCv36

Page 33: 26th International Mammalian  Genome Conference 2012

Unique Identification

NC_000086.6chrX in MGSCv36

List of scaffolds and gaps (AGP)

List of components and gaps (AGP)

Page 34: 26th International Mammalian  Genome Conference 2012

hg19GRCh37

mm8MGSCv37

NCBIM37

danRer5Zv7

What’s in a name?

Page 35: 26th International Mammalian  Genome Conference 2012

What’s in a name?

Page 36: 26th International Mammalian  Genome Conference 2012

Assemblies with the same name aren’t always the same

chr21:8,913,216-9,246,964

Page 37: 26th International Mammalian  Genome Conference 2012

Assemblies with the same name aren’t always the same

Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX

Page 38: 26th International Mammalian  Genome Conference 2012

hg19GRCh37

GRCh37.p2

GCA_000001405.1

Assembly Database to the rescue

GCA_000001405.3

Page 39: 26th International Mammalian  Genome Conference 2012

http://www.ncbi.nlm.nih.gov/genome/assembly

GRCh37hg19

Page 40: 26th International Mammalian  Genome Conference 2012
Page 41: 26th International Mammalian  Genome Conference 2012

Assembly (e.g. GRCh37.p5)GCA_000001405.6 /GCF_000001405.17

Primary Assembly

GCA_000001305.1/GCF_000001305.13

ALT 1

GCA_000001315.1/GCF_000001315.1

ALT 2

GCA_000001325.1/GCF_000001325.2

ALT 3

GCA_000001335.1/GCF_000001335.1

ALT 4

GCA_000001345.1/GCF_000001345.1

ALT 5

GCA_000001355.1/GCF_000001355.1

ALT 6

GCA_000001365.1/GCF_000001365.2

ALT 7

GCA_000001375.1/GCF_000001375.1

ALT 8

GCA_000001385.1/GCF_000001385.1

ALT 9

GCA_000001395.1/GCF_000001395.1

Patches GCA_000005045.5GCF_000005045.4

Non-nuclear assembly unit

(e.g. MT)

GCA_000006015.1/GCF_000006015.1

Page 42: 26th International Mammalian  Genome Conference 2012

GenBank RefSeq vs

Submitter Owned RefSeq OwnedRedundancy Non-Redundant

Updated rarely CuratedINSDC Not INSDC

BRCA183 genomic records31 mRNA records27 protein records

3 genomic records 5 mRNA records1 RNA record5 protein records

Page 43: 26th International Mammalian  Genome Conference 2012

Sequence Alignments Basics

Page 44: 26th International Mammalian  Genome Conference 2012

Hypothesis

Page 45: 26th International Mammalian  Genome Conference 2012

• The biological basis of sequence alignment is evolution

• Sequences that share a common ancestor are homologous– Sequence similarity is evidence of homology– Sequences, genes, etc. are homologous or not,

there is no “percent homology”

Page 46: 26th International Mammalian  Genome Conference 2012

Homology• Orthologous sequences

– Common ancestor; speciation• Paralogous sequences

– Gene duplicationwithin a species

(lineage specificexpansion)

http://www.nature.com/nrd/journal/v2/n8/box/nrd1152_BX2.html

Page 47: 26th International Mammalian  Genome Conference 2012

Alignment to NR -> Homology

Alignment to an Assembly -> Mapping

Page 48: 26th International Mammalian  Genome Conference 2012
Page 49: 26th International Mammalian  Genome Conference 2012

Global and local alignments

Optimal global alignment

Needleman-Wunsch

Sequences align essentially from end to end

Optimal local alignment

Smith-Waterman

Sequences align only in small, isolated regions

References

Needleman and Wunsch (1970). J. Mol. Biol. 48, 443-453.

Smith and Waterman (1981). Nucleic Acids Res 13, 645-656.

Page 50: 26th International Mammalian  Genome Conference 2012
Page 51: 26th International Mammalian  Genome Conference 2012

http://en.wikipedia.org/wiki/Sequence_alignment

Page 52: 26th International Mammalian  Genome Conference 2012

Hashing methods

MVRRLPERTSTPACEMVRVRRRRLRLPLPEPERERTRTSTSTSTPTPAPACACE

Query sequence

Word size = 3(configurable)

References

Wilbur & Lipman (1983), PNAS 80, 726-30

Lipman & Pearson (1985), Science 227, 1435-1441

Pearson & Lipman (1988), PNAS 85, 2444-2448

Page 53: 26th International Mammalian  Genome Conference 2012
Page 54: 26th International Mammalian  Genome Conference 2012
Page 55: 26th International Mammalian  Genome Conference 2012
Page 56: 26th International Mammalian  Genome Conference 2012

http://wwwdev.ebi.ac.uk/fg/hts_mappers/Fonseca et al., 2012

Page 57: 26th International Mammalian  Genome Conference 2012

Sensitivity vs. Specificity

Sensitivity = actual number of true positives (tp) identifiedSpecificity = number of true negatives (tn) identified

Actu

al

Predicted

TP FN

FP TN

positives

negatives

positives negatives

Sensitivity= TP/(TP+FN)Specificity=TN/(TN+FP)

Page 58: 26th International Mammalian  Genome Conference 2012

• Aligner technology specific?• Gapped vs. ungapped alignments?• Spliced alignments (cDNAs/RNA-Seq)• Can use paired-end data?

Page 59: 26th International Mammalian  Genome Conference 2012

Ruffalo et al., 2012

Page 60: 26th International Mammalian  Genome Conference 2012

Li and Homer, 2010

Page 61: 26th International Mammalian  Genome Conference 2012

Indels have correct and consistent alignment in readsafter multiple sequence local realignment

61DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet.

Phase 1:NGS data processing

Highlighted as one of the major methodological advances of the 1000 Genomes Pilot Project!

Page 62: 26th International Mammalian  Genome Conference 2012

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

CDC27

Page 63: 26th International Mammalian  Genome Conference 2012
Page 64: 26th International Mammalian  Genome Conference 2012

Richa Agarwala

MHC Alternate locus

Alignment to chr6

Page 65: 26th International Mammalian  Genome Conference 2012

Mouse Ren1 chr1 (NC_000067.6): 133350674-133360320

NM_031192.3: transcript from C57BL/6JNM_031193.2: transcript from FVB/N

Page 66: 26th International Mammalian  Genome Conference 2012
Page 67: 26th International Mammalian  Genome Conference 2012

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

CEPH: A=1.000 G=0

APOL1

Page 68: 26th International Mammalian  Genome Conference 2012

YRI: A=0.5852 G=0.4148

Multiple submissions

FrequencyData

1000G

Suspect

Sudmant et al., 2010