VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April...

17
VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012

Transcript of VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April...

Page 1: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.

VectorBase

Frank Collins, Scott Emrich, Dan Lawson,Greg MadeyBRC PI/PM Meeting

Bethesda, MDApril 27, 2012

Page 2: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.
Page 3: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.

Genome Sizes

• Pediculus humanus: ~110 Mb, N50 = 488 kb• Anopheles gambiae S: ~260 Mb, N50 = 1,505 kb• Culex quinquefasciatus: ~580 Mb, N50 = 487 kb• Aedes aegypti: ~1.3 Gb, N50 = 1,500 kb• Ixodes scapularis: ~1.8 Gb, N50 = 72 kb

Page 4: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.

4

Future genomesWhite papers

SandfliesLutzomyia longipalpisPhlebotomus papatasi

Anopheles (AGCC)Anopheles arabiensisAnopheles quadriannulatusAnopheles merusAnopheles melasAnopheles christylAnopheles epiroticusAnopheles stephensiAnopheles maculatusAnopheles funestusAnopheles minimusAnopheles culicifaciesAnopheles farautiAnopheles dirusAnopheles atroparvusAnopheles albimanus

GlossinaGlossina palpalisGlossina fuscipesGlossina pallidipesGlossina brevipalpisGlossina austeniStomoxys calcitransMusca domestica

SimuliumSimulium vittatumSimulium sirbanumSimulium damnosumSimulium ochraceumSimulium squamosumSimulium thyolenseSimulium santipauliSimulium woodiSimulium exiguum Simulium yahense

Tick & MitesLeptotrombidium delienseIxodes scapularis*Dermacentor variabilisOrnithodorus turicata

AnophelesAnopheles darlingi*Anopheles stephensi

Others

AedesAedes albopictus

i5K initiative

Page 5: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.

First New Release in New Contract

Page 6: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.
Page 7: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.
Page 8: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.

Challenges of vector genomes

• Relatively large, hard to inbreed genomes

• Heterozygosity in sequencing samples (up to 80 different males were sequenced for the new gambiae genomes) causes dubious scaffolds.

• Inversions and heterochromatic regions induce gaps

• Newer generation sequencing has reduced cost but has not yet kept overall quality

• Non-trivial annotations

Page 9: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.

An. gambiae formsM-form

• More permanent• Available year-round• Allows slower development• Predator-rich

S-form

• Ephemeral• rainy-season dependent• Requires rapid development• Largely predator-free

Page 10: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.

C. Cheng et al, unpublished

Divergence across chromosome arms

2L 2R

X

3R3L

Page 11: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.

Optical mapping DBP : Wisconsin

Page 12: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.

Size matters

Genome MB optically mapped genes found

S Sanger

145,837.97

14162

S Illumina

58,192.13

14124

PEST

60,239.6

14324

Sanger + Ill

204,030.1

14224

Page 13: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.

13

Annotation strategies• Speeding up computational annotation• Use of MAKER system• Prediction by projection from ‘high quality’ reference

• Expanded use of RNA-Seq• Scripture, Trinity & Cufflinks/Bowtie

• Community engagement• Primarily deployed for new genomes (Glossina, Rhodnius)• Works for all other VectorBase genomes

Page 14: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.

14

de novo annotationMAKER with RNA-Seq & reference proteomes

Aim:• Gene prediction pipeline for the masses.• Used for a number of arthropod genome projects• Touted as the default pipeline for many more (part of the GMOD toolkit)

Overview• ab-initio gene predictions from SNAP, Augustus & FGENESH Final gene models from MAKER EST alignments from both EXONERATE and BLASTN Protein alignments from EXONERATE and BLASTX Repeats from RepeatFinder & RepeatMasker• Additional data sets integrated via GFF3 files (RNA-Seq)• Uses MPI for parallelization over a compute farm Optimization for long scaffolds

Summary• Iterative runs give acceptable reference gene sets.• Used for Glossina and An. stephensi• Used by others for Strigamia, Manduca, published ant genomes

Page 15: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.

15

Community annotation• Simple tool to capture community annotation• Makes gene prediction and evidence available as GFF3• Compatible with Artemis and Apollo tools• Submissions in GFF3 format

• Gene structure corrections• Gene meta data (symbol, description, citations)

• Glossina annotation effort (Nov 11 – Apr 12)• 790 GFF submissions• 2670 items of metadata

• gene symbols, descriptions• Structure confirmation

Page 16: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.

16

ARTEMIS APOLLO

scf7180000638805 ptn2genome ptn_match 52 605 892 + . ID=xxxx;Name=tr|Q3UIQ2|scf7180000638805 ptn2genome ptn_match 78 205 960 + . ID=xxxx2;Name=tr|Q3TIU7|scf7180000638805 ptn2genome ptn_match 52 305 696 + . ID=xxxx3;Name=sp|Q91VD9|scf7180000638805 ptn2genome ptn_match 78 205 950 + . ID=xxxx2;Name=tr|Q3VIU732|

scf7180000638805 ptn2genome ptn_match 52 605 892 + . ID=xxxx;Name=tr|Q3UIQ2|scf7180000638805 ptn2genome ptn_match 78 205 960 + . ID=xxxx2;Name=tr|Q3TIU7|scf7180000638805 ptn2genome ptn_match 78 205 950 + . ID=xxxx2;Name=tr|Q3VIU732|

>MY SUPERCONTIGATATATGCGTTGAGCTGCGTTACGTTCGGGATGCGTTAGGCTTGTGAGCTGGATCGGTCCTGCCTGCGTCGATATAAACGACCT…

Identify gene

Modify model

SubmitCAP

GFF3FASTA

Page 17: VectorBase Frank Collins, Scott Emrich, Dan Lawson,Greg Madey BRC PI/PM Meeting Bethesda, MD April 27, 2012.

17

Population biology• Chado Natural diversity schema • 183 projects, 15190 samples• incorporates Irbase samples

• Ensembl variation schema• 1,511,335 SNP calls• Visualization through browser• Data downloads through browser• Queries via BioMart interface