Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short...

46
Reference Genome Sequencing Of Conifers PineRefSeq: An adaptive approach to the sequencing of large conifer genomes Nicholas Wheeler: University of California, Davis

Transcript of Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short...

Page 1: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Reference Genome Sequencing Of Conifers

PineRefSeq: An adaptive approach to the sequencing of large conifer genomes

Nicholas  Wheeler:    University  of  California,  Davis  

Page 2: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

University of California, Davis

Children’s Hospital Oakland Research Institute

Johns Hopkins University

University of Maryland

Indiana University

Texas A&M University

Washington State University

United States Department of Agriculture

National Institute of Food and Agriculture

pinegenome.org/pinerefseq

Page 3: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Our Guiding Principles

EMPOWERMENT. Our goal is to develop the technologies, platforms and bioinformatics infrastructures required to rapidly and inexpensively sequence large and complex genomes of coniferous forest trees. This will allow the forestry community to begin sequencing the many genomes of economic and ecological importance without a dependence on centralized genome centers.

ADAPTIVE. We recognize that sequencing technologies are developing rapidly and that we must have the expertise and flexibility to rapidly adopt new approaches into our overall sequencing strategy.

COMPARATIVE. We recognize the power of comparative genomics approaches in assembling and annotating genome sequences and will use this approach throughout the project.

OPEN ACCESS. We have a policy of sharing all data generated from this project with the research community

Page 4: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Reference Genome Sequence: For any given organism (species), the complete and ordered “assembly” of DNA, as denoted by the nucleotides A, T, C, and G.

Page 5: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Genome Sequencing – A Short History

A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome Project was presented to members of congressional appropriations committees in mid-February, 1990. According to the document, "a centrally coordinated project, focused on specific objectives, is believed to be the most efficient and least expensive way" to obtain the 3-billion-bp map of the human genome. In the course of the project, especially in the early years, the plan states that "much new technology will be developed that will facilitate biomedical and a broad range of biological research, bring down the cost of many [mapping and sequencing] experiments, and find application in numerous other fields."

Human Genome News, May 1990; 2(1) Five-Year Plan Goes to Capitol Hill

James Watson Francis Collins Craig Venter

Source: WikiPedia Source: Michigan St University Source: WikiPedia

Page 6: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Genome Sequencing – A Short History (continued)

The rate of publication of plant genomes, updated in late 2011

Figures courtesy of the CoGePedia and Phytozome.org websites.

A phylogenetic tree of all plants with published full genomes as of May 13, 2012

Page 7: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Existing & Planned Angiosperm Tree Genome Sequences As of mid-2012

1 Genome size: Approximate total size, not completely assembled. 2 Number of Genes: Approximate number of loci containing protein coding sequence. 3 Status: Assembly / Annotation versions

Species Genome Size1

(Mbp) # of Genes2 Status3

In Progress with Draft Assemblies

Populus trichocarpa Black Cottonwood 500 ~40,000 2.0 / 2.2

Eucalyptus grandis Rose Gum 691 ~36,000 1.0 / 1.1

Malus domestica Apple 881 ~26,000 1.0 / 1.0

Prunus persica Peach 227 ~28,000 1.0 / 1.0

Citrus sinensis Sweet Orange 319 ~25,000 1.0 / 1.0

Carica papaya Papaya 372 -

Amborella trichopoda Amborella 870 -

Betula nana Dwarf Birch 450 - 1.0 / -

In Progress or Planned – No Published Assemblies

Castanea mollissima Chinese Chestnut 800 -

Salix purpurea Purple Willow 327 -

Quercus robur Pedunculate Oak 740 -

Populus spp. and ecotypes Various Various -

Azadirachta indica Neem 384 -

Page 8: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Existing and Planned Gymnosperm Tree Genome Sequences As of mid-2012

1 Genome size: Approximate total size, not completely assembled. 2 Number of Genes: Approximate number of loci containing protein coding sequence. 3 Status: Assembly / Annotation versions; See http://www.phytozome.net for all publicly released tree genomes. Conifer genomes will also be posted here as they are completed.

Species Genome Size1

(Mbp) # of Genes2 Status3

Gymnosperms

Picea abies Norway Spruce 20,000 ? Pending

Picea glauca White Spruce 22,000 ? Pending

Pinus taeda Loblolly Pine 24,000 ? Pending

Pinus lambertiana Sugar Pine 33,500 ? Pending

Pseudotsuga menziesii Douglas-fir 18,700 ? Pending

Larix sibirica Siberian Larch 12,030 ? Pending

Pinus pinaster Maritime Pine 23,810 ? Pending

Pinus sylvestris Scots Pine 23,000 ? Pending

Page 9: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Technological Advances Facilitate Sequence Acquisition

Figure credit: modified from Jill Wegrzyn, UC Davis

Page 10: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Why Do We Need a Conifer Genome Sequence?

Fundamental Genetic Information Phylogenetic Representation Ecological Representation Development of Genomic Technologies Economic Importance

Photo credit: Nicholas Wheeler, UC Davis

Page 11: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Challenges to Sequencing a Conifer Genome

Image Credit: Modified from Daniel Peterson, Mississippi State University

Page 12: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Elements of a Conifer Genome Sequencing Project Approaches to Resolving Challenges

Figure  Credit:  Nicholas  Wheeler,  University  of  California,  Davis  

Page 13: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Assembling the Reference Sequence Based on Whole Genome Shotgun Sequencing

Figure  Credit:  Nicholas  Wheeler,  University  of  California,  Davis  

Page 14: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Acquiring the Sequence Target Genome, Appropriate Tissues for DNA & RNA

Haploid Haploid megagametophyte tissue 1N Shotgun sequenced

Diploid Diploid needle tissue 2N 40 Kb cloned fosmids, pooled and sequenced

Figure  Credit:  Nicholas  Wheeler,  University  of  California,  Davis  

Page 15: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Sequencing Strategy

40X to 60X 1X to 5X 1X to 5X 1X to 5X

Genome Equivalents Figure  Credit:  Nicholas  Wheeler,  University  of  California,  Davis  

Page 16: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Visit the Broad Institute for details on DNA preparation, library construction, and sequencing technology of Illumina HiSeq

Whole Genome Shotgun Sequencing Millions of Short “Reads”

Figure  Credit:  Nicholas  Wheeler,  University  of  California,  Davis  

Page 17: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Sequencing DNA from Pools of Fosmid Clones Fosmid Library Construction

Genomic DNA cloned into fosmid vectors represents a source of stable genomic fragments of approximately 30 to 40 kb.

Figure  Credit:  Modified  from  Maxim  Koriabine,  Children’s  Hospital  Oakland  Research  InsHtute    

Page 18: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Sequencing DNA from Pools of Fosmid Clones Preparing Fosmids for Sequencing

Assembly of complex genomes with a high level of repetitive DNA is facilitated by reducing the complexity of the “puzzle”.

Figure  Credit:  Modified  from  Maxim  Koriabine,  Children’s  Hospital  Oakland  Research  InsHtute    

Page 19: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Jumping Libraries Illumina Mate-Pair, Clone-Free

Page 20: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Jumping Libraries Fosmid Di-Tag Cloned

Figure  Credit:  Modified  from  Maxim  Koriabine,  Children’s  Hospital  Oakland  Research  InsHtute    

Page 21: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Assembling the Reference Sequence Based on Whole Genome Shotgun Sequencing

Figure  Credit:  Nicholas  Wheeler,  University  of  California,  Davis  

Page 22: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Assembling the Reference Sequence The Essence of Assembly

This general approach is called OLC or Overlap-Layout-Consensus.

Figure  Credit:  Nicholas  Wheeler,  University  of  California,  Davis  

Page 23: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Find all k-mers (short DNA sequences of length k) and build a graph. • Every k-mer is a node • Two nodes are linked with an edge if they share k-1 nucleotides

De Bruijn Graph Assembly Approach

Figure  Credit:  Modified  from  Aleksey  Zimin  

Page 24: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

OLC

Benefits

•  Can deal with variable length reads and reads from different sequencing platforms

•  Overlaps can be long and thus more reliable

•  Overlaps do not have to be exact

•  Can resolve repeats of up to read size

Drawbacks

•  Computationally intensive, number of overlaps grows quadratically with the number of reads

Comparing OLC and Graph Assembly Approaches

Graph

Benefits

•  Computationally efficient to find paths in the graph

•  Don’t have to find overlaps; they are implicit in the de Bruijn graph

Drawbacks

•  The graph is very large; approximately one node per base

•  Errors in the reads create spurious branches in the graph – requires error correction

•  Max. size of k-mer is limited by the shortest read size

Page 25: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Assembly Strategy

Figure  Credit:  Nicholas  Wheeler,  University  of  California,  Davis  

Page 26: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Dense Genetic Maps Aid Assembly

snp0_14875_01-457 snp0_18789_01-316snp0_6869_01-53 snp0_2615_02-1900.0snpCL1912Contig1_01-7012.5snp0_14614_01-530 snpCL1590Contig1_01-149snpCL4371Contig1_02-2445.0rflpPitaIFG_2899_4 snp0_17343_02-417snp0_15011_02-197 snpUMN_4478_02-286.7rflpPitaIFG_2530_2 snp0_13608_01-303snpUMN_822_01-99 snpCL343Contig1_01-1347.7snp2_6100_01-4911.5snp2_6100_02-10612.5snp0_5239_02-321 snpCL4270Contig1_06-172snp0_8457_01-191 snpUMN_2867_01-6914.1rflpPitaIFG_1916_315.0snpUMN_1409_01-323 snpUMN_819_01-4419.7snpCL745Contig1_03-203 snpUMN_2867_01-36120.4snp0_13835_01-34123.4rflpPitaIFG_2697_A25.5snp0_18258_02-276 snpCL317Contig2_05-32329.6snp0_648_01-97 snpCL1400Contig1_02-14430.9rflpPitaIFG_2006_A33.9estPitaIFG_20G2_a estPitaIFG_9053_arflpPitaIFG_1D9_138.9snp0_6709_02-58842.2snp0_12488_01-289 snpCL14Contig4_03-193snpUMN_2913_01-11843.0snp0_16257_01-349 snp0_16546_01-159snpUMN_CL58Contig1_03-17445.3snp0_5898_01-51 snp2_792_01-319snpCL229Contig1_03-7245.8snp2_3444_01-5147.3snp0_12074_01-41147.7snp0_12074_01-66 snpCL1198Contig1_04-129snpCL2342Contig1_01-40748.1snp0_15329_01-7153.0snp0_18284_01-76155.5snp2_4841_01-7257.1snp2_3447_02-51657.2snp0_9482_01-34858.8estPitaIFG_8537_a snp0_15417_01-13863.3snp2_3494_01-19668.1estPitaIFG_1576_a estPitaIFG_2253_arflpPitaIFG_1576_1 rflpPitaIFG_2253_ArflpPitaIFG_3008_2

70.5

snpUMN_2740_01-228 snp2_9199_01-49572.5snp0_12156_02-134 snp0_7494_02-31373.9snp0_12133_02-42775.0snp2_6033_01-35776.0snpCL3492Contig1_02-18877.8snp0_1302_01-3979.2snp0_10693_01-146 snpUMN_2993_01-167snpUMN_7192_01-17483.6snp0_4265_02-5385.4snp2_1833_01-23585.5snp0_11588_02-103 snpCL4725Contig1_01-135snp0_13063_02-47486.4snpCL3817Contig1_05-29687.7snp0_7457_01-27795.2snp0_17383_01-149 snp0_15466_02-169snp2_5499_02-136 snpCL4036Contig1_03-10899.6snpCL4255Contig1_03-125100.7snpUMN_3450_01-409101.8snpUMN_3450_01-72102.9snpUMN_1142_01-250105.3estPitaIFG_9102_a snp0_15901_01-587108.3snpCL77Contig1_04-124108.9rflpPitaIFG_701_1109.1snpCL2101Contig1_04-233113.9snpCL1894Contig1_03-38114.6snpUMN_3332_01-383115.1snp0_6617_01-108117.4snpCL3054Contig1_01-82117.5snpUMN_2770_01-367118.7snp0_4344_01-218122.7snp2_3934_01-32123.2rflpPitaIFG_658_A snp0_8304_02-414snpUMN_579_01-348 snp2_3934_01-356123.8snp0_13957_02-302125.5snp0_10712_02-123127.3snp0_2366_01-98129.8estPitaIFG_2053_b130.5snp2_4150_02-20134.1rflpPitaIFG_2441_1136.2snp2_9100_01-505 snpUMN_1023_01-376snpUMN_CL290Contig1_08-336 snp0_11003_02-50snpUMN_7016_02-75

138.3

snpUMN_188_01-55 snpUMN_4702_01-557139.2snp0_14744_01-155 snp2_2559_01-303snpCL1917Contig1_03-86140.0rflpPitaIFG_975_2140.7snpUMN_4702_01-130142.7snpUMN_2001_01-235143.0snp0_1722_01-574 snpCL199Contig3_02-149144.5snp0_17123_02-380147.0snp0_3964_01-453148.3snpCL2037Contig1_03-241149.7rflpPitaIFG_2882_1 rflpPitaIFG_2897_ArflpPitaIFG_2931_A snpUMN_5830_01-60snpCL358Contig2_01-76 snpUMN_828_01-73

150.2

rflpPitaIFG_2393_1 rflpPitaIFG_2963_2150.8snp0_15361_01-225154.4rflpPitaIFG_1636_6 snp0_14824_01-185155.2snp2_6541_01-385 snp0_1145_01-361156.7snp2_10215_01-53168.2

cromosome9

snpCL3786Contig1_02-700.0snpCL3715Contig1_01-1281.1snp0_17113_01-873 snpCL4664Contig1_05-1012.5snp0_16084_01-134 snp0_17254_01-5023.4estPitaIFG_9156_a4.9rflpPitaIFG_2479_16.3snp0_11865_01-66 snp0_13026_02-3309.3estPitaIFG_1956_a snp0_14069_01-49snpCL647Contig1_04-50 snpUMN_2253_01-136snp0_17756_02-191

11.8

snp0_11344_02-107 snpUMN_3214_01-16515.9snp2_6397_01-90 snpCL519Contig2_06-195snp2_1618_01-52416.9snp2_10236_01-5420.7rflpPitaIFG_2323_A snp2_6240_02-230snpCL1451Contig1_02-60 snpCL4272Contig1_03-63snp0_14492_01-284 snp0_7497_02-94

22.0

snp0_2223_01-10623.1snpUMN_CL148Contig1_02-21924.2snpCL1747Contig1_04-7626.4snp0_1984_02-21626.6snp0_17933_01-41127.6rflpPitaIFG_2994_228.7rflpPitaIFG_669_4 rflpPitaIFG_669_A32.7snpCL634Contig1_03-19635.0snp0_13755_03-52 snpCL747Contig1_06-15139.5snpCL4321Contig1_01-7639.8snp0_1443_01-80 snp0_7321_01-4240.1snp0_14724_01-327 snpCL2411Contig1_05-7342.3snp0_11214_01-408 snp0_6699_01-33743.4snp0_11214_01-8844.5snp0_18619_01-15345.6snpCL2651Contig1_03-10347.6snpCL1595Contig1_02-62 snp0_3787_01-5848.6snp0_15140_01-592 snp2_7278_02-128snp0_6323_01-240 snpCL3399Contig1_05-10051.8snp0_16837_02-42153.2snp0_2751_01-35353.7snp2_2020_03-78 snp0_17143_02-7156.0snpCL1694Contig1_04-199 snpUMN_487_01-10065.2snp2_5548_01-65466.2estPitaIFG_2290_a rflpPitaIFG_125_2rflpPitaIFG_459_1 snp0_8286_03-183snpCL4592Contig1_03-180 snp0_13646_01-394snp0_16697_01-76 snp0_4022_01-328snp2_9699_01-160

71.7

snp2_5438_01-465 snp0_10903_02-555snp0_16697_01-338 snp0_783_01-233snp2_9226_01-67

72.8

snp0_14575_01-153 snp2_3770_02-282snpCL4041Contig1_02-11174.0snpCL529Contig1_04-89 snp0_17521_01-385snp2_7619_01-19174.5snp0_10446_01-24076.2snp2_1638_01-77 snpCL148Contig1_02-53snpUMN_5551_01-21977.9snp0_13841_01-11880.3estPitaIFG_0500_a81.3snpUMN_CL309Contig1_03-208 snpCL1057Contig1_03-5183.1snp2_4405_01-8185.8snp0_18710_03-17887.0snp0_17881_01-276 snpCL2431Contig1_01-459snpUMN_CL228Contig1_03-18187.8snp2_684_01-374 snpCL2253Contig1_01-187snp0_3461_01-618 snpUMN_3353_01-43890.9snpUMN_5161_01-258 snpUMN_CL228Contig1_03-49291.8snp2_974_02-19892.4snpCL3116Contig1_03-207 snp2_9294_01-29493.9snp2_7852_01-52597.4snpCL2416Contig1_06-36099.2snp0_4172_02-205 snp0_711_01-591101.9snpCL3094Contig1_03-125105.5snp0_13455_01-122 snp2_4854_03-202106.3snp2_2160_02-645 snp0_18187_02-220107.0snp2_9930_01-56108.2snp0_7272_02-199112.3snp0_12149_01-86 snpUMN_1342_01-244113.7snp0_9276_01-32 snpCL305Contig1_05-249116.3estPitaIFG_8415_a snp0_1854_01-189snpCL3370Contig1_03-158 snpCL4425Contig1_04-168snp0_18287_01-124

117.6

snp2_8491_01-672 snpCL1868Contig1_02-211119.0snp0_17947_02-90120.1rflpPitaIFG_2899_A snp0_4755_01-301snp2_7196_01-423 snpCL3954Contig1_01-95122.2snp0_14873_01-155123.3snp0_1644_01-792123.4snpUMN_CL22Contig1_02-466125.5snpUMN_CL326Contig1_05-421126.6snp0_2137_01-354128.4snp0_2137_01-66 snpUMN_5911_02-179129.6snp2_1534_02-96 snp0_5111_02-211snp2_6379_03-345 snpUMN_6365_02-31130.8snp0_17641_01-156 snpCL1999Contig1_03-156snpUMN_4383_01-585 snp0_10591_01-289snp0_12134_02-264 snp0_12973_02-367snpCL22Contig1_03-218

131.9

snp2_4191_01-104 snpUMN_239_01-90133.0estPitaIFG_8725_a rflpPitaIFG_2370_1135.1snp0_13620_01-163137.1snp0_7371_01-187 snp2_4569_01-615snp0_3665_02-347 snp2_7751_02-211138.2snp0_13311_02-812139.3snp0_13311_02-182140.4

cromosome10

snp0_4885_01-62 snp2_1728_01-3010.0snp2_945_01-771.7snp2_7707_02-3873.3snp2_7707_02-674.5snp0_7933_01-715.5snp0_18750_01-42 snp2_3973_01-105snpCL1757Contig1_01-187 snpCL4689Contig1_02-2755.6snp2_3475_02-1626.4snp0_12681_01-532 snp0_17543_01-196snp2_6504_01-4237.4snp0_630_01-86 snp2_10306_01-3349.1snpUMN_1633_02-5010.9snpUMN_1633_02-36511.3snp2_9060_01-46011.7snp0_18810_02-148 snp2_1808_02-5912.6snp0_10373_01-711 snp0_12076_01-31014.9snp2_4959_01-16519.4snp0_2217_01-143 snp0_378_01-442snp0_18318_02-50 snp0_2999_02-35021.7snp0_13185_01-12923.5snpCL2046Contig1_03-21324.4snp0_9473_02-33424.6snp0_18300_01-7325.9snpUMN_3258_01-33327.5snp0_16732_01-38628.3snp0_8795_01-33434.4snpCL4156Contig1_04-11535.8snp2_5095_02-155 snpCL4082Contig1_01-44237.1snpUMN_1866_01-41738.2snp0_9511_02-38839.2snp0_9511_02-4839.9snp0_5559_01-41242.0snp0_576_01-32847.0snp0_3203_01-231 snp0_14613_01-110snp0_8695_01-17248.0snpCL1864Contig1_04-18649.2snp0_1576_01-40751.6snp0_18075_03-9053.9snp0_10663_01-21358.6rflpPitaIFG_2986_A snpCL1536Contig1_05-12059.7snpCL1360Contig1_05-460 snp2_4137_02-672snpCL2311Contig1_01-7960.8snp2_6439_03-10661.8snp0_8823_01-30662.8snpCL1080Contig1_03-9064.2snp2_6453_03-24066.3snpCL1799Contig1_04-73767.3snpCL444Contig1_02-25668.4snp0_17386_01-8170.4snp2_7373_01-26471.3snpUMN_6889_01-6372.1snp0_1423_01-39073.8rflpPitaIFG_66_1 snp0_17822_02-15674.6snp0_7580_01-264 snpCL2136Contig1_01-6575.3snp0_7326_01-6376.7snpUMN_3932_01-24077.3rflpPitaIFG_1623_A77.9snp2_8315_02-24578.9snp0_10899_03-173 snp0_7881_01-38279.8snp2_795_02-13080.8snp2_8611_01-9681.9snp0_14750_02-25682.5snp0_17832_01-104 snpUMN_5395_01-43083.3snp0_2716_01-512 snp0_18624_02-6684.2snp2_394_01-25186.0snp0_15010_01-25987.2snp0_2302_01-41 snpUMN_4149_01-111snp0_7427_01-34088.7snp0_17194_01-11490.1snpUMN_3001_01-5690.7estPitaIFG_8939_a rflpPitaIFG_3006_193.0snp0_3671_02-181 snp0_9106_02-5594.5snp0_5688_01-443 snp0_17758_01-50098.2snpCL2160Contig1_01-6399.9rflpPitaIFG_48_1 snpUMN_5214_01-52103.2snp2_8909_02-75107.1snp2_2119_01-61112.3snpUMN_157_01-119114.9snpCL1778Contig1_04-72117.1snpUMN_CL78Contig1_01-192118.7snp0_4441_01-196 snpCL1027Contig1_04-35122.0snp0_886_02-324122.6snpCL3367Contig1_01-189123.7rflpPitaIFG_2536_1 rflpPitaIFG_2564_A127.8rflpPitaIFG_1A7_A134.8snp0_12494_03-189137.4snpCL1989Contig1_01-345140.1snpCL3466Contig1_01-91 snp0_7496_01-373143.0snp0_1675_01-136143.9snp2_7562_01-464152.3rflpPitaIFG_2538_B154.5snpUMN_3689_01-427156.1estPitaIFG_8569_a156.6rflpPitaIFG_2150_A rflpPitaIFG_2885_1rflpPitaIFG_2885_B snpCL352Contig1_03-122161.5rflpPitaIFG_2994_5 snp0_13929_02-186snp2_5345_01-159 snpUMN_6063_01-143snp0_15958_01-145 snp2_3141_01-131snp2_9087_01-114

164.4

snp0_4698_01-53165.7snp0_16524_02-57 snp2_5026_01-268166.8snp0_11199_01-376 snp0_1682_01-580snpCL1566Contig1_03-200 snpUMN_706_01-330170.2snp0_4924_01-230 snpCL133Contig2_06-75snpUMN_5166_01-35171.3snp2_9455_01-122173.6snp0_14991_01-174 snpCL193Contig1_03-104snp0_1078_01-477174.7

cromosome11

snpUMN_5143_01-4480.0snp0_10384_02-2741.1snp0_5193_01-86 snp0_17197_01-2203.4snp2_4484_02-622 snpCL104Contig2_02-46snp0_16860_01-85 snp2_6804_01-2894.5snp0_3980_03-200 snp0_8331_01-292snpCL4059Contig1_03-2255.0snp2_6804_01-5715.8snp0_14606_01-77 snp2_5005_01-7512.5snp0_14730_01-354 snpCL4330Contig1_03-120snp0_17456_01-15214.7snp0_5784_01-234 snpUMN_1422_01-26814.8snpUMN_3368_01-139 snp0_14093_01-19015.4snp0_17253_02-86 snpCL631Contig1_05-25816.9snp0_18830_01-22318.6rflpPitaIFG_1869_218.7snp0_9732_03-5720.9snp2_2505_02-39523.2rflpPitaIFG_1889_1 snp0_12998_01-3425.6estPitaIFG_8580_a26.6snp1_2647_01-41127.9snp0_10401_01-12630.3snpCL206Contig1_03-12333.7snpCL3061Contig1_03-99 snpCL4268Contig1_01-21333.8snpCL1468Contig1_01-184 snp2_10399_01-42534.9snpCL4144Contig1_06-16036.1estPitaIFG_4CH2_a rflpPitaIFG_503_1snp0_881_01-54337.3snp2_8987_01-6543.0snp0_13058_01-6843.8snp0_15106_02-49246.4snpCL64Contig1_07-13446.7snp0_15106_02-153 snp0_3570_01-43747.0snp0_13181_01-56648.4snp0_13898_01-4749.8snpUMN_592_01-404 snp2_3801_02-19452.8snp2_9629_02-40353.6snpCL1196Contig1_03-5455.4snp0_15238_01-44 snp0_13484_01-190snp0_572_02-5457.2snp2_10034_01-484 snp0_12535_01-57058.9snp0_12535_01-11759.7rflpPitaIFG_1919_1 snp2_10034_01-17760.6snp0_12396_01-7460.7snp0_7512_01-31265.4snpCL3925Contig1_03-16368.2snp2_6387_01-71169.0estPitaIFG_1635_a rflpPitaIFG_1635_2rflpPitaIFG_1635_A rflpPitaIFG_1635_C70.6snpCL996Contig1_03-6674.9snp0_15003_01-72 snp0_16653_02-19577.0snp0_2069_01-29080.1snp2_9997_02-44681.5snp0_10113_01-11984.0snpUMN_CL306Contig1_04-261 snp0_18411_02-105snpCL2215Contig1_03-40886.6snp2_6339_01-20387.2snp2_2356_02-88 snpCL2007Contig1_05-16892.8snp0_9867_01-94 snp2_4375_01-168snp0_1319_02-7993.9snpCL1099Contig1_03-89 snpCL1734Contig1_06-234snp0_6544_01-50 snp2_1402_02-59694.9snp0_17603_01-12595.0snpCL2034Contig1_03-312 snp0_7654_01-19296.2snp0_3251_01-442 snp2_2270_01-7998.4snp2_5574_02-450100.3snpCL3862Contig1_06-229100.8snp2_6130_01-57101.5snp0_13098_01-68 snp0_18322_01-117102.3snp2_4921_01-90105.6snp2_10071_02-104108.1snp0_14039_01-47114.4snp2_212_01-295 snp0_16706_02-326snp0_3657_02-90115.6snp0_16565_02-112117.2snp0_12163_01-286 snpCL3978Contig1_03-565118.8snp0_10950_02-148 snp2_6263_01-358snp0_2542_01-126 snpCL4404Contig1_01-145snpUMN_2192_02-42

120.1

snpCL1129Contig1_01-649 snpCL4404Contig1_01-464snpUMN_3522_01-441121.7snpCL4629Contig1_04-197122.5snp2_8019_01-33123.9snp0_489_01-89125.7snp0_4834_01-60126.5snp0_6047_02-245127.3snp0_15484_02-97128.1snp0_7115_02-341129.5snp2_3312_01-422130.2snp0_13202_01-360130.4snp2_6698_02-408131.3snp0_9993_01-50131.4snp0_1441_02-74132.4snpUMN_1956_02-75 snp0_17893_01-332133.2snp0_9296_01-189135.9snpCL3402Contig1_04-408138.8snpUMN_1397_01-416140.2snp2_7803_01-212141.0snp0_18301_01-115141.8rflpPitaIFG_2020_1 rflpPitaIFG_2361_1144.0snp2_10057_01-63 snp2_4323_01-429snp0_9314_01-64 snp2_4944_01-109146.9snpCL1453Contig1_01-181 snp0_9314_01-367148.0snp0_8544_01-128151.4rflpPitaIFG_3021_1 snp0_2234_01-128snp0_5740_01-61 snp2_4342_03-79snp2_6052_01-347

154.8

snp0_18225_02-54 snpUMN_5833_01-392156.2snp2_3140_01-95 snpUMN_4856_01-462157.6snp2_8381_01-116158.2rflpPitaIFG_1A7_6 rflpPitaIFG_2361_3159.7rflpPitaIFG_2994_3 snpCL1052Contig1_03-62snpCL4264Contig1_01-164162.1snp0_9868_01-383164.3snp0_17971_01-642 snp2_5974_02-51166.6snp2_4724_01-136168.0

cromosome12LG9 LG10 LG11 LG12

Figure  Credit:  Courtesy  of    Andrew  Eckert  and  Pedro  MarHnez-­‐Garcia,  University  of  California,  Davis    

Page 27: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Assembling the Reference Sequence Based on Whole Genome Shotgun Sequencing, Jumping Libraries, & Genetic Maps

Figure  Credit:  Nicholas  Wheeler,  University  of  California,  Davis  

Page 28: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Transcriptome (RNA) sequencing defines the genes expressed in different pine tissues

Figure  Credit:  Modified  from  Keithanne  MockaiHs,  Indiana  University  

Page 29: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

How Does the Transcriptome Inform the Reference Genome?

Figure  Credit:  Modified  from  Jill  Wegrzyn,  University  of  California,  Davis  

Page 30: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

What Else Can We Learn From Transcriptomics ?

• Genes and Pathways • Gene Function • Diagnostics • Gene Regulation • Proxy for other Omics

Page 31: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Library N, quality filtered

Nucleotides Transcripts Assembled

Mean Contig Length

Unique Transcripts

Pila Needles & Candles 454 (Newbler) 1,096,017 387,174,063 28,910

955 49,035

Pila Needle RNASeq (Trinity) 33,961

Psme Needles and Candles 454 (Newbler) 1,216,156 419,643,998 25,041

961 92,897 Psme Needle RNASeq (Trinity) 99,936

Pita Shoot 454 (Newbler) 874,971 205,284,775 62,342

1,124 48,842 Pita Callus 454 (Newbler) 882,199 344,842,307 37,322

Pita Stem 454 (Newbler) 934,760 310,498,816 43,234

Preliminary Results of Transcriptome Reference Sequencing for Three Conifer Species

Updated summary of transcriptome assemblies from 454 (CCG, JGI) and RNASeq (FS) in Psme (Douglas-fir), Pila (sugar pine), and Pita (loblolly pine).

Page 32: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Genome Annotation

Figure  Credit:  Modified  from  Jill  Wegrzyn,  University  of  California,  Davis  

Page 33: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

The Functional Annotation Process

Figure  Credit:  Modified  from  Jill  Wegrzyn,  University  of  California,  Davis  

Page 34: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Gene Ontology

Biological  Process   Molecular  Func=on   Cellular  Component  A  commonly  recognized  

series  of  events  An  elemental  acHvity  

or  task  or  job  Where  a  gene  product  

is  located  

Cell  division  Mitosis  

Organelle  fission  

Protein  kinase  acHvity  Insulin  binding  

Insulin  receptor  acHvity  

Mitochondrion    Mitochondrial  matrix  

Mitochondrial  membrane    

Figure  Credit:  Modified  from  Jill  Wegrzyn,  University  of  California,  Davis  

Page 35: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Transcriptome Ontology

Figure  Credit:  Modified  from  Barakat  et  al.  BMC  Plant  Biology  2012,  12:38  -­‐    hZp://www.biomedcentral.com/1471-­‐2229/12/38  

Page 36: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

More Functional Annotation

Identifying and classifying regulatory sequences

Identifying and classifying genes that produce functional RNAs

•  tRNA – Protein Synthesis

•  rRNA - Protein Synthesis

•  U snRNA – Splicing

•  snoRNA - rRNA modification

•  miRNA – Gene regulation

Page 37: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

A Complete Annotation

Figure  Credit:  Modified  from  Jill  Wegrzyn,  University  of  California,  Davis  

Page 38: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Annotating Structural Features of Genome Sequences

•  Repetitive sequence content & distribution

•  Pseudogenes/gene families

•  GC content

•  Segmental duplication

•  Centromere and telomere structure

Page 39: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Database Resource Requirements Project Level Genome Browsers

Goal: Provide capacity to capture, archive, curate, distribute, and analyze genomic information.

TreeGenes

Genome Database for Rosaceae

Identifying and classifying genes that produce functional RNAs

Page 40: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Public Databases Nucleic Acid Sequence Databases

ENA / EBI – European Bioinformatics Institute DDBJ – DNA Data Bank of Japan NCBI /GenBank – National Center for Biotechnology Info

Page 41: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

NCBI Entrez

NCBI Entrez – The Life Sciences Search Engine

Page 42: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Version  0.6  Release:  June,  2012  Version  0.8  Release:  January,  2013  

0

50

100

150

200

250

NCBI BLAST Against Loblolly Draft Reference

 Jun  -­‐12      Jul-­‐12        Aug-­‐12        Sep-­‐12        Oct-­‐12        Nov-­‐12  0  

100  

200  

300  

400  

500  

600  

700  

800  

Jun-­‐12   Jul-­‐12   Aug-­‐12   Sep-­‐12   Oct-­‐12   Nov-­‐12  

FTP Downloads of Loblolly Draft Reference

The Pine Reference Sequence in Use

Unique users

Figure  Credit:  Modified  from  Jill  Wegrzyn,  University  of  California,  Davis  

Page 43: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Year   Common  Name   Scien=fic  Name  Assembly  Size  

(GB)  Predicted  Size  

(GB)  N50  Con=g  

(KB)  N50  Scaffold  

(KB)  

2011   Potato  Solanum  tuberosum  L.   0.7   0.8   31.4   1320.0  

2011   Orangutan  Pongo  abelii/pygmaeus   3.1   3.1   15.5   740.0  

2011   Nake  Mole  Rat  Heterocephalus  glaber   2.7       19.3   1590.0  

2011   AtlanHc  Cod   Gadus  morhua   0.8       2.8   690.0  

2011   Coral  Reef   Acropora  digi<fera   0.4   0.4   10.7   190.0  

2012   Gorilla   Gorilla  gorilla  gorilla   2.9       11.9   914.0  

2012   Oyster   Crassostrea  gigas   0.6   0.6   19.4   400.0  

2013   Radish   Raphanus  sa<vus  L   0.4   0.5   25.0      

2012   Wheat   Tri<cum  aes<vum   5.5   17.0   0.6   0.6  

2013   Loblolly  Pine   Pinus  taeda   20.1   22.0   8.2   30.7  

Genome  Assembly  StaHsHcs  for  Recently  Sequenced  Species  

Page 44: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Key Website Resources

Human  Genome  Sites  hZp://www.ornl.gov/sci/techresources/Human_Genome/project/hgp.shtml  hZp://www.nature.com/nature/supplements/collecHons/humangenome/index.html  hZp://www.sciencemag.org/content/300/5617/286.abstract  

Resource  –  Broad  Ins=tute  for  Illumina  Sequencing  Technology  hZp://www.broadinsHtute.org/scienHfic-­‐community/science/plaeorms/genome-­‐sequencing/broadillumina-­‐genome-­‐analyzer-­‐boot-­‐camp  

Photo  Credits:  Slide  5  hZp://en.wikipedia.org/wiki/Craig_Venter  hZp://www.humaniHesandhealth.wordpress.com  hZp://en.wikipedia.org/wiki/James_D._Watson  .   Figure  Credits:  Slide  6  hZp://www.phytozome.net/  hZp://genomevoluHon.org/wiki/index.php/Sequenced_plant_genomes  

Page 45: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

Key Website Resources

Slide  38  hZp://dendrome.ucdavis.edu/treegenes/  hZp://www.rosaceae.org/  

Slide  39  hZp://www.ebi.ac.uk/ena/  hZp://ddbj.sakura.ne.jp/  hZp://www.ncbi.nlm.nih.gov/genbank/  

  Slide  40  hZp://www.ncbi.nlm.nih.gov/sites/gquery  

Side  43  Download:  hZp://loblolly.ucdavis.edu/bipod/fp/Genome_Data/genome/pinerefseq/Pita/v0.9/  BLAST:  hZp://dendrome.ucdavis.edu/resources/blast/  Data  Release  Policy:  hZp://www.pinegenome.org/pinerefseq/Data_Use_Policy.pdf    

Page 46: Reference Genome Sequencing Of Conifers - eOrganic · 2019-01-21 · Genome Sequencing – A Short History A 5-year plan (FY 1991 to 1995) detailing the goals of the U.S. Human Genome

References Cited

 

Gibson,  G.,  and  S.  V.  Muse.  2004.  A  primer  of  genome  science.  (2nd  Ed).  Sinauer    Associates,  Sunderland,  MA.      Lesk,  A.  M.  2012.  IntroducHon  to  genomics.  (2nd  Ed).  Oxford  University  Press,    New  York.