Sweet Orange (Draft Genome

download Sweet Orange (Draft Genome

of 10

Transcript of Sweet Orange (Draft Genome

  • 8/14/2019 Sweet Orange (Draft Genome

    1/10

    NATURE GENETICS VOLUME 45 | NUMBER 1 | JANUARY 2013 59

    Citrus is a large genus that includes several major cultivated species,including C. sinensis (sweet orange), Citrus reticulata (tangerine andmandarin), Citrus limon (lemon), Citrus grandis (pummelo) andCitrus paradisi (grapefruit). In 2009, the global citrus acreage was

    9 million hectares and citrus production was 122.3 million tons (FAOstatistics, see URLs), which is the top ranked among all the fruit crops.Among the 10.9 million tons (valued at $9.3 billion) of citrus productstraded in 2009, sweet orange accounted for approximately 60% ofcitrus production for both fresh fruit and processed juice consump-tion (FAO statistics, see URLs). Moreover, citrus fruits and juice arethe prime human source of vitamin C, an important component ofhuman nutrition.

    Citrus fruits also have some unique botanical features, such asnucellar embryony (nucellus cells can develop into apomictic embryosthat are genetically identical to mother plant). Consequently, somaticembryos grow much more vigorously than the zygotic embryos inseeds such that seedlings are essentially clones of the maternal parent.Such citrus-unique characteristics have hindered the study of citrusgenetics and breeding improvement 1,2. Complete genome sequenceswould provide valuable genetic resources for improving citrus crops.

    Citrus is believed to be native to southeast Asia 35, and cultivationof fruit crops occurred at least 4,000 years ago 3,6. The genetic originof the sweet orange is not clear, although there are some speculationsthat sweet orange might be derived from interspecific hybridization ofsome primitive citrus species 7,8. Citrus is also in the order Sapindales,

    a sister order to the Brassicales in the Malvidae, making it valuable forcomparative genomics studies with the model plant Arabidopsis.

    We aimed to sequence the genome of Valencia sweet orange(C. sinensis cv. Valencia), one of the most important sweet orange

    varieties cultivated worldwide and grown primarily for orange juiceproduction. Normal sweet oranges are diploids, with nine pairs ofchromosomes and an estimated genome size of ~367 Mb 9. To reducethe complexity of the sequenced genome, we obtained a double-haploid (dihaploid) line derived from the anther culture of Valenciasweet orange 10. We first generated whole-genome shotgun paired-endtag sequence reads from the dihaploid genomic DNA and builta de novo assembly as the citrus reference genome; we then producedshotgun sequencing reads from the parental diploid DNA and mappedthe sequences to the haploid reference genome to obtain the completegenome information for Valencia sweet orange. In addition, we con-ducted comprehensive transcriptome sequencing analyses for fourrepresentative tissues using shotgun RNA sequencing (RNA-Seq) tocapture all transcribed sequences and paired-endtag RNA sequenc-ing (RNA-PET) to demarcate the 5 and 3 ends of all transcripts. Onthe basis of the DNA and RNA sequencing data, we characterized theorange genome for its gene content, heterozygosity and evolution-ary features. The genome and t ranscriptome analyses presented inthis study yield new insights into the origin of sweet orange and thegenomic basis of vitamin C metabolism and provide a rich resource ofgenetic information for citrus breeding and genetic improvement.

    The draft genome of sweet orange ( Citrus sinensis)Qiang Xu 1,6, Ling-Ling Chen 2,6, Xiaoan Ruan 3,6, Dijun Chen 2, Andan Zhu 1, Chunli Chen 1,2, Denis Bertrand 3,Wen-Biao Jiao 2, Bao-Hai Hao 2, Matthew P Lyon 4, Jiongjiong Chen 1, Song Gao 3, Feng Xing 2, Hong Lan 1,Ji-Wei Chang 2, Xianhong Ge 5, Yang Lei2, Qun Hu 1, Yin Miao 2, Lun Wang 1, Shixin Xiao 1, Manosh Kumar Biswas 1,Wenfang Zeng 1, Fei Guo 1, Hongbo Cao 1, Xiaoming Yang 1, Xi-Wen Xu 2, Yun-Jiang Cheng 1, Juan Xu 1,Ji-Hong Liu 1, Oscar Junhong Luo 3, Zhonghui Tang 2, Wen-Wu Guo 1, Hanhui Kuang 1, Hong-Yu Zhang 2,Mikeal L Roose 4, Niranjan Nagarajan 3, Xiu-Xin Deng 1 & Yijun Ruan 2,3

    Oranges are an important nutritional source for human health and have immense economic value. Here we present acomprehensive analysis of the draft genome of sweet orange ( Citrus sinensis ). The assembled sequence covers 87.3% of theestimated orange genome, which is relatively compact, as 20% is composed of repetitive elements. We predicted 29,445 protein-coding genes, half of which are in the heterozygous state. With additional sequencing of two more citrus species and comparativeanalyses of seven citrus genomes, we present evidence to suggest that sweet orange originated from a backcross hybrid betweenpummelo and mandarin. Focused analysis on genes involved in vitamin C metabolism showed that GalUR , encoding the rate-limiting enzyme of the galacturonate pathway, is significantly upregulated in orange fruit, and the recent expansion of this genefamily may provide a genomic basis. This draft genome represents a valuable resource for understanding and improving manyimportant citrus traits in the future.

    1 Key Laboratory of Horticultural Plant Biology (Ministry of Education), Huazhong Agricultural University, Wuhan, China. 2College of Life Science and Technology,Huazhong Agricultural University, Wuhan, China. 3Genome Institute of Singapore, Singapore. 4Botany and Plant Sciences, University of California, Riverside,California, USA. 5 College of Plant Science and Technology, Huazhong Agricultural University, Wuhan, China. 6These authors contributed equally to this work.Correspondence should be addressed to X.-X.D. ( [email protected] ) or Y.R. ( [email protected] ).

    Received 30 April; accepted 24 October; published online 25 November 2012; doi:10.1038/ng.2472

    OPEN

    A RT I C L E S

    http://www.nature.com/doifinder/10.1038/ng.2472http://www.nature.com/doifinder/10.1038/ng.2472
  • 8/14/2019 Sweet Orange (Draft Genome

    2/10

    60 VOLUME 45 | NUMBER 1 | JANUARY 2013 NATURE GENETICS

    A RT I C L E S

    RESULTSGenome sequencing and assemblyFrom the dihaploid DNA of sweet orange ( Supplementary Fig. 1 ), wegenerated 785 million high quality paired-endtag sequencing reads(2 100 bp) from various DNA fragment sizes (~300 bp, 2 kb, 10 kband 20 kb) using the Illumina GAII sequencer, representing 214-fold base-pair coverage and 4,889-fold physical coverage of the esti-

    mated citrus genome (Supplementary Table 1

    ). The sequence readswere assembled by SOAPdenovo 11 and Opera 12, resulting in 16,890assembled sequence contigs (N50, 49.89 kb) and 4,811 scaffolds (N50,1.69 Mb) (Table 1 ). The total contig sequence length (320.5 Mb) cov-ers 87.3% of the estimated sweet orange genome, and more than 80%of the genome assembly is in 135 scaffolds that are larger than 713 kband up to 8.2 Mb. To evaluate the accuracy of the citrus genomeassembly, we assessed the quality of the scaffolds by BAC cloneanalysis, alignment to the established citrus genetic linkage map andcytogenetic analysis.

    A BAC clone library of sweet orange has previously been con-structed and characterized 13, and the estimated size range of BACclone inserts was 120 kb. We picked individual clones for BAC-endcapillary sequencing. Of the 735 pairs of BAC-end sequence reads inwhich both ends of a pair were mapped within the same contig, theaverage distance was 125,875 26,788 bp (s.d.). This is consistentwith the previously estimated length of BAC clones 13, suggesting thecorrectness of the contig assembly. To further evaluate the accuracyof the assembly on a local scale, we analyzed seven complete BACsequences of sweet orange that are available in GenBank and alsoanchored in the sweet orange genetic map. The alignments betweenthese BAC sequences and the citrus genome assembly showed anoverall identity of 97.6% (Supplementary Fig. 2 and SupplementaryTable 2 ). Together, the BAC clone sequencing analyses validated thatthe assembly is of high quality.

    We then asked whether the assembly of the genome sequencesmatches well with the established sweet orange genetic linkage map 14.

    We mapped the scaffolds to 768 markers with known sequencesin the genetic linkage groups, and 160 scaffolds (each >10 kb) wereanchored, accounting for 239 Mb of the assembled citrus genome(Supplementary Table 3 ). Furthermore, 83 scaffolds that accountedfor the majority of the length of the anchored scaffolds (178.7 Mb, or75% of total anchored scaffolds) were in matched orientation withinthe genetic map ( Supplementary Note ), suggesting high alignmentaccordance between the anchored genetic markers and the sequencescaffolds. The discordance in alignment could reflect technical accu-racy, such as insufficient marker density, genotyping errors or individ-ual variations between the sequenced genome (Valencia sweet orange)and the genetic map that was based on three crosses between threesweet orange cultivars 14. Valencia orange is known to have chromo-some rearrangements relative to seedy sweet oranges 15. Combiningthe assembled genome sequences and the genetic linkage groups, weconstructed the pseudomolecules for each of the nine chromosomesand ordered them on the basis of genetic length ( Fig. 1).

    We then took some cytological markers, including two ribosomalgenes (45S and 5S) and a tandem repeat sequence (107 bp), for FISHanalyses to further validate the accordance of the assembled citrusgenome sequences. The cytologically marked chromosomes showedtentative consistence with the genomically defined pseudochromo-somes (Supplementary Table 4 ). For example, pseudochromosomes 1and 8, bearing the 45S region at proximal locations, correspond totwo type-B chromosomes, with one in the fragile site ( SupplementaryFig. 3); pseudochromosome 6, with the mapped scaffolds containing45S and 5S ribosomal DNA sequences at distal regions, corresponds

    to the type-D chromosome with a fragile site; and chromosome 7 con-tains a large number of copies of the 107-bp tandem repeat sequenceat two distal locations, which corresponds to the type-C chromosome(Supplementary Fig. 3 ). Collectively, our cytological experimentsprovided sequence-matched evidence for chromosomes 1, 6, 7 and 8.Additional cytological evidence is needed for other chromosomes.

    Genome characterizationMost plant genomes are featured with highly repetitive regions withtransposable elements and unique gene-coding regions. We analyzedthe citrus genome in these two aspects to reveal the basic structuraland functional features of the genome.

    Transposable elements are major components of plant genomesand important contributors to genome evolution. However, very few

    transposable elements have previously been identified in Citrus . Toanalyze the transposable elements of the orange genome, we first con-structed a custom digital transposable element library using ab initio ,homology- and structure-based transposable element prediction toolsand then applied this dataset to annotate the citrus reference genome.Such analysis led to the identification of 61.7 Mb of sequence in theassembled contigs, accounting for 20.5% of the C. sinensis genome(Supplementary Table 5 ). This level of repetitiveness is similar tothose of Arabidopsis16 and rice 17, suggesting that sweet orange has arelatively compact genome.

    Class I long terminal repeat retrotransposons occupy over 89% ofthe transposable elements, consisting of 198 Copia families and 130Gypsy families. Similar to other compact genomes such as those of Arabidopsis18 and rice 19, the intact elements in each family have lessthan 250 members, whereas more than 90% of long terminal repeatretrotransposon families contain only one to five full-length elements.Estimation of insertion time indicated that more than 80% of theintact elements were amplified within the last 2 mill ion years, and 10%are younger than 50,000 years ( Supplementary Table 6 ). In contrast,class II DNA transposons only comprised ~11% of the transposableelements, similar to in the apple 20 and papaya 21 genomes. Among theclass II DNA transposons, miniature inverted-repeat transposable ele-ments (MITEs) occupied 67% of the sequence. Notably, we identifieda new type of MITE, named MiM (MITE inserted in microsatellite),in the citrus genome ( Supplementary Table 5 ).

    To annotate the citrus genome for protein-coding genes, we useda comprehensive strategy that combined ab initio gene predictions,

    Table 1 Statistics of the sweet orange draft genomeEstimate of genome size 367 MbChromosome number (2 n ) 18Total size of assembled contigs 320.5 MbNumber of contigs (>500 bp) 16,890Largest contig 323.34 kbN50 length (contig) 49.89 kbNumber of scaffolds (>500 bp) 4,811Total size of assembled scaffolds 301.02 MbN50 length (scaffolds) 1.69 MbLongest scaffold 8.16 MbGC content 34.06%Number of gene models 29,445/44,387Mean transcript length 1,817 bpMean coding sequence length 1,255 bpMean exon length 312 bpMean intron length 359 bpNumber of predicted miRNA genes 227Total size of transposable elements 61.7 Mb

  • 8/14/2019 Sweet Orange (Draft Genome

    3/10

    NATURE GENETICS VOLUME 45 | NUMBER 1 | JANUARY 2013 61

    A RT I C L E S

    protein-based homology searches and experimental supports (ESTs,RNA-Seq and RNA-PET) ( Supplementary Fig. 4 ). We used fourab initio gene-finding programs (Fgenesh, GeneID, Genscan andGlimmerHMM) on the repeat-masked genome sequences. In addi-tion, we collected 917,610 proteins from the UniProt database 22 and 958,121 Citrus EST reads from GenBank and generated 964.6

    million RNA-Seq reads from four citrus tissues (callus, leaf, flowerand fruit) ( Supplementary Table 1 ). In total, we identified 29,445protein-coding loci (gene models) with 44,387 transcripts ( Fig. 2 andSupplementary Table 7 ), and 23,421 gene models were allocated innine pseudochromosomes. The gene transcripts have an averagelength of 1,817 bp, a mean coding sequence size of 1,255 bp and anaverage of 5.8 exons per gene. The RNA-Seq and EST data suggestedthat 26% of the gene models (7,640 gene loci) encoded two or moretranscript isoforms ( Supplementary Fig. 5a ). Overall, 85% of the genemodels contain multiple exons ( Supplementary Fig. 5b ), and 99% ofthe predicted coding sequences and 80% of the splice junction siteswere supported by the RNA-Seq data ( Fig. 2b and SupplementaryTable 7 ), showing the high accuracy of gene annotation in the orangegenome. We evaluated and corrected all predicted gene models usingthe RNA-PET sequencing data (which is complementary to the RNA-Seq data) for demarcating the gene boundaries ( SupplementaryFig. 6). We generated approximately 100 million RNA-PETs from fourtissues (callus, leaf, flower and fruit) and used them for mapping,clustering and annotation of the genome ( Supplementary Table 1 ),confirming 70% of the 5 and 3 boundaries of the gene transcripts(Fig. 2c and Supplementary Table 8 ). Notably, we identified 967transcripts for 744 putative gene models solely on the basis of thecombined RNA-Seq and RNA-PET data ( Supplementary Table 9 andSupplementary Fig. 7 ), representing a collection of potentially newand citrus-specific genes. In addition, we identified a total of 181ribosomal RNAs, 451 transfer RNA, 39 small nucleolar RNAs and104 small nuclear RNAs in the citrus genome ( Supplementary Note ).

    We also identified 40 conserved microRNA (miRNA) families with227 members in the orange genome on the basis of previous smallRNA sequencing data 23 (Supplementary Table 10 ), and most of themiRNA genes are distributed evenly across the genome, with severalsmall clusters (Supplementary Fig. 8 ).

    To better understand citrus-specific biology, we identified citrus-

    specific genes. We performed comparative analyses of the orangegenome against 21 other plant genomes and screened them by ESTand peptide sequences in the public database ( Supplementary Note ).Of the 29,445 protein-coding genes in the orange genome, 23,804were grouped into 14,348 gene families, with 1,691 genes beingspecific to citrus (Supplementary Fig. 9 ). Most (96%) of the citrus-specific genes are hypothetical and have unknown function. Of the58 annotated genes, the over-represented protein domains are the zincfinger transcription factor domain and the nucleotide-binding-siteand leucine-rich-repeat (NBS-LRR) domain, which contain potentialdisease-resistance genes (Supplementary Table 11 ).

    Heterozygosity and hybrid origin of sweet orangeHeterozygosity is a common feature in most eukaryotic organisms andhas important biological functions in crops. Although citrus trees aregenerally perceived to have highly heterozygous traits 2, the amount ofheterozygosity in the whole genome is not clear. Sequencing of the hap-loid and diploid genomes provided the genomic data ( SupplementaryTable 12 ) for us to comprehensively assess the heterozygosity of sweetorange. From the diploid genome sequencing data, we identified1.06 million SNPs and 176,953 small insertions/deletions (indels).More than 80% of the SNPs and indels were anchored on the ninepseudochromosomes. The overall polymorphism density is 3.6 SNPsand 0.6 indels per kb in the genome ( Supplementary Table 13 ).Approximately one-third of the SNPs were located in genic regions(Supplementary Table 14 ), and 16,843 gene loci (72%) contained atleast one SNP (Supplementary Fig. 10 ). Considering a 0.52% error

    Marker Marker Marker Marker Marker Marker Marker Marker Marker

    Marker

    Chr1 Chr2 Chr3 Chr4 Chr5 Chr6 Chr7 Chr8 Chr9kb kb kb kb kb

    1,804

    1,712

    2,738

    1,819

    1,116575336

    6029211,291

    661

    1,762

    1,779

    1,335

    kb kb kb kb197 290 666

    4,431

    8,393

    975288286

    2,122863

    1,621

    632929

    2,378

    2961,526

    1,291 4361,456

    2,817

    3,757

    3,848

    277

    2,0762811861,103849

    2,219

    1,061813

    4,819

    1,700

    1,4074765938032561868627876451971,641

    2,464

    451,890

    612

    3,019

    45945821

    4,190

    3,531

    1,819

    696

    847987862

    3,379

    1,220

    1721,310

    1,675

    2,604

    199834

    3,019

    347

    3,884

    650679

    1,977

    267990957

    1,274

    2,113

    4,343

    1,296

    2,859

    375430791702

    4,614

    6471,326

    3,095

    682

    2,217

    981704693887

    1,277829

    1,565483940

    2,2007142473504661,4846361491,369

    5,312

    4,534

    11

    6,592

    645774891

    3,801

    1,750

    1,606

    3,390

    1,4291011801,088

    2 ,5 6 5

    46 9

    3, 1 7 1

    6221,052

    6,887

    1,435

    2,005259707295841,117

    2,669

    5401,110

    3,206

    552

    1,544

    2,641

    1,153

    2,526

    LG7 LG2 LG5 LG1

    LG3

    LG6LG4 LG8

    LG9

    Scaffold

    Figure 1 Alignment of the genome sequence assembly with the genetic map of C. sinensis. Assembled scaffolds (blue; 239 Mb, or 75% of theassembled genome sequence) were anchored in the nine linkage groups (LG1LG9, yellow) with the corresponding genetic markers (black bars).The pseudochromosome numbers are assigned on the basis of the estimated length of the genetic linkage groups 14 .

  • 8/14/2019 Sweet Orange (Draft Genome

    4/10

    62 VOLUME 45 | NUMBER 1 | JANUARY 2013 NATURE GENETICS

    A RT I C L E S

    rate in detecting SNPs ( Supplementary Note ), we set up a stringentcriterion of five or more SNPs per gene to reliably call variant alleles;using this criterion, 12,297 gene loci (53%) were considered to harbor variant alleles (Supplementary Table 15 ). This observation suggestsa notably high heterozygosity rate in the gene loci.

    To verify this observation, we genotyped 307 simple sequencerepeat (SSR) markers on the dihaploid and parental diploid of sweetorange and identified 155 (50%) SSR loci in a heterozygous state(Fig. 3a and Supplementary Table 16 ), which is similar to the hetero-zygosity rate in the protein-coding gene loci found by SNP analysis.Moreover, chromosome karyotyping analysis showed that 6 of the18 chromosomes have no clear counterpart in the parental diploid,whereas the chromosomes were easily grouped into nine pairs in thedihaploid line ( Fig. 3b ), further supporting the observation that thediploid genome of sweet orange is highly heterozygous.

    We thus investigated the speculated hybrid origin of sweet orangefrom pummelo ( C. grandis) and mandarin ( C. reticulata )7,8, which aretwo primitive species of Citrus. To measure the possible genetic contri-bution to sweet orange from pummelo and mandarin, we surveyed the

    genotype of sweet orange, two pummelo cultivars and two mandarincultivars using sweet orangederived SSR markers and identified 202polymorphic markers, 55 of which are pummelo hereditary and 147 ofwhich are mandarin hereditary ( Fig. 3c and Supplementary Table 16 ).It is remarkable that the numbers of pummelo-hereditary markers andmandarin-hereditary markers (55 and 147, respectively) fit well witha 1:3 ratio ( 2 = 0.54 < 20.95 (21) = 3.84), indicating an interestinggenetic relationship between sweet orange, pummelo and mandarin.

    To comprehensively verify the genome composition of sweet orangeinferred from SSR markers, we extended our analysis using high-density SNP makers. For this purpose, we generated whole-genomeshotgun short sequence reads for three pummelo cultivars and threemandarin cultivars, each at more than 30-fold coverage of the citrusgenome (Supplementary Table 12 ); we then mapped the sequencesto the haploid orange reference genome and identified 475,450 SNPscommon in three individual pummelos and 308,837 SNPs commonin three individual mandarins ( Supplementary Table 17 ). Using a100-kb genomic window, we calculated the SNP density preferencefor pummelo or mandarin and then determined the hereditary origin(Supplementary Note ), which allowed us to assign 39.7 Mb of thesweet orange genome from pummelo and 118.2 Mb from mandarin,which approximately fits a ratio of 1:3 in the genomic contributionsfrom pummelo and mandarin ( Fig. 3d and Supplementary Fig. 11 ),consistent with the SSR analysis results.

    On the basis of the high heterozygosity rate (50%), the 1:3 ratio of pum-melo and mandarin genetic input and the evidence that the chloroplast ofsweet orange is probably derived from pummelo 7 (Supplementary Fig. 12 ),

    we hypothesized that sweet orange originated from an interspecifichybridization with pummelo as the female parent and mandarin as themale parent followed by a backcross with a male mandarin (sweet orange =(pummelo mandarin) mandarin) ( Fig. 3e).

    Genome evolutionCharacterization and annotation of the citrus genome provided thebasic features for us to further investigate the evolutionary history ofsweet oranges. Most plant species have gone through whole-genomeduplication (WGD) during evolution 24. The size and frequency ofparalogous gene families accumulated in each genome serves as arecord for such evolutionary process. Self-alignment of the citrusgenome sequences on the basis of the 23,421 gene models locatedin the nine pseudochromosomes identified 1,296 paralogous genegroups ( Fig. 4a ). The mean synonymous substitution rates ( K s) inthe duplicated genes ( K s = 1.27) suggested that an ancient WGDoccurred in citrus ( Fig. 4b and Supplementary Fig. 13 ). Furthermore,we analyzed intergenome collinearity between orange and cacao 25, Arabidopsis16, apple20, strawberry 26 and grape 27 genomes and identi-fied orange orthologous genes in syntenic blocks with respect to eachof these genomes (Supplementary Table 18 ). A one-to-one ortho-logous relationship was predominant for the orange-to-cacao, orange-to-grape and orange-to-strawberry comparisons, indicating that therewere no recent WGDs except the shared ancient triplication, calledthe event, which was shared by all core eudicots 24. In contrast, wefound an approximately one-to-two orthologous relationship for theorange-to-apple comparison and a one-to-four relationship for the

    aIIIIIIIV I. Repeat coverage (%)

    II. Gene density (genes per Mb)

    III. Transcription levels

    IV. Concordant RNA-PETs

    V. GC content

    VI. SNP density

    VII. Indel density

    0

    0

    30

    0

    0

    20 40 60

    35 40 (GC%, bin = 1 Mb)

    ( 100 SNPs, bin = 1 Mb)

    ( 100 indels, bin = 1 Mb)

    0

    5 180

    50VVIVII

    C h r 9

    C h r 8

    C h r 7

    C h r

    6

    C h r 5

    C h r 4

    C h r 3

    C h r 2

    C h r 1 1 5 1 0

    5 0 2 0

    1 5 1 0

    5

    3 0

    2 5 2 0

    1 5

    1 0

    5

    0 2 0

    1 5

    1 0 5

    0 3 5

    3 0

    2 5

    2 0

    1 5

    1 0 5

    0 1 5

    1 0

    5 0

    2 5

    2 0

    1 5

    1 0 5

    03 0

    2 5

    2 0 1 5

    1 0 5

    0 2 5

    2 0 1 5 1

    0 5 0

    0

    5

    5

    25 50 75100 (concord%, bin = 1 Mb)

    10

    10

    15

    15

    ( 100 RPKM, bin = 1 Mb)

    EST(76,722) Protein

    (87,153)

    Prediction(121,546)

    RNA-Seq(97,219)

    10.0%

    1.6% 1.1%

    50.5%

    6.2%

    13.9%5.6%

    b

    5 UTR

    5 UTR Exon Intron

    Gene body

    Exon Intron

    Splice junction

    3 UTR

    3 UTR

    Exon Intron

    Gene model

    CDS

    1.0

    0.8

    0.6

    0.4

    0.2

    0

    R

    N A - S e q r e a

    d s a

    l i g n m e n

    t

    o

    v e r g e n e m o

    d e

    l f e a

    t u r e s

    c400

    300

    200

    N u m

    b e r o

    f g e n e m o

    d e

    l s

    100

    0 0 0 1,0001,000 1,0001,000

    TSS RNA-PET PAS

    5 3

    Figure 2 Genome characterization. ( a ) Circular diagram depicting thegenomic landscape of the nine sweet orange pseudochromosomes(Chr1 to ~Chr9 on a Mb scale). The denotation of each track is listed onthe right. RPKM, reads per kilobase exon model per million mapped reads.(b ) Experimental data support for gene model predictions. Left, box plotof RNA-Seq reads aligned over gene features. The top and bottom of theboxes indicate the upper and lower quartiles, respectively. The middle redbars represent the median, and the red dots denote the mean value foreach feature. CDS, coding sequence. Right, Venn diagram of the predictedRNA splice junctions supported by various evidences (EST, proteinand RNA-Seq data). Numbers in parentheses are the number of splicejunctions in each category. ( c ) Demarcation of gene model boundariesby RNA-PET data. Histogram plot of RNA-PET data in association withaligned gene models in relation to the 5 (putative transcription start site,TSS) and 3 (putative poly(A) site, PAS) boundaries.

  • 8/14/2019 Sweet Orange (Draft Genome

    5/10

    NATURE GENETICS VOLUME 45 | NUMBER 1 | JANUARY 2013 63

    A RT I C L E S

    orange-to- Arabidopsis comparison, supporting that an additional oneand two rounds of recent WGD occurred in apple and Arabidopsis,respectively (Fig. 4c and Supplementary Fig. 14 ). Furthermore, onthe basis of cross-species synteny block boundaries, we estimated thatat least 49 events of chromosome translocations and fusions occurredin the nine citrus chromosomes evolved from chromosomes of a pro-posed paleohexaploid ancestor that is common to all sequenced eud-icot genomes 28. It is noteworthy that citrus pseudochromosome 4 hadonly one chromosome translation and fusion event, whereas otherpseudochromosomes underwent multiple (312) interchromosomefusions during their evolutionary history ( Supplementary Fig. 15 ).

    This whole-genome sequence also provides comprehensive infor-mation for phylogenetic analysis. We applied the maximum likelihood

    method for genome-scale phylogenic analysis using 103 shared single-copy nuclear genes from orange, Arabidopsis16, cacao25, poplar 29,grape27, apple20, strawberry 26, papaya21 and castor bean 30 (Fig. 4d and Supplementary Note ). The phylogenetic tree showed that Citrus is phylogenetically closer to cacao, Arabidopsis and papaya, althoughit is ancient compared to them in the Malvidae, and is estimated tohave diverged from Malvales 85 million years ago. Thus, our analysisprovides new information on the phylogeny of citrus and other plantspecies on the basis of genome-scale data.

    Genes specially involved in citrus fruit biologyFruit biology is an interesting topic for citrus not only because citrus fruitsare an important nutrition source for humans but also because they involve

    a b e

    c

    d

    Dihaploid

    Dihaploid chrs

    Diploid chrsBf

    Bf

    Df

    Df

    D

    D D D D

    D F

    F F F F

    F FB

    Bf

    Bf

    Bf

    Df

    Df

    Df

    B

    B

    B

    F

    D D D

    D

    DD

    D

    D

    D

    D C

    5 m5 m

    C

    C

    C

    F

    F FF

    F

    FF

    F

    F

    F

    FF

    B

    C

    C

    Diploid

    Sweet orange 307 SSR markers

    P1 P2 SO M1

    (55)(105)

    (147)Ratio 1 3:

    M2 P1 P2 SO M1 M2 P1 P2 SO M1 M2Shared Mandarin hereditaryPummelo hereditary

    Repeat region(SSR markers)

    152(50%)

    155(50%)

    Genic regions(SNP markers)

    11,124(47%)

    Homozygous

    Heterozygous

    12,297(53%)

    Sweet orange = (PP MM) MM

    Interspecific hybrid

    C. grandisPummelo

    C. sinensisSweet orange

    Today s sweet orange cultivars

    Somatic mutations

    C. reticulataMandarin

    C. reticulataMandarin

    OP

    SC Chr1

    LG

    Mandarin origin 3 (118.2 Mb) UndeterminedPummelo origin 1 (39.7 Mb)

    Figure 3 Heterozygos ity and hybrid origin of sweet orange. ( a ) The heterozygosity rate of the sweet orange genome. Homozygous and heterozygousstates of genic regions were detected using SNP data from the parental diploid genome; repeat regions were evaluated by comparison of SSR markerpatterns between the diploid and dihaploid. The number in the top pie chart indicates the gene number, and the number in the bottom pie chart isthe number of SSR markers. ( b ) Chromosome karyotyping of dihaploid (left) and parental diploid (right). The 18 chromosomes (chrs) in each nucleuswere easily identified into nine matched pairs in the dihaploid line, whereas the six chromosomes in the diploid nucleus have no clear counterpart,indicating high heterozygosity in the diploid state. The cytological types of chromosomes were traditionally named as Bf, B, C, Df, D and F accordingto the 4 -6-diamidino-2-phenylindole (blue) and chromomycin A3 (yellow) banding regions, as well as the fragile site (circle), as previously described 41 (Supplementary Fig. 3 ). ( c ) SSR marker genotyping of two pummelo cultivars (P1 and P2), sweet orange (SO) and two mandarin cultivars (M1 and M2).Of the 307 SSR markers, 105 were shared among pummelo, orange and mandarin (middle), 55 were pummelo hereditary (left) and 147 were mandarinhereditary (right). One marker showed an orange-specific pattern and was excluded in this analysis. ( d ) The genetic origin of the sweet orange dihaploidgenome determined by high-density SNP markers of pummelo and mandarin. OP, origin pattern (pummelo origin in yellow, mandarin origin in orangeand undetermined in gray); SC, assembled sequence scaffold; LG, linkage group map. Pseudochromosome 1 is shown here as an example, and the otherchromosomes are shown in Supplementary Figure 11 . ( e ) A model of the sweet orange origin. With pummelo (PP) as the female parent crossed withmandarin (MM), the interspecific hybrid was backcrossed with mandarin (MM) and produced the ancient sweet orange.

  • 8/14/2019 Sweet Orange (Draft Genome

    6/10

    64 VOLUME 45 | NUMBER 1 | JANUARY 2013 NATURE GENETICS

    A RT I C L E S

    unique metabolic pathways. This citrus reference genome provides a com-prehensive genomic framework of genes to help address citrus fruit biology.A clear strategy is to identify genes specifically expressed in fruit tissues.We compared the RNA-Seq data derived from callus, leaf, flower andfruit and identified 2,697 (9.2%) genes that were significantly upregulated(P < 0.01) in fruits (Supplementary Table 19 ). Gene Ontology analysisof upregulated genes in citrus fruits revealed that genes for oxidoreduct-ase, oxidation reduction and vitamin binding were over-represented(Supplementary Table 20 ).

    Fruit ripening is an important biological feature and has tremendouseconomical value. Genes involved in ripening have been well studiedin tomato 31. Of the 16 tomato ripening genes, we found that only 3(ETR3 (also known as NR), MADS-RIN and BP0353 (also knownas PHYF )) are differentially expressed in orange fruits, whereas themajority (13) are not ( Supplementary Table 21 ), possibly reflecting

    the existence of two distinct ripening mechanisms for the climacteric(tomato) and nonclimacteric (orange) fruits. Notably, MADS-RIN,a transcription factor previously shown to be necessary for the nor-mal development and ripening of both tomato 32 (climacteric) andstrawberry 33 (nonclimacteric), is highly upregulated in orange fruits(Supplementary Table 21 ), providing further support for the notionthat MADS-RIN may function as a key-ripening regulator with broadimpacts in both climacteric and nonclimacteric fruit ripening.

    We were specifically interested in the biosynthesis of vitamin C( -ascorbic acid (AsA), a strong antioxidant agent), as orange fruits areknown to be rich in this vitamin. We analyzed the gene transcriptionof key enzymes involved in four known biosynthesis branch pathwaysupstream of AsA 34,35 (Supplementary Fig. 16 ) and found that manyof these genes are upregulated specifically in fruit. Remarkably, allthree genes for the enzymes ( PG, PME and GalUR) involved in the

    100

    Vitis

    Oryza

    Malus

    Sorghum

    Fragaria vesca

    One-to-one One-to-one

    One-to-twoOne-to-four

    Theobroma cacao Vitis vinifera

    Citrus sinensis

    Arabidopsis thaliana Malus domestica

    F a b i d a e

    Fragaria

    Citrus

    Theobroma Arabidopsis

    M a l v i d a e

    Carica

    Populus

    Ricinus

    100

    Chr1 Chr3 Chr4

    0

    5

    1 0

    1 5

    20

    2 5

    3 0

    0

    5

    1 0

    1 5

    2 0

    0 5

    1 0

    1 5

    0 5 1 0

    1 5

    2 0

    2 5

    0

    5

    1 0

    1 5

    2 0

    2 5

    3 0

    0

    5

    1 0

    1 5

    2 0

    2 5

    0

    5

    1 0 1 5

    0 5

    1 0 1

    5 2 0 2

    5 3 0 3

    5

    0

    5 1 0

    1 5 2 0

    Chr7

    Chr9

    Chr1 Chr1 Chr7 Chr11 Chr13 Chr14 Chr16 Chr17Chr3 Chr3Chr4 Chr5

    Chr1 Chr3 Chr4 Chr5 Chr6 Chr16

    100115

    MYA

    98MYA

    89MYA

    85MYA

    100

    100

    100

    0.1

    100

    97

    Ancestralprotochromosomes

    a b

    dc

    A1

    C h r 1

    C h r 2

    C h r 3

    C h r 4

    C h r 5

    C h r 9

    A4A7A10A13A16A19

    P e r c e n

    t a g e o

    f s y n

    t e n

    i c g e n

    e s

    20

    15

    10

    5

    Recent duplication Apple-appleApple-orangeOrange-orange

    10

    5

    00.05 0.55

    K s

    No recent duplication

    1.551.05

    Cacao-orangeCacao-cacao

    Orange-orange

    10

    5

    & duplication Arabidopsis-Arabidopsis

    Orange-orangeArabidopsis-orange

    C h r 6

    C hr7

    C h r 8

    Figure 4 Evolutionary analysi s of the sweet orange genome. ( a ) Schematic representation of major interchromosomal relationships within the 1,294paralogous gene groups in the sweet orange genome. Syntenic blocks derived from the seven ancestral protochromosomes A1, A4, A7, A10, A13, A16and A19 (ref. 28) are color coded as indicated. ( b ) Distribution of synonymous substitution rates ( K s) for homologous gene groups for intrachromosomeand interchromosome comparisons. Gene duplication analysis in comparison to apple, Arabidopsis and cacao indicates that no recent WGDs occurredin the sweet orange genome. ( c ) Schematic representation of the syntenic relationship of sweet orange chromosome 9 and the correspondingchromosomes in Arabidopsis , Theobroma cacao , Malus domestica , Fragaria vesca and Vitis vinifera . Syntenic regions derived from the seven ancestralprotochromosomes are color coded as in a . Among the 59 syntenic blocks between sweet orange chromosome 9 and the Arabidopsis chromosomes,22 (37%) show primary correspondence to four Arabidopsis blocks, 10 to three blocks, 24 to two blocks and 3 to one block; among the 79 syntenic blocksbetween sweet orange chromosome 9 and the apple chromosomes, 48 (61%) show primary correspondence to two Arabidopsis blocks, 20 to threeblocks, 6 to four blocks and 2 to one block. Therefore, on the basis of these percentages, one-to-four and one-to-two relationships were determined tobe the most probable for the orange-to- Arabidopsis and orange-to-apple comparisons, respectively. ( d ) Citrus phylogeny on the basis of 103 single-copygenes shared between Arabidopsis , cacao, poplar, grape, apple, strawberry, papaya and castor bean from nuclear genome data. MYA, million years ago.

  • 8/14/2019 Sweet Orange (Draft Genome

    7/10

    NATURE GENETICS VOLUME 45 | NUMBER 1 | JANUARY 2013 65

    A RT I C L E S

    galacturonate pathway branch were sub-stantially upregulated ( Fig. 5a), of which thegene encoding -galacturonic acid reduct-ase (GalUR) is the rate-limiting step in thispathway 36. Coincidentally, we identified 18GalUR paralogous genes in the citrus genome(Supplementary Table 22 ), which is the largestnumber among the plant genomes (17 in appleand grapevine, 15 in strawberry, 13 in papaya,12 in rice, maize and cocoa, 7 in Arabidopsis, 6in Brachypodium and none in Chlamydomonas)(Supplementary Table 23 ). Further evolution-ary analysis using the four closely related plantspecies within the Malvidae clade (orange, cacao, papaya and Arabidopsis)revealed two clusters (one with four genes and the other with seven genes)of recent GalUR gene expansion in the citrus genome, probably by tan-dem duplication ( Fig. 5b ). Notably, these two clusters also showed an

    expansion mode in papaya, another tree crop with high vitamin C con-tent in its fruit. Noticeably, orange GalUR genes within these two clus-ters showed diverse expression patterns across different developmentalstages. One of them ( GalUR-12) was significantly upregulated in fruit(P = 5 107), whereas the other three members in the same cluster didnot show a similar expression pattern ( Fig. 5a ,b), indicating diversifi-cation of gene transcription regulation in AsA biosynthesis. GalUR-12shares high sequence identity ( E value < 10106) and shows a close phylo-genetic relationship ( Supplementary Fig. 17 ) with the strawberry GalUR gene, which has been further tested for specific vitamin C production 36.Collectively, our data indicate that the galacturonate pathway is highlyupregulated in orange fruit and GalUR may be the most important con-tributor to AsA accumulation in orange fruit.

    DISCUSSIONWe sequenced the sweet orange genome using a two-step approach:first, we sequenced and assembled a dihaploid genome, and second,we mapped the parental diploid genomic DNA sequence reads to thehaploid reference genome to complete the construction of the het-erozygous genome map. The availability of the sweet orange genomesequence provides a valuable genomic resource for citrus genetics andbreeding improvement. The sequence of sweet orange, as a memberof the Rutaceae family and the Sapindales, also provides a valuablereference for plant genome evolutionary analyses.

    Our genome analysis indicated that there was only an ancient( ) triplication event and no recent WGD events in the orangegenome, which partially explains the small size of the citrus genome.

    Phylogenetic analysis indicated that citrus is rather ancient, evolution-arily close to cacao and estimated to have diverged 85 million yearsago from Malvales. An infrequent reproductive cycle in some taxabecause of apomixis, male and/or female sterility, long juvenility and

    vegetative propagation may contribute to the restriction of genomeexpansion and evolution.Heterozygosity is of great interest in crop plants. Hybridization caus-

    ing a high degree of heterozygosity could create hybrid vigor knownas heterosis37. The remarkably high degree of heterozygosity in thegenome of cultivar Valencia sweet orange ( C. sinensis cv. Valencia), asevidenced in our genomic and cytological analyses, hinted that sweetorange is an interspecific hybrid between pummelo and mandarin.On the basis of the collective evidence, we reconstructed the scenarioregarding the ancient primary events of the origin of sweet orange:female pummelo crossed with male mandarin to create the initial inter-specific hybrid that was further crossed again with male mandarin toproduce sweet orange. This event might have happened at least 2,300years ago, or much earlier, as sweet orange was recorded in Chinese lit-erature as long ago as 314 BC 3,38. Although additional genetic changesmight have occurred afterward, it is still remarkable that this ancienthybrid genotype seems to be preserved in todays sweet orange, prob-ably because of its strict nature of asexual reproduction (apomixisthrough nucellar embryo) and manmade selection and propagationby grafting. This scenario explains why most of todays sweet orangecultivars are genetically one biotype and highly heterozygous, withdiversification occurring mostly through somatic mutations 7,39,40.

    Characterization of the citrus genome for protein-coding genes andcomparative genomic analysis with other plant genomes, as well as tran-scriptome analyses, have yielded several insights into citrus-specificgenes, particularly those that are involved in fruit biology and meta-bolic pathways. Our analysis suggests that the rate-limiting enzyme

    G r o u p

    4

    G r o u p

    3

    G r o u p

    2

    G r o u p

    1

    C a l l u

    sa b

    Normalized expression level

    0 0.2 0.4

    GalUR -1

    21.1 21.2

    7.6 7.7 Mb

    0.1 Mb0

    0.1 Mb

    11.2 Mb 14.4 Mb 0.3 Mb5.0 Mb4.9 11.1 14.3 0.2

    0.6 Mb0.5

    21.3 0.80.721.4 Mb 0.9 Mb

    GalUR -10

    GalUR -2

    GalUR -8

    GalUR -14

    GalUR -6

    GalUR -3 GalUR -4

    GalUR -5 GalUR -7 GalUR -9 GalUR -18

    GalUR -13GalUR -12GalUR -11

    Chr1

    Chr7 Chr6 Chr9 Sc167

    Chr5

    Chr3

    Sc094

    Sc011

    Sc132

    0.6 0.8 1.0

    APX-1GalUR-12PME-1 AAT-1GalPPPME-2PMIMDAR-1GalLDH-1GLOase

    GalUR-10

    MDAR-2GalDHGalLDH-2GLOase-2GME-1GME-2GGalPP-1GME-3GalUR-5PG-1

    AODHAR-1IPS

    AAT-2 APX-2PG-2GalUR-8GalUR-14MIOX GGalPP-2PME-3

    Alase-1DHAR-2GalUR-17PGI-1

    PGI-2PMMPME-4

    APX-3

    APX-4

    GMPase Alase-2MDAR-3PG-3PG-4

    L e a f

    F l o w

    e r

    GalUR -15

    G a l U

    R - 9

    G a l U R

    - 2

    G a l U R

    - 4

    G a l U

    R - 3

    G a l U

    R - 1

    G a l U

    R - 8

    G a l U

    R - 1 4

    G a l U

    R - 1 0

    GalUR -16

    GalUR -11

    GalUR -6

    G a l U

    R - 1 8

    G a l U

    R - 7

    G a l U

    R - 5

    G a l U R - 1 2

    G a l U R - 1 3

    G a l U R - 1 7

    F r u i t

    Theoboma cacao

    Arabidopsis thalianaCarica papaya

    Citrus sinensis

    Figure 5 Genes involved in vitamin Cmetabolism. ( a ) Heat map of the normalizedRNA-Seq data for genes involved in AsAmetabolism. Genes with a minimum expressionlevel (RNA-Seq RPKM > 1 (ref. 42)) are shown.(b ) Phylogenetic analysis of the GalUR genefamily in C. sinensis , Arabidopsis thaliana ,Carica papaya and T. cacao (top); two recentexpansions are shaded with two oval circles (onecluster with GalUR -6, GalUR -11, GalUR -12 andGalUR -13 (orange oval) and the other clusterwith GalUR -1, GalUR -2, GalUR -3, GalUR -4,GalUR -8, GalUR -10 and GalUR -14 (gray oval)).Note that GalUR -12 is a recent member ofthe family and is highly upregulated in orangefruits. The genomic organization of the 18GalUR genes (black arrows) in the sweet orangegenome is shown (bottom). Sc, scaffold.

  • 8/14/2019 Sweet Orange (Draft Genome

    8/10

    66 VOLUME 45 | NUMBER 1 | JANUARY 2013 NATURE GENETICS

    GalUR 36 in the galacturonate pathway could be an important con-tributor to the high vitamin C accumulation in citrus fruit, and therecent expansion of the GalUR gene family in the citrus genome maysuggest a genomic basis for regulating vitamin C production.

    The availability of the orange genome and deep mining of orange-specific genes with known or unknown biochemical functions willfacilitate a better understanding of many economically and agronomi-

    cally important traits , including fruit color, flavor, sugar, organic acidand disease resistance in years to come. Future comparison of thissweet orange genome with the recently released high-quality clemen-tine mandarin genome assembly draft (see URLs) should provide newinsights into citrus biological and evolutionary questions.

    URLs. Food and Agriculture Organization of the United Nations(FAO statistics), http://faostat.fao.org/default.aspx ; clementine man-darin genome assembly, http://www.phytozome.net/clementine.php ;sweet orange genome website, http://citrus.hzau.edu.cn/orange .

    METHODSMethods and any associated references are available in the online version of the paper .

    Accession codes. Genome assembly data has been deposited atDDBJ/EMBL/GenBank under the accession AJPS00000000 and theBioProject ID PRJNA86123. The version described in this paper is thefirst version, AJPS01000000. DNA sequence data, assembly, annota-tion and the browser of the sweet orange genome are available throughour website at http://citrus.hzau.edu.cn/orange .

    Note: Supplementary information is available in the online version of the paper .

    ACKNOWLEDGMENTSThe sweet orange genome project was supported by the Ministry of Science andTechnology of China (2011CB100600), the National Natural Science Foundation ofChina (30921002, 30830078 and 31071779) and the Ministry of Agriculture (2011-g21and CARS-27). Sweet orange mapping was funded by the Citrus Research Board (5200-125) and University of California Discovery grant program (itl-bio-10155). We thankW. Xie and Z. Li for their assistance in bioinformatic analysis and cytological experiments.

    AUTHOR CONTRIBUTIONSX.-X.D. and Y.R. conceived the project and the strategy. Q.X. coordinated the overallproject. X.R. directed sequencing data generation. L.-L.C. led the genome annotationanalysis. N.N., D.B. and S.G. assembled the draft genome. D.C., W.-B.J., B.-H.H.,J.-W.C., Z.T., F.X., O.J.L., Y.L., X.-W.X. and H.-Y.Z. performed gene annotation, SNPanalysis, transcriptome analysis and database management. A.Z., Q.X., H.K., J.C.,L.W., Q.H., M.K.B., W.Z., J.-H.L., F.G., H.C., Y.-J.C., J.X., X.Y. and W.-W.G. performedrepetitive elements, genome anchor, gene synteny and evolutionary analyses. C.C.,H.L., X.G., Y.M. and S.X. performed cytological analyses. M.P.L. and M.L.R. developedthe sweet orange linkage map. Q.X. and Y.R. wrote the manuscript with contributionsfrom X.R., L.-L.C., D.C., A.Z., N.N., D.B., C.C., W.-B.J., F.X., H.K., M.L.R. and X.-X.D.

    COMPETING FINANCIAL INTERESTSThe authors declare no competing financial interests.

    Published online at http://www.nature.com/doifinder/10.1038/ng.2472 .Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html .

    This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 Unported License. To view a copy ofthis license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/

    1. Roose, M.L. & Close, T.J. Genomics of citrus, a major fruit crop of tropical andsubtropical regions. in Genomics of Tropical Crop Plants (eds. Moore, P.H. &Ming, R.) 187201 (Springer Press, 2008).

    2. Gmitter, F.G. et al. Citrus genomics. Tree Genet. Genomes 8 , 611626 (2012).3. Webber, H.J. History and development of the Citrus industry. in The Citrus Industry

    (eds. Reuther, W. et al. ) Chap. 1 , 139 (University of California Press, 1967).

    4. Scora, R.W. On the history and origin of Citrus . Bull. Torrey Bot. Club 102 , 369375(1975).

    5. Gmitter, F. & Hu, X. The possible role of Yunnan province, China, in the origin ofcontemporary Citrus species (Rutaceae). Econ. Bot. 44 , 267277 (1990).

    6. Legge, J. The shoo king and The tribute of Yu. in The Chinese Classics , Vol. 3,Pt. 1 and Pt. 3, Bk. 1, Chap. 6, 111112 (Trubner & Co., London, 1865).

    7. Moore, G.A. Oranges and lemons: clues to the taxonomy of Citrus from molecularmarkers. Trends Genet. 17 , 536540 (2001).

    8. Nicolosi, E. et al. Citrus phylogeny and genetic origin of important species asinvestigated by molecular markers. Theor. Appl. Genet. 100 , 11551166 (2000).

    9. Arumuganathan, K. & Earle, E.D. Nuclear DNA content of some important plantspecies. Plant Mol. Biol. Rep. 9 , 208218 (1991).10. Cao, H. et al. Doubled haploid callus lines of Valencia sweet orange recovered from

    anther culture. Plant Cell Tissue Organ Cult. 104 , 415423 (2011).11. Li, R. et al. De novo assembly of human genomes with massively parallel short

    read sequencing. Genome Res. 20 , 265272 (2010).12. Gao, S., Sung, W.K. & Nagarajan, N. Opera: reconstructing optimal genomic

    scaffolds with high-throughput paired-end sequences. J. Comput. Biol. 18 ,16811691 (2011).

    13. Baig, M.N., Yu, A., Guo, W. & Deng, X. Construction and characterization of twoCitrus BAC libraries and identication of clones containing the phytoene synthasegene. Genome 52 , 484489 (2009).

    14. Lyon, M.P. A Genomic Genetic Map of the Common Sweet Orange and Poncirustrifoliata . PhD dissertation, University of California, Riverside, (2008).

    15. Iwamasa, M. Reciprocal translocation in the Valencia orange. Chromosome Inform.Service 4 , 910 (1963).

    16. The Arabidopsis Genome Initiative. Analysis of the genome sequence of the oweringplant Arabidopsis thaliana . Nature 408 , 796815 (2000).

    17. Goff, S.A. et al. A draft sequence of the rice genome ( Oryza sativa L. ssp japonica).

    Science 296 , 92100 (2002).18. Du, J. et al. Evolutionary conservation, diversity and specicity of LTR-

    retrotransposons in owering plants: insights from genome-wide analysis andmulti-specic comparison. Plant J. 63 , 584598 (2010).

    19. Gao, L., McCarthy, E.M., Ganko, E.W. & McDonald, J.F. Evolutionary history of Oryzasativa LTR retrotransposons: a preliminary survey of the rice genome sequences.BMC Genomics 5 , 18 (2004).

    20. Velasco, R. et al. The genome of the domesticated apple ( Malus x domestica Borkh.).Nat. Genet. 42 , 833839 (2010).

    21. Ming, R. et al. The draft genome of the transgenic tropical fruit tree papaya ( Caricapapaya Linnaeus). Nature 452 , 991996 (2008).

    22. Dimmer, E.C. et al. The UniProt-GO Annotation database in 2011. Nucleic AcidsRes. 40 , D565D570 (2012).

    23. Xu, Q. et al. Discovery and comparative proling of microRNAs in a sweet orangered-esh mutant and its wild type. BMC Genomics 11 , 246 (2010).

    24. Jiao, Y. et al. Ancestral polyploidy in seed plants and angiosperms. Nature 473 ,97100 (2011).

    25. Argout, X. et al. The genome of Theobroma cacao . Nat. Genet. 43 , 101108 (2011).26. Shulaev, V. et al. The genome of woodland strawberry ( Fragaria vesca ). Nat. Genet. 43 ,

    109116 (2011).27. Jaillon, O. et al. The grapevine genome sequence suggests ancestral hexaploidization

    in major angiosperm phyla. Nature 449 , 463467 (2007).28. Abrouk, M. et al. Palaeogenomics of plants: synteny-based modelling of extinct

    ancestors. Trends Plant Sci. 15 , 479487 (2010).29. Tuskan, G.A. et al. The genome of black cottonwood, Populus trichocarpa (Torr. &

    Gray). Science 313 , 15961604 (2006).30. Chan, A.P. et al. Draft genome sequence of the oilseed species Ricinus communis .

    Nat. Biotechnol. 28 , 951956 (2010).31. The Tomato Genome Consortium. The tomato genome sequence provides insights

    into eshy fruit evolution. Nature 485 , 635641 (2012).32. Vrebalov, J. et al. A MADS-box gene necessary for fruit ripening at the tomato

    ripening-inhibitor ( rin ) locus. Science 296 , 343346 (2002).33. Seymour, G.B. et al. A SEPALLATA gene is involved in the development and ripening

    of strawberry ( Fragaria ananassa Duch.) fruit, a non-climacteric tissue. J. Exp.Bot. 62 , 11791188 (2011).

    34. Cruz-Rus, E., Botella, M.A., Valpuesta, V. & Gomez-Jimenez, M.C. Analysis of genesinvolved in l-ascorbic acid biosynthesis during growth and ripening of grape berries.

    J. Plant Physiol. 167 , 739748 (2010).35. Bulley, S.M. et al. Gene expression studies in kiwifruit and gene over-expression inArabidopsis indicates that GDP-l-galactose guanyltransferase is a major control pointof vitamin C biosynthesis. J. Exp. Bot. 60 , 765778 (2009).

    36. Agius, F. et al. Engineering increased vitamin C levels in plants by overexpressionof a D-galacturonic acid reductase. Nat. Biotechnol. 21 , 177181 (2003).

    37. Lippman, Z.B. & Zamir, D. Heterosis: revisiting the magic. Trends Genet. 23 ,6066 (2007).

    38. Chinese Society of Citriculture. Citrus Industry in China . (China Agriculture Press,Beijing, China, 2008).

    39. Barrett, H.C. & Rhodes, A.M. A numberical taxonomic study of afnity relationshipsin cultivated Citrus and its close relatives. Syst. Bot. 1 , 105136 (1976).

    40. Mendel, K. Bud mutations in citrus and their potential commercial value. Proc.Int. Soc. Citricult. 1 , 8689 (1981).

    41. Guerra, M. Cytogenetics of Rutaceae. V. High chromosomal variability in C itrusspecies revealed by CMA/DAPI staining. Heredity 71 , 234241 (1993).

    42. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifyingmammalian transcriptomes by RNA-Seq. Nat. Methods 5 , 621628 (2008).

    A RT I C L E S

    http://faostat.fao.org/default.aspxhttp://www.phytozome.net/clementine.phphttp://citrus.hzau.edu.cn/orangehttp://www.nature.com/doifinder/10.1038/ng.2472http://www.nature.com/doifinder/10.1038/ng.2472http://www.ncbi.nlm.nih.gov/nuccore/AJPS00000000http://www.ncbi.nlm.nih.gov/bioproject?term=PRJNA86123http://www.ncbi.nlm.nih.gov/nuccore/405187023http://citrus.hzau.edu.cn/orangehttp://www.nature.com/doifinder/10.1038/ng.2472http://www.nature.com/doifinder/10.1038/ng.2472http://www.nature.com/reprints/index.htmlhttp://www.nature.com/reprints/index.htmlhttp://creativecommons.org/licenses/by-nc-sa/3.0/http://creativecommons.org/licenses/by-nc-sa/3.0/http://www.nature.com/reprints/index.htmlhttp://www.nature.com/reprints/index.htmlhttp://www.nature.com/doifinder/10.1038/ng.2472http://www.nature.com/doifinder/10.1038/ng.2472http://citrus.hzau.edu.cn/orangehttp://www.ncbi.nlm.nih.gov/nuccore/405187023http://www.ncbi.nlm.nih.gov/bioproject?term=PRJNA86123http://www.ncbi.nlm.nih.gov/nuccore/AJPS00000000http://www.nature.com/doifinder/10.1038/ng.2472http://www.nature.com/doifinder/10.1038/ng.2472http://citrus.hzau.edu.cn/orangehttp://www.phytozome.net/clementine.phphttp://faostat.fao.org/default.aspx
  • 8/14/2019 Sweet Orange (Draft Genome

    9/10

    NATURE GENETICSdoi:10.1038/ng.2472

    ONLINE METHODSPlant material and DNA sequencing. The dihaploid line derived from anotherculture of C. sinensis cv. Valencia 10 was used for paired-endtag DNA sequenc-ing using an Illumina GAII platform. Approximately 785 million 2 100 bpreads were generated from libraries with different fragment sizes (300 bp, 2 kb,10 kb and 20 kb). DNA from the heterozygous diploid parent of the dihaploidline (the tree from which the anther was harvested and the dihaploid line wasderived), three pummelos ( C. grandis) and three mandarins ( C. reticulata ) was

    subjected to whole-genome shotgun sequencing of paired-endtag reads from300-bp genomic DNA fragments ( Supplementary Table 12 ).

    Genome assembly. We first applied the software QUAKE 43 to correct thereads from each library. Contig assembly, scaffolding and gap filling ofsequencing data was done using the assembler SOAPdenovo 11, with the threeDNA-PET libraries (2 kb, 10 kb and 20 kb) being used to link contigs intoscaffolds. Scaffolds and contigs were refined further with the gap-filling mod-ule in SOAPdenovo (GapCloser), which is used for bridging scaffold gaps,to produce an assembly with a scaffold N50 of 427 kb that covered 286 Mbof the citrus genome. To further improve the assembly, we used the optimalscaffolder Opera 12 with DNA-PET reads (10 kb and 20 kb) and BAC ends(~125 kb, 5,136 BAC-end sequences) in order of increasing library size toconstruct larger scaffolds and fill gaps using GapCloser (final genome size of320.5 Mb and N50 of 1.69 Mb). DNA-PET reads and BAC ends were mapped

    respectively to the SOAPdenovo assembly with BOWTIE 44 and BWA-SW 45 (Supplementary Note ).

    Assembled scaffolds anchoring to genetic map. The C. sinensis geneticmap by Lyon et al.14 was used for anchoring the genome assembly. A totalof 768 sequence-known markers in the genetic map were used to align thescaffolds by BLASTN analysis at an E value cutoff of 1 1020. Hits withcoverage >50% and identity >90% were considered mapped markers. Thepseudomolecules were constructed on the basis of the anchoring results,and the order of chromosomal numbers was based on the linkage grouplength in cM.

    Identification of repetitive elements in the citrus genome. Transposableelement libraries were constructed by integrating the results of differentsignature-based, homology-based and ab initio based methods ( Supple-mentary Note ). Results from different methods were combined to buildtransposable element families as described by Wicker et al.46. At least onemember for each family was selected to construct the custom transposableelement libraries, which were applied to mask the citrus genome using theRepeatMasker software ( http://www.repeatmasker.org/ ).

    Gene prediction and annotation. A hybrid approach was used, which com-bined de novo predictions with evidence-based data (ESTs, protein homologyand RNA-Seq) analysis using the PASA and EVM 47 pipeline ( SupplementaryNote ). The predicted gene models from EVM were then updated by PASAassembly alignments. Gene functions were assigned according to the bestalignment using BLASTP ( E value < 105) to the Uniprot database (includingthe SWISS-PROT and TrEMBL databases). miRNAs were predicted on thebasis of small RNA sequencing data 23, and the criteria applied to characterizea miRNA locus is described in the Supplementary Note .

    RNA-Seq analysis. RNA-Seq reads were generated using Il lumina and SOLiDplatforms and aligned to the reference genome using TopHat 48 (v1.2.0) andBioScope (v1.3, with default parameters), respectively ( Supplementary Table 1 and Supplementary Fig. 4 ), to identify exons and splice junctions ab initio .The expression level of each gene in each RNA-Seq library was calculated asthe RPKM42 (Supplementary Note ). Biological and technical replicates wereperformed for gene expression analysis ( Supplementary Fig. 18 ). For hierar-chical clustering analysis, we used the clustering software Cluster 3.0 (ref. 49)to perform complete linkage hierarchical clustering on genes using uncenteredPearsons correlation as distance measure.

    RNA-PET analysis. RNA-PET analysis for full-length mRNA transcriptswas performed as described in Ruan and Ruan 50. Approximately 100 million

    RNA-PETs ( Supplementary Table 1 ) were generated from four tissues (callus,flower, leaf and fruit) and were used for mapping, clustering and annotationof the reference genome ( Supplementary Fig. 6 ).

    Identification of potential species-specific genes. The OrthoMCLmethod 51 was adopted to construct gene families for the genes in 22 plantspecies (Supplementary Fig. 19 ). Proteins with no homologs in the other 21plant genomes were used for TBLASTN searches against the Plant Unique

    Transcripts Assembly database (PUTs, http://www.plantgdb.org/ ) to excludefalse-positive genes caused by incomplete genome annotation. The potentialspecies-specific genes were rescreened by BLAST searches against the nonre-dundant protein database in GenBank.

    SNPs and small indels analysis. Illumina reads from heterozygous diploidorange (the diploid parent of the dihaploid line), three mandarin culti- vars (C. reticulata ) and three pummelo cultivars ( C. grandis) were used forSNP and small indel analysis. All the sequencing reads were mapped usingbwa v0.5.9 with default parameters for pair-end reads. Two variant callingtools, SAMtools52 and the Genome Analysis Toolkit (GATK 53), were used(Supplementary Note ). The sequence data from the dihaploid line were usedas a negative control for the SNP analysis as described above to estimate theerror rate of SNP detection ( Supplementary Note ).

    Analyses for the hybrid origin of sweet orange. The SSR and SNP markerswere used for genotyping analysis to test the potential hybrid origin ofsweet orange.

    For SSR genotyping, SSR regions were predicted using the whole-genome sequence. A total of 480 pairs of SSR primers (primers are listed inSupplementary Table 24 ) were selected on the basis of three primer pairsper scaffold (>10 kb). The genomic locations of these SSR loci are provided inSupplementary Figure 20 . PCR amplification was conducted in a MJ-PTC-200 thermal PCR cycler (MJ Research, Waltham Mass) using the followingprogram: 3 min of 94 C, 32 cycles of 94 C for 30 s, 55 C for 30 s, 72 C for45 s and a final extension at 72 C for 10 min. After PCR, polyacrylamide gelanalysis and SSR bands were visualized with silver staining as described byChen et al.54.

    For SNP genotyping, the genome-wide patterns of the sweet orange geneticorigin were investigated on the basis of the whole-genome SNP data for man-darins, pummelos and sweet orange. We used a sliding window method 55 todetect the ratio (denoted as RM/P) of SNP numbers from mandarin ( M) orpummelo ( P) in a 100 kb window size and 10 kb sliding step. Only windowswith SNP number 30 were used for the genetic origin analysis. If RM/P waslarger than 4, we marked this region as from pummelo, and if RM/P was smallerthan 0.25, we marked the region as from mandarin. Otherwise the region wasregarded as undetermined.

    Estimation of genome duplication. We used two methods to detect the sig-nature of genome duplications. The multicollinearity alignment algorithmMCscan56, with the gap size set to 15 genes and requiring at least 5 syntenicgenes, was used to def ine a duplicated block. For each duplicate gene pair, thesynonymous site divergence value ( K s) was calculated using Bioperl modulesunder the Jukes-Cantor corrected model ( Supplementary Note ).

    Genomic synteny analysis. MCscan software 56 was used to identify syntenicregions between C. sinensis and two genomes ( A. thaliana and T. cacao) inMalvidae, two genomes ( F. vesca and M. domestica) in Fabidae and V. vinifera with the same parameters as used for the above mentioned genome duplica-tion analysis.

    Phylogenetic analysis. Maximum parsimonybased phylogeny construc-tion was performed using heuristic searches that consisted of tree bisec-tion and reconnection (TBR) branch swapping in PAUP*4.0b10 soft.Maximum likelihoodbased method analysis was conducted in RaxML version 7.2.8 (ref. 57) using the shared single-copy genes ( SupplementaryNote ). DNA sequences of 103 single-copy genes were concatenated, andthe GTRGAMMA substitution model was used for the phylogenetictree construction.

    http://www.repeatmasker.org/http://www.plantgdb.org/http://www.plantgdb.org/http://www.repeatmasker.org/
  • 8/14/2019 Sweet Orange (Draft Genome

    10/10

    NATURE GENETICS doi:10.1038/ng.2472

    Chromosome karyotype analysis. Root tips (35 mm) from germinated seedsof Valencia sweet orange and suspended cells from calli of the Valencia sweetorange dihaploid line were pretreated with saturated p-dichlorobenzene solu-tion for 2 h at 20 C, fixed in 3:1 ethanol and acetic acid (v/v) for 24 h andstored at 4 C in 70% ethanol solution. Roots were treated with an enzymemixture containing 0.25% pect inase (Sigma), 0.25% pectolyase Y-23 (Yakult)and 0.5% cellulose RS (Onozuka) for 6070 min at 37 C. The meristem ofeach individual root tip was then squashed on a clean slide in a drop of 60%

    acetic acid and then in a drop of fixation solution with a fine needle and wasthen flame dried over an alcohol flame.Chromomycin A3 (CMA) and 4 -6-diamidino-2-phenylindole (DAPI)

    banding pattern analysis was performed according to Guerra et al.41. Theslides were double stained with 0.5 mg/ml of CMA (1 h) and 5 g/ml ofDAPI (30 min) and mounted with in antifade solution of Vectashield (VectorLaboratories, Peterborough, UK). Slides were examined under a Zeiss ScopeA1 fluorescence microscope (Zeiss, Germany) equipped with a Hamamatsudigital camera. Gray-scale images were captured for each color channel.Chromosomes in at least ten metaphase cells were measured using Image-Pro Plus 5.0 software (Media Cybernetics).

    45. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheelertransform. Bioinformatics 26 , 589595 (2010).

    46. Wicker, T. et al. A unied classication system for eukaryotic transposable elements.Nat. Rev. Genet. 8 , 973982 (2007).

    47. Haas, B.J. et al. Automated eukaryotic gene structure annotation usingEVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9 ,R7 (2008).

    48. Trapnell, C., Pachter, L. & Salzberg, S.L. TopHat: discovering splice junctions withRNA-Seq. Bioinformatics 25 , 11051111 (2009).

    49. de Hoon, M.J., Imoto, S., Nolan, J. & Miyano, S. Open source clustering software.

    Bioinformatics 20 , 14531454 (2004).50. Ruan, X. & Ruan, Y. Genome wide full-length transcript analysis using 5 and 3 paired-end-tag next generation sequencing (RNA-PET). Methods Mol. Biol. 809 ,535562 (2012).

    51. Li, L., Stoeckert, C.J. Jr. & Roos, D.S. OrthoMCL: identication of ortholog groupsfor eukaryotic genomes. Genome Res. 13 , 21782189 (2003).

    52. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25 ,20782079 (2009).

    53. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework foranalyzing next-generation DNA sequencing data. Genome Res. 20 , 12971303(2010).

    54. Chen, C. et al. EST-SSR genetic maps for Citrus sinensis and Poncirus trifoliata .Tree Genet. Genomes 4 , 110 (2008).

    55. Lai, J. et al. Genome-wide patterns of genetic variation among elite maize inbredlines. Nat. Genet. 42 , 10271030 (2010).

    56. Tang, H. et al. Synteny and collinearity in plant genomes. Science 320 , 486488(2008).

    57. Stamatakis, A., Ludwig, T. & Meier, H. RAxML-III: a fast program for maximumlikelihood-based inference of large phylogenetic trees. Bioinformatics 21 , 456463(2005).

    43. Kelley, D.R., Schatz, M.C. & Salzberg, S.L. Quake: quality-aware detection andcorrection of sequencing errors. Genome Biol. 11 , R116 (2010).

    44. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efcientalignment of short DNA sequences to the human genome. Genome Biol. 10 , R25(2009).