Supplemental Table 1. Size of 32 families in the four ... · PF00096 Zinc finger, C2H2 type 122 134...
Transcript of Supplemental Table 1. Size of 32 families in the four ... · PF00096 Zinc finger, C2H2 type 122 134...
Supplemental Table 1. Size of 32 families in the four Oryza genomes, O. sativa ssp. japonica (Oj), O. sativa ssp. indica (Oi), O. glaberrima (Og), O. brachyantha (Ob), and Brachypodium distachyon (Bd).
Family Size
Ob
PF00632 7 8 8 9 7PF02386 Cation transport protein 7 6 6 6 6PF06480 9 9 7 6 9PF07839 9 10 8 9 6PF00888 Cullin family 16 16 16 11 11PF03321 GH3 auxin-responsive promoter 17 16 14 13 11PF03110 21 22 19 19 18PF07887 22 26 19 19 12PF01476 23 30 20 19 17PF01466 32 36 25 23 16PF02701 32 31 28 12 28PF03552 35 39 26 31 27PF00535 37 41 34 38 30PF00232 41 53 41 35 30PF02309 AUX/IAA family 52 55 45 53 44PF00462 60 71 61 40 57PF03514 GRAS (GAI, RGA, SCR) family transcription factor 62 64 58 56 46PF02519 Auxin responsive protein 66 67 62 59 53PF00319 77 70 64 60 52PF04564 U-box domain 84 82 84 72 71PF03106 WRKY DNA -binding domain 98 113 90 90 83PF00046 107 103 85 88 86PF00026 111 110 98 69 70PF00096 Zinc finger, C2H2 type 122 134 110 98 101PF00234 138 126 113 84 93PF00010 Helix-loop-helix DNA-binding domain 159 172 150 133 138PF00651 166 200 134 83 110PF00249 283 338 272 273 247PF00067 Cytochrome P450 366 418 302 293 263PF00097 Zinc finger, C3HC4 type (RING finger) 422 420 368 301 380PF00931 NB-ARC domain 610 687 398 373 364PF00646 F-box domain 735 767 649 249 535
Pfam id
Oj Oi Og Bd
HECT-domain (ubiquitin-transferase)
FtsH Extracellular, membrane-bound ATP-dependent proteasePlant calmodulin-binding domain
SBP domain (for SQUAMOSA-pROMOTER BINDING PROTEIN)Calmodulin binding protein-likeLysM (lysin motif) domainSkp1 family, dimerisation domainDof domain, zinc fingerCellulose synthaseGlycosyl transferase family 2Glycoside hydrolase family 1
Glutaredoxin
SRF-type transcription factor (DNA-binding and dimerisation domain)
Homeobox domain (Homeodomain fold)Eukaryotic aspartyl protease
Protease inhibitor/seed storage/LTP family (Tryp-alpha-amyl)
BTB (for BR-C, ttk and bab)/POZ (for Pox virus and Zinc finger) domainMyb-like DNA-binding domain
Supplemental Table 2. Sub-family copy number. In gray are the families that show minor size variation between Ob and the IRGSP RefSeq.
subfamily total Oj Oi Og Ob (Oj-Ob) (Oj-Ob)/size %
F-box-19 9 3 2 3 1 2 22.2
F-box-22 14 4 3 4 3 1 7.1
F-box-20 18 6 5 4 3 3 16.7F-box-6 25 8 6 9 2 6 24.0
F-box-4 26 9 9 6 2 7 26.9
F-box-18 35 11 9 10 5 6 17.1
F-box-21 45 14 14 13 4 10 22.2
F-box-9 46 10 12 11 13 -3 -6.5
F-box-5 55 15 17 13 10 5 9.1
F-box-10 63 16 17 15 15 1 1.6
F-box-14 71 22 24 17 8 14 19.7
F-box-23 72 23 28 15 6 17 23.6
F-box-7 85 24 24 23 14 10 11.8F-box-2 91 31 29 23 8 23 25.3
F-box-3 110 34 34 32 10 24 21.8
F-box-16 117 33 36 38 10 23 19.7
F-box-15 122 39 41 31 11 28 23.0
F-box-17 143 38 48 38 19 19 13.3
F-box-8 186 61 56 47 22 39 21.0
F-box-13 206 63 64 60 19 44 21.4
F-box-11 261 83 83 73 22 61 23.4
F-box-12 269 84 87 79 19 65 24.2F-box-1 331 104 119 85 23 81 24.5
Total 2400 735 767 649 249 486 20.3
Asp-2 20 7 4 6 3 4 20.0
Asp-3 21 7 7 5 2 5 23.8
Asp-4 24 6 6 7 5 1 4.2
Asp-5 97 25 29 28 15 10 10.3
Asp-6 109 32 30 24 23 9 8.3
Asp-1 117 34 34 28 21 13 11.1
Total 388 111 110 98 69 42 10.8
BTB-9 4 1 1 1 1 0 0.0BTB-4 6 1 2 2 1 0 0.0
BTB-8 7 2 2 1 2 0 0.0
BTB-10 7 2 2 2 1 1 14.3
BTB-5 8 2 3 2 1 1 12.5
BTB-6 12 4 3 2 3 1 8.3
BTB-13 17 5 6 4 2 3 17.6
BTB-11 38 13 17 4 4 9 23.7
BTB-1 46 13 16 10 7 6 13.0
BTB-3 67 22 28 17 0 22 32.8
BTB-7 94 25 30 25 14 11 11.7BTB-2 100 28 43 20 9 19 19.0
BTB-12 177 48 47 44 38 10 5.6
Total 583 166 200 134 83 83 14.2
GRX-2 54 12 17 14 11 1 1.9
GRX-1 178 48 54 47 29 19 10.7
Total 232 60 71 61 40 20 8.6
NB-ARC-5 28 9 11 4 4 5 17.9
NB-ARC-1 33 9 14 4 6 3 9.1
NB-ARC-17 51 15 19 11 6 9 17.6
NB-ARC-12 68 19 24 11 14 5 7.4NB-ARC-14 71 20 28 12 11 9 12.7
NB-ARC-4 80 16 27 16 21 -5 -6.3
NB-ARC-9 84 24 29 17 14 10 11.9
NB-ARC-6 95 31 33 16 15 16 16.8
NB-ARC-11 102 28 35 15 24 4 3.9
NB-ARC-15 104 31 40 21 12 19 18.3
NB-ARC-3 147 39 41 30 37 2 1.4
NB-ARC-7 155 48 57 24 26 22 14.2
NB-ARC-10 157 51 45 32 29 22 14.0
NB-ARC-2 189 53 62 41 33 20 10.6NB-ARC-8 208 62 60 46 40 22 10.6
NB-ARC-16 214 67 73 43 31 36 16.8
NB-ARC-13 282 88 89 55 50 38 13.5
Total 2068 610 687 398 373 237 11.5
Tryp-α-amyl-6 44 14 11 10 9 5 11.4
Tryp-α-amyl-1 45 12 12 10 11 1 2.2
Tryp-α-amyl-4 52 17 10 13 12 5 9.6
Tryp-α-amyl-2 59 20 19 16 4 16 27.1
Tryp-α-amyl-3 99 29 30 21 19 10 10.1
Tryp-α-amyl-5 162 46 44 43 29 17 10.5
Total 461 138 126 113 84 54 11.7
Supplemental Table 3. Number and percentage of orthologous groups and orthologous genes for each gene family.
Asp BTB F-box GRX NB-ARC Tryp-α-amyl Zf-Dof
orthologous groups 105 160 738 59 567 137 31
Oj-Oi-Og-Ob 46 51 163 30 149 46 9
Oj-Oi-Og-Ob % 44 32 22 51 26 34 29
Oj-Oi-Og 32 58 362 20 130 45 13
Oj-Oi-Og % 31 36 49 34 23 33 42
Oj-Oi 12 25 113 2 105 14 5
Oj-Oi % 11 16 15 3 19 10 16
others 15 26 100 7 183 32 4
others % 14 16 14 12 32 23 13
orthologous pairs Asp BTB F-box GRX NB-ARC Tryp-α-amyl Zf-Dof
Oj-Oi 94 143 667 54 449 112 27
% Oj genes in OP Oj-Oi 84.7 86.1 90.7 90 73.6 81.2 84.4
% Oi genes in OP Oj-Oi 85.4 71.5 87 76 65.3 88.9 87.1
Oj-Og 85 116 560 52 321 106 25
% Oj genes in OP Oj-Og 76.6 69.9 76.2 86.7 52.6 76.8 78.1
% Og genes in OP Oj-Og 86.7 86.6 86.3 85.2 80.6 93.8 89.3
Oj-Ob 52 64 197 34 244 65 10
% Oj genes in OP Oj-Ob 46.8 38.5 26.8 56.7 40 47.1 31.2
% Ob genes in OP Oj-Ob 75.4 77.1 79.1 85 65.4 77.4 83.3
Supplemental Table 6. Results of the Poisson regression analysis for the relationship between recombination rate and gene family number.
Family Coefficient Exp(Coeff) Std Error P-valueAsp 0.102 1.11 0.031 0.001
BTB -0.002 1 0.02 0.912F-box 0.063 1.06 0.012 0
GRX 0.143 1.15 0.041 0NB-ARC 0.01 1.01 0.015 0.509
0.067 1.07 0.029 0.02
Zf-Dof 0.085 1.09 0.059 0.151
Tryp-α-amyl
Supplemental Table 8. List of loci with signature of positive selection. Gene IDs are simplified compared to the initial annotations so only the species (Oj, Oi, Og or Ob) and the unique number are indicated.
Results of branch-site test of positive selection Genes with the mutation
Family and groups OG Locus Mutation Gene 1 Gene 2 Gene 3
Og01_00472.Oj01_08330.Oi_02885 426 K to T Oi_02885
Og01_03244.Oi_04874.Oj01_6483054 M to I Oj01_64830
132 G/S to L Og01_03244
Og02_01220.Oi_06611.Oj02_20690 369 R/P to T Oj02_20690
Og08_00579.Oj08_13030.Oi_28227 362 S to K Oi_28227
Og08_00584.Oj08_13090.Oi_16035 303 V to M Oi_16035 Og08_00584
Og08_01244.Oi_26997.Oj08_3143062 T/R/K to M Og08_01244
174 A to E Og08_01244
Og09_00453.Oj09_16870.Oi_30541193 R to C Oj09_16870
219 A to V Oi_30541
Og10_00916.Oi_32948.Oj10_28820 319 C to Y Oj10_28820
Og10_00920.Oi_32953.Oj10_28870
52 S to T Og10_00920
105 G to A Og10_00920
148 N to S Oi_32953
210 L to V Og10_00920
219 L to F Og10_00920
242 S to L Oj10_28870 Oi_32953
243 S to L Oj10_28870
278 A to D Oi_32953
313 N to K Oj10_28870 Oi_32953
Og10_00937.Oj10_29340.Oi_32974
487 C to Y and F Oj10_29340 Og10_00937
501 Y to L and F Oj10_29340 Oi_32974 Og10_00937
517 I/V to L Oi_32974
518 E to D Oi_32974
Og11_01646.Oj11_40490.Oi_35629 339 P to L Oj11_40490 Oi_35629
Oj08_13020.Oi_28224.Og08_00578 314 L to P Oj08_13020
Oi_114.Oj_970135 A/S to I Oj_970
968 N/S to S Oj_970
Og_2114.Oi_68.Oj_929 130 W to Y Og_2114
Og_2125.Oi_83.Oj_941 74 W to L Og_2125
Og_2334.Oi_324.Oj_1165 241 L to R/Y/V Oj_1165 Oi_324 Og_2334
Og_2347.Oi_338.Oj_1176 388 E to Y Og_2347
Og_2412.Oi_415.Oj_1252 96 L to K Og_2412
Og_2498.Oi_514.Oj_1346 358 D to Q Oj_1346 Oi_514 Og_2498
Oi_01920.Oj01_16390
202 A/R to G/Q Oj01_16390 Oi_01920204 K to Q Oj01_16390
205 T to S Oj01_16390
Oi_03230.Oj01_16370157 A to T/L Oj01_16370 Oi_03230
159 S to K Oi_03230
Oi_06618.Oj02_20460
434 I to T Oj02_20460
701 P to L Oj02_20460
811 E to K Oi_06618
Oi_34858.Oj11_06230 408 G to S Oj11_06230 Oi_34858
Oi_35038.Oj11_12330
753 N to K/T Oj11_12330 Oi_35038
760 A to N Oj11_12330
764 */R/I/E/K/M/S to S/A Oj11_12330 Oi_35038
820 */T/V/A to V Oi_35038
823 */R/K/S to K/S Oj11_12330 Oi_35038
Oi_35437.Oj11_35210
388 R to C Oj11_35210
395 A to V/G Oj11_35210 Oi_35437
505 E to K Oj11_35210
523 Y to N Oi_35437
Oj01_49190.Oi_04222 497 F to S Oi_04222
Oj01_57270.Oi_04573
786 I to M Oj01_57270
916 D/N to N/P Oj01_57270
947 H/C/F/L to Y Oj01_57270 Oi_04573
1397 L to P Oj01_57270 Oi_04573
Oi_21844.Oj06_06860
447 S to T/Y Oj06_06860 Oi_21844
779 E to V/K Oj06_06860
781 T to G/M Oj06_06860
785 I to T/G Oj06_06860
868 R to H Oj06_06860
872 V to R Oj06_06860
Oi_21869.Oj06_06390 322 M to L Oi_21869
Oi_33964.Oj11_30050 77 L to V Oj11_30050
Oi_34292.Oj11_10770 90 A to N Oj11_10770
Oi_35431.Oj11_34970626 L to T/M Oj11_34970 Oi_35431
870 A to P/S Oj11_34970 Oi_35431
Oi_36183.Oj12_30070
471 Q to R/D Oj12_30070 Oi_36183
634 L to G Oj12_30070
945 R to A Oj12_30070
Oi_36194.Oj12_29710467 K to D/N Oj12_29710 Oi_36194
1377 V to D/F Oj12_29710
Oi_36218.Oj12_28250 517 L to I Oj12_28250
Oj04_25900.Oi_15350 251 E to D Oi_15351 (IP)
Oj11_14380.Oi_35077 285 L to V/H Oj11_14380 Oi_35077
Oj12_33160.Oi_36029
152 I to N Oj12_33160 Oi_36029
504 Y to H Oj12_33160 Oi_36029
923 to D/W Oj12_33160 Oi_36029
Oj12_37760.Oi_37604 1222 D to E Oi_37604
Oi_06629.Oj02_20210.Og02_01204 744 C/S to L Oj02_20210
Oi_07774.Oj11_46210.Og11_01857
390 N to A Og11_01857
470 K to S Og11_01857
767 S to H Og11_01857
859 N/S to E Og11_01857
921 K to Y Og11_01857
Oj12_29290.Og12_01078.Oi_37381 74 F to L Oj12_29290 Oi_37381
Asp Expansion Oj-Oi-Og
BTB Expansion Oj-Oi-Og
F-box Expansion Oj-Oi
F-boxExpansion Oj-Oi-Og
NB-ARC Expansion Oj-Oi
NB-ARC Expansion Oj-Oi-Og
Supplemental Table 9. Genomic sequences and annotation information.
O. sativa ssp. japonica O. sativa ssp. indica
O. glaberrima O. brachyantha
cultivar/strain Nipponbare 93-11 IRGC:96717 IRGC:101232
institution MSU Rice Genome Annotation Project
Beijing Genomics Institute (BGI)
Arizona Genomics Institute (AGI)
Chinese Academy of sciences IGDB
release v6.1 2003annotations updated in 2010
v1.1 v1.4
reference International Rice Genome Sequencing Project, 2005
Yu et al., 2002 unpublished Chen et al., 2013
length of all chromosomes
372.317.567 bp 352.234.171 bp 285.037.524 bp 250.927.218 bp
access http://rice.plantbiology.msu.edu
http://rise2.genomics.org.cn/page/rice/index.jsp
http://www.genome.arizona.eduGenBank: ADWL01000000
http://www.gramene.orgGenBank: AGAT00000000.1
Supplemental Table 11. Informations about the population resequencing data used for selection pressure validation.
Run name Accessions Country of origin
IRGC 58398
IRGC 58716 Bangladesh
IRGC 3340 Dominican Republic
IRGC 58267 Afghanistan
IRGC 12880 Iran
IRGC 13793
IRGC 101049 Senegal
IRGC67563 Ghana
IRGC 100854 Congo
IRGC 68939 Madagascar
IRGC 96841 Zimbabwe
Run name Accessions Country of origin
IRGC 32399 Bhutan
SZC08001BHDCAAPE IRGC 55471 South-Korea
SZC08001ATDCAAPE IRGC 8231 Vietnam
SZC08001ADDCAAPE IRGC 27762 Thailand
AGI resequencing populations
O. sativa ssp. japonica
Oj_58398 Brunel
Oj_58716
Oj_J102
O. sativa ssp. indica
Oi_58267
Oi_12880
Oi_13793 Brunel
O. glaberrima
Og_101049Og_67563
Og_100854
Og_68939
Og_96841
Xu et al., 2012 resequencing populations
O. sativa ssp. japonica
ORIwwRAADBAAPE
O. sativa ssp. indica
(a)
(b)
OP Non- collinear Collinear % Collinear on total OP % Ox Collinear*
Asp 94 93 1 86 6 91.5 77.5 78.2BTB 143 143 6 128 9 89.5 77 64F-box 667 659 21 599 36 89.8 81.5 78
54 54 0 53 1 98 88.3 74.6NB-ARC 449 443 18 385 37 85.7 63 56
112 111 1 106 4 94.6 76.8 8427 26 0 26 0 96.3 81.3 84
Asp 85 77 1 75 1 88.2 67.6 76.5BTB 116 113 1 112 0 96.5 67.5 83.6
560 560 24 528 5 94.3 72 81.452 52 0 52 0 100 86.7 85.2
NB-ARC 321 305 7 295 1 92 48.4 74.1106 98 0 98 0 92.4 71 86.725 24 0 24 0 96 75 85.7
Asp 52 52 0 52 0 100 46.8 75.4BTB 64 62 1 59 2 92.2 35.5 71F-box 197 196 10 175 8 88.8 23.8 70.3
34 34 1 32 1 94 53.3 80NB-ARC 244 243 18 214 9 87.7 35 57.4
65 65 2 59 2 90.8 42.8 70.210 10 0 10 0 100 31.2 83.3
Syntenic
OP with coordinates
Non-syntenic % Oj Collinear *
Oj-Oi Glutaredoxin
Tryp-alpha-amylZf-Dof
Oj-OgFboxGlutaredoxin
Tryp-alpha-amylZf-Dof
Oj-Ob Glutaredoxin
Tryp-alpha-amylZf-Dof
Supplemental Figure 1. (a) Relationships of orthologous pairs between Oryza sativa ssp. japonica and both O. glaberrima and O. brachyantha for the F-box, NB-ARC, Asp, BTB, GRX, Tryp-α-amyl, and Zf-Dof gene families. Gray genes and lines correspond to collinear relationships. Red genes and lines correspond to non-syntenic and non-collinear relationships. Blue genes on the chromosomes are absent from orthologous pairs. (b) Number and percentage of syntenic, collinear, and non-syntenic orthologous pairs (OP) for each pair of Oryza species.
Asp BTB
No Outgroup With Outgroup B. distachyon
Tree:(((Oi,Oj)n1,Og)n2,Ob)n3
λ tree: (((1,1)1,1)1,2)
λ 1 = 6.94 x10-6λ 2 = 6.67 x10-6
Average Expansion Oi = 9.750000Oj = -0.062500n1 = 6.156250Og = -13.031250n2 = 18.156250Ob = -16.437500
Tree:(((Oi,Oj)n1,Og)n2,Ob)n3,Bd)n4
λ tree: (((((1,1)1,1)1,1)1,2)3,3)
λ 1 = 6.99 x10-6 λ 2 = 6.67 x10-6λ 3 = 2.09 x10-6
Average Expansion Oi = 7.419355Oj = -0.225806n1 = 4.193548Og = -8.935484n2 = 17.483871Ob = -12.903226n3 = 0.903226Bd = -2.129032
Additional data
Oj Oi Og Ob Bd
-7
77
110111 98 6998 70
108
105
82
-13
-7
+5
+23
+3
+3 +20.025
Oj Oi Og Ob Bd
166
-4
200 134 83 110
114
116
158
170
-33
-24
+30-4
+12
+42
+2
0
0
0
0.008
0.024
F-box
Oj Oi Og Ob Bd
+40
735 767 649 249 535
731
701
468
495
-219
-52
+36+4
+30
+233
-27
0
0
0
0
0
GRX
Oj Oi Og Ob Bd60 71 61 40 57
64
63
52
54
+3-12
-2
+7-4
+1
+11
-2
+11
0.0010.03
Tryp-α-amyl
Zf-Dof
Oj Oi Og Ob Bd
138
-5
126 113 84 93
128
123
100
98
+2
+23
+5
+10 -2 -10
-16
Oj Oi Og Ob Bd
32
+3
31 28 12 28
25
22
30
31
-10
-2
0+1
+1
+8
-3
0
0.04
0.004
NB-ARC
Oj Oi Og Ob610 687 398 373
605
537
452
-79
+85
+68
+5 +82 -1390
0
0
0.006
0.007
Supplemental Figure 2. Results of CAFE analysis. Indicated in the first row of the table are the Newick formated tree, the global birth and death parameter λ (gain-loss/gene/MYA) for the specified subset of branches in the tree and the average size of expansions/contractions along each branch among all families. Then for each family, the ancestral states of family size for each node and the direction of the changes are displayed, along with the branches p-value if inferior to 0.05.
0.05 0.2
0.4
0.6
0.8 1
1.2
1.4
1.6
1.8 2
0
20
40
60
80
100
120
140
0.05 0.
2
0.4
0.6
0.8 1
1.2
1.4
1.6
1.8 2
0
20
40
60
80
100
120
140
F-box
Oj-Oi-Og-Ob
Oj-Oi-Og
0.05 0.2
0.4
0.6
0.8 1
1.2
1.4
1.6
1.8 2
0
2
4
6
8
10
12
14
16
18
20
Oj-Oi
NB-ARC
0.05 0.2
0.4
0.6
0.8 1
1.2
1.4
1.6
1.8 2
0
5
10
15
20
25
30
35
40
45
Oj-Oi-Og-Ob
Oj-Oi-Og
0.05 0.2
0.4
0.6
0.8 1
1.2
1.4
1.6
1.8 2
0
20
40
60
80
100
120
140
160
0.05 0.
2
0.4
0.6
0.8 1
1.2
1.4
1.6
1.8 2
0
2
4
6
8
10
12
14
16
18
Oj-Oi
Supplemental Figure 3. Frequency distribution of ω ratios of all pairwise comparisons for the different types of orthologous groups for the F-box and NB-ARC gene families. The last bars on the x scale represents all values superior to 2 including the infinite values (ω = 99).
mean median mean median mean medianall families 7635 845 1725 2284 261 568 2813 786 582Asp 12513 4132 5181 2949 1104 1180 6101 3334 1611BTB 8395 1665 6220 1784 864 391 458 115 305F-box 5900 734 2076 1616 218 568 1489 381 558GRX 19236 7870 6376 10057 3336 7655 / / /NB-ARC 5130 1163 1115 1035 229 461 3621 2431 438
18100 5627 4600 11726 3338 5354 9642 3479 57666272 1513 6491 3802 1351 1825 1397 630 975
Oj in Oj-Oi-Og-Ob Oj in Oj-Oi-Og Oj in Oj-Oise se se
Tryp-α-amylZf-Dof
n1 n2 w (U) H n1 n2 w (U) H n1 n2 w (U) Hall families 327 133 29018 1.841e-08 H1 327 492 109624 2.2e-16 H1 133 492 33938 0.509 H0Asp 28 5 86 0.4482 H0 28 21 440 0.002714 H1 5 21 58 0.7526 H0BTB 29 8 197 0.001796 H1 29 50 1153 1.342e-05 H1 8 50 167 0.4636 H0F-box 129 75 6743 2.77e-06 H1 129 267 23495 1.583e-09 H1 75 267 9745.5 0.7984 H0GRX 20 1 15 0.51714 H0 20 13 146 0.573 H0 1 13 5 0.8571 H0NB-ARC 111 38 2644 0.01981 H1 111 110 8411 1.223e-06 H1 38 110 2309 0.3364 H0
22 9 100 0.983 H0 22 22 232 0.8253 H0 9 22 101 0.949 H06 3 17 0.04762 H1 6 11 48 0.149 H0 3 11 11 0.456 H0
Oj-Oi-Og-Ob vs Oj-Oi Oj-Oi-Og-Ob vs Oj-Oi-Og Oj-Oi vs Oj-Oi-Ogpvalue pvalue pvalue
Tryp-α-amylZf-Dof
(a)
(b)
11325374961738597
109121133145157169181193205217229241253265277289301313325337
0 20000 40000 60000 80000 100000 120000 140000
Oj-Oi-Og-Ob
1305988
117146175204233262291320349378407436465
0 10000 20000 30000 40000 50000 60000
OjOiOg
110192837465564738291
100109118127136
0 20000 40000 60000 80000 100000
Oj-Oi
(c)
Supplemental Figure 4. Results of the global gene level expression analysis: (a) Parameters of mean expression intensity distribution, (b) Results of the Mann-Whitney-Wilcoxon test between the distributions of different class of OGs, (c) Distribution histograms for each class of OGs with values combined for all gene families.
Supplemental Method 1. More detailed description of the methods applied in the study
Identification of gene families, orthology relationships and collinearity
We initially targeted 32 gene families (Supplemental Table 1) across the four Oryza species for which
genome sequences were available. The HMM-profile of each family domain (see Supplemental Table 1
for Pfam identifiers), downloaded from Pfam 26.0 (www.pfam.sanger.ac.uk; Punta et al., 2012), was
used as query to search the annotated proteomes of Oryza sativa ssp. japonica and indica, O.
glaberrima and O. brachyantha (E-value≤1.0) with HMMER 3.0 (www.hmmer.janelia.org).
Information on genome sequences and annotation data for the four Oryza genomes are displayed in
Table 4. Alternative transcripts and TE related proteins were filtered. For the large F-box family, to
identify additional genes missed in the annotation, a similarity-based re-annotation was performed. The
collection of F-box proteins from the four genomes were remapped to their unmasked whole genome
sequences in all pairwise combinations using tblastn. From these alignments, gene models were
predicted via spliced alignments with GenomeThreader (Gremme et al., 2005). The newly predicted
genes were retained only if they contained a F-box domain. This combined analysis has led to the
manual assignment of 58, 43, 127, and 29 additional F-box genes in japonica, indica, O. glaberrima
and O. brachyantha, respectively. The F-box genes in the four genomes were given new ids to account
for these newly predicted genes. The new nomenclature and its correspondence with the initial
annotations is provided in Supplemental Table 11. Among the 32 families, 7 families were selected
(Supplemental Table 9) with the largest size differences between O. brachyantha and the O. sativa ssp.
japonica reference sequence (Ob/Oj < 0.70 and at least 10 more genes in Oj compared to Ob). The
families were divided into sub-families using an initial neighbor-joining tree constructed with Clustal
2.0 (Larkin et al., 2007). To identify domain composition of the AA-specific sub-family BTB-3, we
searched the PfamA database using HMMER3 (E-value≤ 1.0).
OrthoMCL (Li et al., 2003) was used to cluster the proteins into orthologous groups (OGs). The
similarity search was conducted with an all-versus-all BLASTP analysis (E-value<1e-5). Inparalogous
and co-orthologous pairs were removed in order to obtain strict orthologous relationships. In specific
cases where proteins in the first species had several mutual hits of equal significance with proteins in
the second species, OrthoMCL clustered several OGs together. Phylogenetic information was used to
manually divide these large groups into single OGs, and identify additional inparalogs and co-
orthologs. To construct trees, proteins from each sub-family were aligned with MUSCLE (v3.7, Edgar,
2004). ProtTest3 (Darriba et al., 2011) was used to select the best-fitting substitution model for each
alignment among JTT, LG, WAG and VT. Maximum likelihood (ML) trees were generated with
RAxML using default settings (v7.2.8, Stamatakis, 2006) and the corresponding amino acid model.
Once the final OGs were obtained, the genes without chromosome coordinates were excluded, and
syntenic (genes located on homologous chromosomes), collinear (syntenic genes in corresponding
orders) and non-syntenic orthologous genes were defined for each pair of species. If one or both of the
genes in the non-syntenic pairs had inparalogs, we checked if one of those inparalogs could be the real
syntenic ortholog by comparing the list of genes in non-syntenic pairs to the lists of inparalogs: {Iff
OPx = (Ax-a;By-b) && IP(By-b;Bx-b') && (Bx-b') OGn, so OPx' = (Ax-a;Bx-b') }. OP =∉
orthologous pair, OG= orthologous group, (A;B) = species, (x;y) = chromosomes and (a;b;b') =
genes, Iff = if and only if, &&= and, IP() =inparalogous pair, = does not belong to}. ∉ We also
checked if a true collinear ortholog could not be found because it was not classified as part of the
family under investigation. For each non-syntenic pair between species A and B, the gene in A was
compared by blastp to the entire proteome of species B. If a better or equivalent hit was found with a
different gene in species B, in comparison to the initial pair, and this gene was collinear but not present
in the family, it was considered as the true ortholog. The initial pair was discarded in the collinearity
final results, although such pairs were included in the OrthoMCL results, as they are the best hits we
could find within a family. Circos was used to visualize the gene distribution and collinearity
(Krzywinski et al., 2009).
Evolution of gene family sizes
To compute the global birth and death rate (λ, probability of both gene gain and loss per gene per MY,
assuming that they are equally probable) of gene families in a group of taxa, CAFE (Computational
Analysis of gene Family Evolution, De Bie et al., 2006) requires gene family sizes in extant species and
their phylogenetic relationships, including their time of divergence. CAFE calculates the most likely
ancestral family sizes (internal nodes), from which we can infer if the variation is the consequence of
expansion or contraction. The dataset for 32 families (Supplemental Table 1) and the species tree
displayed in Figure 1 were used in the analyses. Divergence times were based on data from the
literature (Zhu and Ge, 2005; Tang et al., 2010; Sanderson, 1997). CAFE computes a P-value
(probability of observing the data given random gain and loss of genes) for each family to compare
their evolutionary dynamic to the global birth and death model. If p<0.05, the family deviates from the
model and CAFE computes P-values for each branch to identify where the largest changes have taken
place. The number of random samples for the Monte Carlo re-sampling procedure was 1000. As CAFE
failed to provide accurate λ estimation because of the large gene family size variation of the NB-ARC
family on a single branch (between Og and Oj/Oi), this family had to be removed in the first round of
analysis. Running a second analysis round without the outgroup B. distachyon allowed us to obtain
values for the internal nodes in Oryza for the NB-ARC family. For the 6 other families under
investigation, as both analyses gave the same results for the Oryza branches, the results of the first
round were presented. In the first round of analysis, we proposed a model with different λ rates, one for
B. distachyon and the branch leading to Oryza, one for O. brachyantha, and one for the remaining
branches. In the second round for NB-ARC, we had CAFE find two λ values, one for the O.
brachyantha branch and one for all other branches. All analyses were run in triplicate to ensure that
CAFE converged into a single optimum for the parameter λ.
Composition of tandem duplication and presence in segmental duplications
The tandem arrays were identified with DAGchainer (Haas et al., 2004). To create a list of putative
paralogous pairs, an all-versus-all BLAST search was run on the protein dataset for each family and
each species (E-value<1e-30). Arrays with at least 2 copies (-A 1) and a maximum distance between
two matches of 20 kb were marked at putative tandem arrays. To determine if any of the genes in our
gene families were part of segmental duplications, segmental duplication blocks were constructed for
the four genomes using a homology search at the DNA level. To define putative paralogous pairs, an
all-versus-all BLAST search was run on the whole CDS dataset for each species (E-value<1e-10). The
results were sorted according to the CIP (cumulative identity percentage) and CALP (cumulative
alignment length percentage) parameters of Salse et al. (2008), both defined at 70%. DAGchainer was
run with a minimum of 5 aligned pairs in one block, and a maximum of 1 Mb allowed between two
matches. If two blocks on the same chromosomes were less than 1 Mb apart (on both chromosomes),
they were concatenated. The lists of genes in our families were then compared to the list of collinear
genes in each segmental duplication block. Local recombination rates (cM/Mb) were recovered every
100 kb along each Oj chromosome using MareyMap (Rezvoy et al., 2007), by comparing rice genetic
(Muyle et al., 2011; www.rgp.dna.affrc.go.jp) and physical maps (Marey's map). Cubic splines were
chosen as the interpolation method. The mean recombination rates for 1 Mb-intervals were computed
and compared with the number of japonica genes from each family in the same interval. The
correlations were then tested using a Poisson regression analysis (Supplemental Table 6).
Selective pressure analysis
For each gene family investigated, we first performed a filtering step whereby all gene that contained
>5% N’s in their coding sequences, or did not contain an initiator methionine were removed, as they
can increase the detection of false-positive positive selection. Codon-based CDS alignments for each
individual OG were created from protein alignments (MUSCLE) with the BioPerl module bp_mrtrans
(Jason Stajich). ML trees were constructed with RAxML using the GTRGAMMA model and default
settings. The basic codon substitution model of Goldman and Yang (1994) in Codeml (Yang, 2007) was
used to compute pairwise ω ratios between orthologs. The branch-site codon substitution model (Yang
and Nielsen, 2002) and the branch-site test of positive selection (Zhang et al., 2005) were used to look
at the selective pressure on the orthologs in the Oj-Oi-Og or Oj-Oi groups compared to their paralogs.
For each ortholog in all Oj-Oi-Og and Oj-Oi groups, a BLAST (E-value<1e-5) analysis on all the
family proteins was performed, and the twenty top high-scoring segment pairs (HSPs) were retained,
and filtered with 50% CIP and 70% CALP. Protein and CDS alignments were constructed for each
expansion group with the initial orthologs and their respective hits, but then excluded if they were
composed of 3 or less sequences. The CDS sequences were aligned and used to construct trees as
described above. In the tree file the initial ortholog sequences were specified as foreground branches. If
inparalogs were found during the similarity search, they were also marked as foreground branches.
Alignments composed only of foreground branches were also excluded. The branch-site modified
model A (Yang et al., 2005; Zhang et al., 2005) was run in Codeml first. Three ω ratios were calculated
for the 4 site classes assumed by the model (Table 3). p2a and p2b are the proportions of 2a and 2b
sites, for which codons experienced positive selection (ω2 >1) in the foreground branches. The
percentages of OGs with different profiles of predominant site classes are summarized in Figure 7 for
each type of expanded group and each family. The branch-site likelihood ratio test (LRT) was applied
for alignments with p2a and/or p2b> 0 and at least 1 site under positive selection in one of the
foreground branches. The LRT compared model A with a null model that does not allow positive
selection on the foreground lineages (ω2 is fixed to1). P-values were computed from LRT statistics
with df = 1. Multiple testing corrections were executed by computing q-values using QVALUE (Storey
et al., 2004). Each individual test was considered significant at the 5% level, only if the q-value ≤ 0.05.
The BEB (Bayes Empirical Bayes) method (Yang et al., 2005) gave posterior probabilities that each
site was from a particular class. For alignments with a significant test in favor of model A, the sites
with high posterior probabilities to belong to the 2a or 2b classes (P > 90%) were selected only. Finally,
each putatively positively selected site was manually checked at both the protein and DNA levels to
remove false positives due to misalignments.
All sites that appeared to be under positive selection in the three AA genome species were validated
using Illumina HiSeq resequencing data generated at AGI (3 O. sativa ssp. japonica, 3 O. sativa ssp.
indica, 5 O. glaberrima accessions: Wing et al. unpublished), or obtained from Dr. Wen Wang (2
accessions each of the 2 O. sativa subspecies (Xu et al., 2012b) (Supplemental Table 10). BWA (Li and
Durbin, 2009) and SAMtools (Li et al., 2009) were used for read alignments which were visualized
using Tablet (Milne et al., 2010). A polymorphism was considered validated if 5 or more reads from a
single accession confirmed the presence of the positively selected site.
Expression analysis
Genevestigator (Hruz et al., 2008) is a gene expression database in which public microarray data are
collected, manually curated and normalized. At the time of this analysis, it contained data for 607
samples (1275 including replicates) from 63 whole genome 51K array (Affymetrix GeneChips)
experiments for the IRGSP rice reference sequence (Oryza sativa ssp. japonica cv. Nipponbare). This
database was queried for gene expression levels of Oj genes in expanded and non-expanded OGs,
across the entire set of experimental treatments. After identifying the probe set ID corresponding to the
specified genes, Genevestigator displayed signal intensities from Affymetrix probe sets. Only probes
that targeted a single gene were selected. If a gene was targeted by several probe sets, only one was
selected. Signal intensities were averaged over 1275 samples for each gene.
References
Darriba, D., Taboada, G.L., Doallo, R., and Posada, D. (2011). ProtTest 3: fast selection of best-fit
models of protein evolution. Bioinformatics. 27, 1164-1165.
De Bie, T., Cristianini, N., Demuth, J.P., and Hahn, M.W. (2006). CAFE: a computational tool for
the study of gene family evolution. Bioinformatics. 22, 1269-1271.
Edgar, R.C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput.
Nucleic Acid Res. 32(5), 1792-1797.
Goldman, N., and Yang, Z. (1994). A codon-based model of nucleotide substitution for protein-coding
DNA sequences. Mol. Biol. Evol. 11, 725-736.
Gremme, G., Brendel, V., Sparks, M.E., and Kurtz, S. (2005). Engineering a software tool for gene
structure prediction in higher organisms. Inform Software Tech. 47(15), 965-978.
Haas, B.J., Delcher, A.L., Wortman, J.R., and Salzberg, S.L. (2004). DAGchainer: a tool for mining
segmental genome duplications and synteny. Bioinformatics. 20(18), 3643-6.
Hruz, T., Laule, O., Szabo, G., Wessendorp, F., Bleuler, S., Oertle, L., Widmayer, P., Gruissem,
W., and Zimmermann, P. (2008). Genevestigator V3: a reference expression database for the meta-
analysis of transcriptomes. Adv. Bioinformatics. 2008, 420747.
Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., Jones, S.J., and
Marra, M.A. (2009). Circos: an Information Aesthetic for Comparative Genomics. Genome Res. 19,
1639-1645.
Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H.,
Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J., and Higgins, D.G.
(2007). Clustal W and Clustal X version 2.0. Bioinformatics. 23, 2947-2948.
Li, L., Stoeckert, C.J. Jr., and Roos, D.S. (2003). OrthoMCL: Identification of Ortholog Groups for
Eukaryotic Genomes. Genome Res. 13, 2178-2189.
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G.,
Durbin, R., and 1000 Genome Project Data Processing Subgroup. (2009). The sequence
alignment/map format and SAMtools. Bioinformatics. 25, 2078-9.
Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler trans-
form. Bioinformatics. 25, 1754-60.
Milne, I., Bayer, M., Cardle, L., Shaw, P., Stephen, G., Wright, F., and Marshall, D. (2010). Tablet
- next generation sequence assembly visualization. Bioinformatics. 26(3), 401-402.
Muyle, A., Serres-Giardi, L., Ressayre, A., Escobar, J., and Glemin, S. (2011). GC-Biased Gene
conversion and Selection affect GC content in the Oryza genus (rice). Mol. Biol. Evol. 28(9), 2695-
2706.
Punta, M., Coggill, P.C., Eberhardt, R.Y., Mistry, J., Tate, J., Boursnell, C., Pang, N., Forslund,
K., Ceric, G., Clements, J., Heger, A., Holm, L., Sonnhammer, E.L.L., Eddy, S.R., Bateman, A.,
and Finn, R.D. (2012). The Pfam protein families database. Nucleic Acid Res. 40, D290-D301.
Rezvoy, C., Charif, D., Gueguen, L., and Marais, G.A. (2007). MareyMap: an R-based tool with
graphical interface for estimating recombination rates. Bioinformatics. 23, 2188-2189.
Salse, J., Bolot, S., Throude, M., Jouffe, V., Piegu, B., Quraishi, U.M., Calcagno, T., Cooke, R.,
Delseny, M., and Feuillet, C. (2008). Identification and characterization of shared duplications be-
tween rice and wheat provide new insights into grass genome evolution. Plant Cell. 20, 11-24.
Sanderson, M.J. (1997). A nonparametric approach to estimating divergence times in the absence of
rate constancy. Mol. Biol. Evol. 14, 1218-1231.
Stamatakis, A. (2006). RAxML-VI-HPC: Maximum Likelihood-based Phylogenetic Analyses with
Thousands of Taxa and Mixed Models. Bioinformatics. 22(21), 2688–2690.
Storey, J.D., Taylor, J.E., and Siegmund, D. (2004). Strong control, conservative point estimation,
and simultaneous conservative consistency of false discovery rates: a unified approach. J. R. Stat. Soc.
Series B 66, 187-205.
Tang, L., Zou, X., Achoundong, G., Potgieter, C., Second, G., Zhang, D., and Ge, S. (2010).
Phylogeny and biogeoraphy of the rice tribe (Oryzeae): evidence from combined analysis of 20
chloroplast fragments. Mol. Phylogenet. Evol. 54, 266-277.
Xu, X., Liu, X., Ge, S., Jensen, J.D., Hu, F., Li, X., Dong, Y., Gutenkunst, R.N., Fang, L., Huang,
L., Li, J., He, W., Zhang, G., Zheng, X., Zhang, F., Li, Y., Yu, C., Kristiansen, K., Zhang, X.,
Wang, J., Wright, M., McCouch, S., Nielsen, R., Wang, J., and Wang, W. (2012b). Resequencing
50 accessions of cultivated and wild rice yields markers for identifying agronomically important genes.
Nat. Biotechnol. 30(1), 105-114.
Yang, Z., and Nielsen, R. (2002). Codon-substitution models for detecting molecular adaptation at
individual sites along specific lineages. Mol. Biol. Evol. 19, 908-917.
Yang, Z., Wong, W.S.W., and Nielsen, R. (2005). Bayes Empirical Bayes Inference of Amino Acids
Sites Under Positive selection. Mol. Biol. Evol. 22(4), 1107-1118.
Yang, Z. (2007). PAML 4: a program package for phylogenetic analysis by maximum likelihood. Mol.
Biol. Evol. 24, 1586-1591.
Zhang, J., Nielsen, R., and Yang, Z. (2005). Evaluation of an improved Branch-Site Likelihood
Method for Detecting Positive Selection at the Molecular level. Mol. Biol. Evol. 22(12), 2472-2479.
Zhu, Q., and Ge, S. (2005). Phylogenetic relationships among A-genome species of the genus Oryza
revealed by intron sequences of four nuclear genes. New Phytol. 167, 249-265.