Supplemental Table 1. Size of 32 families in the four ... · PF00096 Zinc finger, C2H2 type 122 134...

Supplemental Table 1. Size of 32 families in the four Oryza genomes, O. sativa ssp. japonica (Oj), O. sativa ssp. indica (Oi), O. glaberrima (Og), O. brachyantha (Ob), and Brachypodium distachyon (Bd).

Family Size

Ob

PF00632 7 8 8 9 7PF02386 Cation transport protein 7 6 6 6 6PF06480 9 9 7 6 9PF07839 9 10 8 9 6PF00888 Cullin family 16 16 16 11 11PF03321 GH3 auxin-responsive promoter 17 16 14 13 11PF03110 21 22 19 19 18PF07887 22 26 19 19 12PF01476 23 30 20 19 17PF01466 32 36 25 23 16PF02701 32 31 28 12 28PF03552 35 39 26 31 27PF00535 37 41 34 38 30PF00232 41 53 41 35 30PF02309 AUX/IAA family 52 55 45 53 44PF00462 60 71 61 40 57PF03514 GRAS (GAI, RGA, SCR) family transcription factor 62 64 58 56 46PF02519 Auxin responsive protein 66 67 62 59 53PF00319 77 70 64 60 52PF04564 U-box domain 84 82 84 72 71PF03106 WRKY DNA -binding domain 98 113 90 90 83PF00046 107 103 85 88 86PF00026 111 110 98 69 70PF00096 Zinc finger, C2H2 type 122 134 110 98 101PF00234 138 126 113 84 93PF00010 Helix-loop-helix DNA-binding domain 159 172 150 133 138PF00651 166 200 134 83 110PF00249 283 338 272 273 247PF00067 Cytochrome P450 366 418 302 293 263PF00097 Zinc finger, C3HC4 type (RING finger) 422 420 368 301 380PF00931 NB-ARC domain 610 687 398 373 364PF00646 F-box domain 735 767 649 249 535

Pfam id

Oj Oi Og Bd

HECT-domain (ubiquitin-transferase)

FtsH Extracellular, membrane-bound ATP-dependent proteasePlant calmodulin-binding domain

SBP domain (for SQUAMOSA-pROMOTER BINDING PROTEIN)Calmodulin binding protein-likeLysM (lysin motif) domainSkp1 family, dimerisation domainDof domain, zinc fingerCellulose synthaseGlycosyl transferase family 2Glycoside hydrolase family 1

Glutaredoxin

SRF-type transcription factor (DNA-binding and dimerisation domain)

Homeobox domain (Homeodomain fold)Eukaryotic aspartyl protease

Protease inhibitor/seed storage/LTP family (Tryp-alpha-amyl)

BTB (for BR-C, ttk and bab)/POZ (for Pox virus and Zinc finger) domainMyb-like DNA-binding domain

rameshk

Text Box

2

Supplemental Table 2. Sub-family copy number. In gray are the families that show minor size variation between Ob and the IRGSP RefSeq.

subfamily total Oj Oi Og Ob (Oj-Ob) (Oj-Ob)/size %

F-box-19 9 3 2 3 1 2 22.2

F-box-22 14 4 3 4 3 1 7.1

F-box-20 18 6 5 4 3 3 16.7F-box-6 25 8 6 9 2 6 24.0

F-box-4 26 9 9 6 2 7 26.9

F-box-18 35 11 9 10 5 6 17.1

F-box-21 45 14 14 13 4 10 22.2

F-box-9 46 10 12 11 13 -3 -6.5

F-box-5 55 15 17 13 10 5 9.1

F-box-10 63 16 17 15 15 1 1.6

F-box-14 71 22 24 17 8 14 19.7

F-box-23 72 23 28 15 6 17 23.6

F-box-7 85 24 24 23 14 10 11.8F-box-2 91 31 29 23 8 23 25.3

F-box-3 110 34 34 32 10 24 21.8

F-box-16 117 33 36 38 10 23 19.7

F-box-15 122 39 41 31 11 28 23.0

F-box-17 143 38 48 38 19 19 13.3

F-box-8 186 61 56 47 22 39 21.0

F-box-13 206 63 64 60 19 44 21.4

F-box-11 261 83 83 73 22 61 23.4

F-box-12 269 84 87 79 19 65 24.2F-box-1 331 104 119 85 23 81 24.5

Total 2400 735 767 649 249 486 20.3

Asp-2 20 7 4 6 3 4 20.0

Asp-3 21 7 7 5 2 5 23.8

Asp-4 24 6 6 7 5 1 4.2

Asp-5 97 25 29 28 15 10 10.3

Asp-6 109 32 30 24 23 9 8.3

Asp-1 117 34 34 28 21 13 11.1

Total 388 111 110 98 69 42 10.8

BTB-9 4 1 1 1 1 0 0.0BTB-4 6 1 2 2 1 0 0.0

BTB-8 7 2 2 1 2 0 0.0

BTB-10 7 2 2 2 1 1 14.3

BTB-5 8 2 3 2 1 1 12.5

BTB-6 12 4 3 2 3 1 8.3

BTB-13 17 5 6 4 2 3 17.6

BTB-11 38 13 17 4 4 9 23.7

BTB-1 46 13 16 10 7 6 13.0

BTB-3 67 22 28 17 0 22 32.8

BTB-7 94 25 30 25 14 11 11.7BTB-2 100 28 43 20 9 19 19.0

BTB-12 177 48 47 44 38 10 5.6

Total 583 166 200 134 83 83 14.2

GRX-2 54 12 17 14 11 1 1.9

GRX-1 178 48 54 47 29 19 10.7

Total 232 60 71 61 40 20 8.6

NB-ARC-5 28 9 11 4 4 5 17.9

NB-ARC-1 33 9 14 4 6 3 9.1

NB-ARC-17 51 15 19 11 6 9 17.6

NB-ARC-12 68 19 24 11 14 5 7.4NB-ARC-14 71 20 28 12 11 9 12.7

NB-ARC-4 80 16 27 16 21 -5 -6.3

NB-ARC-9 84 24 29 17 14 10 11.9

NB-ARC-6 95 31 33 16 15 16 16.8

NB-ARC-11 102 28 35 15 24 4 3.9

NB-ARC-15 104 31 40 21 12 19 18.3

NB-ARC-3 147 39 41 30 37 2 1.4

NB-ARC-7 155 48 57 24 26 22 14.2

NB-ARC-10 157 51 45 32 29 22 14.0

NB-ARC-2 189 53 62 41 33 20 10.6NB-ARC-8 208 62 60 46 40 22 10.6

NB-ARC-16 214 67 73 43 31 36 16.8

NB-ARC-13 282 88 89 55 50 38 13.5

Total 2068 610 687 398 373 237 11.5

Tryp-α-amyl-6 44 14 11 10 9 5 11.4

Tryp-α-amyl-1 45 12 12 10 11 1 2.2

Tryp-α-amyl-4 52 17 10 13 12 5 9.6

Tryp-α-amyl-2 59 20 19 16 4 16 27.1

Tryp-α-amyl-3 99 29 30 21 19 10 10.1

Tryp-α-amyl-5 162 46 44 43 29 17 10.5

Total 461 138 126 113 84 54 11.7

rameshk

Text Box

3

Supplemental Table 3. Number and percentage of orthologous groups and orthologous genes for each gene family.

Asp BTB F-box GRX NB-ARC Tryp-α-amyl Zf-Dof

orthologous groups 105 160 738 59 567 137 31

Oj-Oi-Og-Ob 46 51 163 30 149 46 9

Oj-Oi-Og-Ob % 44 32 22 51 26 34 29

Oj-Oi-Og 32 58 362 20 130 45 13

Oj-Oi-Og % 31 36 49 34 23 33 42

Oj-Oi 12 25 113 2 105 14 5

Oj-Oi % 11 16 15 3 19 10 16

others 15 26 100 7 183 32 4

others % 14 16 14 12 32 23 13

orthologous pairs Asp BTB F-box GRX NB-ARC Tryp-α-amyl Zf-Dof

Oj-Oi 94 143 667 54 449 112 27

% Oj genes in OP Oj-Oi 84.7 86.1 90.7 90 73.6 81.2 84.4

% Oi genes in OP Oj-Oi 85.4 71.5 87 76 65.3 88.9 87.1

Oj-Og 85 116 560 52 321 106 25

% Oj genes in OP Oj-Og 76.6 69.9 76.2 86.7 52.6 76.8 78.1

% Og genes in OP Oj-Og 86.7 86.6 86.3 85.2 80.6 93.8 89.3

Oj-Ob 52 64 197 34 244 65 10

% Oj genes in OP Oj-Ob 46.8 38.5 26.8 56.7 40 47.1 31.2

% Ob genes in OP Oj-Ob 75.4 77.1 79.1 85 65.4 77.4 83.3

rameshk

Text Box

4

Supplemental Table 6. Results of the Poisson regression analysis for the relationship between recombination rate and gene family number.

Family Coefficient Exp(Coeff) Std Error P-valueAsp 0.102 1.11 0.031 0.001

BTB -0.002 1 0.02 0.912F-box 0.063 1.06 0.012 0

GRX 0.143 1.15 0.041 0NB-ARC 0.01 1.01 0.015 0.509

0.067 1.07 0.029 0.02

Zf-Dof 0.085 1.09 0.059 0.151

Tryp-α-amyl

Supplemental Table 8. List of loci with signature of positive selection. Gene IDs are simplified compared to the initial annotations so only the species (Oj, Oi, Og or Ob) and the unique number are indicated.

Results of branch-site test of positive selection Genes with the mutation

Family and groups OG Locus Mutation Gene 1 Gene 2 Gene 3

Og01_00472.Oj01_08330.Oi_02885 426 K to T Oi_02885

Og01_03244.Oi_04874.Oj01_6483054 M to I Oj01_64830

132 G/S to L Og01_03244

Og02_01220.Oi_06611.Oj02_20690 369 R/P to T Oj02_20690

Og08_00579.Oj08_13030.Oi_28227 362 S to K Oi_28227

Og08_00584.Oj08_13090.Oi_16035 303 V to M Oi_16035 Og08_00584

Og08_01244.Oi_26997.Oj08_3143062 T/R/K to M Og08_01244

174 A to E Og08_01244

Og09_00453.Oj09_16870.Oi_30541193 R to C Oj09_16870

219 A to V Oi_30541

Og10_00916.Oi_32948.Oj10_28820 319 C to Y Oj10_28820

Og10_00920.Oi_32953.Oj10_28870

52 S to T Og10_00920

105 G to A Og10_00920

148 N to S Oi_32953

210 L to V Og10_00920

219 L to F Og10_00920

242 S to L Oj10_28870 Oi_32953

243 S to L Oj10_28870

278 A to D Oi_32953

313 N to K Oj10_28870 Oi_32953

Og10_00937.Oj10_29340.Oi_32974

487 C to Y and F Oj10_29340 Og10_00937

501 Y to L and F Oj10_29340 Oi_32974 Og10_00937

517 I/V to L Oi_32974

518 E to D Oi_32974

Og11_01646.Oj11_40490.Oi_35629 339 P to L Oj11_40490 Oi_35629

Oj08_13020.Oi_28224.Og08_00578 314 L to P Oj08_13020

Oi_114.Oj_970135 A/S to I Oj_970

968 N/S to S Oj_970

Og_2114.Oi_68.Oj_929 130 W to Y Og_2114

Og_2125.Oi_83.Oj_941 74 W to L Og_2125

Og_2334.Oi_324.Oj_1165 241 L to R/Y/V Oj_1165 Oi_324 Og_2334

Og_2347.Oi_338.Oj_1176 388 E to Y Og_2347

Og_2412.Oi_415.Oj_1252 96 L to K Og_2412

Og_2498.Oi_514.Oj_1346 358 D to Q Oj_1346 Oi_514 Og_2498

Oi_01920.Oj01_16390

202 A/R to G/Q Oj01_16390 Oi_01920204 K to Q Oj01_16390

205 T to S Oj01_16390

Oi_03230.Oj01_16370157 A to T/L Oj01_16370 Oi_03230

159 S to K Oi_03230

Oi_06618.Oj02_20460

434 I to T Oj02_20460

701 P to L Oj02_20460

811 E to K Oi_06618

Oi_34858.Oj11_06230 408 G to S Oj11_06230 Oi_34858

Oi_35038.Oj11_12330

753 N to K/T Oj11_12330 Oi_35038

760 A to N Oj11_12330

764 */R/I/E/K/M/S to S/A Oj11_12330 Oi_35038

820 */T/V/A to V Oi_35038

823 */R/K/S to K/S Oj11_12330 Oi_35038

Oi_35437.Oj11_35210

388 R to C Oj11_35210

395 A to V/G Oj11_35210 Oi_35437

505 E to K Oj11_35210

523 Y to N Oi_35437

Oj01_49190.Oi_04222 497 F to S Oi_04222

Oj01_57270.Oi_04573

786 I to M Oj01_57270

916 D/N to N/P Oj01_57270

947 H/C/F/L to Y Oj01_57270 Oi_04573

1397 L to P Oj01_57270 Oi_04573

Oi_21844.Oj06_06860

447 S to T/Y Oj06_06860 Oi_21844

779 E to V/K Oj06_06860

781 T to G/M Oj06_06860

785 I to T/G Oj06_06860

868 R to H Oj06_06860

872 V to R Oj06_06860

Oi_21869.Oj06_06390 322 M to L Oi_21869

Oi_33964.Oj11_30050 77 L to V Oj11_30050

Oi_34292.Oj11_10770 90 A to N Oj11_10770

Oi_35431.Oj11_34970626 L to T/M Oj11_34970 Oi_35431

870 A to P/S Oj11_34970 Oi_35431

Oi_36183.Oj12_30070

471 Q to R/D Oj12_30070 Oi_36183

634 L to G Oj12_30070

945 R to A Oj12_30070

Oi_36194.Oj12_29710467 K to D/N Oj12_29710 Oi_36194

1377 V to D/F Oj12_29710

Oi_36218.Oj12_28250 517 L to I Oj12_28250

Oj04_25900.Oi_15350 251 E to D Oi_15351 (IP)

Oj11_14380.Oi_35077 285 L to V/H Oj11_14380 Oi_35077

Oj12_33160.Oi_36029

152 I to N Oj12_33160 Oi_36029

504 Y to H Oj12_33160 Oi_36029

923 to D/W Oj12_33160 Oi_36029

Oj12_37760.Oi_37604 1222 D to E Oi_37604

Oi_06629.Oj02_20210.Og02_01204 744 C/S to L Oj02_20210

Oi_07774.Oj11_46210.Og11_01857

390 N to A Og11_01857

470 K to S Og11_01857

767 S to H Og11_01857

859 N/S to E Og11_01857

921 K to Y Og11_01857

Oj12_29290.Og12_01078.Oi_37381 74 F to L Oj12_29290 Oi_37381

Asp Expansion Oj-Oi-Og

BTB Expansion Oj-Oi-Og

F-box Expansion Oj-Oi

F-boxExpansion Oj-Oi-Og

NB-ARC Expansion Oj-Oi

NB-ARC Expansion Oj-Oi-Og

Supplemental Table 9. Genomic sequences and annotation information.

O. sativa ssp. japonica O. sativa ssp. indica

O. glaberrima O. brachyantha

cultivar/strain Nipponbare 93-11 IRGC:96717 IRGC:101232

institution MSU Rice Genome Annotation Project

Beijing Genomics Institute (BGI)

Arizona Genomics Institute (AGI)

Chinese Academy of sciences IGDB

release v6.1 2003annotations updated in 2010

v1.1 v1.4

reference International Rice Genome Sequencing Project, 2005

Yu et al., 2002 unpublished Chen et al., 2013

length of all chromosomes

372.317.567 bp 352.234.171 bp 285.037.524 bp 250.927.218 bp

access http://rice.plantbiology.msu.edu

http://rise2.genomics.org.cn/page/rice/index.jsp

http://www.genome.arizona.eduGenBank: ADWL01000000

http://www.gramene.orgGenBank: AGAT00000000.1

Supplemental Table 11. Informations about the population resequencing data used for selection pressure validation.

Run name Accessions Country of origin

IRGC 58398

IRGC 58716 Bangladesh

IRGC 3340 Dominican Republic

IRGC 58267 Afghanistan

IRGC 12880 Iran

IRGC 13793

IRGC 101049 Senegal

IRGC67563 Ghana

IRGC 100854 Congo

IRGC 68939 Madagascar

IRGC 96841 Zimbabwe

Run name Accessions Country of origin

IRGC 32399 Bhutan

SZC08001BHDCAAPE IRGC 55471 South-Korea

SZC08001ATDCAAPE IRGC 8231 Vietnam

SZC08001ADDCAAPE IRGC 27762 Thailand

AGI resequencing populations

O. sativa ssp. japonica

Oj_58398 Brunel

Oj_58716

Oj_J102

O. sativa ssp. indica

Oi_58267

Oi_12880

Oi_13793 Brunel

O. glaberrima

Og_101049Og_67563

Og_100854

Og_68939

Og_96841

Xu et al., 2012 resequencing populations

O. sativa ssp. japonica

ORIwwRAADBAAPE

O. sativa ssp. indica

(b)

OP Non- collinear Collinear % Collinear on total OP % Ox Collinear*

Asp 94 93 1 86 6 91.5 77.5 78.2BTB 143 143 6 128 9 89.5 77 64F-box 667 659 21 599 36 89.8 81.5 78

54 54 0 53 1 98 88.3 74.6NB-ARC 449 443 18 385 37 85.7 63 56

112 111 1 106 4 94.6 76.8 8427 26 0 26 0 96.3 81.3 84

Asp 85 77 1 75 1 88.2 67.6 76.5BTB 116 113 1 112 0 96.5 67.5 83.6

560 560 24 528 5 94.3 72 81.452 52 0 52 0 100 86.7 85.2

NB-ARC 321 305 7 295 1 92 48.4 74.1106 98 0 98 0 92.4 71 86.725 24 0 24 0 96 75 85.7

Asp 52 52 0 52 0 100 46.8 75.4BTB 64 62 1 59 2 92.2 35.5 71F-box 197 196 10 175 8 88.8 23.8 70.3

34 34 1 32 1 94 53.3 80NB-ARC 244 243 18 214 9 87.7 35 57.4

65 65 2 59 2 90.8 42.8 70.210 10 0 10 0 100 31.2 83.3

Syntenic

OP with coordinates

Non-syntenic % Oj Collinear *

Oj-Oi Glutaredoxin

Tryp-alpha-amylZf-Dof

Oj-OgFboxGlutaredoxin


Oj-Ob Glutaredoxin


Supplemental Figure 1. (a) Relationships of orthologous pairs between Oryza sativa ssp. japonica and both O. glaberrima and O. brachyantha for the F-box, NB-ARC, Asp, BTB, GRX, Tryp-α-amyl, and Zf-Dof gene families. Gray genes and lines correspond to collinear relationships. Red genes and lines correspond to non-syntenic and non-collinear relationships. Blue genes on the chromosomes are absent from orthologous pairs. (b) Number and percentage of syntenic, collinear, and non-syntenic orthologous pairs (OP) for each pair of Oryza species.

Asp BTB

No Outgroup With Outgroup B. distachyon

Tree:(((Oi,Oj)n1,Og)n2,Ob)n3

λ tree: (((1,1)1,1)1,2)

λ 1 = 6.94 x10-6λ 2 = 6.67 x10-6

Average Expansion Oi = 9.750000Oj = -0.062500n1 = 6.156250Og = -13.031250n2 = 18.156250Ob = -16.437500

Tree:(((Oi,Oj)n1,Og)n2,Ob)n3,Bd)n4

λ tree: (((((1,1)1,1)1,1)1,2)3,3)

λ 1 = 6.99 x10-6 λ 2 = 6.67 x10-6λ 3 = 2.09 x10-6

Average Expansion Oi = 7.419355Oj = -0.225806n1 = 4.193548Og = -8.935484n2 = 17.483871Ob = -12.903226n3 = 0.903226Bd = -2.129032

Additional data

Oj Oi Og Ob Bd

-7

77

110111 98 6998 70

108

105

82

-13

-7

+5

+23

+3

+3 +20.025

Oj Oi Og Ob Bd

166

-4

200 134 83 110

114

116

158

170

-33

-24

+30-4

+12

+42

+2

0

0

0

0.008

0.024

F-box

Oj Oi Og Ob Bd

+40

735 767 649 249 535

731

701

468

495

-219

-52

+36+4

+30

+233

-27

0

0

0

0

0

GRX

Oj Oi Og Ob Bd60 71 61 40 57

64

63

52

54

+3-12

-2

+7-4

+1

+11

-2

+11

0.0010.03

Tryp-α-amyl

Zf-Dof

Oj Oi Og Ob Bd

138

-5

126 113 84 93

128

123

100

98

+2

+23

+5

+10 -2 -10

-16

Oj Oi Og Ob Bd

32

+3

31 28 12 28

25

22

30

31

-10

-2

0+1

+1

+8

-3

0

0.04

0.004

NB-ARC

Oj Oi Og Ob610 687 398 373

605

537

452

-79

+85

+68

+5 +82 -1390

0

0

0.006

0.007

Supplemental Figure 2. Results of CAFE analysis. Indicated in the first row of the table are the Newick formated tree, the global birth and death parameter λ (gain-loss/gene/MYA) for the specified subset of branches in the tree and the average size of expansions/contractions along each branch among all families. Then for each family, the ancestral states of family size for each node and the direction of the changes are displayed, along with the branches p-value if inferior to 0.05.

0.05 0.2

0.4

0.6

0.8 1

1.2

1.4

1.6

1.8 2

0

20

40

60

80

100

120

140

0.05 0.

2

0.4

0.6

0.8 1

1.2

1.4

1.6

1.8 2

0

20

40

60

80

100

120

140

F-box

Oj-Oi-Og-Ob

Oj-Oi-Og

0.05 0.2

0.4

0.6

0.8 1

1.2

1.4

1.6

1.8 2

0

2

4

6

8

10

12

14

16

18

20

Oj-Oi

NB-ARC

0.05 0.2

0.4

0.6

0.8 1

1.2

1.4

1.6

1.8 2

0

5

10

15

20

25

30

35

40

45

Oj-Oi-Og-Ob

Oj-Oi-Og

0.05 0.2

0.4

0.6

0.8 1

1.2

1.4

1.6

1.8 2

0

20

40

60

80

100

120

140

160

0.05 0.

2

0.4

0.6

0.8 1

1.2

1.4

1.6

1.8 2

0

2

4

6

8

10

12

14

16

18

Oj-Oi

Supplemental Figure 3. Frequency distribution of ω ratios of all pairwise comparisons for the different types of orthologous groups for the F-box and NB-ARC gene families. The last bars on the x scale represents all values superior to 2 including the infinite values (ω = 99).

mean median mean median mean medianall families 7635 845 1725 2284 261 568 2813 786 582Asp 12513 4132 5181 2949 1104 1180 6101 3334 1611BTB 8395 1665 6220 1784 864 391 458 115 305F-box 5900 734 2076 1616 218 568 1489 381 558GRX 19236 7870 6376 10057 3336 7655 / / /NB-ARC 5130 1163 1115 1035 229 461 3621 2431 438

18100 5627 4600 11726 3338 5354 9642 3479 57666272 1513 6491 3802 1351 1825 1397 630 975

Oj in Oj-Oi-Og-Ob Oj in Oj-Oi-Og Oj in Oj-Oise se se

Tryp-α-amylZf-Dof

n1 n2 w (U) H n1 n2 w (U) H n1 n2 w (U) Hall families 327 133 29018 1.841e-08 H1 327 492 109624 2.2e-16 H1 133 492 33938 0.509 H0Asp 28 5 86 0.4482 H0 28 21 440 0.002714 H1 5 21 58 0.7526 H0BTB 29 8 197 0.001796 H1 29 50 1153 1.342e-05 H1 8 50 167 0.4636 H0F-box 129 75 6743 2.77e-06 H1 129 267 23495 1.583e-09 H1 75 267 9745.5 0.7984 H0GRX 20 1 15 0.51714 H0 20 13 146 0.573 H0 1 13 5 0.8571 H0NB-ARC 111 38 2644 0.01981 H1 111 110 8411 1.223e-06 H1 38 110 2309 0.3364 H0

22 9 100 0.983 H0 22 22 232 0.8253 H0 9 22 101 0.949 H06 3 17 0.04762 H1 6 11 48 0.149 H0 3 11 11 0.456 H0

Oj-Oi-Og-Ob vs Oj-Oi Oj-Oi-Og-Ob vs Oj-Oi-Og Oj-Oi vs Oj-Oi-Ogpvalue pvalue pvalue

Tryp-α-amylZf-Dof

(a)

(b)

11325374961738597

109121133145157169181193205217229241253265277289301313325337

0 20000 40000 60000 80000 100000 120000 140000

Oj-Oi-Og-Ob

1305988

117146175204233262291320349378407436465

0 10000 20000 30000 40000 50000 60000

OjOiOg

110192837465564738291

100109118127136

0 20000 40000 60000 80000 100000

Oj-Oi

(c)

Supplemental Figure 4. Results of the global gene level expression analysis: (a) Parameters of mean expression intensity distribution, (b) Results of the Mann-Whitney-Wilcoxon test between the distributions of different class of OGs, (c) Distribution histograms for each class of OGs with values combined for all gene families.

Supplemental Method 1. More detailed description of the methods applied in the study

Identification of gene families, orthology relationships and collinearity

We initially targeted 32 gene families (Supplemental Table 1) across the four Oryza species for which

genome sequences were available. The HMM-profile of each family domain (see Supplemental Table 1

for Pfam identifiers), downloaded from Pfam 26.0 (www.pfam.sanger.ac.uk; Punta et al., 2012), was

used as query to search the annotated proteomes of Oryza sativa ssp. japonica and indica, O.

glaberrima and O. brachyantha (E-value≤1.0) with HMMER 3.0 (www.hmmer.janelia.org).

Information on genome sequences and annotation data for the four Oryza genomes are displayed in

Table 4. Alternative transcripts and TE related proteins were filtered. For the large F-box family, to

identify additional genes missed in the annotation, a similarity-based re-annotation was performed. The

collection of F-box proteins from the four genomes were remapped to their unmasked whole genome

sequences in all pairwise combinations using tblastn. From these alignments, gene models were

predicted via spliced alignments with GenomeThreader (Gremme et al., 2005). The newly predicted

genes were retained only if they contained a F-box domain. This combined analysis has led to the

manual assignment of 58, 43, 127, and 29 additional F-box genes in japonica, indica, O. glaberrima

and O. brachyantha, respectively. The F-box genes in the four genomes were given new ids to account

for these newly predicted genes. The new nomenclature and its correspondence with the initial

annotations is provided in Supplemental Table 11. Among the 32 families, 7 families were selected

(Supplemental Table 9) with the largest size differences between O. brachyantha and the O. sativa ssp.

japonica reference sequence (Ob/Oj < 0.70 and at least 10 more genes in Oj compared to Ob). The

families were divided into sub-families using an initial neighbor-joining tree constructed with Clustal

2.0 (Larkin et al., 2007). To identify domain composition of the AA-specific sub-family BTB-3, we

searched the PfamA database using HMMER3 (E-value≤ 1.0).

OrthoMCL (Li et al., 2003) was used to cluster the proteins into orthologous groups (OGs). The

similarity search was conducted with an all-versus-all BLASTP analysis (E-value<1e-5). Inparalogous

and co-orthologous pairs were removed in order to obtain strict orthologous relationships. In specific

cases where proteins in the first species had several mutual hits of equal significance with proteins in

the second species, OrthoMCL clustered several OGs together. Phylogenetic information was used to

manually divide these large groups into single OGs, and identify additional inparalogs and co-

orthologs. To construct trees, proteins from each sub-family were aligned with MUSCLE (v3.7, Edgar,

2004). ProtTest3 (Darriba et al., 2011) was used to select the best-fitting substitution model for each

alignment among JTT, LG, WAG and VT. Maximum likelihood (ML) trees were generated with

http://www.pfam.sanger.ac.uk/

RAxML using default settings (v7.2.8, Stamatakis, 2006) and the corresponding amino acid model.

Once the final OGs were obtained, the genes without chromosome coordinates were excluded, and

syntenic (genes located on homologous chromosomes), collinear (syntenic genes in corresponding

orders) and non-syntenic orthologous genes were defined for each pair of species. If one or both of the

genes in the non-syntenic pairs had inparalogs, we checked if one of those inparalogs could be the real

syntenic ortholog by comparing the list of genes in non-syntenic pairs to the lists of inparalogs: {Iff

OPx = (Ax-a;By-b) && IP(By-b;Bx-b') && (Bx-b') OGn, so OPx' = (Ax-a;Bx-b') }. OP =∉

orthologous pair, OG= orthologous group, (A;B) = species, (x;y) = chromosomes and (a;b;b') =

genes, Iff = if and only if, &&= and, IP() =inparalogous pair, = does not belong to}. ∉ We also

checked if a true collinear ortholog could not be found because it was not classified as part of the

family under investigation. For each non-syntenic pair between species A and B, the gene in A was

compared by blastp to the entire proteome of species B. If a better or equivalent hit was found with a

different gene in species B, in comparison to the initial pair, and this gene was collinear but not present

in the family, it was considered as the true ortholog. The initial pair was discarded in the collinearity

final results, although such pairs were included in the OrthoMCL results, as they are the best hits we

could find within a family. Circos was used to visualize the gene distribution and collinearity

(Krzywinski et al., 2009).

Evolution of gene family sizes

To compute the global birth and death rate (λ, probability of both gene gain and loss per gene per MY,

assuming that they are equally probable) of gene families in a group of taxa, CAFE (Computational

Analysis of gene Family Evolution, De Bie et al., 2006) requires gene family sizes in extant species and

their phylogenetic relationships, including their time of divergence. CAFE calculates the most likely

ancestral family sizes (internal nodes), from which we can infer if the variation is the consequence of

expansion or contraction. The dataset for 32 families (Supplemental Table 1) and the species tree

displayed in Figure 1 were used in the analyses. Divergence times were based on data from the

literature (Zhu and Ge, 2005; Tang et al., 2010; Sanderson, 1997). CAFE computes a P-value

(probability of observing the data given random gain and loss of genes) for each family to compare

their evolutionary dynamic to the global birth and death model. If p<0.05, the family deviates from the

model and CAFE computes P-values for each branch to identify where the largest changes have taken

place. The number of random samples for the Monte Carlo re-sampling procedure was 1000. As CAFE

failed to provide accurate λ estimation because of the large gene family size variation of the NB-ARC

family on a single branch (between Og and Oj/Oi), this family had to be removed in the first round of

analysis. Running a second analysis round without the outgroup B. distachyon allowed us to obtain

values for the internal nodes in Oryza for the NB-ARC family. For the 6 other families under

investigation, as both analyses gave the same results for the Oryza branches, the results of the first

round were presented. In the first round of analysis, we proposed a model with different λ rates, one for

B. distachyon and the branch leading to Oryza, one for O. brachyantha, and one for the remaining

branches. In the second round for NB-ARC, we had CAFE find two λ values, one for the O.

brachyantha branch and one for all other branches. All analyses were run in triplicate to ensure that

CAFE converged into a single optimum for the parameter λ.

Composition of tandem duplication and presence in segmental duplications

The tandem arrays were identified with DAGchainer (Haas et al., 2004). To create a list of putative

paralogous pairs, an all-versus-all BLAST search was run on the protein dataset for each family and

each species (E-value<1e-30). Arrays with at least 2 copies (-A 1) and a maximum distance between

two matches of 20 kb were marked at putative tandem arrays. To determine if any of the genes in our

gene families were part of segmental duplications, segmental duplication blocks were constructed for

the four genomes using a homology search at the DNA level. To define putative paralogous pairs, an

all-versus-all BLAST search was run on the whole CDS dataset for each species (E-value<1e-10). The

results were sorted according to the CIP (cumulative identity percentage) and CALP (cumulative

alignment length percentage) parameters of Salse et al. (2008), both defined at 70%. DAGchainer was

run with a minimum of 5 aligned pairs in one block, and a maximum of 1 Mb allowed between two

matches. If two blocks on the same chromosomes were less than 1 Mb apart (on both chromosomes),

they were concatenated. The lists of genes in our families were then compared to the list of collinear

genes in each segmental duplication block. Local recombination rates (cM/Mb) were recovered every

100 kb along each Oj chromosome using MareyMap (Rezvoy et al., 2007), by comparing rice genetic

(Muyle et al., 2011; www.rgp.dna.affrc.go.jp) and physical maps (Marey's map). Cubic splines were

chosen as the interpolation method. The mean recombination rates for 1 Mb-intervals were computed

and compared with the number of japonica genes from each family in the same interval. The

correlations were then tested using a Poisson regression analysis (Supplemental Table 6).

Selective pressure analysis

For each gene family investigated, we first performed a filtering step whereby all gene that contained

>5% N’s in their coding sequences, or did not contain an initiator methionine were removed, as they

can increase the detection of false-positive positive selection. Codon-based CDS alignments for each

individual OG were created from protein alignments (MUSCLE) with the BioPerl module bp_mrtrans

(Jason Stajich). ML trees were constructed with RAxML using the GTRGAMMA model and default

settings. The basic codon substitution model of Goldman and Yang (1994) in Codeml (Yang, 2007) was

used to compute pairwise ω ratios between orthologs. The branch-site codon substitution model (Yang

and Nielsen, 2002) and the branch-site test of positive selection (Zhang et al., 2005) were used to look

at the selective pressure on the orthologs in the Oj-Oi-Og or Oj-Oi groups compared to their paralogs.

For each ortholog in all Oj-Oi-Og and Oj-Oi groups, a BLAST (E-value<1e-5) analysis on all the

family proteins was performed, and the twenty top high-scoring segment pairs (HSPs) were retained,

and filtered with 50% CIP and 70% CALP. Protein and CDS alignments were constructed for each

expansion group with the initial orthologs and their respective hits, but then excluded if they were

composed of 3 or less sequences. The CDS sequences were aligned and used to construct trees as

described above. In the tree file the initial ortholog sequences were specified as foreground branches. If

inparalogs were found during the similarity search, they were also marked as foreground branches.

Alignments composed only of foreground branches were also excluded. The branch-site modified

model A (Yang et al., 2005; Zhang et al., 2005) was run in Codeml first. Three ω ratios were calculated

for the 4 site classes assumed by the model (Table 3). p2a and p2b are the proportions of 2a and 2b

sites, for which codons experienced positive selection (ω2 >1) in the foreground branches. The

percentages of OGs with different profiles of predominant site classes are summarized in Figure 7 for

each type of expanded group and each family. The branch-site likelihood ratio test (LRT) was applied

for alignments with p2a and/or p2b> 0 and at least 1 site under positive selection in one of the

foreground branches. The LRT compared model A with a null model that does not allow positive

selection on the foreground lineages (ω2 is fixed to1). P-values were computed from LRT statistics

with df = 1. Multiple testing corrections were executed by computing q-values using QVALUE (Storey

et al., 2004). Each individual test was considered significant at the 5% level, only if the q-value ≤ 0.05.

The BEB (Bayes Empirical Bayes) method (Yang et al., 2005) gave posterior probabilities that each

site was from a particular class. For alignments with a significant test in favor of model A, the sites

with high posterior probabilities to belong to the 2a or 2b classes (P > 90%) were selected only. Finally,

each putatively positively selected site was manually checked at both the protein and DNA levels to

remove false positives due to misalignments.

All sites that appeared to be under positive selection in the three AA genome species were validated

using Illumina HiSeq resequencing data generated at AGI (3 O. sativa ssp. japonica, 3 O. sativa ssp.

indica, 5 O. glaberrima accessions: Wing et al. unpublished), or obtained from Dr. Wen Wang (2

accessions each of the 2 O. sativa subspecies (Xu et al., 2012b) (Supplemental Table 10). BWA (Li and

Durbin, 2009) and SAMtools (Li et al., 2009) were used for read alignments which were visualized

using Tablet (Milne et al., 2010). A polymorphism was considered validated if 5 or more reads from a

single accession confirmed the presence of the positively selected site.

Expression analysis

Genevestigator (Hruz et al., 2008) is a gene expression database in which public microarray data are

collected, manually curated and normalized. At the time of this analysis, it contained data for 607

samples (1275 including replicates) from 63 whole genome 51K array (Affymetrix GeneChips)

experiments for the IRGSP rice reference sequence (Oryza sativa ssp. japonica cv. Nipponbare). This

database was queried for gene expression levels of Oj genes in expanded and non-expanded OGs,

across the entire set of experimental treatments. After identifying the probe set ID corresponding to the

specified genes, Genevestigator displayed signal intensities from Affymetrix probe sets. Only probes

that targeted a single gene were selected. If a gene was targeted by several probe sets, only one was

selected. Signal intensities were averaged over 1275 samples for each gene.

References

Darriba, D., Taboada, G.L., Doallo, R., and Posada, D. (2011). ProtTest 3: fast selection of best-fit

models of protein evolution. Bioinformatics. 27, 1164-1165.

De Bie, T., Cristianini, N., Demuth, J.P., and Hahn, M.W. (2006). CAFE: a computational tool for

the study of gene family evolution. Bioinformatics. 22, 1269-1271.

Edgar, R.C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Nucleic Acid Res. 32(5), 1792-1797.

Goldman, N., and Yang, Z. (1994). A codon-based model of nucleotide substitution for protein-coding

DNA sequences. Mol. Biol. Evol. 11, 725-736.

Gremme, G., Brendel, V., Sparks, M.E., and Kurtz, S. (2005). Engineering a software tool for gene

structure prediction in higher organisms. Inform Software Tech. 47(15), 965-978.

Haas, B.J., Delcher, A.L., Wortman, J.R., and Salzberg, S.L. (2004). DAGchainer: a tool for mining

segmental genome duplications and synteny. Bioinformatics. 20(18), 3643-6.

Hruz, T., Laule, O., Szabo, G., Wessendorp, F., Bleuler, S., Oertle, L., Widmayer, P., Gruissem,

W., and Zimmermann, P. (2008). Genevestigator V3: a reference expression database for the meta-

analysis of transcriptomes. Adv. Bioinformatics. 2008, 420747.

Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., Jones, S.J., and

Marra, M.A. (2009). Circos: an Information Aesthetic for Comparative Genomics. Genome Res. 19,

1639-1645.

Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H.,

Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J., and Higgins, D.G.

(2007). Clustal W and Clustal X version 2.0. Bioinformatics. 23, 2947-2948.

Li, L., Stoeckert, C.J. Jr., and Roos, D.S. (2003). OrthoMCL: Identification of Ortholog Groups for

Eukaryotic Genomes. Genome Res. 13, 2178-2189.

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G.,

Durbin, R., and 1000 Genome Project Data Processing Subgroup. (2009). The sequence

alignment/map format and SAMtools. Bioinformatics. 25, 2078-9.

Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler trans-

form. Bioinformatics. 25, 1754-60.

Milne, I., Bayer, M., Cardle, L., Shaw, P., Stephen, G., Wright, F., and Marshall, D. (2010). Tablet

- next generation sequence assembly visualization. Bioinformatics. 26(3), 401-402.

Muyle, A., Serres-Giardi, L., Ressayre, A., Escobar, J., and Glemin, S. (2011). GC-Biased Gene

conversion and Selection affect GC content in the Oryza genus (rice). Mol. Biol. Evol. 28(9), 2695-

2706.

Punta, M., Coggill, P.C., Eberhardt, R.Y., Mistry, J., Tate, J., Boursnell, C., Pang, N., Forslund,

K., Ceric, G., Clements, J., Heger, A., Holm, L., Sonnhammer, E.L.L., Eddy, S.R., Bateman, A.,

and Finn, R.D. (2012). The Pfam protein families database. Nucleic Acid Res. 40, D290-D301.

Rezvoy, C., Charif, D., Gueguen, L., and Marais, G.A. (2007). MareyMap: an R-based tool with

graphical interface for estimating recombination rates. Bioinformatics. 23, 2188-2189.

Salse, J., Bolot, S., Throude, M., Jouffe, V., Piegu, B., Quraishi, U.M., Calcagno, T., Cooke, R.,

Delseny, M., and Feuillet, C. (2008). Identification and characterization of shared duplications be-

tween rice and wheat provide new insights into grass genome evolution. Plant Cell. 20, 11-24.

Sanderson, M.J. (1997). A nonparametric approach to estimating divergence times in the absence of

rate constancy. Mol. Biol. Evol. 14, 1218-1231.

Stamatakis, A. (2006). RAxML-VI-HPC: Maximum Likelihood-based Phylogenetic Analyses with

Thousands of Taxa and Mixed Models. Bioinformatics. 22(21), 2688–2690.

Storey, J.D., Taylor, J.E., and Siegmund, D. (2004). Strong control, conservative point estimation,

and simultaneous conservative consistency of false discovery rates: a unified approach. J. R. Stat. Soc.

Series B 66, 187-205.

Tang, L., Zou, X., Achoundong, G., Potgieter, C., Second, G., Zhang, D., and Ge, S. (2010).

Phylogeny and biogeoraphy of the rice tribe (Oryzeae): evidence from combined analysis of 20

chloroplast fragments. Mol. Phylogenet. Evol. 54, 266-277.

Xu, X., Liu, X., Ge, S., Jensen, J.D., Hu, F., Li, X., Dong, Y., Gutenkunst, R.N., Fang, L., Huang,

L., Li, J., He, W., Zhang, G., Zheng, X., Zhang, F., Li, Y., Yu, C., Kristiansen, K., Zhang, X.,

Wang, J., Wright, M., McCouch, S., Nielsen, R., Wang, J., and Wang, W. (2012b). Resequencing

50 accessions of cultivated and wild rice yields markers for identifying agronomically important genes.

Nat. Biotechnol. 30(1), 105-114.

Yang, Z., and Nielsen, R. (2002). Codon-substitution models for detecting molecular adaptation at

individual sites along specific lineages. Mol. Biol. Evol. 19, 908-917.

Yang, Z., Wong, W.S.W., and Nielsen, R. (2005). Bayes Empirical Bayes Inference of Amino Acids

Sites Under Positive selection. Mol. Biol. Evol. 22(4), 1107-1118.

Yang, Z. (2007). PAML 4: a program package for phylogenetic analysis by maximum likelihood. Mol.

Biol. Evol. 24, 1586-1591.

Zhang, J., Nielsen, R., and Yang, Z. (2005). Evaluation of an improved Branch-Site Likelihood

Method for Detecting Positive Selection at the Molecular level. Mol. Biol. Evol. 22(12), 2472-2479.

Zhu, Q., and Ge, S. (2005). Phylogenetic relationships among A-genome species of the genus Oryza

revealed by intron sequences of four nuclear genes. New Phytol. 167, 249-265.

Supplemental Table 1. Size of 32 families in the four ... · PF00096 Zinc finger, C2H2 type 122 134...

Documents

Transcript of Supplemental Table 1. Size of 32 families in the four ... · PF00096 Zinc finger, C2H2 type 122 134...