Mouse Genomes Poster - Genetics 2010

1
Whole genome sequencing and analysis of 17 laboratory and wild-derived mouse strains , Laura Reinholdt 3 , Leah Rae Donahue 3 , Steve Brown 4 , , David J. Adams 1 The Jackson Laboratory, Bar Harbor, Maine, USA. 4 MRC-Harwell, Oxford, UK. European Bioinformatics Institute, Hinxton, Cambridgeshire, UK. http://www.sanger.ac.uk/mousegenomes Contact: [email protected] NOD/ShiLtJ This strain is a polygenic model for type 1 (non-obese) diabetes. Progenitor strain of the collaborative cross. BALB/cJ Prone to develop mammary and kidney cancer. Progenitor strain of the collaborative cross and HS cross. CBA/J Renal tubulointerstitial lesions observed at a high frequency. Prone to exocrine pancreatic insufficiency syndrome. Progenitor strain of the HS cross. C3H/HeJ Spontaneously develops mammary tumours. Highly susceptible to Gram-negative bacterial infections. Progenitor strain of the HS cross. DBA/2J Develops agressive early hearing loss. Extreme intolerance to alcohol and morphine. Aging DBA/2J mice develop progressive eye abnormalities CAST/EiJ Resistant to cancer and infections. formation. Progenitor strain of the HS cross. LP/J High susceptibility to audiogenic seizures. This strain is also reported to have a fairly high incidence of tumors that develop later in life. Progenitor strain of the HS cross. SPRET/EiJ Resistant to cancer and infections. C57BL/6NJ Used in the KOMP and EUCOMM programmes to knockout every gene in the mouse genome. PWK/PhJ Susceptibility to type I diabetes and various behavioral traits. Progenitor strain of the collaborative cross. NZO/HILtJ New Zealand Obese. Susceptible to type II diabetes. Progenitor strain of the collaborative cross. WSB/EiJ Displays extremely long life-span. Progenitor strain of the collaborative cross. The Mouse Genomes Project is currently sequencing the genomes of 17 inbred mouse strains with the aim of generating a complete map of the nucleotide The ability to manipulate the mouse genome, combined with the wealth of disease models and genetic studies available, makes the mouse the premier model The clonal nature of inbred mouse strains means that sequence information for a strain can be Access to complete sequence of multiple inbred strains can therefore provide a permanent Calling Strategy We used a few different approaches for calling short indels. We called indels from the BAM files using Samtools, by local realignment around potential indels using Dindel, and by aligning de novo assembled contigs to call indels. We are currently finalising the short indel callset. Short Indels SNPs Sequencing The sequencing for the project was carried out at the Wellcome Trust Sanger Institute in 2009. All of the sequencing was done on the Illumina GAII platform with 54, 76, and 108bp reads with an insert size of 200-600bp. All of the strains have been sequenced to >20x coverage (fig. 1). Sequencing and Alignment Alignment The reads were aligned to the mm9 reference using MAQ. The alignments are stored in the BAM alignment format. Each lane is aligned individually, merged to the sequencing library level, PCR duplicates removed, and then merged to produce a single BAM file per strain. The BAM files are available from the project ftp site. ! #! $! %! &! ’! (! #$)*$ +,-./ 0%1.12/ 03+./ 0’435.(6 678 5*./ #$)9#.9:;</ 9=>[email protected] +./ *D,.*E 3+53F./ 83+.$/ 0+9G.BC #$)9’ D93.BC 6H7 ?I?JK.FI:2>JL2 <J==2M.FI:2>JL2 ><M@=.FI:2>JL2 <J==2M.><M@=.FI:2>JL2 The Mouse Genomes Project has generated 65M SNPs (currently dbSNP has 10M mouse SNPs) ! !# !## !### !#### !##### !###### !####### !######## !$%&$ !$%’! !$%’( )*+ ),- .)/. 012 0(3./ 0)’4 0.) 5.) /&*+ 675 687 &9, ’&-:4 9’. -;< &=>?;@A SNP Calling We called SNPs from the BAM files using multiple SNP callers: Samtools, GATK, QCALL, and our local realignment based approach. To create the final list of SNPs, we merged the 4 callsets in various ways (fig. 2) and then compared the SNPs against 10Mbp of manually finished sequence in the NOD/ShiLtJ strain. Endogeneous Retroviral Elements It has been estimated that Endogenous Retroviral elements (ERVs) are a significant source (~10%) of spontaneous germ line mutations among laboratory mouse strains. Two high-copy families of ERVs in particular, IAP and ETns, have been found to be responsible for the vast majority of these mutations. We have catalogued the full repertoire of ERVs insertions across the set of 17 strains. ! !# !## !### !#### $%&’()*+,- $%&’. /01 /20 3*) !456% ’3.’ 1’3 .7*) !456! $’3 !4574 38+ $9: ;6’ 67+,< $36< 7;8 <=>?@ 7ABC?>D Fig 1: Sequencing and mapped coverage over the 17 strains Fig 2: False positive/negative rates in the SNP calls based on 10Mbp of NOD/ShiLtJ Fig 3: Total and private SNP numbers across the strains Fig 4: Total and private number of IAP insertions We call structural variations (SVs) from the data by observing various types of patterns in the alignment of the read pairs vs. the expected fragment size distribution. We use a combination of a few different SV programs such as Breakdancer, Pindel, CND, and single-end read clustering to call the full range of SVs. The automated caller runs a local assembly step in order to do an initial computational validation of the SV calls. In order to validate our calls, we have manually annotated chr19 for SVs and the compared the calls against our automated caller. We have also validated a subset of each type of SV by PCR Structural Variation 0 50000 100000 150000 200000 250000 300000 129P2 129S1_SvImJ 129S5 A_J AKR_J BALBc_J C3H_HeJ C57BL_6N CBA_J DBA_2J LP_J NOD NZO CAST_Ei PWK_Ph Spretus_Ei WSB_Ei Total Shared Strain Distribution Patterns The vast majority of SVs are shared between the strains owing to the common origins of the classical laboratory strains. From Variation to Function All of the variants called are being compared against the known QTLs and the ongoing mouse knockout projects such as KOMP/ EUCOMM in order to determine potential functional consequences. Fig 5: Visualisation of a large deletion Fig 6: Visualisation of a large inversion Fig. 8: Number of genes disrupted by a large deletion Fig 7: Corresponding knockout of genes completely deleted

Transcript of Mouse Genomes Poster - Genetics 2010

Page 1: Mouse Genomes Poster - Genetics 2010

Whole genome sequencing and analysis of 17 laboratory and wild-derived mouse strains

The Mouse Genomes Project

Thomas M. Keane1, Jim Stalker1, Binnaz Yalchin5, Martin Goodson5, Petr Danecek1, Sendu Bala1, Kim Wong1, Guy Slater1, Avigail Agam5, Ian Jackson2, Laura Reinholdt3, Leah Rae Donahue3, Steve Brown4, Andreas Heger5, Chris Ponting5, Ewan Birney6, Allan Bradley1, Richard Durbin1, Jonathan Flint5, David J. Adams1

1The Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, UK. 2MRC-HGU, Edinburgh, UK. 3The Jackson Laboratory, Bar Harbor, Maine, USA. 4MRC-Harwell, Oxford, UK. 5Wellcome Trust Centre for Human Genetics, Oxford, UK. 6European Bioinformatics Institute, Hinxton, Cambridgeshire, UK.

http://www.sanger.ac.uk/mousegenomes

Contact: [email protected]

Inbred strains

NOD/ShiLtJThis strain is a polygenic model for type 1 (non-obese) diabetes. Progenitor strain of the collaborative cross.

A/JAn asthma model. Progenitor strain of the collaborative cross and of the heterogeneous stock (HS) cross.

BALB/cJProne to develop mammary and kidney cancer. Progenitor strain of the collaborative cross and HS cross.

CBA/JRenal tubulointerstitial lesions observed at a high frequency. Prone to exocrine pancreatic insufficiency syndrome. Progenitor strain of the HS cross.

C3H/HeJSpontaneously develops mammary tumours. Highly susceptible to Gram-negative bacterial infections. Progenitor strain of the HS cross.

DBA/2JDevelops agressive early hearing loss. Extreme intolerance to alcohol and morphine. Aging DBA/2J mice develop progressive eye abnormalities

CAST/EiJ

Resistant to cancer and infections.

AKR/JHigh leukemia incidence. Hyporesponsive to diets containing high levels of fat and cholesterol, and resistant to aortic lesion formation. Progenitor strain of the HS cross.

LP/JHigh susceptibility to audiogenic seizures. This strain is also reported to have a fairly high incidence of tumors that develop later in life. Progenitor strain of the HS cross.

129S5/SvEvBrd

Commonly used to make embryonic stem cell lines.

129P2/OlaHsd

Commonly used to make embryonic stem cell lines.

SPRET/EiJ

Resistant to cancer and infections.

C57BL/6NJUsed in the KOMP and EUCOMM programmes to knockout every gene in the mouse genome.

PWK/PhJSusceptibility to type I diabetes and various behavioral traits. Progenitor strain of the collaborative cross.

NZO/HILtJNew Zealand Obese.Susceptible to type II diabetes.Progenitor strain of the collaborative cross.

WSB/EiJDisplays extremely long life-span. Progenitor strain of the collaborative cross.

129S1/SvImJCommonly used to make embryonic stem cell lines. Progenitor strain of the collaborative cross.

IntroductionThe Mouse Genomes Project is currently sequencing the genomes of 17 inbred mouse strains with the aim of generating a complete map of the nucleotide and structural variation, and ultimately a de novo genome assembly, of each strain.

The ability to manipulate the mouse genome, combined with the wealth of disease models and genetic studies available, makes the mouse the premier model organism for genetic approaches to mammalian biology. The clonal nature of inbred mouse strains means that sequence information for a strain can be directly applied to all experiments: past, present and future. Access to complete sequence of multiple inbred strains can therefore provide a permanent foundation for a systems biology approach to phenotypic variation in the mouse.

Calling StrategyWe used a few different approaches for calling short indels. We called indels from the BAM files using Samtools, by local realignment around potential indels using Dindel, and by aligning de novo assembled contigs to call indels. We are currently finalising the short indel callset.

Short Indels

SNPs

SequencingThe sequencing for the project was carried out at the Wellcome Trust Sanger Institute in 2009. All of the sequencing was done on the Illumina GAII platform with 54, 76, and 108bp reads with an insert size of 200-600bp. All of the strains have been sequenced to >20x coverage (fig. 1).

Sequencing and AlignmentAlignmentThe reads were aligned to the mm9 reference using MAQ. The alignments are stored in the BAM alignment format. Each lane is aligned individually, merged to the sequencing library level, PCR duplicates removed, and then merged to produce a single BAM file per strain. The BAM files are available from the project ftp site.!"

#!"

$!"

%!"

&!"

'!"

(!"

#$)*$"

+,-./"

0%1.12/"

03+./"

0'435.(6"

678"

5*./"

#$)9#.9:;</"

9=>[email protected]"

+./"

*D,.*E"

3+53F./"

83+.$/"

0+9G.BC"

#$)9'"

D93.BC"

6H7"

?I?JK.FI:2>JL2"

<J==2M.FI:2>JL2"

><M@=.FI:2>JL2"

<J==2M.><M@=.FI:2>JL2"

The Mouse Genomes Project has generated 65M SNPs (currently dbSNP has 10M mouse SNPs) !"

!#"

!##"

!###"

!####"

!#####"

!######"

!#######"

!########"

!$%&$"!$%'!"!$%'(" )*+" ),-" .)/." 012" 0(3./" 0)'4" 0.)" 5.)" /&*+" 675" 687" &9," '&-:4" 9'."

-;<"

&=>?;@A"

SNP CallingWe called SNPs from the BAM files using multiple SNP callers: Samtools, GATK, QCALL, and our local realignment based approach. To create the final list of SNPs, we merged the 4 callsets in various ways (fig. 2) and then compared the SNPs against 10Mbp of manually finished sequence in the NOD/ShiLtJ strain.

Endogeneous Retroviral Elements

It has been estimated that Endogenous Retroviral elements (ERVs) are a significant source (~10%) of spontaneous germ line mutations among laboratory mouse strains. Two high-copy families of ERVs in particular, IAP and ETns, have been found to be responsible for the vast majority of these mutations. We have catalogued the full repertoire of ERVs insertions across the set of 17 strains. !"

!#"

!##"

!###"

!####"

$%&'()*+,-"

$%&'."

/01"

/20"

3*)"

!456%"

'3.'"

1'3"

.7*)"

!456!"

$'3"

!4574"

38+"

$9:"

;6'"

67+,<"

$36<"

7;8"

<=>?@"

7ABC?>D"

Fig 1: Sequencing and mapped coverage over the 17 strains

Fig 2: False positive/negative rates in the SNP calls based on 10Mbp of NOD/ShiLtJ Fig 3: Total and private SNP numbers across the strains

Fig 4: Total and private number of IAP insertions

We call structural variations (SVs) from the data by observing various types of patterns in the alignment of the read pairs vs. the expected fragment size distribution. We use a combination of a few different SV programs such as Breakdancer, Pindel, CND, and single-end read clustering to call the full range of SVs.

The automated caller runs a local assembly step in order to do an initial computational validation of the SV calls.

In order to validate our calls, we have manually annotated chr19 for SVs and the compared the calls against our automated caller. We have also validated a subset of each type of SV by PCR

Structural Variation

0

50000

100000

150000

200000

250000

300000

12

9P

2

12

9S

1_

SvIm

J

12

9S

5

A_

J

AK

R_

J

BA

LB

c_

J

C3

H_

He

J

C5

7B

L_

6N

CB

A_

J

DB

A_

2J

LP

_J

NO

D

NZ

O

CA

ST

_E

i

PW

K_

Ph

Sp

retu

s_

Ei

WS

B_

Ei

Total

Shared

Strain Distribution PatternsThe vast majority of SVs are shared between the strains owing to the common origins of the classical laboratory strains.

From Variation to FunctionAll of the variants called are being compared against the known QTLs and the ongoing mouse knockout projects such as KOMP/EUCOMM in order to determine potential functional consequences.

Fig 5: Visualisation of a large deletion Fig 6: Visualisation of a large inversion

Fig. 8: Number of genes disrupted by a large deletionFig 7: Corresponding knockout of genes completely deleted