WiggansARS Big Data Workshop – July 16, 2015 (1) George R. Wiggans Animal Genomics and Improvement...

18
Wiggans ARS Big Data Workshop – July 16, 2015 (1) George R. Wiggans Animal Genomics and Improvement Laboratory Agricultural Research Service, USDA Beltsville, MD 20705-2350, USA [email protected] Big data in support of genetic improvement of dairy cattle 101111001121100020122002220111120210120021112211002111202 00111100101101101022001100220110112002011010202221211221012 20100111000112202212221120211201202010020220200021221110002 2112201110121001112111021121100201021000220002022 20100020110000220221102211210112111012222001201 1222002000200202020122211002222222002212111122 210021111200110111011200202220001112011010212 112121110202210021120121100111110211121102000 12200010110111020220022111010201112111101122 2021021021211011022122001211011211012022011 01 2220021002110001110021102110111000222002112 2 212121100022201020022221200122112121011101 11 200201102020012222220021110 20011201 211122 10101121211 122200 202111 2112 12112121 10120 1021 01 11220 012 10 0 21 00 2 2 11 12 1 0 21 1 2 12001 2 0 1 22 `

Transcript of WiggansARS Big Data Workshop – July 16, 2015 (1) George R. Wiggans Animal Genomics and Improvement...

WiggansARS Big Data Workshop – July 16, 2015 (1)

George R. WiggansAnimal Genomics and Improvement LaboratoryAgricultural Research Service, USDABeltsville, MD 20705-2350, [email protected]

Big data in support ofgenetic improvementof dairy cattle

100 011110 1220020012 02121110111121 101111001121100020122002220111120210120021112211002111202 00111100101101101022001100220110112002011010202221211221012 20100111000112202212221120211201202010020220200021221110002 2112201110121001112111021121100201021000220002022 20100020110000220221102211210112111012222001201 1222002000200202020122211002222222002212111122 210021111200110111011200202220001112011010212 112121110202210021120121100111110211121102000 12200010110111020220022111010201112111101122 2021021021211011022122001211011211012022011 01 2220021002110001110021102110111000222002112 2 212121100022201020022221200122112121011101 11 200201102020012222220021110 20011201 211122 10101121211 122200 202111 2112 12112121 10120 1021 01 11220 012 10 0 21 00 2 2 11 12 1 0 21 1 2 12001 2 0 12 22 `

WiggansARS Big Data Workshop – July 16, 2015 (2)

Mission

Genetic improvement of dairy cattle for economically important traits Yield (milk, fat, and protein) Conformation (overall and individual traits) Longevity (productive life) Fertility (conception and pregnancy rates) Calving (dystocia and stillbirth) Disease resistance (mastitis)

WiggansARS Big Data Workshop – July 16, 2015 (3)

Data types

Identification information for animal, sire, and dam: Name ID number Birth date

Animal genotypes from marker panels thatthat range from 2,900 to 777,962 markers

Breed Herd Country

Courtesy of Il

lumina, Inc.

WiggansARS Big Data Workshop – July 16, 2015 (4)

Data types (continued)

Records for milk yield, fat percentage, protein percentage, and somatic cell count (1/month)

Appraiser-assigned scores for 16 body and udder characteristics related to conformation (e.g., stature)

Breeding records that include indicator for conception success

Calving difficulty scores and stillbirth occurrences

WiggansARS Big Data Workshop – July 16, 2015 (5)

Data amounts

Pedigree records: 71,974,045

Animal genotypes: 1,035,590

Lactation records (since 1960): 132,629,200

Daily yield records (since 1990): 641,864,015

Reproduction event records: 176,559,035

Calving difficulty scores: 29,528,607

Stillbirth scores: 19,567,198

WiggansARS Big Data Workshop – July 16, 2015 (6)

Computing environment

Computation server 2.27 GHz CPU (32 cores, 64 threads) 660 GB RAM 2.7 TB local storage

Database server 3.4 GHz CPU (12 cores, 24 threads) 264 GB RAM 1.3 TB local storage

Shared storage 38 TB

WiggansARS Big Data Workshop – July 16, 2015 (7)

Data management

Variable length segments for database rows to minimize space and overhead in identifying data

All marker genotypes for an animal stored each as a single byte in a character large object (CLOB)

All breedings and monthly milk yield and component information for a cow’s lactation stored in variable character data types

WiggansARS Big Data Workshop – July 16, 2015 (8)

Programming languages

C Database interface including data editing

FORTRAN Calculation of genetic merit estimates

SAS Data preparation, checking, and delivery

WiggansARS Big Data Workshop – July 16, 2015 (9)

Calculation schedule

Triannual genetic merit estimatesfrom processed phenotypic data

Monthly genomic evaluations based on estimates of marker effects using genotypic data and triannual phenotype-based evaluations

Weekly evaluations using marker effect estimates from monthly evaluations

APRDEC

AUg

WiggansARS Big Data Workshop – July 16, 2015 (10)

Transition to industry

Council on Dairy Cattle Breeding Database maintenance Calculation and distribution of genetic merit

estimates Interface with evaluation users and data suppliers

ARS Research and development using data made

available by Council

WiggansARS Big Data Workshop – July 16, 2015 (11)

Research resource

Massive amount of genomic data Location of causal genetic variants

Investigation of haplotypes never found in a homozygous state Discovery of chromosomal abnormalities

resulting in early embryonic death

Investigation of sons of heterozygous sires Specific markers associated with differences

between sons by haplotype

WiggansARS Big Data Workshop – July 16, 2015 (12)

Genetic merit of marketed Holstein bulls

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14-300

-200

-100

0

100

200

300

400

500

600

Year entered AI

Aver

age

net m

erit

($)

Average gain:$19.42/year

Average gain:$47.95/year

Average gain:$87.49/year

WiggansARS Big Data Workshop – July 16, 2015 (13)

Working with sequence data

Sequence available from 1000 Bull Genomes Project hosted in Australia

Project funded by industry to sequence over 200 bulls to create a haplotype library

A posteriori granddaughter design to locate chromosomal segments of interest from 71 bulls each with over 100 genotyped and progeny-tested sons

WiggansARS Big Data Workshop – July 16, 2015 (14)

Imputing sequence data

Haplotype library supports imputation

Genotypes from genotyping chips can be imputed to full sequence

Lower accuracy of sequence data compared with chip genotypes accommodated by dealing in dosages to represent allele content

Findhap v4 (VanRaden) fast and more accurate than Beagle at low × coverage

WiggansARS Big Data Workshop – July 16, 2015 (15)

Alignment of sequence data

Alignment – determining location of chromosomal segments provided by sequencer

Findmap – matches segment against library of haplotypes

Preserves low-frequency variants

Does not identify new variants

Uses a hash table to find variant enabling rapid processing

WiggansARS Big Data Workshop – July 16, 2015 (16)

Accuracy of Findhap vs. Beagle*

Sequence + HD Imputed from HDProgram Depth Correct Correlation Correct CorrelationFindhap 8× 98.7 0.981 95.0 0.926

4× 95.8 0.939 93.1 0.8972× 91.3 0.879 89.2 0.837

Beagle 8× 99.0 0.984 97.1 0.9564× 95.0 0.918 78.2 0.5822× 79.5 0.602 63.5 0.100

*250 bulls had sequence + HD; 250 others were imputed from HD

WiggansARS Big Data Workshop – July 16, 2015 (17)

Data storage and backup

Disk storage being added Compression option being investigated

Back up to tape with weekly submission to off-site storage

Expect to have internet 2 connection Facilitate sharing of sequence data

WiggansARS Big Data Workshop – July 16, 2015 (18)

Summary

Highly successful program leading to annual increases in genetic merit for production efficiency

Large database of phenotypic and genomic data provided by industry

Research projects to determine mechanism of genetic control of economically important traits

Data processing techniques developed so that rapid turnaround could be realized