MHC region and its related disease study

71
UNIVERSITY OF COPENHAGEN PhD thesis Hongzhi Cao MHC region and its related disease study Method for decipher the polymorphism and fine-mapping disease causal mutations of MHC region Academic advisor: Anders Krogh University of Copenhagen, Denmark Jun Wang University of Copenhagen, Denmark BGI-Shenzhen, China The PhD School of Science, Faculty of Science, University of Copenhagen Submitted: 23/03/2015

Transcript of MHC region and its related disease study

Page 1: MHC region and its related disease study

U N I V E R S I T Y O F C O P E N H A G E N

PhD thesis Hongzhi Cao

MHC region and its related disease study Method for decipher the polymorphism and fine-mapping disease causal mutations of MHC region

Academic advisor:

Anders Krogh

University of Copenhagen, Denmark

Jun Wang

University of Copenhagen, Denmark

BGI-Shenzhen, China

The PhD School of Science, Faculty of Science, University of Copenhagen

Submitted: 23/03/2015

Page 2: MHC region and its related disease study

2

Dissertation for the degree of philosophiae doctor (PhD)

Department of Biology, University of Copenhagen

Copenhagen, Denmark

And

BGI-Shenzhen

Shenzhen, China

March 2015 Author: Hongzhi Cao Title: MHC region and its related disease study Subtitle: Method for decipher the polymorphism and fine-mapping disease causal mutations of MHC region Academic advisors: Anders Krogh University of Copenhagen, Denmark Jun Wang University of Copenhagen, Denmark BGI-Shenzhen, China The PhD School of Science, Faculty of Science, University of Copenhagen

Submitted: March 2015

Page 3: MHC region and its related disease study

3

PREFACE This PhD project started in 2011 as collaboration between the Department of Biology, University of

Copenhagen and BGI-Shenzhen. The work presented here has been performed at both institutions

by supervision of Prof. Anders Krogh and Prof. Jun Wang.

Page 4: MHC region and its related disease study

4

ACKNOWLEGEMENTS I would like to express my sincere gratitude to everyone who has helped me during my time as a PhD student. First of all, I would like to thank my supervisors Prof. Anders Krogh and Prof Jun Wang for giving me this opportunity and introducing me to the research field within human genomics and genetics, exploring the mystery of the most complex genomic region – Major histocompatibility complex (MHC), and making the time of my PhD exciting and interesting, in addition to all help, guidance and motivation. Thanks to my colleagues, friends at BGI and University of Copenhagen for all the help, guidance, collaborations and making me feel very easy and pleasure to work and study here. I wish to thank all my co-authors, without your helps, comments and guidance with data analysis, interpretation of results and writing, I won’t make it so smooth. Last, but not least, thanks to the endless support from my family, my gentle and considerate wife Yijie, the kindness and great parents of my wife’s and mine. You are my source of power and happiness. Mostly, to my little princess Little September/小九. Your appearance is the happiest thing of my life and makes my life much more wonderful.

Page 5: MHC region and its related disease study

5

ABSTRACT The major histocompatibility complex (MHC) is one of the most gene dense regions in the human genome and many disorders, including primary immune deficiencies, autoimmune conditions, infections, cancers and mental disorder have been found to be associated with this region. However, due to a high degree of polymorphisms within the MHC, the detection of variants in this region, and diagnostic HLA typing, still lacks a coherent, standardized, cost effective and high coverage protocol of clinical quality and reliability. And owing to marked linkage disequilibrium of MHC region, genes/mutations involved in the above diseases have not, with very few exceptions, been identified. Currently popular next-generation sequencing (NGS) technology, comprising massively parallel single-molecule sequencing, can generate billions of bases from amplified single DNA molecules within several days, holding great promise for achieving cost-effective and high-throughput and high accurate genotyping. Based on the NGS platform, by using delicately designed bioinformatics methods, we systematically developed new tools/methods, including high throughput HLA genotyping method – RCHSBT, targeted sequencing based SNV detection as well as HLA gene typing and large structural variation detection using optical mapping technic, to provide comprehensive and accurate information of the MHC region and apply them into disease causal mutation’s fine-mapping.

Page 6: MHC region and its related disease study

6

SAMMENDRAG Major histocompatibility complex (MHC) er et af de mest gen-tætte områder i det menneskelige genom. Mange lidelser - herunder primære immundefekter, autoimmune sygdomme, infektioner, cancer. og anden sindslidelse - er blevet fundet forbundet med denne region. Men på grund af en høj grad af polymorfi i MHC, påvisning af varianter i denne region, og diagnostiske HLA typebestemmelse, mangler der stadig en sammenhængende, standardiseret, omkostningseffektiv og høj dækning protokol med god klinisk kvalitet og pålidelighed. Og på grund af markant bindingsuligevægt af MHC-regionen, er gener / mutationer involveret i disse sygdomme endnu ikke, med meget få undtagelser, blevet identificeret. De nu populære næste-generation sekventering (NGS) teknologier omfatter massivt parallel enkelt molekyle sekventering, og kan generere milliarder af baser udfra amplificerede enkelt DNA-molekyler inden for få dage, og er lovende for at opnå omkostningseffektive og høj-hastighed og høj-nøjagtighed genotypebestemmelse. Baseret på NGS platformen, ved at bruge fint designede bioinformatik metoder, vi systematisk udviklet nye værktøjer / metoder, herunder high throughput HLA genotypebestemmelse metode - RCHSBT, målrettet sekventering baseret SNV detektion samt HLA-gen maskinskrivning og stort strukturelt variation detektion bruger optisk kortlægning teknik , at yde en omfattende og præcis information om MHC-regionen og anvende dem i sygdom kausal mutation bøde-mapping.

Page 7: MHC region and its related disease study

7

Table of Contents 1  Introduction  ..........................................................................................................................................  10  1.1  Major  histocompatibility  complex  ........................................................................................................  10  1.1.1  Function  of  MHC  region  gene  in  human  immune  system  ......................................................................  11  1.1.2  MHC  related  diseases  ............................................................................................................................................  11  

1.2  High  throughput  sequencing  technology  ............................................................................................  18  1.2.1  Roche  454  ...................................................................................................................................................................  19  1.2.2  Illunina  Solexa/Hiseq/Miseq  .............................................................................................................................  19  1.2.3  Life  Technologies  Ion  Torrent  PGM/Proton  ................................................................................................  20  1.2.3  Complete  Genomics  ................................................................................................................................................  20  1.2.4  PacBio  ...........................................................................................................................................................................  20  1.2.6  Oxford  Nanopore  .....................................................................................................................................................  20  1.2.7  Others  ...........................................................................................................................................................................  21  

2  Aims  .........................................................................................................................................................  22  2.1  Current  problem  for  MHC  region  related  study  ................................................................................  22  2.2  Specific  aims  of  this  project  .....................................................................................................................  22  

3  Results  ....................................................................................................................................................  23  3.1  Novel  method  for  MHC  region  study  .....................................................................................................  23  3.1.1  Paper  I:  High  throughput  HLA  typing  Method  ...........................................................................................  23  3.1.2  Paper  II:  Integrated  tools  to  study  whole  MHC  region  ...........................................................................  25  

3.2  Construction  Chinese  MHC  database  ....................................................................................................  27  3.3  Fine-­‐mapping  disease  related  mutation  .............................................................................................  30  3.4  Paper  III:  Novel  method  and  technology  in  genome  study  ............................................................  30  

4  Conclusions  ...........................................................................................................................................  32  

5  Future  Perspective  .............................................................................................................................  33  6  References  .............................................................................................................................................  34  

7  Appendix  ................................................................................................................................................  42  

Page 8: MHC region and its related disease study

8

ABBREVIATIONS MHC Major Histocompatibility Complex HLA Human Leukocyte Antigen BAC Bacterial Artificial Chromosome APC Antigen Presenting Cell IST Immunosuppression UNOS United Network for Organ Sharing DGF Delayed Graft Function PID Primary Immune Deficiency IgAD IgA Deficiency RA Rheumatoid Arthritis SLE Systemic Lupus Erythematous GD Grave’s Disease AIDS Acquired Immune Deficiency Syndrome HIV Human Immunodeficiency Virus HBV Hepatitis B Virus HCV Hepatitis C Virus HGP Human Genome Project NGS Next Generation Sequencing SBS Sequence By Synthesis CG Complete Genomics DNB DNA Nano Ball SBH Sequencing By Hybridization SMRT Single Molecule Real Time sequencing ZMW Zero-Mode Waveguide SBT Sequencing Based Typing SNP Single Nuclear Polymorphism LD Linkage Disequilibrium SV Structural Variation GWAS Genome Wide Association Study HSCT Hematopoietic Stem Cell Transplantation PE Paired End CNV Copy Number Variation KIR Killer cell Immunoglobulin-like Receptor TRA T cell Receptor Alpha

Page 9: MHC region and its related disease study

9

TRB T cell Receptor Beta IGH Immunoglobulin Heavy IGL Immunoglobulin Light MAM Maximum Allowable Mismatch

Page 10: MHC region and its related disease study

10

1 Introduction

1.1 Major histocompatibility complex The major histocompatibility complex (MHC) is a group of genes that found on the cell surface, and they control a major part of the immune system in all vertebrates and help the immune system recognize foreign materials. In humans, the MHC is also called as human leukocyte antigen (HLA). The genomic region encoding MHC locates on the short arm of chromosome 6 on human and its essential for the human immune system [1]. It is also the densest gene region of the human genome, with over 200 identified loci and more than 120 expressed genes. Most of the genes in this region play key roles in human immune system. The typical MHC region is a 3.6M region, which further divided into three major regions according to the genes’ function (Fig. 1). The class I and class II genes play essential role in antigen presentation and main related with adaptive immunity, including the well-known HLA-A, -B, -C in class I region and DRB and DQB gene in class II region. Gene in class III region mainly related with innate immunity [2].

Figure 1. Location and organization of genes for MHC [2]. Due to its importance, MHC was the first multi-megabase region of the human genome to be completely sequenced [1]. In 2004, the sequences of the two common and disease associated MHC haplotypes, the PGF (HLA-A*03:01-B*07:02-C*07:02-DRB1*15:01-DQB1*06:02) and COX (HLA-A*01:01-B*08:01-C*07:02-DRB1*03:01-DQB1:02:01, also named as 8.1 haplotype) in Caucasian have been provided by Sanger sequencing the bacterial artificial chromosome (BAC) using cell lines from consanguineous individual [3]. Subsequently, sequence of another disease

ADVANCES IN IMMUNOLOGY

Volume 343 Number 10

·

703

cytoplasm. The bags are then pinched off as endocyt-ic vesicles, and they fuse with primary lysosomes,which are loaded with an assortment of proteolyticenzymes, to form endosomes.

5

Within the endo-somes, the ingested proteins are degraded, first topeptides and then, in some vesicles, into amino acidsfor reuse.

This mode of protein processing must have evolvedat an early stage in the history of living forms. Later,the jawed vertebrates found a new use for it by en-listing it in the defense of the body against patho-gens. Normally, the proteins that undergo recyclingare the organism’s own, but in infected cells, proteins

originating from the pathogen are also routed intothe processing pathways. With the exception of jawedvertebrates, no organisms appear to make a distinctionbetween peptides derived from their own (self ) pro-teins and those derived from foreign (nonself ) pro-teins. Jawed vertebrates, by contrast, use the peptidesderived from foreign (usually microbial) proteins tomark infected cells for destruction.

In an uninfected cell, “housekeeping” proteasomescontinuously churn out self peptides, some of whichare picked up by molecules called “transporters asso-ciated with antigen processing,” or TAPs, encoded bythe

TAP1

and

TAP2

genes.

5

The TAP1 and TAP2

Figure 1.

Location and Organization of the HLA Complex on Chromosome 6.The complex is conventionally divided into three regions: I, II, and III. Each region contains numerous loci (genes), only some ofwhich are shown. Of the class I and II genes, only the expressed genes are depicted. Class III genes are not related to class I andclass II genes structurally or functionally.

BF

denotes complement factor B;

C2

complement component 2;

C21B

cytochrome P-450,subfamily XXI;

C4A

and

C4B

complement components 4A and 4B, respectively;

HFE

hemochromatosis;

HSP

heat-shock protein;

LMP

large multifunctional protease;

LTA

and

LTB

lymphotoxins A and B, respectively;

MICA

and

MICB

major-histocompatibility-complex class I chain genes A and B, respectively;

P450

cytochrome P-450;

PSMB8

and

9

proteasome

b

8 and 9, respectively;

TAP1

and

TAP2

transporter associated with antigen processing 1 and 2, respectively;

TAPBP

TAP-binding protein (tapasin); TNF-

a

tumornecrosis factor

a

;

and

HSPA1A, HSPA1B,

and

HSPA1L

heat-shock protein 1A A-type, heat-shock protein 1A B-type, and heat-shockprotein 1A–like, respectively.

Telomere Telomere

Chromosome 6

Class II Class III Class I

HLA class III!region loci

Regions

HLA class II!region loci

HLA class I !region loci

q pCentromere

6p21

.31

DRA

DRB

3

DRB

2

DRB

1

DQ

A1

DQ

B1

DO

B

DM

B

DM

A

DO

A

DPA

1D

PB1

DPB

2

P450

, C21

BC4

B

BF C2 HSP

A1B

HSP

A1A

HSP

A1L

C4A

LTB

TNF-

a-

LTA

HLA

-E

HLA

-C

HLA

-BM

ICA

MIC

B

HLA

-A

HLA

-G

HLA

-F

HFE

TAPB

P

TAP2

PSM

B8 (L

MP1

)TA

P1PS

MB9

(LM

P2)

The New England Journal of Medicine Downloaded from nejm.org at NIH on May 14, 2011. For personal use only. No other uses without permission.

Copyright © 2000 Massachusetts Medical Society. All rights reserved.

Page 11: MHC region and its related disease study

11

associated MHC haplotype QBL (HLA-A*26:01-B*18:01-C*05:01-DRB1*03:01-DQB1*02:01) was provided by employing the same strategy [4]. To the end of 2008, sequence of eight different MHC haplotypes [5] were provided by the MHC haplotype project in total and the haplotype sequence of PGF has been introduced into the reference sequence of human genome (since NCBI35), and becomes the largest single haplotype sequence of the human genome so far. Pairwise comparison of these eight MHC haplotypes has revealed dramatic variants, showing the great polymorphism of this region and the great diverse of differences between different haplotypes.

1.1.1 Function of MHC region gene in human immune system As illustrated in the review paper [2], a man dies after he transplants a heart; a woman is crippled because of her rheumatoid arthritis; a child becomes a coma after she gets cerebral malaria; another child dies in an infection as he has immunodeficiency; an elderly man gets serious hepatic cirrhosis because of iron overload. Though these five different clinical situations are different, it all involves in human MHC system. The importance of MHC to human immune system main due to two classic gene clusters within the region: HLA class I and class II genes. The class I molecules are expressed on the surface of most human cells although the expression level were different between each tissue. While constitutive expression of class II molecules are normally only expressed in a specific group of immune cells called antigen presenting cells (APC) that includes dendritic cells, macrophages, B cells, activated T cells and thymic epithelial cells [6]. One of the most important functions of class I and class II MHC molecules are to present short, pathogen-derived peptides to T cells to initiate the human adaptive immune responses. Furthermore, class III region comprises important gene involved in innate immunity which including the complement genes and genes encoding for some of the inflammatory cytokines [1]. The key to a healthy immune system is its remarkable ability to distinguish between “self” and “non-self”. Among these process, MHC molecules play a fundamental role. Processing endogenous protein and loading the generated short peptides onto class I molecules are happening all the way in body’s cells to mark as “self” and avoid tricking immune reaction main by cytotoxic CD8+ T cell. Meanwhile, the processed exogenous proteins, presented as small peptides, will load onto class II molecules to mark as “non-self”, hence tricking the immune reaction by CD4+ T cell and B cell [7].

1.1.2 MHC related diseases Due to the importance of genes in this region with immune system, many disorders were discovered to relate with MHC (Fig. 2). According to their mechanism, they can be classified into following groups:

Page 12: MHC region and its related disease study

12

Figure 2: Partial of diseases that found to be related with MHC region [8]. 1.1.2.1 Rejection after transplantation 1.1.2.1.1 Graft and Patient Survival Despite significant advances in immunosuppression (IST) and post-operative patient management

GG14CH14-Trowsdale ARI 26 July 2013 14:56

CD4+

T helpercell

Gene annotationsStrand

HCG23HLA-DRA

C6orf10

BTNL2

HLA-DQA2

PSMB9

BRD2

HLA-DPB1

HCG24HLA-DPB2

HLA-DQB2HLA-DOB

TAP2PSMB8

TAP1

HLA-DMBHLA-DMAHLA-DOA

HLA-DPA1

HLA-DQB3

HLA-DPA2COL11A2PHLA-DPA3

PBX2NOTCH4

HLA-DQA1

HLA-DRB5

HLA-DRB1

HLA-DQB1

HLA-DRB9

HLA-DRB6

HLA-F

HLA-G

HLA-AHCG9

ZNRD1PPP1R11

TRIM40TRIM15

TRIM39RPP21

HLA-E

PRR3ABCF1MRPS18BATAT1

TUBBHCG20

DDR1GTF2H4VARS2DPCR1MUC21

HCG22

ZFP57

ZNRD1

RNF39TRIM31TRIM10TRIM26

HCG17HCG18

GNL1

PPP1R10DHX16

NRMMDC1FLOT1

IER3

TIGD1L

SFTA2HCG21

MICA

HCP5MICBNFKBIL1

LTATNF LST1AIF1APOMCSNK2B

LY6G6DMSH5HSPA1AHSPA1B

C2CFB

HLA-C

HLA-B

BAT1ATP6V1G2

LTBNCR3BAT3BAT4BAT5

LY6G6CDDAH2

CLIC1VARSLSM2

HSPA1LNEU1

SLC44A4EHMT2

RDBPDOM3Z STK19

C4AC4BCYP21A2

TNXBATF6B

HCG27TCF19CCHCR1

POU5F1PSORS1C3

PSORS1C1CDSN

PSORS1C2

PPT2EGFL8RNF5

PRRT1AGER

AGPAT1

LY6G6BLY6G6F

-AS1

Antigen processingand presentation

BacteriaParasites

Phagosome

CLIP

DM

Endosomal/lysosomal

compartment

Virus

TAP

Golgiapparatus

ER

α3

α2

β2

α1

microglobulin

CD8+

cytotoxicT cell

Peptide-loadingcomplexPeptide

antigen

MHC class Imolecules

MHC class IImolecules

HLA-A, HLA-B,HLA-C

HLA-DR, HLA-DQ,HLA-DP

CD4

C1 C4C2

Hsp70

Proteins requiringchaperone

Antigen processingTAP1, TAP2, PSMB8,

HLA-DM, TAPBP

Immunoglobulinsuperfamily

AGER, BTNL2

Classical pathwaytriggered by Ag-Ab complex

Complement system C2, BF, C4A, C4B

Leukocyte maturationLY6G5B, LY6G5C, LY6G6D,LY6G6E, LY6G6C, DDAH2

Stress responseHSPA1A, HSPA1B, HSPA1L,

MICA, MICB

In!ammationABCF1, AIF1, IER3, LST1,

LTA, LTB, NCR3, TNF

Immune regulationNFKBIL1, FKBPL

Examples of the role of MHC genes in immune function Examples of MHC-disease associations

PsoriasisPSORS1 (HLA-C*0602)

Ankylosingspondylitis

HLA-B27

SarcoidosisBTNL2

Type 1 diabetesDRB1*04–DQA1*0301–

DQB1*0302, DRB1*03–DQA1*0501–

DQB1*0201

Rheumatoid arthritisHLA-DRB1 (*0401), HLA-DQA1 (*0301)

Multiple sclerosisHLA-DRB1 (*0501)

NarcolepsyHLA-DQB1 (*0602)

Celiac diseaseHLA-DQA1 (*0501),HLA-DQB1 (*0201)

Bare lymphocytesyndrome

TAP1, TAP2, TABPJuvenile

myoclonic epilepsyBRD2

Pemphigus vulgarisHLA-DQB1 (*0301)

MalariaHLA-B53

Graves’ diseaseHLA-DRB1 (*0301), HLA-DQA1 (*0501)

Abacavir drughypersensitivity

HLA-B (*5701)

Complementde"cienciesC2, C4A, C4B

Congenital adrenalhyperplasia

CYP21A2Ehlers-Danlos

syndromeTNXB

Dengue shock syndromeMICB

SialidosisNEU1

Systemic lupuserythematosus

HLA-DRB1 (*0301)

Ulcerative colitisHLA-DRB1 (*1101)

LeprosyHLA-DRB1, DQA1

Myasthenia gravisHLA-C (*0701)

Immunoregulatoryfunctions

Selective IgA de"ciencyHLA-DQB1 (*0201)

CD8

HIV/AIDSHLA-B, HLA-C

α2β2

β1 α1

308 Trowsdale · Knight

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 201

3.14

:301

-323

. Dow

nloa

ded

from

ww

w.a

nnua

lrevi

ews.o

rgby

Uni

vers

itat Z

uric

h- H

aupt

bibl

ioth

ek Ir

chel

on

09/0

5/13

. For

per

sona

l use

onl

y.

Page 13: MHC region and its related disease study

13

after transplantation, the 5-year survival rates are still significantly low in lung transplant recipients (51%), and better rates are observed in heart (73% for male, 69% female); liver ~78% and Kidney ~89% (http://www.srtr.org/annual_reports). Graft survival records from the united network for organ sharing (UNOS) registry assessing the fraction of graft failures attributed to immunological or non-immunological factors shows that 10-year graft survival rates for living HLA-mismatched donors to be 66%. Analyses of the overall 10-year graft failure rate for cadaver donors (60%) showed that18% of failures were due to HLA factors, as observed through mismatched living donor grafts [9]. 1.1.2.1.2 Delayed Graft Function Delayed Graft Function (DGF) is a commonly observed adverse post-transplant that impacts graft survival [10]. Risk factors for DGF include non-immunologic factors such as donor-age, cold ischemia time, and recipient ancestry. Immunologic factors have been shown to be associated with DGF including amino-acid differences at over 60 polymorphic sites of HLA-A in a prospective study involve ~700 kidney transplant recipients showed that combinations of amino acid mismatches at crucial HLA-A sites were related with DGF [11]. In European ancestry recipients, amino acid difference at position 62, 95 or 163, which known to play important function as located in the antigen recognition site were found to relate with increased DGF risk. Additionally, decreased risk for DGF was associated with mismatches at HLA-A sites (149, 184, 193 or 246), indicating that evolutionary features of HLA-A variants separating HLA-A families and lineages among D-R pairs may be intertwined with the strength of allogenicity underpinning DGF. While these findings indicate that amino acid variants at important function sites in the antigen recognition site of the HLA-A molecule have great relationship with DGF, there may also be additional common and rare variants with similar protective and deleterious influences in MHC region. 1.1.2.2 Primary immune deficiency Primary immune deficiency (PID), a kind of disorder that part of the body’s immune system is missing or impaired. Most primary immune deficiencies are genetic disorders. Some of selected diseases include: 1.1.2.2.1 IgA deficiency Selective IgA deficiency (IgAD) is the most common form of primary immunodeficiency disorders in the Caucasian population. It is defined as serum IgA levels <= 0.07 g/l with normal serum levels of IgG and IgM in individuals >= 4 years old [12]. It affects approximately 1 in 600 of Caucasians [13]. However, its prevalence varies among different ethnical populations ranging from 1:143 in Saudi Arabia [14] to 1:23,255 in Japan [15]. Though intense researches, the genetic basis of IgAD remains elusive. However, it is well known that IgAD is strongly associated with MHC, especially the 8.1 haplotype [16]. 1.1.2.2.2 Complement deficiency Complement deficiency is a disorder caused by one of proteins involving the complement system is absent or does not work properly. There are a group of genes in the MHC class III region relate with human complement system. The most common complement deficiency was in C2. Other deficiencies were also found in C1q, C1r, C1s, C4 and also have been reported to link with immune disease [17]. 1.1.2.2.3 MHC Class I deficiency

Page 14: MHC region and its related disease study

14

Major histocompatibility complex class I deficiency is a disorder with severe expression reduce of MHC class I genes in the cell surface. It is a rare disease with high biological heterogeneity and has a variable clinical phenotype. Mutations in gene TAP1 or TAP2 which coding for the transporter associated with antigen presentation in the MHC region have been found in patients with this disorder [18]. 1.1.2.3 Autoimmune condition So far, autoimmune diseases comprise the largest number of diseases that found to be related with MHC region (Table 1). It characterize as the body’s immune responses directly against one’s own tissues. This may be restricted to specific organs or involve a particular tissue in different places. Some represented autoimmune diseases are given below. 1.1.2.3.1 Psoriasis Psoriasis is a common autoimmune and hyper proliferative skin disease, characterized by thick, silvery-scale patches. The prevalence of psoriasis is 0.1-0.3% in Asian and 2-5% in Caucasian [19]. Psoriasis has a strong hereditary component and the major determinant that defined before is PSORS1, which located in the MHC region. It could explains 35%-50% of psoriasis’ heritability [20]. 1.1.2.3.2 Rheumatoid arthritis Rheumatoid arthritis (RA) is a chronic inflammatory disorder, mainly affects the joints of feet and hands. The prevalence of RA is approximately 1% worldwide [21]. Genetic studies reflect this disease is influenced by both genetic and environment factors [22]. The strongest genetic risk factor for RA is HLA-DRB1 gene that located in MHC region. Risk alleles that identified in MHC region include DR4, DR1, and B27. 1.1.2.3.3 System lupus erythematous Systemic lupus erythematous (SLE) is a chronic inflammatory and classical autoimmune disease in which the body’s immune system attacks its healthy tissue by mistake. It characterized by a diverse array of autoantibodies, immune complex deposition and complement activation, and influenced by both genetic and environmental factors. Most of the cases were women and frequently starting at childbearing age. The MHC class II gene DRB1 was reported to relate with SLE (Table 1). 1.1.2.3.3 Graves disease Grave’s disease (GD) is an autoimmune disease and characterized by lymphocytic infiltration of the thyroid gland and the production of thyrotropin receptor autoantibodies (TRAbs), leading to produce excess thyroxin, hence to hyperthyroidism. The incidence of GD shows a geographical variation as well as the seasonal trends, suggesting that environmental triggers such as infections might be an important contributor to the GD pathogenesis [23]. However, twin and sibling studies have reported an increased risk ratio for GD [24], indicating genetic plays an important rule on the development of this disease. The most genetic associated loci is the HLA-DRB1 (DRB1*03,DRB1*11,DRB1*01) gene that is locating in MHC region.

Page 15: MHC region and its related disease study

15

Table 1 Autoimmune disease related loci within MHC Disease Related variants Ankylosing spondylitis B*27, DPA1*01:03, A*02:01, B*13:01 Autoimmune hepatitis rs2187668, DRB1*03:01/DRB1*0301, DRB1*0401 Autoimmune hepatitis type-1 DRB1*0301, DQA1*0101/DRB1*1301 Behçet's disease B*51/rs116799036, rs12525170, rs114854070, C*1602 Celiac disease DQA1*05,DQB1*02/rs2647044 Crohn disease A*33, B*45, C*3, A26, DRB1*03, DRB1*08/DRB1*07, DQB1*0303 Graves disease DRB1*03, DRB1*11, DRB1*01/DRB1*09:01, DRB1*12:02,

DPB1*05:01, DQB1*03:02, DRB1*15:01, DRB1*16:02 / DRB1*0301, DRB1*0802, DRB1*1403, DRB1*0701, DRB1*1302

IgA nephropathy DRB1*03, DRB1*10/DRB1*04/rs660895, rs1794275, rs2523946 Multiple sclerosis DRB1*15:01, DRB1*03:01, DRB1*13:03, DRB1*04:04, DRB1*04:01,

DRB1*14:01, A*02:01, rs9277489, rs2516489, B*37 Myasthenia Gravis DQA1*01, DQB1*0502 /DRB1*04, DRB1*03, DQB1*02, DQB1*03 /

DQB1∗05:02 Narcolepsy DRB1*1501, DQB1*0602/rs2858884/DQB1*06:01, DQB1*03:02,

rs3117242, DPB1*05:01 Primary biliary cirrhosis DRB1*0801, DRB1*11, DRB1*13/DRB1*0405, DRB1*0803/DRB1*08,

DRB1*11, DRB1*14 /rs2856683, rs9275312, rs9275390, rs7775228, rs2395148, rs9277535, rs3806156, rs9357152, rs3135363, rs9277565, rs2281389, rs660895, rs9501626

Psoriasis C∗06:02, C∗12:03, B(AA 67 ,9), A(AA 95), DQA1(AA 53) Rheumatic heart disease B*51, C*04, DRB1*01, DQB1*08/DRB1*07/B*13, DRB1*01, DRB1*04,

DRB1*07, DQB1*02, DRB1*13 Rheumatoid arthritis DRB1(AA 9), B(AA 9)/DRB1(AA 11,71,74), B(AA 9), DPB1(AA 9) Systemic lupus erythematous

rs3131379, rs1270942/rs9271366, rs9275328/ DRB1*15:01, DRB1*13:02, DRB1*14:03

Systemic sclerosis rs6457617/rs9275224, rs6457617, rs9275245, rs3130573 /-DRB1*15:02, DRB5*01:02

Ulcerative colitis rs9268480, rs660895/rs2395185/DRB1(AA 11)/DRB1*13, DRB1*08 Vitiligo rs11966200, rs9468925/rs7758128/A*33:01, B*44:03, DRB1*07:01 /

rs9468925 1.1.2.4 Infection disease Infection disease is one of the leading reasons for human morbidity and mortality, especially for children [25]. Genes involved in the immune response locate within regions that experience the most selective pressure. It is generally assumed that resistance to infection exerts selection pressure fuels the generation of MHC variation and, since its discovery, MHC has stood out as the leading candidates for infection disease susceptibility. However, currently, only several infection diseases have been clearly identified to be associated with this region (Table 2). Some of the selected infection diseases were given below.

Page 16: MHC region and its related disease study

16

Table 2 Infection disease related loci within MHC Disease Relavent variants SARS A*0201 HIV-1 control AA_B_97 HIV-TB B*4006 HCV - sustained response to therapy B*04 Hepatitis C virus-associated hypertrophic cardiomyopathy DPB1*0901 Mycosis fungoides DQB1* 03 Malaria DRB1*04 Leprosy DRB1*15 Allergic bronchopulmonary aspergillosis DRB1*1501, DRB1*1503 Melioidosis DRB1*1602 AIDS rs2395029 Dengue shock syndrome rs3132468 Hepatitis B vaccine response rs3135363(C) Kaposi's sarcoma rs6902982(G) HCV-induced liver cirrhosis rs910049(A) Hepatitis B virus-related hepatocellular carcinoma rs9275319(G) HPV rs9357152(G)

1.1.2.4.1 Acquired immune deficiency syndrome Acquired immune deficiency syndrome (AIDS) is the final as well as the most serious stage after infected with human immunodeficiency virus (HIV). The HIV infection and HLA genes’ information of MHC region are one of the most convincing correlations of human infection and offer the opportunity to develop novel interventions and therapies. Gao et al. [26] reported specific HLA-B*35-Px subtypes as being responsible for the correlation between HLA-B*35 and rapid progression to AIDS, and they also have confirmed the known protective effect of the HLA-B*27 and B*57 subtypes against progression to AIDS. 1.1.2.4.2 Hepatitis virus infection Chronic hepatitis virus infection is an important healthy issue, especially in Asian region like China that cause huge public health concern. Hepatitis B and C virus infections are estimated to account for 70% of the global burden of liver disease [27]. The clinical outcomes after infection were different, from clearance of infection to hepatocellular carcinoma. Many researches proved that genetic information play important role in determining the clinical phenotypes. Most of these researches focused on the relationship between HLA genes of MHC region and infection and HLA class II heterozygote advantage has been demonstrated for clearance of hepatitis B virus (HBV) infection [28]. 1.1.2.5 Cancer Human immune system and HLA genes in MHC region plays important role in the development of cancer but the detailed mechanism still poorly understood. Cells of cancer tissue could express some genes that the normal pairs do not, and the short peptides derives from these specific expressed genes will load to MHC class I molecules. These peptides may sever as the target for the immune system. One of the first cancer that reported to relate with MHC genes was Hodgkin’s lymphoma [29]. Many tumors could modulate MHC class I genes’ expression to escape the immune system’ recognition and attack. Cancers that have been showed to related with MHC

Page 17: MHC region and its related disease study

17

region so far were given below. Table 3: Cancer related loci within MHC Disease Related variants Abdominal aortic aneurysm DQA1*0102 Acute lymphoblastic leukemia DQB1*0201, DQB1*0501, DQB1*0301/DRB1*15/rs2227956,

rs1043618, rs1061581 Basal cell carcinoma DRB1*07/DRB1*01 Breast cancer DRB1*1301/A*30,A*31, A*24 /DQA1*0301,DRB1*1303,

DQA1*0505, DRB1*1301/DQB1* 02 Cervical cancer DRB1*1501/rs4282438/DRB1*13,DRB1*3,DRB1*09,DRB1*1201 Chronic lymphocytic leukemia A*02:01,DRB4*01:01 Diffuse large B cell Lymphoma DRB1*01,C*03 Esophageal squamous cell carcinoma

rs35399661, rs3763338, rs2844695

Follicular lymphoma DQB1*05,DQB1*06/rs10484561, rs6457327/DPB1*03:01,rs2647012 /DRB1(AA 13)/rs17203612, rs3130437

Hepatocellular carcinoma DRB1*07, DRB1*12/ DQB1*0502, DQB1*0602/rs2596542, rs9275319

Hodgkin lymphoma rs6903608, rs2281389, DQA1*02:01 Large B-cell lymphoma DRB1*01,C*03 Lung adenocarcinoma DQA1*03 Lung cancer A*0201, A*2601, B*1518, B*3802, DRB1*0401, DRB1*0402,

DRB1*1201, DRB1*1302/rs1052486, rs3117582 Nasopharyngeal carcinoma rs2517713, rs2975042, rs29232, rs3129055 , rs9258122/A*31,

A*33, A*19, DRB1*01, DQB1*05, DQB1*02/ B∗46, A∗24/rs3869062

Ovarian cancer A*02 Papillary thyroid carcinoma B*51:01 Testicular germ cell tumors DRB1*0405, DQB1*0401/ B*41

1.1.2.6 Neurological disorder Immune system abnormal during brain development can cause serious effects on neuronal function and there are more and more evidences show that immune dysregulation is associated with neurodevelopmental disorders [30]. Other neurological disorders, including Alzheimer’s disease [31], Parkinson diseases [32], Schizophrenia [33] and Autism spectrum disorder (ASD) [30] were reported to related with MHC region, which indicated the importance of immune system with these disorders. 1.1.2.7 Others Remarkably, MHC is also found to be associated with some other conditions, such as smoking behavior [34] and drag sensitivity which includes abacavir induced hypersensitivity with HLA-B*57:01 [35] and carbamazepine sensitivity with B*15:02 and A*31:01 [36]. It has also been reported that MHC is related with mate choice [37].

Page 18: MHC region and its related disease study

18

1.2 High throughput sequencing technology DNA sequencing is a process of reading and determining the precise order of nucleotides in a DNA molecule. Sanger sequencing method has dominated the sequencing filed for decades since it first developed in 1977 and has paved the road for our understanding of the human genome as well as many other species. Sequencing technology has exploded since the completion of the Human Genome Project (HGP). The rapid development of sequencing methods, especially the arises of next generation sequencing (NGS) methods (Fig. 3) which enable to produce thousands or millions of sequences parallel has greatly accelerated the medical and biological research and discovery. The most advantage of NGS method is its high-throughput, thus lead to produce an enormous volume of data cheaply.

Figure 3: Development of sequencing methods Currently, the most popular NGS technology are Illumina’s Hiseq, Roche’s 454, Life’s Ion Torrent/Proton, PacBio’s SMRT sequencing and Complete Genome’s CG. Detailed review of the mechanism and comparison of these technologies have been published before [38-43]. Along with the emergence of massively parallel sequencing technology, the cost of sequencing a single human genome is dramatically decreased (Fig. 4) and there has been a drive to improve. The newest sequencers, named as ‘third generation’ or ‘single molecular’ sequencing method [44], are able to sequencing DNA directly. Although the throughputs are not comparable to current NGS methods, they have great potential and advantage in clinical usage currently, and may dominate the sequence filed in the future after persistent technical improvement.

1950� 1960� 1970� 1980� 2010� 2013�1990� 2000�

The structure of DNA was

established in 1953�

DNA sequencing with chain-terminating inhibitors (1977)�

DNA sequencing by chemical degradation

(1977)�

The first semi-automated DNA sequencing machine

in 1986�

Real-time DNA sequencing using

detection of pyrophosphate release in

1996�

The Polony sequencing (2005)�

Massively parallel signature sequencing

(MPSS) (1992)�

454 GS 20 in 2005; 454 GS FLX in 2007; 454 GS FLX Titanium in 2009; 454 GS FLX+ in 2011;

Illumina GA lanuch in 2006 Illumina GA II in 2007; Illumina GAIIx in 2009; Illumina Hiseq2000 in 2010; Illumina Miseq in 2011; Illumina Hiseq2500 in 2012;

Life Technologies SOLiD in 2008; Life Technologies SOLiD 5500xl in 2010; Life Technologies Ion Torrent PGM 314/316/318 chip in 2011; Life Technologies Ion Proton in 2012;

��

Intelligent BioSystems (2006)�

Helicos (2008)�

PacBio RS system (2010)�

Oxford Nanopore Technologies (2012)�

Page 19: MHC region and its related disease study

19

Figure 4. Cost for sequencing one human genome [45].

1.2.1 Roche 454 Roche 454 DNA sequencer was the first next-generation sequencing method that has been commercialized (Fig. 3). The method was developed by 454 life Sciences and was acquired by Roche in 2007. The technology is derived from the technological convergence of pyrosequencing and emulsion PCR [46]. DNA was first sheared into small fragments and ligated with adapter. Then the adapter-ligated NDA fragments were fixed to beads in a water-in-oil emulsion and amplified by PCR. Finally, these DNA-bound beads were placed to a fiber chip and placed into the system for sequencing.

1.2.2 Illunina Solexa/Hiseq/Miseq This sequencing technology is based on sequencing by synthesis (SBS) and was first released in 2006 by Solexa [47]. It employs a modified dNTPs that contains a terminator to blocks further synthesis that similar to the Sanger sequencing. The whole sequence process constitutes by three major steps: 1) Sample preparation. DNA is sheared to an appropriate size, i.e., 500bp (according to the sequencing strategy). Then, the ends of fragmented DNA are polished and ligated with unique adapters. 2) Cluster Generation by Bridge PCR. The constructed DNA library is loaded to the flow cell. Using a specific designed process, DNA fragment ligate with the oligonucleotides of the flow cell surface and amplified. 3) Sequencing by synthesis. Each sequencing cycle consists of loading a single fluorescent nucleotide and imaging by a high-resolution camera. In 2007, Illumina acquired Solexa and have provided a series of next-generation sequencing machines based on the same theory which including Genome Analyzer IIx, Hiseq, Miseq and so on. As the development of this technology, the output of raw sequencing data increased from 1G (GA) at the first begin to near 600G per run (Hiseq2000) and read length increased from single-end 35bp to current pair-end 300bp (Miseq).

Page 20: MHC region and its related disease study

20

1.2.3 Life Technologies Ion Torrent PGM/Proton Ion Torrent semiconductor sequencing [48] also uses sequencing by synthesis technology, but has moved away from optics altogether. Library preparation involves coating beads in individual library fragments and each bead is then deposited in its own well of a semiconductor plate. The wells are flooded sequentially with each of the four DNA base pairs. If one is incorporated into the growing strand, the hydrogen ions released causes a change in pH and is recorded by the semiconductor chip at the bottom of each well. Homopolymer runs are detected as larger spikes in pH change. Ion Torrent runs are faster due to the absent of optics and lack of modified bases which requiring chemical alteration. Currently, Ion Torrent provides two different sequencing systems: Proton and PGM. The Ion Proton system enables the highest throughput with up to 10 Gb reads in 2-4 hours with read length up to 200bp. While for Ion PGM, although the throughput is only ~2G, the read length could be as long as 400bp.

1.2.3 Complete Genomics By brought diverse technologies together, Complete Genomics (CG) created a comprehensive platform [49-50] for large-scale human genomics studies. Like other NGS method, the whole sequencing process can also be divide into three major process: 1) Library Construction. Besides shearing DNA into an appreciated size, CG inserts four adaptors into each DNA fragment during the library construction. 2) DNA library amplification and loading to patterned surface. Unlike other sequencing methods which DNA amplification is performed on surfaces or in emulsions, CG’s amplification process performed in solution and produces amplified DNA clusters which named as DNA nano-balls (DNBs). These produced DNBs are loaded on the patterned surface. 3) Sequencing by hybridization (SBH). Pools of probes labeled with four different dyes were hybridized and ligated to the template to produce high accuracy reads.

1.2.4 PacBio The PacBio sequencing system provided by Pacific Biosciences is one of current next-generation sequencing technologies which uses a single molecule real time sequencing (SMRT) method [51]. This method is also SBS technology that based on real-time imaging of fluorescently marked nucleotides when synthesizing along the DNA template molecules. The DNA sequencing is performed on a chip that contains many zero-mode waveguide (ZMW). Inside each ZMW, a single active DNA polymerase with a single molecule DNA template is immobilized to the bottom through which light can penetrate and create a visualization chamber that allows monitoring of the activity of the DNA polymerase at a single molecule level. The signal from a phospho-linked nucleotide incorporated by the DNA polymerase is detected as the DNA synthesis proceeds which results in the DNA sequencing in real time. As a result, the read length is not uniform but has an approximately lognormal distribution. Meanwhile, sequencing error could be reduced after the process of circular consensus sequencing and assembly.

1.2.6 Oxford Nanopore The basic idea of Nanopore based sequencing method is threading the DNA molecule through a small hole and measuring the physical property which is related with each nucleotide that passed the hole. It has great potential to become a direct, fast and inexpensive sequencing technology.

Page 21: MHC region and its related disease study

21

Oxford Nanopore was founded in 2005. It translate the academic nanopore research into a commercial, electronics-based sensing technology [52] which could be used into analysis of single molecules, including DNA, RNA and proteins.

1.2.7 Others Besides these commercial sequencing methods showed above, there were some other high throughput sequencing methods that appeared and used before. Such as Applied Biosystems SOLiD system [53] which using 2-based encoding and ligation sequencing rather than SBS and Helios Biosciences’ Helicos Genetic Analysis System [54], which was the first commercial NGS method to use the principle of single molecule fluorescent sequencing. However, these technologies were gradually withdrawn from the sequencing market because of technical or commercial problems.

Page 22: MHC region and its related disease study

22

2 Aims

2.1 Current problem for MHC region related study 1) Choosing a reliable and user-friendly genotyping method is one of the most important issues

for researches or studies which mainly rely on HLA typing information. Sanger sequencing, distinguished for its long read length and accuracy, has dominated high-quality sequencing for decades since it adapted as the gold standard of genotyping. However, Sanger sequencing based typing (SBT) is labor intensive and low throughput, hindering it being used as a routine method, especially in large-scale genotyping projects.

2) Despite the impressive cumulative list of MHC associations of different disorders during recent decades [8], there has been a surprising lacking of progress in identifying causal genes and variants behind these disease associations. This is likely to be due to the complex allelic and genetic structure of this region with an extended linkage disequilibrium (LD) and a high degree of polymorphisms [55].

3) Effective imputation methods and large-scale reference panels are available to determine the high polymorphism HLA genes in silico and fine-map diseases associations within classical HLA genes [56–58]. However, adequately sized reference panels are still not available for many ethnic groups. Moreover, current available reference panels lack MHC-wide ascertainment of variation, and may thus not allow fine mapping of the causal mutation or even miss parts of disease associations. In addition, it has been shown that MHC haplotypes, polymorphisms and recombination rates are highly divergent among individuals and populations [59].

4) Structural variation (SV) have been proven to be an important source of human genetic diversity and disease susceptibility [60-63]. Base pair differences arising from SVs occur on a significantly higher order (>100 fold) than point mutations [64-65], and data from the 1000 Genomes Project show population-specific patterns of SV prevalence [66]. The MHC is one of the most polymorphism regions on genome, there exists abundant SVs between different MHC haplotypes [5], [67]. However, due to its complexity, current methods could hardly reveal this important information.

2.2 Specific aims of this project Take the advantage of next generation sequencing methods together with advanced bioinformatics analysis to:

1) Develop novel, accurate and cost-effective HLA typing method for routinely usage, especially in large-scale genotype projects.

2) Develop novel tools that could efficiently obtain information of the whole MHC region which could be used to solve the problem that hard to identify the disease causal mutation due to the high polymorphism and linkage dis-equilibrium in genome wide association study (GWAS) for complex disease study in MHC region.

3) Construct a MHC reference panel database that contains comprehensive variants information, including SNPs, Indels and HLA typing, for Han Chinese population, which could serve as a basis for studies on a variety of MHC associated disorders.

4) Adopt new technology in revealing SVs of MHC region as well as in whole genome.

Page 23: MHC region and its related disease study

23

3 Results 3.1 Novel method for MHC region study

3.1.1 Paper I: High throughput HLA typing Method Sanger based PCR-SBT is the golden standard for genotyping but has been underused because of its low-throughput, labor intensive and high cost of sequencing. NGS methods provide high throughput and inexpensive sequencing, holding great promise for applying to PCR-SBT. However, the short read length is a major challenge. HLA gene typing is one of the most important applications of genotyping as high-resolution donor-recipient HLA matching is essential for the success of unrelated donor hematopoietic stem cell transplantation (HSCT). However the extremely polymorphic and complex of this region poses a great challenge for obtaining high accurate and resolution result. Given this situation, we developed a novel method RCHSBT (Fig. 5), a NGS based PCR-SBT assays that takes the advantage of massively parallel high-throughput and paired-end sequencing, which provides accurate HLA gene typing result using assembled haploid sequences. Utilizing this method, amplicons with length as long as Sanger sequencing reads can be readily genotyped.

Figure 5. Outline of RCHSBT procedures. In order to develop a high throughput, high accurate and low cost HLA typing methods, we adapted steps as described below:

1) First, each amplicon was amplified with Index PCR. Then, amplicons from different sample were pooled together and send for NGS sequencing. By using the index information within each sequenced read, it enables us to group the sequenced reads into each sample’ corresponding amplicon.

Page 24: MHC region and its related disease study

24

Figure 6. Amplicon information of each target gene. (A) For HLA-A,-B and -C, exon 1 to exon 7 were amplified by four reactions. For DQB1, exon 2 and exon 3 were amplified together. For DRB1, Only part of exon 2 was amplified. Q stands for DQB1, R stands for DRB1; (B) Agarose gel showing amplicons from PCR. The amplicons ranged between 250 bp and 1,000 bp in length. Amplicon Q2 and Q3 were loaded in the same lane.

2) After align the sequenced reads to corresponding reference, we adopted a Maximum Allowable Mismatches (MAM) cutoff that used for filtering out the aligned low-quality reads to get accurate variants of each amplicon (Table 4). 3) Detected heterozygous variants were phased into haplotype by using reads’ pair-end information. 4) Re-construction of each amplicon’s two haplotype sequences. Sequenced reads were classified into two groups according to previous phased heterozygous markers and assembled into consensus sequences.

Page 25: MHC region and its related disease study

25

Table 4. Variants calling accuracy statistic

Stage Gene Sanger variants NO.

True Positive (TP) NO. (%)

False Positive (FP) NO.(%)

False Negative (FN) NO.(%)

Total variants NO.

Raw A 4100 4086 (99.51) 20 (0.49) 0 (0.00) 4106

Filtered A 4100 4092 (99.71) 12 (0.29) 0 (0.00) 4104

Phasing A 4100 4090 (99.68) 10 (0.24) 3 (0.07) 4103 Raw B 4350 4513(97.73)/4339 (98.75)* 55 (1.19)/55 (1.25)* 50 (1.08)/0 (0.00)* 4618/4394* Filtered B 4350 4471 (97.75)/4350 (100.00)* 0 (0.00)/0 (0.00)* 103 (2.25)/0 (0.00)* 4574/4350* Phasing B 4350 4475(97.84)/4350(100%)* 0 (0.00)/0 (0.00)* 99 (2.16)/0 (0.00)* 4574/4350* Raw C 2822 2786 (73.43) 1008 (26.57) 0 (0.00) 3794

Filtered C 2822 2809 (99.43) 5 (0.18) 11 (0.39) 2825

Phasing C 2822 2819 (99.79) 5 (0.18) 1 (0.04) 2825

Raw DRB1 2427 2405 (98.28) 42 (1.72) 0 (0.00) 2447

Filtered DRB1 2427 2406 (98.45) 38 (1.55) 0 (0.00) 2444

Phasing DRB1 2427 2406 (98.36) 40 (1.64) 0 (0.00) 2446

Raw DQB1 3973 3960 (98.98) 41 (1.02) 0 (0.00) 4001

Filtered DQB1 3973 3972 (99.60) 16 (0.40) 0 (0.00) 3988

Phasing DQB1 3973 3973 (100.00) 0 (0.00) 0 (0.00) 3973

Variants calling accuracy of most frequently studied HLA exons of HLA-A,-B,-C,DQB1 and DRB1 by RCHSBT was evaluated before MAM filtering(raw), after MAM filtering(filtered) and after haploid assembly(phasing). 5) HLA gene typing by using haplotype sequence information. All in all, the key to provide accurate HLA gene typing result is by taking the advantage of advanced bioinformatics analysis method to construct each genotyped site’s two-haplotype sequences using the pair-end information from NGS sequencing data. It fully takes the advantage of high throughput NGS sequencing technology which enable ultra-high sequence depth and high polymorphic of HLA genes to phase the haplotype sequence.

3.1.2 Paper II: Integrated tools to study whole MHC region Although previous studies have broaden the database of variations in MHC regions, detecting the variations in human MHC region remains a challenge due to the relatively high cost for whole genome sequencing or PCR-based Sanger sequencing. High throughput NGS technologies have decreased the cost for genome sequencing dramatically. But, sequencing a genome with high coverage (~30-fold, about 100 Gb in total) would still costs several thousands dollars. In addition to the cost for sequencing, analyzing and storage of these vast amounts sequencing data would place a substantial burden on a research center’s bioinformatics infrastructure. For researchers who only interesting in MHC region or HLA gene’s typing information only, targeted sequencing stands out as a promising approach. Target capture sequencing mainly used to enrich and sequence the interested region, such as exons of genes or the whole exome of entire human genome [68-69]. This strategy could be a cost-effective solution for analyzing MHC region as well. However, It was thought to be difficult to enrichment of large and high variable genomic region, as there exists ubiquitous repetitive sequence and diverse between different haplotypes. In 2011, Pröll et al [70] used microarray capture together with 454 sequencing to analyze the 3.5 million base pair MHC region of an acute myeloid leukemia patient who underwent unrelated HLA-matched allogeneic stem cell transplantation. It proved the feasibility of capture-based method in studying genomic region as complex as MHC. However, the low capture efficiency hinders it from routine usage.

Page 26: MHC region and its related disease study

26

Given this situation, taking the usage of the eight different MHC haplotype sequences in the hg19, together with NimbleGen, we sophisticated design probes that could efficiently capture the whole MHC region (Fig. 7).

Figure 7. Variant and probe density of MHC. The figure in middle indicates the variant density along MHC. In the figure on the bottom, a short black vertical line stands for a probe. Additionally, we align all the probes to the eight haplotypes of MHC and find all the regions with higher depth have more probes than other regions. Based on the captured data, not only we could provide comprehensive variants information of the covered region, but also, can provide the typing result of the 26 most polymorphism HLA genes that recorded in the IMGT/HLA database. These enable researchers to study the relationship between genetic variants with relevant traits [8] at different levels and fine-map the disease causal variant (Fig. 8).

Page 27: MHC region and its related disease study

27

Figure 8. Three-dimensional ribbon models for the HLA-DR, HLA-B and HLA-DP proteins that related with seropositive Rheumatoid arthritis [56].

3.2 Construction Chinese MHC database It has been shown that MHC haplotypes, polymorphisms and recombination rates are highly divergent among individuals and populations [59], [71]. However, our knowledge of sequence features and diseases/traits associations in MHC region was mainly based on European population. To better understand the genomic feature and to fine mapping disease associated loci in MHC region, constructing a reference database based on each studied population is essential. Using our developed MHC capture array [72], we set out to deep sequence the entire MHC region which close to 5 Mb, from upstream of HLA-A to downstream of HLA-DP, in 9,906 normal Chinese donors in order to provide a population wide database (Table 5), which including SNP (Fig. 9), Indel, HLA gene type (Fig. 10) and MHC haplotype, that will serve as a basis for studies on a variety of MHC associated disorders. Table 5 Basic data statistics. There are 9,906 samples across MHC 5M region. For example the number 1.35 in the formula “1.46±0.38” represents the mean value while the 0.38 represents the standard deviation. The average depth of MHC is about 55X and the coverage of MHC is about 94.67%. It’s adequate for us to do a credible result. Capture sequencing data #of individuals 9906 Raw bases(G) 1.46±0.38

Mapped bases(G) 1.16±0.27

Mapped bases on target(G) 0.63±0.14

Capture specificity (%) 55±6.14

Average sequence depth(fold) 55.15±12.15

Fraction of target covered >= 1X (%) 94.67±0.58

Fraction of gene covered >= 1X (%) 97.98±0.12

Fraction of target covered >= 5X (%) 90.22±1.77

Fraction of gene covered >= 5X (%) 94.78±1.00

Fraction of target covered >= 10X (%) 84.51±2.66

Fraction of gene covered >= 10X (%) 90.52±5.88

©20

12 N

atur

e A

mer

ica,

Inc.

All

righ

ts r

eser

ved.

4 ADVANCE ONLINE PUBLICATION NATURE GENETICS

L E T T E R S

of the classical HLA-B*08 allele (P > 0.68). Similar to positions 11, 71 and 74 in HLA-DR 1, position 9 in HLA-B is also located in the peptide-binding groove (Fig. 4). Many of the previously described associations across the MHC, including markers in the TNF region, are in LD with Asp9 (ref. 10).

As previously observed associations of HLA-B*08 to autoimmune diseases, including to rheumatoid arthritis, have been attributed specifically to the long ancestral 8.1 haplotype, which contains HLA-B*08 on the HLA-DRB1*03 background9,11, we tested whether the HLA-B*08-Asp9 effect is common to all HLA-DRB1 backgrounds. Because HLA-B*08 and HLA-DRB1*03 are not in perfect LD and are both seen independent of the 8.1 hap-lotype, we were able to apply a conditional haplotype analysis to show that HLA-B*08-Asp9 increases disease risk roughly twofold regardless of the HLA-DRB1 background (Fig. 5). Therefore, this risk effect is not restricted to the 8.1 haplotype. Risk alleles for HLA-B and HLA-DRB1 contribute risk additively (on a log-odds scale) even though they are in strong (but incomplete) LD.

Conditioning on the effects of HLA-DRB1 and HLA-B, we observed the most signifi-cant association at HLA-DPB1 in the class II HLA region (P < 10−20; Fig. 1c), which cor-

responds to Phe9 in HLA-DP 1 (OR = 1.40 relative to His9 or Tyr9; Table 1, Fig. 3 and Supplementary Table 4). The Phe9 effect was significantly stronger than that of any two- or four-digit HLA-DPB1 classical allele; Phe9 is in LD with and is indistinguishable from the Val8 allele in HLA-DPB1. Amino acid position 9 is within the binding groove of HLA-DP (Fig. 4).

We observed no residual signals across the MHC after conditioning on the effects of HLA-DRB1, HLA-B*08 Asp9 in HLA-B and Phe9 in HLA-DP 1 (P > 3 × 10−6; Fig. 1d). We also did not observe any evi-dence of epistatic interactions between known risk loci13,20,21 and any of the HLA alleles described here (P > 0.0003; Supplementary Note).

Table 1 Effect estimates for the five amino acids associated with risk of rheumatoid arthritis

HLA-DR 1 amino acid at position MultivariateUnadjusted allele

frequency

11 13 71 74 OR 95% CI Controls Cases Classical HLA-DRB1 alleles

Val His Lys Ala 4.44 4.02–4.91 0.106 0.316 *04:01Val His or Phe Arg Ala 4.22 3.75–4.75 0.056 0.141 *04:08, *04:05, *04:04, *10:01Leu Phe Arg Ala 2.17 1.94–2.42 0.109 0.143 *01:02, *01:01Pro Arg Arg Ala 2.04 1.59–2.62 0.013 0.012 *16:01Val His Arg Glu 1.65 1.24–2.19 0.010 0.009 *04:03, *04:07Asp Phe Arg Glu 1.65 1.29–2.10 0.011 0.013 *09:01Val His Glu Ala 1.43 1.04–1.96 0.011 0.006 *04:02Ser Ser Lys Ala 1.04 0.76–1.41 0.012 0.006 *13:03Pro Arg Ala Ala 1.00 Reference 0.142 0.092 *15:01, *15:02Gly Tyr Arg Gln 0.91 0.80–1.03 0.133 0.064 *07:01Ser Ser or Gly Arg Ala 0.88 0.77–1.00 0.103 0.049 *11:01, *11:04, *12:01Ser Ser Arg Glu 0.84 0.67–1.05 0.025 0.012 *14:01Leu Phe Glu Ala 0.73 0.42–1.27 0.004 0.002 *01:03Ser Gly Arg Leu 0.71 0.57–0.89 0.028 0.013 *08:01, *08:04Ser Ser Lys Arg 0.63 0.54–0.73 0.128 0.083 *03:01Ser Ser Glu Ala 0.59 0.51–0.68 0.112 0.041 *11:02, *11:03, *13:01, *13:02

HLA-B amino acid at position 9 Classical HLA-B alleleAsp 2.12 1.89–2.38 0.118 0.130 *08His, Tyr 1.00 Reference 0.882 0.870 *07, *13, *14, *15, *18, *27, *35, *37, *38, *39, *40,

*41, *44, *45, *47, *49, *50, *51, *52, *53, *55, *56, *57, *58, *73

HLA-DP 1 amino acid at position 9 Classical HLA-DPB1 alleles

Phe 1.40 1.31–1.50 0.728 0.799 *02:01, *02:02, *04:01, *04:02, *05:01, *16:01, *19:01, *23:01

His, Tyr 1.00 Reference 0.272 0.201 *01:01, *03:01, *06:01, *09:01, *10:01, *11:01, *13:01, *14:01, *15:01, *17:01, *20:01

Estimated effects for haplotypes of HLA-DRB1, HLA-B and HLA-DPB1. Classical alleles of HLA-DRB1 are grouped based on the amino acid residues present at positions 11 (or 13), 71 and 74 within DR 1. The classical shared epitope alleles are shown in bold. For each haplotype, the multivariate effect is given as an odds ratio (OR), taking the most frequent haplotype (ProArgAlaAla) in the control samples as the reference (that is, giving that haplotype an OR of 1). All effects are conditional on Asp9 in HLA-B and Phe9 in HLA-DP 1. Unadjusted haplotype frequencies are given for cases and controls. HLA-DRB1 haplotypes in aggregate explain 9.7% of the phenotypic variance of rheu-matoid arthritis. The multivariate effect sizes, allele frequencies and classical alleles corresponding to Asp9 in HLA-B and Phe9 in HLA-DP 1 are also listed. 95% CI, 95% confidence interval.

13

HLA-DR

HLA-DR

HLA-DP

HLA-DP

HLA-B

119 9

7174

Figure 4 Three-dimensional ribbon models for the HLA-DR, HLA-B and HLA-DP proteins. These structures are based on Protein Data Bank entries 3pdo, 2bvp and 3lqz, respectively, with a direct view of the peptide-binding groove. Key amino acid positions identified by the association analysis are highlighted. This figure was prepared using UCSF Chimera25.

Page 28: MHC region and its related disease study

28

a

b

Figure 9. Heterozygosity and functional catalog of common SNPs in the MHC region. (a) The fraction of variants and their contribution to heterozygosity at different allele frequencies. 27% variants are relatively low frequent (MAF<0.1), but 85% of heterozygous sites were due to common variants (MAF>0.05). (b) The number of synonymous SNPs and nonsynonmous SNPs as the increase of minor allele frequency. The nsSNPs have greater diversity among individuals than sSNPs, especially in low frequency variants.

0  

20  

40  

60  

80  

100  

120  

140  

synonymous  

nonsynonymous  

0.1          0.15          0.2          0.25          0.3          0.35        0.4            0.45        0.5 Minor  allele  frequency

SNP  

numbe

r

Page 29: MHC region and its related disease study

29

a.

b.

c.

d.

e.

Figure 10. Allelic frequencies at five classic HLA sites. Allele frequency for HLA-A, HLA-B, HLA-C, HLA-DRB1, HLA-DQB1 in Chinese population were given in the figure. Only the top 20 alleles at each locus were showed in diagram. The common alleles (Frequency >5%) were showed in crimson pillars, the low-frequency alleles were showed in light blue pillars and the rare alleles were showed in green pillars.

Page 30: MHC region and its related disease study

30

3.3 Fine-mapping disease related mutation One of the most important usages of the constructed database is fine mapping of disease causing mutations in the MHC region. For example, rs2281388 (located 5.1 kb downstream of HLA-DPB1) were previously reported to be associated with Graves’ disease in the Han Chinese population [73] but were in insufficient LD (r2 <0.8) with currently known nsSNP variants. However, using our database, rs2281388 shows a strong LD (r2=0.91) with the nearby nsSNP rs11551421, which alters a Valine to a Methionine in the fourth exon (HLA-DPB1:NM_002121:exon4:c.G700A:p.V215M) of HLA-DPB1. Rs2281388 was an almost perfect predictor of the DPB1*0501 allele (tagging LD=0.96), suggesting that nsSNP rs11551421 in the DPB1*0501 allele is a potential or even causal functional variant underlying Graves’ disease (Fig. 11).

Figure 11. Fine map disease causal variants using constructed database. The SNP rs2281388 was reported to be related with Grave disease in a previous study, based on our constructed databased, this snp were in high LD (r2>0.8) with a functional non-synonymous SNP in HLA-DPB1.

3.4 Paper III: Novel method and technology in genome study As the development of NGS sequencing technology, it’s much easier and faster to obtain information of a single [46-47], [53-54], [74] or population’s genome [66], [75-76] information than before [77-78]. The well-developed bioinformatics [79–85] could provide each genome’s variants accurately. However, due to the disadvantage of all currently NGS technology – short read length, most of the analyses were focus on the small variants including SNP, Indel as well as copy number variation (CNV) which main takes the usage of depth information. It has little power to detect and analysis the large SVs of human genome, especially for complex region like MHC, T cell receptor alpha/beta (TRA/TRB), Killer cell immunoglobulin-like receptor (KIR), and immunoglobulin heavy/light locus (IGH/IGL) region. Optical mapping and new genome mapping technologies based on nicking enzymes provide low resolution but long range genomic information and has been successfully used for assisting the genome de novo assembly [86] and for detecting large-scale structural variants and rearrangements

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●

●●●●●●●●●●●●●●

●●●●●●

●●●●

●●

●●●

●●●●

●●

●●●●●●●●●●

●●●●●

●●●●●●

●●

●●●●●●

●●

●●●

●●●●●●

●●●●

●●●●

●●

●●

●●●●

●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●

●●●

●●●●●●

●●

●●●●●

●●●

●●●●●

●●●●●●●●●●●●●

●●●

●●

●●●●

●●

●●●●●

●●●●●●●●●●

●●

●●●

●●●

●●●●

●●

●●●●●●

●●●●●●

●●

●●●●●●●

●●●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●●●

●●

●●

●●

●●●●●●

●●

●●

●●●

●●●●●●

●●●

●●

●●

●●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●●●●●

●●

●●

●●●

●●●

●●●●●●●

●●

●●●●

●●●●

●●

●●

●●

●●●

●●

●●●●●●●●

●●●

●●●

●●●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●●●●●

●●

●●●●●●

●●●●●●

●●

●●●●●●

●●●

●●

●●

●●●●●●●●

●●

●●●

●●

●●●●

●●●●

●●

●●

●●●●●●●●●

●●●●●●●●●●●

●●●●

●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●

●●●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●●

●●●●●●●●

●●

●●

●●●●●●●●●●●●●

●●

●●●●

●●●●●

●●

●●

●●

●●●●●●

●●●●●●●

●●●●

●●

●●●●●●●●●●

●●

●●

●●●●

●●●●●●

●●

●●●

●●●

●●●●●●●●●●●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●●

●●

●●●

●●●

●●

●●●

●●

●●●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●●●●

●●●●

●●

●●●

●●

●●

●●

●●●●

●●●●●●●●●●

●●

●●●●

●●●

●●

●●●●●

●●●●●●●

●●

●●

●●●●

●●

●●●●

●●

●●●

●●●●

●●

●●

●●

●●●●●

●●

●●

●●●●●●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●●●●●

●●●●●●

●●

●●●●

●●

●●●●

●●●●●

●●

●●

●●●

●●●●●●●●

●●●●

●●●

●●●●

●●

●●●

●●●

●●●●

●●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●●●●

●●

●●

●●

●●

●●●●

●●●●

●●●

●●●

●●●●●

●●●●

●●

●●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●

●●

●●

●●●●●

●●●●

●●

●●

●●●●

●●●

●●

●●●

●●

●●●●

●●

●●●●

●●●●

●●●●●●●●●

●●●●●●●●

●●

●●

●●

●●●●

●●●●●●

●●

●●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●

●●●●●

●●

●●●●●●●●●

●●●●●●●●●

●●

●●●●●●

●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●

●●●●●

●●●●●

●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●

●●●

●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●0.

00.2

0.4

0.6

0.8

1.0

1.2

Position

LD

33.1M 33.2M 33.3M 33.4M

HLA−DPB1 COL11A2

HLA−DPA1HLA−DPB2 RXRB

SLC39A7

rs2281388

rs11551421

r2 0.8 − 1r2 0.6 − 0.8r2 0.4 − 0.6r2 0.2 − 0.4r2 0 − 0.2

Page 31: MHC region and its related disease study

31

[67] that cannot be detected using the paired end information provide by current NGS sequencing methods. Thus, taking the advantage of a high throughput optical map technique – Irys platform that based on Nano-channel, we provided the comprehensive large SVs information in the human genome. Moreover, it also gave out valuable information for the most complex genomic regions (Fig. 12).

Figure 12. Consensus genome maps compared to hg19 in the MHC region [87]. The green bars represent the hg19 in silico motif map; the blue bars represent consensus genome maps. Large SVs can be seen in the RCCX, HLA-D and HLA-A regions. The Cox and PGF genome maps are shown below for the HLA-A region. HLA: human leukocyte antigen; RCCX: RP-C4-CYP21-TNX module.

Page 32: MHC region and its related disease study

32

4 Conclusions The demand of developing novel and advanced methods along with emerged new technology is the eternal motivation for genomic study as well as other fields. NGS methods, which is well known for its high throughput, and hence, low sequencing cost, has great potential utility in the sequence field ever since their emergence. However, the relatively shorter read length and higher sequence errors, compare to Sanger sequencing, are two major challenges before using them as a routine method. Due to the importance of human MHC to immune system, it's one of the most intensive studied genomic regions. However, due to the limitation of previous methods, there was a lot of room for improvement. SBT was the golden standard for HLA gene typing. Sanger based sequencing method, well known for its long read length and accuracy, has dominated this fields for decades. The High throughput NGS sequencing technologies have great potential of usage in genome analyze, especially in genotyping which has been dominated by Sanger sequencing method. However, the two major challenges NGS method need to be solved before it could be applied in genotyping. Based on sophisticate designed bioinformatics analysis method and making the usage of specific feature of HLA genes – extremely high polymorphism, we achieved nearly the same accuracy as Sanger sequencing based method for HLA genotyping. Situation was the same for the entail MHC and genome region analysis. Results of the studies showed above fully demonstrate the great advantage of new technology and advanced bioinformatics analysis method in genomic related analysis and applications.

Page 33: MHC region and its related disease study

33

5 Future Perspective Being able to obtain accurate genotyping information of HLA gene, MHC as well as the entail genome was just the beginning. The ultimate goal for genomic research is to reveal the relationship between genotype and phenotype, then utilize this information in personalized medicine, including disease diagnosis, organ transplantation guidance, disease risk prediction, particularly, for complex diseases. To reach this aim, a lot of work still needed to do in the future which includes, but not limited to, novel technology development, new bioinformatics method conception and new biological mechanism discovery.

Page 34: MHC region and its related disease study

34

6 References [1] The MHC sequencing consortium, “Complete sequence and gene map of a human major

histocompatibility complex,” Nature, vol. 401, no. October, pp. 921–923, 1999.

[2] J. Klein, A, Sato, “Advances in Immunology, The HLA system,” N Engl J Med, vol. 343, pp. 702-709, September. 2000.

[3] C. A. Stewart, R. Horton, R. J. N. Allcock, J. L. Ashurst, A. M. Atrazhev, P. Coggill, I. Dunham, S. Forbes, K. Halls, J. M. M. Howson, S. J. Humphray, S. Hunt, A. J. Mungall, K. Osoegawa, S. Palmer, A. N. Roberts, J. Rogers, S. Sims, Y. Wang, L. G. Wilming, J. F. Elliott, P. J. de Jong, S. Sawcer, J. a Todd, J. Trowsdale, and S. Beck, “Complete MHC haplotype sequencing for common disease gene mapping.,” Genome Res., vol. 14, no. 6, pp. 1176–87, Jun. 2004.

[4] J. a Traherne, R. Horton, A. N. Roberts, M. M. Miretti, M. E. Hurles, C. A. Stewart, J. L. Ashurst, A. M. Atrazhev, P. Coggill, S. Palmer, J. Almeida, S. Sims, L. G. Wilming, J. Rogers, P. J. de Jong, M. Carrington, J. F. Elliott, S. Sawcer, J. a Todd, J. Trowsdale, and S. Beck, “Genetic analysis of completely sequenced disease-associated MHC haplotypes identifies shuffling of segments in recent human history.,” PLoS Genet., vol. 2, no. 1, p. e9, Jan. 2006.

[5] R. Horton, R. Gibson, P. Coggill, M. Miretti, R. J. Allcock, J. Almeida, S. Forbes, J. G. R. Gilbert, K. Halls, J. L. Harrow, E. Hart, K. Howe, D. K. Jackson, S. Palmer, A. N. Roberts, S. Sims, C. A. Stewart, J. a Traherne, S. Trevanion, L. Wilming, J. Rogers, P. J. de Jong, J. F. Elliott, S. Sawcer, J. a Todd, J. Trowsdale, and S. Beck, “Variation analysis and gene annotation of eight MHC haplotypes: the MHC Haplotype Project.,” Immunogenetics, vol. 60, no. 1, pp. 1–18, Jan. 2008.

[6] S. Purcell, B. Neale, K. Todd-Brown, L. Thomas, M. a R. Ferreira, D. Bender, J. Maller, P. Sklar, P. I. W. de Bakker, M. J. Daly, and P. C. Sham, “PLINK: a tool set for whole-genome association and population-based linkage analyses.,” Am. J. Hum. Genet., vol. 81, no. 3, pp. 559–75, Sep. 2007.

[7] M. Kronenberg and A. Rudensky, “Regulation of immunity by self-reactive T cells.,” Nature, vol. 435, no. 7042, pp. 598–604, Jun. 2005.

[8] J. Trowsdale and J. C. Knight, “Major Histocompatibility Complex Genomics and Human Disease,” 2013.

[9] Terasaki PI, Deduction of the fraction of immunologic and non-immunologic failure in cadaver donor transplants, Clin Transpl. 2003:449052.

[10] J. M. Nogueira, A. Haririan, S. C. Jacobs, M. R. Weir, H. A. Hurley, H. S. Al-qudah, M. Phelan, C. B. Drachenberg, S. T. Bartlett, M. Cooper, and J. M. Nogueira, “The Detrimental Effect of Poor Early Graft Function After Laparoscopic Live Donor Nephrectomy on Graft Outcomes,” no. 53, pp. 337–347, 2009.

Page 35: MHC region and its related disease study

35

[11] M. Kamoun, J. H. Holmes, A. K. Israni, J. D. Kearns, V. Teal, W. Peter, S. E. Rosas, M. M. Joffe, H. Li, and H. I. Feldman, “HLA-A amino acid polymorphism and delayed kidney allograft function,” vol. 105, no. 48, pp. 18883–18888, 2008.

[12] L. D. Notarangelo, A. Fischer, R. S. Geha, J.-L. Casanova, H. Chapel, M. E. Conley, C. Cunningham-Rundles, A. Etzioni, L. Hammartröm, S. Nonoyama, H. D. Ochs, J. Puck, C. Roifman, R. Seger, and J. Wedgwood, “Primary immunodeficiencies: 2009 update.,” J. Allergy Clin. Immunol., vol. 124, no. 6, pp. 1161–78, Dec. 2009.

[13] Q. Pan-Hammarström and L. Hammarström, “Antibody deficiency diseases.,” Eur. J. Immunol., vol. 38, no. 2, pp. 327–33, Feb. 2008.

[14] R. A. Al-attas and A. H. S. Rahi, “Primary Antibody Deficiency in Arabs  : First Report from Eastern Saudi Arabia,” vol. 18, no. 5, pp. 1996–1999, 2000.

[15] Kanoh T, Mizumoto T, Yasuda N, Koya M, Ohno Y, Uchino H, Yoshimura K, Ohkubo Y, Yamaguchi H. Selective IgA deficiency in Japanese blood donors: frequency and statistical analysis. Vox Sang. 1986;50(2):81-6.

[16] R. C. Ferreira, Q. Pan-Hammarström, R. R. Graham, G. Fontán, A. T. Lee, W. Ortmann, N. Wang, E. Urcelay, M. Fernández-Arquero, C. Núñez, G. Jorgensen, B. R. Ludviksson, S. Koskinen, K. Haimila, L. Padyukov, P. K. Gregersen, L. Hammarström, and T. W. Behrens, “High-density SNP mapping of the HLA region identifies multiple independent susceptibility loci associated with selective IgA deficiency.,” PLoS Genet., vol. 8, no. 1, p. e1002476, Jan. 2012.

[17] C. Liu, S. Manzi, A. H. Kao, J. S. Navratil, M. J. Ruffing, and J. M. Ahearn, “Reticulocytes Bearing C4d as Biomarkers of Disease Activity for Systemic Lupus Erythematosus,” vol. 52, no. 10, pp. 3087–3099, 2005.

[18] S. D. G. A. D. O. La, H. T. M. E. Nc, J. T. Ale, W. L. G. R. Oss, V. C. E. R. U. N. D. Olo, and B. Bramstedt, “IMMUNODEFICIENCY REVIEW TAP deficiency syndrome,” 2000.

[19] R. Parisi, D. P. M. Symmons, C. E. M. Griffiths, and D. M. Ashcroft, “Global epidemiology of psoriasis: a systematic review of incidence and prevalence.,” J. Invest. Dermatol., vol. 133, no. 2, pp. 377–85, Mar. 2013.

[20] F. O. Nestle, D. H. Kaplan, J. Barker, “Mechanisms of disease, Psoriasis,” N Engl J Med, vol. 361, pp. 496–509, 2009.

[21] D. L. Scott, F. Wolfe, and T. W. J. Huizinga, “Rheumatoid arthritis.,” Lancet, vol. 376, no. 9746, pp. 1094–108, Sep. 2010.

[22] J. E. Oliver and A. J. Silman, “Risk factors for the development of rheumatoid arthritis,” no. 13, pp. 169–174, 2006.

[23] B. S. Prabhakar, R. S. Bahn, and T. J. Smith, “Current perspective on the pathogenesis of Graves’ disease and ophthalmopathy.,” Endocr. Rev., vol. 24, no. 6, pp. 802–35, Dec. 2003.

Page 36: MHC region and its related disease study

36

[24] E. M. Jacobson, A. Huber, and Y. Tomer, “The HLA gene complex in thyroid autoimmunity: from epidemiology to etiology.,” J. Autoimmun., vol. 30, no. 1–2, pp. 58–62, 2008.

[25] D. Burgner, S. E. Jamieson, and J. M. Blackwell, “Genetic susceptibility to infectious diseases  : big is beautiful , but will bigger be even better  ?,” vol. 6, no. October, pp. 653–663, 2006.

[26] S. Applications and I. C. Frederick, “Effect of a single amino acid change in MHC class I molecules on the rate of progression to AIDS,” vol. 344, no. 22, pp. 1668–1675, 2001.

[27] R. Singh, R. Kaul, A. Kaul, and K. Khan, “A comparative review of HLA associations with hepatitis B and C viral infections across global populations,” vol. 13, no. 12, pp. 1770–1787, 2007.

[28] M. R. Thursz, H. C. Thomas, B. M. Greenwood, A. V.S. Hill, “Heterozygote advantage for HLA class-II type in hepatitis B virus infection.” Nat. Genet, vol. 17, pp. 11-12, 1997

[29] R. Küppers, “The biology of Hodgkin’s lymphoma.,” Nat. Rev. Cancer, vol. 9, no. 1, pp. 15–27, Jan. 2009.

[30] L. a Needleman and a K. McAllister, “The major histocompatibility complex and autism spectrum disorder.,” Dev. Neurobiol., vol. 72, no. 10, pp. 1288–301, Oct. 2012.

[31] F. Listì, G. Candore, C. R. Balistreri, M. P. Grimaldi, V. Orlando, S. Vasto, G. Colonna-Romano, D. Lio, F. Licastro, C. Franceschi, and C. Caruso, “Association between the HLA-A2 allele and Alzheimer disease.,” Rejuvenation Res., vol. 9, no. 1, pp. 99–101, Jan. 2006.

[32] Y. Guo, X. Deng, W. Zheng, H. Xu, Z. Song, H. Liang, J. Lei, X. Jiang, Z. Luo, and H. Deng, “HLA rs3129882 variant in Chinese Han patients with late-onset sporadic Parkinson disease.,” Neurosci. Lett., vol. 501, no. 3, pp. 185–7, Sep. 2011.

[33] M. Debnath, D. M. Cannon, and G. Venkatasubramanian, “Variation in the major histocompatibility complex [MHC] gene family in schizophrenia: associations and functional implications.,” Prog. Neuropsychopharmacol. Biol. Psychiatry, vol. 42, pp. 49–62, Apr. 2013.

[34] G. J. Arason, J. Kramer, C. Szalai, G. Fu, Y. Yang, E. K. Chung, B. Zhou, C. A. Blanchong, M. Lokki, S. Bo, G. Thorgeirsson, C. Y. Yu, and M. Kova, “Genetic basis of tobacco smoking  : strong association of a specific major histocompatibility complex haplotype on chromosome 6 with smoking behavior,” vol. 1940, no. May, 2004.

[35] P. T. Illing, J. P. Vivian, N. L. Dudek, L. Kostenko, Z. Chen, M. Bharadwaj, J. J. Miles, L. Kjer-Nielsen, S. Gras, N. a Williamson, S. R. Burrows, A. W. Purcell, J. Rossjohn, and J. McCluskey, “Immune self-reactivity triggered by drug-modified HLA-peptide repertoire.,” Nature, vol. 486, no. 7404, pp. 554–8, Jun. 2012.

[36] J. J. Farrell, D. Kasperavi, D. Ph, M. Carrington, M. Molokhia, B. Ch, M. R. Johnson, D. Phil, E. L. Heinzen, N. Walley, M. Pandolfo, W. Pichler, B. K. Park, C. Depondt, S. M.

Page 37: MHC region and its related disease study

37

Sisodiya, and D. B. Goldstein, “HLA-A*3101 and Carbamazepine-Induced Hypersensitivity Reactions in Europeans,” 2011.

[37] C. Cao and P. Donnelly, “Is Mate Choice in Humans MHC-Dependent  ?,” vol. 4, no. 9, pp. 1–5, 2008.

[38] M. L. Metzker, “Sequencing technologies — the next generation,” Nat. Rev. Genet., vol. 11, no. 1, pp. 31–46, 2009.

[39] E. R. Mardis, “Next-Generation DNA Sequencing Methods,” 2008.

[40] J. Shendure and H. Ji, “Next-generation DNA sequencing,” vol. 26, no. 10, pp. 1135–1145, 2008.

[41] K. V Voelkerding, S. A. Dames, and J. D. Durtschi, “Next-Generation Sequencing  : From Basic Research to Diagnostics CONTENT  : SUMMARY  :,” vol. 658, pp. 641–658, 2009.

[42] N. J. Loman, R. V Misra, T. J. Dallman, C. Constantinidou, S. E. Gharbia, J. Wain, and M. J. Pallen, “Performance comparison of benchtop high-throughput sequencing platforms,” vol. 30, no. 5, 2012.

[43] M. A. Quail, M. Smith, P. Coupland, T. D. Otto, S. R. Harris, T. R. Connor, A. Bertoni, H. P. Swerdlow, and Y. Gu, “A tale of three next generation sequencing platforms  : comparison of Ion Torrent , Pacific Biosciences and Illumina MiSeq sequencers,” BMC Genomics, vol. 13, no. 1, p. 1, 2012.

[44] B. Division, C. Biology, and S. Cruz, “Characterization of individual polynucleotide molecules using a membrane channel,” vol. 93, no. November, pp. 13770–13773, 1996.

[45] E. C. Hayden, "Technology: The $ 1,000 genome," Nature, vol. 507, pp. 294-295, March. 2014.

[46] D. A. Wheeler, M. Srinivasan, M. Egholm, Y. Shen, L. Chen, A. Mcguire, W. He, Y. Chen, V. Makhijani, G. T. Roth, X. Gomes, K. Tartaro, F. Niazi, C. L. Turcotte, G. P. Irzyk, J. R. Lupski, C. Chinault, X. Song, Y. Liu, Y. Yuan, L. Nazareth, X. Qin, D. M. Muzny, M. Margulies, G. M. Weinstock, R. A. Gibbs, and J. M. Rothberg, “The complete genome of an individual by massively parallel DNA sequencing,” vol. 452, no. April, pp. 872–877, 2008.

[47] S. Fig and T. G. Analyzer, “Accurate whole human genome sequencing using reversible terminator chemistry,” vol. 456, no. November, 2008.

[48] D. Rank, P. Baybayan, B. Bettman, A. Bibillo, K. Bjornson, B. Chaudhuri, F. Christians, R. Cicero, S. Clark, R. Dalal, J. Dixon, M. Foquet, A. Gaertner, P. Hardenbol, C. Heiner, K. Hester, D. Holden, G. Kearns, X. Kong, R. Kuse, Y. Lacroix, S. Lin, P. Lundquist, C. Ma, P. Marks, M. Maxham, D. Murphy, I. Park, T. Pham, M. Phillips, J. Roy, R. Sebra, G. Shen, J. Sorenson, A. Tomaney, K. Travers, M. Trulson, J. Vieceli, J. Wegener, D. Wu, A. Yang, D. Zaccarin, P. Zhao, F. Zhong, J. Korlach, and S. Turner, “Single Polymerase Molecules,” no. January, pp. 133–138, 2009.

Page 38: MHC region and its related disease study

38

[49] R. Drmanac, K. P. Pant, J. Baccash, A. P. Borcherding, A. Brownley, R. Cedeno, L. Chen, D. Chernikoff, A. Cheung, R. Chirita, B. Curson, J. C. Ebert, C. R. Hacker, R. Hartlage, B. Hauser, S. Huang, Y. Jiang, V. Karpinchyk, M. Koenig, C. Kong, T. Landers, C. Le, J. Liu, C. E. Mcbride, M. Morenzoni, R. E. Morey, K. Mutch, H. Perazich, K. Perry, B. A. Peters, J. Peterson, C. L. Pethiyagoda, K. Pothuraju, C. Richter, A. M. Rosenbaum, S. Roy, J. Shafto, U. Sharanhovich, K. W. Shannon, C. G. Sheppy, M. Sun, J. V Thakuria, A. Tran, D. Vu, A. W. Zaranek, X. Wu, D. G. Ballinger, G. M. Church, and C. A. Reid, “Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA Nanoarrays Radoje Drmanac,” vol. 78, no. 2009, 2010.

[50] P. Carnevali, J. Baccash, A. L. Halpern, I. Nazarenko, G. B. Nilsen, K. P. Pant, J. C. Ebert, A. Brownley, M. Morenzoni, V. Karpinchyk, B. Martin, D. G. Ballinger, and R. Drmanac, “Computational Techniques for Human Genome Resequencing Using Mated Gapped Reads,” vol. 19, no. 3, pp. 279–292, 2012.

[51] M. J. Levene, J. Korlach, S. W. Turner, M. Foquet, H. G. Craighead, and W. W. Webb, “Zero-mode waveguides for single-molecule analysis at high concentrations.,” Science, vol. 299, no. 5607, pp. 682–6, Jan. 2003.

[52] D. Branton, D. W. Deamer, A. Marziali, H. Bayley, S. A. Benner, T. Butler, M. Di Ventra, S. Garaj, A. Hibbs, X. Huang, S. B. Jovanovich, P. S. Krstic, S. Lindsay, X. S. Ling, C. H. Mastrangelo, A. Meller, J. S. Oliver, Y. V Pershin, J. M. Ramsey, R. Riehn, G. V Soni, V. Tabard-cossa, M. Wanunu, and M. Wiggin, “The potential and challenges of nanopore sequencing,” vol. 26, no. 10, pp. 1146–1153, 2008.

[53] J. Shendure, M. D. Wang, K. Zhang, R. D. Mitra, and G. M. Church, “Genome Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome,” vol. 1728, no. 2005, 2012.

[54] D. Pushkarev, N. F. Neff, and S. R. Quake, “Single-molecule sequencing of an individual human genome,” Nat. Biotechnol., vol. 27, no. 9, pp. 847–850, 2009.

[55] P. I. W. de Bakker and S. Raychaudhuri, “Interrogating the major histocompatibility complex with high-throughput genomics.,” Hum. Mol. Genet., vol. 21, no. R1, pp. R29–36, Oct. 2012.

[56] S. Raychaudhuri, C. Sandor, E. a Stahl, J. Freudenberg, H.-S. Lee, X. Jia, L. Alfredsson, L. Padyukov, L. Klareskog, J. Worthington, K. a Siminovitch, S.-C. Bae, R. M. Plenge, P. K. Gregersen, and P. I. W. de Bakker, “Five amino acids in three HLA proteins explain most of the association between MHC and seropositive rheumatoid arthritis.,” Nat. Genet., vol. 44, no. 3, pp. 291–6, Mar. 2012.

[57] N. a Patsopoulos, L. F. Barcellos, R. Q. Hintzen, C. Schaefer, C. M. van Duijn, J. a Noble, T. Raj, P.-A. Gourraud, B. E. Stranger, J. Oksenberg, T. Olsson, B. V Taylor, S. Sawcer, D. a Hafler, M. Carrington, P. L. De Jager, and P. I. W. de Bakker, “Fine-mapping the genetic association of the major histocompatibility complex in multiple sclerosis: HLA and non-HLA effects.,” PLoS Genet., vol. 9, no. 11, p. e1003926, Nov. 2013.

[58] Y. Sheng, X. Jin, J. Xu, J. Gao, X. Du, D. Duan, B. Li, J. Zhao, W. Zhan, H. Tang, X. Tang, Y. Li, H. Cheng, X. Zuo, J. Mei, F. Zhou, B. Liang, G. Chen, C. Shen, H. Cui, X. Zhang, C.

Page 39: MHC region and its related disease study

39

Zhang, W. Wang, X. Zheng, X. Fan, Z. Wang, F. Xiao, Y. Cui, Y. Li, J. Wang, and S. Yang, “new susceptibility loci for psoriasis,” Nat. Commun., vol. 5, pp. 1–6, 2014.

[59] P. I. W. de Bakker, G. McVean, P. C. Sabeti, M. M. Miretti, T. Green, J. Marchini, X. Ke, A. J. Monsuur, P. Whittaker, M. Delgado, J. Morrison, A. Richardson, E. C. Walsh, X. Gao, L. Galver, J. Hart, D. a Hafler, M. Pericak-Vance, J. a Todd, M. J. Daly, J. Trowsdale, C. Wijmenga, T. J. Vyse, S. Beck, S. S. Murray, M. Carrington, S. Gregory, P. Deloukas, and J. D. Rioux, “A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC.,” Nat. Genet., vol. 38, no. 10, pp. 1166–72, Oct. 2006.

[60] E. Tuzun, A. J. Sharp, J. A. Bailey, R. Kaul, V. A. Morrison, L. M. Pertz, E. Haugen, H. Hayden, D. Albertson, D. Pinkel, M. V Olson, and E. E. Eichler, “Fine-scale structural variation of the human genome,” vol. 37, no. 7, pp. 727–732, 2005.

[61] L. Feuk, A. R. Carson, and S. W. Scherer, “Structural variation in the human genome,” vol. 7, no. February, pp. 85–97, 2006.

[62] J. M. Kidd, G. M. Cooper, W. F. Donahue, H. S. Hayden, N. Sampas, T. Graves, N. Hansen, B. Teague, C. Alkan, F. Antonacci, E. Haugen, T. Zerr, N. A. Yamada, Z. Cheng, H. M. Ebling, N. Tusneem, R. David, P. Tsang, T. L. Newman, E. Tu, W. Gillett, K. A. Phelps, M. Weaver, D. Saranga, A. Brand, W. Tao, E. Gustafson, K. Mckernan, L. Chen, M. Malig, J. D. Smith, J. M. Korn, S. A. Mccarroll, D. A. Altshuler, D. A. Peiffer, M. Dorschner, J. Stamatoyannopoulos, D. Schwartz, D. A. Nickerson, J. C. Mullikin, R. K. Wilson, L. Bruhn, M. V Olson, R. Kaul, D. R. Smith, and E. E. Eichler, “Mapping and sequencing of structural variation from eight human genomes,” vol. 453, no. May, 2008.

[63] Y. Li, H. Zheng, R. Luo, H. Wu, H. Zhu, R. Li, H. Cao, B. Wu, S. Huang, H. Shao, H. Ma, F. Zhang, S. Feng, W. Zhang, H. Du, G. Tian, J. Li, X. Zhang, S. Li, L. Bolund, K. Kristiansen, A. J. De Smith, A. I. F. Blakemore, L. J. M. Coin, H. Yang, J. Wang, and J. Wang, “A n a ly s ly i s s Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly,” Nat. Biotechnol., vol. 29, no. 8, pp. 725–732, 2011.

[64] Z. Cheng, M. Ventura, X. She, P. Khaitovich, T. Graves, K. Osoegawa, M. Rocchi, E. E. Eichler, D. Church, P. Dejong, R. K. Wilson, and S. Pa, “ARTICLES A genome-wide comparison of recent chimpanzee and human segmental duplications,” vol. 437, no. September, 2005.

[65] J. R. Lupski, “Genomic rearrangements and sporadic disease,” vol. 39, no. July, pp. 43–47, 2007.

[66] The 1000 Genomes Project Consortium, “A map of human genome variation from population-scale sequencing,” Nature, vol. 467, pp. 1061-1073, October, 2010.

[67] E. T. Lam, A. Hastie, C. Lin, D. Ehrlich, S. K. Das, M. D. Austin, P. Deshpande, H. Cao, N. Nagarajan, M. Xiao, and P.-Y. Kwok, “Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly.,” Nat. Biotechnol., vol. 30, no. 8, pp. 771–6, Aug. 2012.

Page 40: MHC region and its related disease study

40

[68] A. Gnirke, A. Melnikov, J. Maguire, P. Rogov, E. M. Leproust, W. Brockman, T. Fennell, G. Giannoukos, S. Fisher, C. Russ, S. Gabriel, D. B. Jaffe, E. S. Lander, and C. Nusbaum, “Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing,” vol. 27, no. 2, pp. 182–189, 2009.

[69] A. E. Shearer, A. P. Deluca, M. S. Hildebrand, K. R. Taylor, J. Gurrola, and S. Scherer, “Comprehensive genetic testing for hereditary hearing loss using massively parallel sequencing,” vol. 107, no. 49, pp. 21104–21109, 2010.

[70] M. A. Danzer, S. T. Stabentheiner, N. O. Niklas, C. H. Hackl, P. E. Hufnagl, C. H. Gu, H. A. Hauser, O. T. T. O. Krieger, and C. H. Gabriel, “Sequence Capture and Next Generation Resequencing of the MHC Region Highlights Potential Transplantation Determinants in HLA Identical Haematopoietic Stem Cell Transplantation,” no. May, pp. 201–210, 2011.

[71] P.-A. Gourraud, P. Khankhanian, N. Cereb, S. Y. Yang, M. Feolo, M. Maiers, J. D. Rioux, S. Hauser, and J. Oksenberg, “HLA diversity in the 1000 genomes dataset.,” PLoS One, vol. 9, no. 7, p. e97282, Jan. 2014.

[72] H. Cao, J. Wu, Y. Wang, H. Jiang, T. Zhang, X. Liu, Y. Xu, D. Liang, P. Gao, Y. Sun, B. Gifford, M. D’Ascenzo, X. Liu, L. C. a M. Tellier, F. Yang, X. Tong, D. Chen, J. Zheng, W. Li, T. Richmond, X. Xu, J. Wang, and Y. Li, “An Integrated Tool to Study MHC Region: Accurate SNV Detection and HLA Genes Typing in Human MHC Region Using Targeted High-Throughput Sequencing,” PLoS One, vol. 8, no. 7, p. e69388, Jan. 2013.

[73] X. Chu, C.-M. Pan, S.-X. Zhao, J. Liang, G.-Q. Gao, X.-M. Zhang, G.-Y. Yuan, C.-G. Li, L.-Q. Xue, M. Shen, W. Liu, F. Xie, S.-Y. Yang, H.-F. Wang, J.-Y. Shi, W.-W. Sun, W.-H. Du, C.-L. Zuo, J.-X. Shi, B.-L. Liu, C.-C. Guo, M. Zhan, Z.-H. Gu, X.-N. Zhang, F. Sun, Z.-Q. Wang, Z.-Y. Song, C.-Y. Zou, W.-H. Sun, T. Guo, H.-M. Cao, J.-H. Ma, B. Han, P. Li, H. Jiang, Q.-H. Huang, L. Liang, L.-B. Liu, G. Chen, Q. Su, Y.-D. Peng, J.-J. Zhao, G. Ning, Z. Chen, J.-L. Chen, S.-J. Chen, W. Huang, and H.-D. Song, “A genome-wide association study identifies two new risk loci for Graves’ disease.,” Nat. Genet., vol. 43, no. 9, pp. 897–901, Sep. 2011.

[74] J. Wang, W. Wang, R. Li, Y. Li, G. Tian, L. Goodman, W. Fan, J. Zhang, J. Li, J. Zhang, Y. Guo, B. Feng, H. Li, Y. Lu, X. Fang, H. Liang, Z. Du, D. Li, Y. Zhao, Y. Hu, Z. Yang, H. Zheng, I. Hellmann, M. Inouye, J. Pool, X. Yi, J. Zhao, J. Duan, Y. Zhou, J. Qin, L. Ma, G. Li, Z. Yang, G. Zhang, B. Yang, C. Yu, F. Liang, W. Li, S. Li, D. Li, P. Ni, J. Ruan, Q. Li, H. Zhu, D. Liu, Z. Lu, N. Li, G. Guo, J. Zhang, J. Ye, L. Fang, Q. Hao, Q. Chen, Y. Liang, Y. Su, a San, C. Ping, S. Yang, F. Chen, L. Li, K. Zhou, H. Zheng, Y. Ren, L. Yang, Y. Gao, G. Yang, Z. Li, X. Feng, K. Kristiansen, G. K.-S. Wong, R. Nielsen, R. Durbin, L. Bolund, X. Zhang, S. Li, H. Yang, and J. Wang, “The diploid genome sequence of an Asian individual.,” Nature, vol. 456, no. 7218, pp. 60–5, Nov. 2008.

[75] The Genome of the Netherlands Consortium, “Whole-genome sequence variation, population structure and demographic history of the Dutch population.,” Nat. Genet., Jun. 2014.

[76] J. Kaye, M. Hurles, H. Griffin, J. Grewal, M. Bobrow, N. Timpson, C. Smee, P. Bolton, R. Durbin, S. Dyke, D. Fitzpatrick, K. Kennedy, A. Kent, D. Muddyman, F. Muntoni, L. F.

Page 41: MHC region and its related disease study

41

Raymond, R. Semple, and T. Spector, “Managing clinically significant findings in research: the UK10K example.,” Eur. J. Hum. Genet., vol. 22, no. 9, pp. 1100–4, Sep. 2014.

[77] International Human Genome Sequencing Consortium, “Initial sequencing and analysis of the human genome,” Nature, vol. 409, no. February, 2001.

[78] International Human Genome Sequencing Consortium, “Finishing the euchromatic sequence of the human genome,” Nature, pp. 931–945, 2004.

[79] R. Li, Y. Li, K. Kristiansen, and J. Wang, “SOAP: short oligonucleotide alignment program.,” Bioinformatics, vol. 24, no. 5, pp. 713–4, Mar. 2008.

[80] A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly, and M. a DePristo, “The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.,” Genome Res., vol. 20, no. 9, pp. 1297–303, Sep. 2010.

[81] H. Li and R. Durbin, “Fast and accurate short read alignment with Burrows-Wheeler transform.,” Bioinformatics, vol. 25, no. 14, pp. 1754–60, Jul. 2009.

[82] R. Li, Y. Li, X. Fang, H. Yang, J. Wang, K. Kristiansen, and J. Wang, “SNP detection for massively parallel whole-genome resequencing.,” Genome Res., vol. 19, no. 6, pp. 1124–32, Jun. 2009.

[83] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, and R. Durbin, “The Sequence Alignment/Map format and SAMtools.,” Bioinformatics, vol. 25, no. 16, pp. 2078–9, Aug. 2009.

[84] K. Ye, M. H. Schulz, Q. Long, R. Apweiler, and Z. Ning, “Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads.,” Bioinformatics, vol. 25, no. 21, pp. 2865–71, Nov. 2009.

[85] K. Chen, J. W. Wallis, M. D. McLellan, D. E. Larson, J. M. Kalicki, C. S. Pohl, S. D. McGrath, M. C. Wendl, Q. Zhang, D. P. Locke, X. Shi, R. S. Fulton, T. J. Ley, R. K. Wilson, L. Ding, and E. R. Mardis, “BreakDancer: an algorithm for high-resolution mapping of genomic structural variation.,” Nat. Methods, vol. 6, no. 9, pp. 677–81, Sep. 2009.

[86] A. R. Hastie, L. Dong, A. Smith, J. Finklestein, E. T. Lam, N. Huo, H. Cao, P.-Y. Kwok, K. R. Deal, J. Dvorak, M.-C. Luo, Y. Gu, and M. Xiao, “Rapid genome mapping in nanochannel arrays for highly complete and accurate de novo sequence assembly of the complex Aegilops tauschii genome.,” PLoS One, vol. 8, no. 2, p. e55864, Jan. 2013.

[87] H. Cao, A. R. Hastie, D. Cao, E. T. Lam, Y. Sun, H. Huang, X. Liu, L. Lin, W. Andrews, S. Chan, S. Huang, X. Tong, M. Requa, T. Anantharaman, A. Krogh, H. Yang, H. Cao, and X. Xu, “Rapid detection of structural variation in a human genome using nanochannel-based genome mapping technology,” vol. 3, no. 1, pp. 1–11, 2014.

Page 42: MHC region and its related disease study

42

7 Appendix

I. Hongzhi Cao et, al. A Short-Read Multiplex Sequencing Method for Reliable, Cost-Effective and

High-Throughput Genotyping in Large-Scale Studies. Hum Mutat. 2013 Dec; 34(12): 1715-20. Doi:

10.1002/humu.22439..

II. Hongzhi Cao et, al. An integrated Tool to study MHC Region: Accurate SNV Detection and HLA

Genes Typing in Human MHC Region Using Targeted High-Throughput Sequencing. PLoS ONE,

2013. 8(7): e69388: doi:10.1371/journal.pone.0069388.

III. Hongzhi Cao et, al. Rapid detection of structural variation in a human genome using nanochannel-

based genome mapping technology. GigaScience 2014, 3:34

Page 43: MHC region and its related disease study

43

Paper I:

A Short-Read Multiplex Sequencing Method for Reliable, Cost-

Effective and High-Throughput Genotyping in Large-Scale Studies

By Hongzhi Cao1,2, Yu Wang1,3, Wei Zhang1, Xianghua Chai1, Xiandong Zhang1 ,Shiping Chen1, Fan Yang1, Caifen Zhang1, Yulai Guo1, Ying Liu1, Zhoubiao Tang1, Caifen Chen1, Yaxin Xue1, Hefu Zhen1, Yinyin Xu1, Bin Rao1, Tao Liu1, Meiru Zhao1, Wenwei Zhang1, Yingrui Li1, Xiuqing Zhang1, Laurent C. A. M. Tellier1,2, Anders Krogh2, Karsten Kristiansen2, Jun Wang1,2,4,5 and Jian Li1,2 1BGI-Shenzhen, Shenzhen, China; 2Department of Biology, University of Copenhagen, Copenhagen, Denmark; 3School of Bioscience and Biotechnology, South China University of Technology, Guangzhou, China; 4King Abdulaziz University, Jeddah, Saudi Arabia; 5The Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark Publication details Published in Human mutation 35(2):270 (2014)

Page 44: MHC region and its related disease study

METHODSOFFICIAL JOURNAL

www.hgvs.org

A Short-Read Multiplex Sequencing Method for Reliable,Cost-Effective and High-Throughput Genotyping inLarge-Scale Studies

Hongzhi Cao,1,2 ∗ † Yu Wang,1,3 † Wei Zhang,1 † Xianghua Chai,1 † Xiandong Zhang,1 † Shiping Chen,1 Fan Yang,1 Caifen Zhang,1

Yulai Guo,1 Ying Liu,1 Zhoubiao Tang,1 Caifen Chen,1 Yaxin Xue,1 Hefu Zhen,1 Yinyin Xu,1 Bin Rao,1 Tao Liu,1 Meiru Zhao,1

Wenwei Zhang,1 Yingrui Li,1 Xiuqing Zhang,1 Laurent C. A. M. Tellier,1,2 Anders Krough,2 Karsten Kristiansen,2

Jun Wang,1,2,4,5 and Jian Li1,2 ∗ †

1BGI-Shenzhen, Shenzhen, China; 2Department of Biology, University of Copenhagen, Copenhagen, Denmark; 3School of Bioscience andBiotechnology, South China University of Technology, Guangzhou, China; 4King Abdulaziz University, Jeddah, Saudi Arabia; 5The Novo NordiskFoundation Center for Basic Metabolic Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark

Communicated by Ian N. M. DayReceived 17 April 2013; accepted revised manuscript 29 August 2013.Published online 6 September 2013 in Wiley Online Library (www.wiley.com/humanmutation). DOI: 10.1002/humu.22439

ABSTRACT: Accurate genotyping is important for genetictesting. Sanger sequencing-based typing is the gold stan-dard for genotyping, but it has been underused, due toits high cost and low throughput. In contrast, short-readsequencing provides inexpensive and high-throughput se-quencing, holding great promise for reaching the goal ofcost-effective and high-throughput genotyping. However,the short-read length and the paucity of appropriate geno-typing methods, pose a major challenge. Here, we presentRCHSBT—reliable, cost-effective and high-throughputsequence based typing pipeline—which takes short se-quence reads as input, but uses a unique variant call-ing, haploid sequence assembling algorithm, can accuratelygenotype with greater effective length per amplicon thaneven Sanger sequencing reads. The RCHSBT method wastested for the human MHC loci HLA-A, HLA-B, HLA-C,HLA-DQB1, and HLA-DRB1, upon 96 samples using Il-lumina PE 150 reads. Amplicons as long as 950 bp werereadily genotyped, achieving 100% typing concordancebetween RCHSBT-called genotypes and genotypes pre-viously called by Sanger sequence. Genotyping through-put was increased over 10 times, and cost was reducedover five times, for RCHSBT as compared with Sangersequence genotyping. We thus demonstrate RCHSBT tobe a genotyping method comparable to Sanger sequencing-based typing in quality, while being more cost-effective,and higher throughput.

Additional Supporting Information may be found in the online version of this article.†These authors contributed equally to this work.∗Correspondence to: Jian Li, BGI-Shenzhen. Main Building, Beishan Industrial Zone,

Yantian District, Shenzhen 518083, China. E-mail: [email protected]; Hongzhi Cao, BGI-

Shenzhen. Main Building, Beishan Industrial Zone, Yantian District, Shenzhen 518083,

China. E-mail: [email protected]

Contract grant sponsors: National High Technology Research and Development

Program of China – 863 Program (2012AA02A201); The Guangdong Enterprise Key

Laboratory of Human Disease Genomics, ShenZhen Engineering Laboratory for Clinical

Molecular Diagnostic

Hum Mutat 00:1–6, 2013. C© 2013 Wiley Periodicals, Inc.

KEY WORDS: short-read next-generation sequencing; high-throughput; genotyping; HLA

IntroductionGenetic testing plays an important role in physician decision

making [Carrington and O’Brien, 2003; Napolitano et al., 2005],and population genetic screening for common inherited single-gene disorders is an economical and effective way to reduce theincidence of disease [Bell et al., 2011; Castellani et al., 2009; Liaoet al., 2005]. One of the most common types of genetic testingis genotyping. A variety of genotyping tests are available, includ-ing Restriction Fragment Length Polymorphism, PCR-sequence-specific oligonucleotides [Ng et al., 1993], PCR-sequence-specificprimers [Natio et al., 2001], reverse dot-blot [Gravitt et al., 1998],and PCR-sequencing based typing (PCR-SBT) [Bettinotti et al.,1997]. Among these methods, PCR-SBT based on Sanger sequenc-ing offers the highest resolution, and is the accepted genotypingreference method [Danzer et al., 2007, 2008].

Sanger sequencing, well known for its accuracy and long-readlength (800–1000bp), has dominated high-quality sequencing forabout 30 years since its introduction as the gold standard of geno-typing. But high cost and low throughput has made it underused,particularly for large-scale genotyping projects. Currently popu-lar small-size next-generation sequencing (NGS) technology, com-prising massively parallel single-molecule sequencing, can gener-ate billions of bases from amplified single DNA molecules withindays, holding great promise for achieving cost-effective and high-throughput genotyping. However, short-read length and error-prone reading, are significant weaknesses of the method [Metzker,2009; Voelkerding et al., 2009].

454 sequencing—a sequencing technique developed by RocheNimblegen with the advantage of longer reads—was the first NGSused for genotyping. Several groups have reported successful use of454 sequencing for HLA genotyping [Bentley et al., 2009; Danzeret al., 2013; Gabriel et al., 2009; Galan et al., 2010; Holcombet al., 2011; Lank et al., 2010]. Similar in strategy, they combine

C© 2013 WILEY PERIODICALS, INC.

Page 45: MHC region and its related disease study

multiplex identifier sequence (MIDs) tagged PCR with direct paral-lel deep sequencing of the pooled amplicons. Genotypes are calledby comparing sorted sequencing data to allele sequences in the genedatabase. Deep sequencing minimizes the importance of randomsequencing errors, but this is still not much of an improvement overSanger SBT in cost and throughput. At the same time, the ampliconsinvolved are restricted to having a length within the maximum-readlength of the sequencing machine, which increases the complexityand number of routes to failure in the method.

Few reports about the success or failure of short-read NGS basedgenotyping methods have been published, though many short-readNGS platforms, such as Illumina sequencing, have lower sequencingcosts and much higher sequencing throughput compared with 454sequencing [Bamshad et al., 2011]. The short sequence-read length,and the lack of a coherent genotyping methodology, is a majorchallenge. Currently, most short-read NGS methods have a readlength of no more than 150 bp (sequence with paired-end read,150 × 2), so applying these methods to genotyping with the strategymentioned above can only involve amplicons shorter than 300 bp,otherwise many ambiguous genotypes will be obtained. For targetDNA fragments longer than 300 bp, multiple iterations of PCR willbe needed to control the amplicon length, increasing the complexityand the cost of the test.

No previously reported NGS-based genotyping method has sofar showed significant improvement in cost and throughput, whencompared with Sanger SBT.

It is for this reason that we have developed RCHSBT: an NGS-based genotyping method that uses a massively parallel paired-end

(PE) sequencing, effective variant phasing, and haploid sequence as-sembling pipeline. To rigorously assess its performance, we tested itfor HLA-A (MIM #142800), HLA-B (MIM #142830), HLA-C (MIM#142840), HLA-DQB1 (MIM #604305), and HLA-DRB1 (MIM#142857), upon 96 previously diagnosed samples using IlluminaPE 150 reads. We demonstrate RCHSBT to be a genotyping methodcomparable to Sanger sequencing based typing in quality, whilebeing more cost effective, and higher throughput.

Material and Methods

Samples

A total of 96 purified DNA samples with known HLA types fromvolunteers were selected. Written consent forms were collected fromall volunteers who were involved in this study. These samples hadbeen extensively characterized by a variety of HLA testing methods,including Sanger SBT. The DNA sample information is listed inSupp.Table S1.

RCHSBT follows four steps (Fig. 1A and B):

(1) Amplification, shearing, and sequencing15 HLA Class I and Class II exonic regions were amplifiedby tag PCR (Supp. Fig. S1a). In this assay, both forward andreverse PCR primers respectively contain an identifier 14 PCRamplifications were performed for each of the 96 samples (Supp.Fig. S1b). PCR products of these 96 samples were then pooled

Figure 1. Schematic representation of RCHSBT pipeline. A: The illustration of RCHSBT. Amplicons with read length comparable to Sangersequencing reads, tagged with MIDs at both ends. They are pooled together and sheared, in pools with fragments ranging from NGS sequencingread length, to complete amplicon length, which are then recovered and sequenced by PE sequencing. Amplicons from different samples havedifferent MIDs, and sequencing data are grouped to the specific loci of individual samples. Consensus sequences of amplicons are assembledby mapping grouped reads to the reference sequence, and used for genotyping. B: An overview of the bioinformatics pipeline of RCHSBT. Afterexcluding low quality reads, sequencing reads are grouped to specific loci of individual samples. Grouped reads are mapped to the reference, andvariants are called. Variants are phased, and used for constructing haploid sequences. The genotypes of amplicons are obtained by comparinghaploid sequences to the gene database. The final genotype can then be obtained by logical deduction, as illustrated.

2 HUMAN MUTATION, Vol. 00, No. 00, 1–6, 2013

Page 46: MHC region and its related disease study

together. The pooled amplicons were fragmented, recoveredand used for library construction. The library was sequencedwith PE 150 bp reads on Illumina Miseq (See Supp. Methodsfor detail).

(2) GroupingDNA sequence reads of low quality were filtered out. Thecleaned PE reads were grouped to individual samples, basedon the MIDs, and grouped to the specific loci, by genomicprimer sequence.

(3) Mapping and haploid sequence assemblingThe MIDs and genomic primer sequence were trimmed beforegrouped PE reads were aligned locally to the reference sequence(Supp. Table S2) by BWA. Variants were called by using SAM-tools. The aligned PE reads with a number of mismatches largerthan its maximum allowable mismatches (MAM) cutoff (seeSupp. Methods for detail) was filtered. After filtering, the vari-ants were called again by SAMtools. The number of aligned PEreads supporting linkage of any two particular, heterozygousSNPs in the amplicon region was investigated (Fig. 2A,I). Allthe reliable blocks were integrated to phase the heterozygousSNPs of the amplicon. The alignments were sorted into twoparts (heterozygous alleles) or one part (homozygous allele),according to each given pair of complete SNP haploids. Consen-sus sequences were assembled for each part by using SAMtoolsin haploid mode. Pairs of haploid sequence were used for thegenotyping which ensued.

(4) GenotypingGenotypes were obtained by comparing the thus determinedhaploid sequences to allele sequences already described in thegene database (IMGT, http://www.ebi.ac.uk/imgt/hla/). Utiliz-ing RCHSBT, amplicons with the length of Sanger sequencingreads were readily genotyped by short PE sequencing.

Results

PCR Amplification and Miseq Sequencing

A total of 96 purified DNA samples with known HLA type wereamplified with a unique set of PCR primers at five loci (HLA-A,HLA-B, HLA-C, HLA-DQB1, and HLA-DRB1), using tag PCR. Theamplicons ranged from 290 to 950 bp in length. For most PCR, onlya single band was observed following amplification and gel elec-trophoresis. Multiple bands, apparently weak bands, or no bandswere observed by gel electrophoresis when there was nonspecificamplification, poor amplification, or amplification failure. All am-plicons of the 96 samples were pooled, and randomly sheared.

Sequencing libraries with fragments including library adaptersbetween 400 and 700 bp in length were recovered by gel slicing. Thelength range of the recovered DNA fragments was confirmed by useof DNA bioanalyzer 2100. Sequencing was performed on IlluminaMiseq with PE 150 bp reads, in a single run. A total of 1.89 Gb ofread data, 6.30 M PE reads was generated. 97.08% and 85.23% of thesequence fragments had an average quality value over Q20 and Q30,respectively. The average percentage of GC content was 59.94%.

Sequencing Data Grouping and Variants Calling

Of the original 1.89 Gb of raw data, 1.86 Gb of high qualitysequence data remained for analysis after all low quality PE readswere purged (See Supp. Methods for detail). For the cleaned PEreads, 70.45% contained MIDs and genomic primer sequence inat least one end, and could be grouped to specific amplicons of

specific samples. 0.59% had a MID/genomic primer mis-pairing,likely caused by template switching during PCR cycles [Odelberget al., 1995]. The remaining 28.96% reads, without MIDs or genomicprimer sequences, could not be readily utilized.

The poor amplification and amplification failure of some PCRoutput can be confirmed by grouped reads. Groups with apparentlylow or entirely absent reads will be excluded from the followinganalysis. Grouped reads were aligned to reference (Supp. Table S2)at HLA-A, HLA-B, HLA-C, HLA-DQB1, and HLA-DRB1 by BWA[Li and Durbin, 2009; Li et al., 2008]. As the alignment parametersset were not strict (See Supp. Methods for detail), reads with manysequencing errors could still be mapped to the reference. 93.05% ofall the grouped PE reads, with both ends intact, were successfullymapped to the reference sequence, under the initial alignment pa-rameters. The unaligned and single-end aligned reads were purgedfrom mapping.

Variants were called by SAMtools [Li et al., 2009], in diploidmode (See Supp. Methods for detail). The ratios of each ampli-con’s heterozygous variants’ two alleles were analyzed. 94.16% hada ratio above 1:6 (the limit of detection for Sanger sequencing) onaverage. We evaluated variant calling accuracy of RCHSBT at thisstep by comparing consistency of variants on all 96 samples de-tected by RCHSBT and Sanger SBT (Supp. Table S3). The rate offalse positive and false negative variant calling was 6.21% and 0%,respectively—loci with allele amplification failure were not includedin this statistic.

Based on the high accuracy of the initial variant calling, we usedthe MAM strategy to individually filter aligned PE reads (See Supp.Methods for detail). After this filtering, 18.38% of the alignmentswith mismatch counts larger than MAM were removed from variantcalling.

The rate of heterozygous variants with allele ratios above 1:6 wasincreased to 98.87%, and the rate of false positive calling reduced to0.48%, but the rate of false negative calling was increased to 0.08%,as the result of this filtering. This increase in the rate of heterozygousvariants with an allele ratio above 1:6, and this concurrent reductionin false positives, mainly came from amplicons of HLA-C (partic-ularly C4). Before MAM filtering, 21.10% of heterozygous variantsfrom amplicons of HLA-C had allele ratios below 1:6, and the rateof false positive variant calling was 26.57%. After MAM filtering,the ratio of heterozygous variants from amplicons of HLA-C withan allele ratio below 1:6 was reduced to 0.22% and the rate of falsepositive variant calling was reduced to 0.18% (Supp. Table S3).

Depth of each site, for each amplicon of each of the 96 samples,was plotted in Supp. Figure S2. Average depth was 789× for allamplicons. For the exonic region, 99.9% have depths above 100×,and 86.67% above 700×. For all grouped PE reads, because they haveat least one end containing MIDs and genomic primer sequences,all amplicons appeared to have the highest depth at the two ends,except for HLA-DRB1, which had highest depth in the middle. Thiswas because HLA-DRB1 was the only amplicon shorter than 300 bp,a length that was within the sequence length of the Miseq PE 150read.

Variant Phasing and Haploid Sequence Assembly

In order to further improve variant calling accuracy, we chose torefine sequence reads aligned to the reference sequence. We con-structed a series of blocks for every amplicon, using the PE reads.Each block reflected the linkage pattern of two particular heterozy-gous variants. We integrated all of those blocks to phase heterozy-gous variants of the amplicons (Fig. 2A).

HUMAN MUTATION, Vol. 00, No. 00, 1–6, 2013 3

Page 47: MHC region and its related disease study

Figure 2. Schematic representation of variant phasing and haploid sequence assembly. A: Schematic of variant phasing: A,I: Record linkagerelationships between bases located at heterozygous variant sites, according to paired-end reads. A,II: Blocking: All types of linkage relationshipsbetween heterozygous variants (blocks) are determined, as supported by PE reads. A,III: Phasing: Haploid variants are obtained, by integrating allthe blocks to phase. B: Haploid sequence assembly: B,I: Before sorting, reads from both alleles are mixed. B,II: Sorting: According to the haploidvariants, reads from the same alleles are sorted and clustered. B,III: Haploid sequences of the alleles are assembled according to their assignedread cluster.

For amplicons shorter than 770 bp, 99.87% were completelyphased. 0.09% had one phasing breakpoint, 0.04% had two phasingbreakpoints. For amplicons longer than 770 bp, few amplicons werecompletely phased, due to there not being enough long DNA frag-ments sequenced to support the linkage of heterozygous variants.We analyzed the length distribution of the DNA fragments, usingthe aligned PE reads (Supp. Fig. S3). Over 85.02% of the recoveredfragments were within a length range of 150–400 bp, and only 2.52%had a length over 400 bp. The incomplete haploid pairs within eachamplicon were cross matched with each other to construct completehaploid pairs. With pairs of haploid variants, we sorted the alignedsequence reads of every amplicon into two parts, and constructedconsensus haploid sequences for each portion of the sorted sequencereads (Fig. 2B).

Variants were called by SAMtools for each haploid sequence,running in the haploid mode (See Supp. Methods for detail). Weevaluated the improvement of variant calling accuracy by variantphasing and haploid sequence assembly (Supp. Table S3). Throughvariant phasing and haploid sequence assembly, the rate of falsepositive and false negative variant calling was reduced to 0.41%and 0.02%, respectively. Pairs of haploid variants were used for theensuing genotyping.

Genotyping

The resulting haploid variants of each exon were compared withthe local HLA variants database (Supp. Table S4), in order to deter-

mine exonal genotypes. For each individual sample, the genotypesof the most frequently studied exons (See Supp. Methods for detail)of the given HLA gene were defined as the candidate HLA allelegenotype, and used for method comparison.

The genotypes of other, less frequently studied exons were usedto filter the candidate genotypes. After filtering, in cases where therewere more than two possible HLA allele genotypes, two of the prob-able allele types were randomly paired. The genotypes, as definedby these paired alleles, were then confirmed by a logical algorithm.By logical deduction, if there was any ambiguity in the choice ofgenotype, information was inferred from intron variants to resolvethe ambiguity.

Of the 480 genotypes from all 96 samples considered, assign-ment was possible in 97.50% of the cases, and of those 97.50%,100% of assignments were concordant with the original Sangergenotypes attached to the samples. Notably, among the assignedgenotypes, there were several InDel genotypes, such as C∗04:82,which has nine bases inserted in exon 5, compared with the ref-erence. The assigned genotypes for HLA-A, HLA-B, HLA-C, andHLA-DQB1 of all samples were 100% unambiguous (Table 1).Though one of the assigned genotypes for HLA-B of sample L002was still ambiguous after logical judgment (B∗15:02:01/15:01:01:01,B∗15:25:01/15:15), the ambiguity was resolved by inference, usingheterozygous variant information of intron 2 from the same gene.Because we only amplified parts of exon 2 of HLA-DRB1 for geno-typing, the unambiguous assignment rate of HLA-DRB1 was only48.96%.

4 HUMAN MUTATION, Vol. 00, No. 00, 1–6, 2013

Page 48: MHC region and its related disease study

Table 1. Genotyping Consistence Statistic Result of HLA Genes for 96 Samples

HLA-A HLA-B HLA-C HLA-DQB1 HLA-DRB1

Number of sample with unambiguous genotype assignment 94 92 93 96 47Number of sample with ambiguous genotype assignment 0 0 0 0 46Number of sample with no genotype assignment 2 4 3 0 3Consistence of assigned genotype with Sanger result (%) 100 100 100 100 100

We performed an analysis to discover the reason why some alleleswere not assigned genotypes. For the 12 unassigned alleles, half ofthem were caused by nonspecific amplification, 1/3 of them werecaused by allele amplification failure, and the rest were caused byinsufficient data, resulting from poor PCR amplification efficiency.

DiscussionUsing RCHSBT, whichtakes short sequencing reads as input, but

uses a unique variant calling, haploid sequence assembling algo-rithm, we can accurately genotype with greater effective length peramplicon than even Sanger sequencing reads. In the study, ampli-cons as long as 950 bp—over sixfold the length of the sequence-readlength were readily genotyped, using Illumina PE 150 sequencereads. According to the principle of RCHSBT, the longest sequenceassembled can be two times the length of the longest PE sequencinginsert. As the maximum DNA length in Illumina sequencing flow-cells is ca. 800 bp, the longest amplicon that can be assembled byRCHSBT is 1,600 bp with a single amplification. This high maxi-mum length could greatly increase the flexibility of primer design,and consequently reduce the complexity of the assay.

Besides comparatively short-read length, a high rate of sequencingerror is another major difficulty for NGS-based genotyping whencompared with Sanger, especially for clinical genotyping, whereresultant diagnosis errors are poorly tolerated. This is why the re-liability of RCHSBT, reliant on more accurate methods for variantcalling, variant phasing, and haploid assembly, is important. InRCHSBT, the MAM strategy effectively filters out both reads withexcess sequencing errors, and reads which are mapped improperly.This filtering facilitates accurate variant calling.

In the pilot study, 18.38% of aligned reads with mismatch countslarger than their MAM were removed from variant calling, by MAMfiltering. Using each sample’s known genotype corresponding vari-ants as a reference, overall rates of false positive variant calling werereduced from 6.21% to 0.48%, while overall rates of false negativevariant calling increased from 0% to 0.08%—this latter increase dueto MAM filtering removing reads with both an extraordinarily higherror rate, but also utile information content, particularly impact-ing variant calling in low depth regions (Supp. Table S3). However,the majority of false negatives can be reduced by using the haploidassembly process, hence the only moderate increase in FN. In thestudy, the overall rate of false negative variant calling was reducedfrom 0.08% to 0.02%, after application of the haploid assembly pro-cess. Besides this, the overall rate of false positive variant calling wasreduced to 0.41% (Supp. Table S3).

Beyond this—and without considering nonspecific amplifica-tion, amplification failure, and poor amplification factors—the finalvariant calling was 100% consistent with the previous Sanger SBTresults, as previously described. Combined with deep sequencing(789× on average), the analysis pipeline provides extreme reliabil-ity, effectively comparable to the best-of-class, as assigned genotypeswere 100% concordant with the Sanger results.

The key point of RCHSBT is to construct the haploid sequenceaccurately. The haploid sequence construction process is indepen-

dent of allele database and allele frequency information, so, it can beused for the genotyping of genes without allele database and allelefrequency information, opening a wider application than previousreported methods. Consequently, RCHSBT can be run indepen-dently of allele database and frequency information, in order togenotype new and rare alleles.

As other NGS-based genotyping methods mainly adopt the 454sequencing platform, the short-read NGS based RCHSBT has asubstantive advantage in terms of sequencing costs and sequenc-ing throughput. RCHSBT capacity to genotype long amplicons (upto 1.6 kb for Illumina) with short reads enables genotyping withfewer amplicons than current 454 based genotyping methods, re-ducing cost and improving throughput. A multisite study, usinghigh-resolution HLA genotyping by 454 GS FlX, reported that theycould genotype 20 samples for 10 HLA genes (17 exons involved)in a single sequencing run, with reagent costs approximately equiv-alent to Sanger SBT [Holcomb et al., 2011]. In our pilot study, wegenotyped 96 samples on 5 HLA genes (24 exons involved) in asingle Illumina Miseq sequencing run, enabling us to cut HLA typ-ing costs by 80%, and to improve HLA typing throughput by 10×compared with Sanger SBT.

Compared with other short-read NGS based genotyping meth-ods, RCHSBT constitutes an immense improvement. In a previouslyreported Illumina genotyping study [Wang et al., 2012], encoded se-quencing adaptors were used to identify the source of sequence readsfor each sample, requiring library construction for each sample. Inthis study, both primer MIDs and encoded sequencing adaptorswere used to trace the source of sequence reads for each sample.Amplicons with different primer MIDs were pooled together whenconstructing library, which greatly reduced the number of librariesconstructed. Although the initial synthesis of MID primers may beexpensive, the cost of primer synthesis for each test will be compar-atively lower in large-scale projects using RCHSBT, and the numberof primers synthesized can be adjusted according to the scale of theproject. The previous study reported genotyping four HLA genes(25 exons in total) in 180 samples, for a single lane of Illumina Hiseq2000. With the same rate of output (30G), we estimate that at least1,500 samples, genotyping on five HLA genes, can be processed atthe same time in a single Hiseq 2000 lane using RCHSBT, based onthe result of our pilot study. In unpublished data, we have randomlypicked half of the 1.89 Gb raw data of the pilot study for genotyp-ing, run RCHSBT on that data, and the results were comparableto results obtained by running RCHSBT on the whole set of rawdata. This implies potential for further improvement of cost andthroughput of genotyping.

Although the performance of RCHSBT was obtained from testingof HLA genes, similar result can be predicted when testing nonpoly-morphic genes. As few variants will be called for nonpolymorphicgenes, variants phasing and haploid sequence assembly will have lit-tle effect on genotyping accuracy. But as each sequence reads containfew SNPs, their initial mapping are accurate, and deep sequencinghelp remove the influence of sequencing error, people can still geno-type long amplicons accurately by RCHSBT using short sequenceread.

HUMAN MUTATION, Vol. 00, No. 00, 1–6, 2013 5

Page 49: MHC region and its related disease study

RCHSBT can meet the requirements of both medium-size geno-typing projects, and large-scale genotyping projects. Combined withone of the very rapid NGS platforms, such as Illumina Miseq, hun-dreds of samples can be genotyped in under a week, at levels suitablefor clinical genotyping. When reporting time is not very impor-tant, people can choose high-throughput NGS platforms, such asIllumina Hiseq 2000. Tens of thousands of samples can be geno-typed in a single Hiseq 2000 run, at costs lower than Miseq, makingthis method very suitable for large-scale genotyping projects, suchas HLA genotyping for global scale bone marrow registries, genescreening for common monogenic disease like thalassemia, and thelike.

Although the advantages of RCHSBT were obvious, the limita-tions must be considered. The longer the amplicon, the less per-centage of raw data can be used. PE reads having no MIDs andgenomic primer sequence in at least one end are useless, as theycannot be grouped to specific samples for the following analysis.In unpublished data, we tested amplicons with a length of 1.4 kbusing RCHSBT, and we found only about 40% of the raw data wereuseful. Also, as we found the longer the amplicon, the lower thedepth in the middle of amplicon was, much more data are neededfor genotyping long amplicons using RCHSBT. Another limitationis about SNP phasing. For SNPs within the same amplicons, wecan phase them well, and the maximum length of amplicon can betwo times the length of the longest PE sequencing insert. For SNPswithin different amplicons, we can phase them well only in the caseof the two amplicons are overlapped and they have common SNPsbetween the two SNPs we are going to phase.

In this assay, most of our time and labor was spent on PCR am-plification. We know that the key to scaling PCR-based sequenceenrichment involves automation, miniaturization, and multiplex-ing of reactions. There are commercially available platforms whichenable miniaturized PCR by using microfluidics (Fluidigm). Thecombination of RCHSBT and microfluidic PCR can further im-prove genotyping throughput and cost, making it highly suitablefor studies involving large numbers of amplifications, or large num-bers of genes for each sample.

Acknowledgments

The authors thank the members of BGI-Shenzhen genome sequencing teamfor help with sequencing.

Disclosure statement: The authors declare no conflict of interest.

References

Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J.2011. Exome sequencing as a tool for Mendelian disease gene discovery. Nat RevGenet 12(11):745–755.

Bell CJ, Dinwiddie DL, Miller NA, Hateley SL, Ganusova EE, Mudge J, Langley RJ, ZhangL, Lee CC, Schilkey FD, Sheth V, Woodward JE, et al. 2011. Carrier testing for severechildhood recessive diseases by next-generation sequencing. Sci Transl Med 3(65):65ra4.

Bentley G, Higuchi R, Hoglund B, Goodridge D, Sayer D, Trachtenberg EA, ErlichHA. 2009. High-resolution, high-throughput HLA genotyping by next-generationsequencing. Tissue Antigens 74(5):393–403.

Bettinotti MP, Mitsuishi Y, Bibee K, Lau M, Terasaki PI. 1997. Comprehensive methodfor the typing of HLA-A, B, and C alleles by direct sequencing of PCR productsobtained from genomic DNA. J Immunother 20(6):425–430.

Carrington M, O’Brien SJ. 2003. The influence of HLA genotype on AIDS. Annu RevMed 54:535–551.

Castellani C, Picci L, Tamanini A, Girardi P, Rizzotti P, Assael BM. 2009. Associationbetween carrier screening and incidence of cystic fibrosis. JAMA 302(23):2573–2579.

Danzer M, Niklas N, Stabentheiner S, Hofer K, Proll J, Stuckler C, Raml E, Polin H,Gabriel C. 2013. Rapid, scalable and highly automated HLA genotyping using next-generation sequencing: a transition from research to diagnostics. BMC Genomics14(1):221.

Danzer M, Polin H, Hofer K, Proll J, Gabriel C. 2008. Characterisation of two novelHLA alleles, HLA-Cw∗0429 and HLA-DRB3∗0223. Tissue Antigens 72(5):498–499.

Danzer M, Polin H, Proll J, Hofer K, Fae I, Fischer GF, Gabriel C. 2007. High-throughputsequence-based typing strategy for HLA-DRB1 based on real-time polymerasechain reaction. Hum Immunol 68(11):915–917.

Gabriel C, Danzer M, Hackl C, Kopal G, Hufnagl P, Hofer K, Polin H, Stabentheiner S,Proll J. 2009. Rapid high-throughput human leukocyte antigen typing by massivelyparallel pyrosequencing for high-resolution allele identification. Hum Immunol70(11):960–964.

Galan M, Guivier E, Caraux G, Charbonnel N, Cosson J-F. 2010. A 454 multiplexsequencing method for rapid and reliable genotyping of highly polymorphic genesin large-scale studies. BMC Genomics 11(1):296.

Gravitt PE, Peyton CL, Apple RJ, Wheeler CM. 1998. Genotyping of 27 human papil-lomavirus types by using L1 consensus PCR products by a single-hybridization,reverse line blot detection method. J Clin Microbiol 36(10):3020–3027.

Holcomb CL, Hoglund B, Anderson MW, Blake LA, Bohme I, Egholm M, Ferriola D,Gabriel C, Gelber SE, Goodridge D, Hawbecker S, Klein R, et al. 2011. A multi-site study using high-resolution HLA genotyping by next generation sequencing.Tissue Antigens 77(3):206–217.

Lank SM, Wiseman RW, Dudley DM, O’Connor DH. 2010. A novel single cDNAamplicon pyrosequencing method for high-throughput, cost-effective sequence-based HLA class I genotyping. Hum Immunol 71(10):1011–1017.

Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows–Wheelertransform. Bioinformatics 25(14):1754–1760.

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G,Durbin R. 2009. The Sequence Alignment/Map format and SAMtools. Bioinfor-matics 25(16):2078–2079.

Li H, Ruan J, Durbin R. 2008. Mapping short DNA sequencing reads and calling variantsusing mapping quality scores. Genome Res 18(11):1851–1858.

Liao C, Mo QH, Li J, Li LY, Huang YN, Hua L, Li QM, Zhang JZ, Feng Q, Zeng R,Zhong HZ, Jia SQ, et al. 2005. Carrier screening for alpha- and beta-thalassemia inpregnancy: the results of an 11-year prospective program in Guangzhou Maternaland Neonatal hospital. Prenat Diagn 25(2):163–171.

Metzker ML. 2009. Sequencing technologies—the next generation. Nat Rev Genet11(1):31–46.

Naito H, Hayashi S, Abe K. 2001. Rapid and specific genotyping system for hepatitis Bvirus corresponding to six major genotypes by PCR using type-specific primers. JClin Microbiol 39(1):362–364.

Napolitano C, Priori SG, Schwartz PJ, Bloise R, Ronchetti E, Nastoli J, Bottelli G, CerroneM, Leonardi S. 2005. Genetic testing in the long QT syndrome: developmentand validation of an efficient approach to genotyping in clinical practice. JAMA294(23):2975–2980

Ng J, Hurley CK, Baxter-Lowe LA, Chopek M, Coppo PA, Hegland J, KuKuruga D,Monos D, Rosner G, Schmeckpeper B, Yang SY, Dupont B, et al. 1993. Large-scaleoligonucleotide typing for HLA-DRB1/3/4 and HLA-DQB1 is highly accurate,specific, and reliable. Tissue Antigens 42(5):473–479.

Odelberg SJ, Weiss RB, Hata A, White R. 1995. Template-switching during DNA synthe-sis by Thermus aquaticus DNA polymerase I. Nucleic Acids Res 23(11):2049–2057.

Voelkerding KV, Dames SA, Durtschi JD. 2009. Next-generation sequencing: from basicresearch to diagnostics. Clin Chem 55(4):641–658.

Wang C, Krishnakumar S, Wilhelmy J, Babrzadeh F, Stepanyan L, Su LF, Levin-son D, Fernandez-Vina MA, Davis RW, Davis MM, Mindrinos M. 2012. High-throughput, high-fidelity HLA genotyping with deep sequencing. Proc Natl AcadSci USA 109(22):8676–8681.

6 HUMAN MUTATION, Vol. 00, No. 00, 1–6, 2013

Page 50: MHC region and its related disease study

Paper II

An integrated Tool to study MHC Region: Accurate SNV

Detection and HLA Genes Typing in Human MHC Region

Using Targeted High-Throughput Sequencing.

By Hongzhi Cao1,3, Jinghua Wu1, Yu Wang1, Hui Jiang1,3, Tao Zhang1, Xiao Liu1,3, Yinyin Xu1, Dequan Liang1, Peng Gao1, Yepeng Sun1, Benjamin Gifford2, Mark D’Ascenzo2, Xiaomin Liu1, Laurent C. A. M. Tellier1,3, Fang Yang1, Xin Tong1, Dan Chen1, Jing Zheng1, Weiyang Li1, Todd Richmond2, Xun Xu1, Jun Wang1,3,4,5, Yingrui Li1 1Science and Technology Department, BGI-Shenzhen, Shenzhen, China, 2Roche NimbleGen, Inc., Madison, Wisconsin, United States of America, 3Department of Biology, University of Copenhagen, Copenhagen, Denmark, 4King Abdulaziz University, Jeddah, Saudi Arabia, 5The Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark Publication details Published in Plos One 8(7):e69388 (2013)

Page 51: MHC region and its related disease study

An Integrated Tool to Study MHC Region: Accurate SNVDetection and HLA Genes Typing in Human MHC RegionUsing Targeted High-Throughput SequencingHongzhi Cao1,3., Jinghua Wu1., Yu Wang1., Hui Jiang1,3., Tao Zhang1., Xiao Liu1,3, Yinyin Xu1,

Dequan Liang1, Peng Gao1, Yepeng Sun1, Benjamin Gifford2, Mark D’Ascenzo2, Xiaomin Liu1,

Laurent C. A. M. Tellier1,3, Fang Yang1, Xin Tong1, Dan Chen1, Jing Zheng1, Weiyang Li1,

Todd Richmond2, Xun Xu1, Jun Wang1,3,4,5*, Yingrui Li1*

1 Science and Technology Department, BGI-Shenzhen, Shenzhen, China, 2 Roche NimbleGen, Inc., Madison, Wisconsin, United States of America, 3Department of Biology,

University of Copenhagen, Copenhagen, Denmark, 4 King Abdulaziz University, Jeddah, Saudi Arabia, 5 The Novo Nordisk Foundation Center for Basic Metabolic Research,

Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark

Abstract

The major histocompatibility complex (MHC) is one of the most variable and gene-dense regions of the human genome.Most studies of the MHC, and associated regions, focus on minor variants and HLA typing, many of which have beendemonstrated to be associated with human disease susceptibility and metabolic pathways. However, the detection ofvariants in the MHC region, and diagnostic HLA typing, still lacks a coherent, standardized, cost effective and high coverageprotocol of clinical quality and reliability. In this paper, we presented such a method for the accurate detection of minorvariants and HLA types in the human MHC region, using high-throughput, high-coverage sequencing of target regions. Aprobe set was designed to template upon the 8 annotated human MHC haplotypes, and to encompass the 5 megabases(Mb) of the extended MHC region. We deployed our probes upon three, genetically diverse human samples for probe setevaluation, and sequencing data show that ,97% of the MHC region, and over 99% of the genes in MHC region, arecovered with sufficient depth and good evenness. 98% of genotypes called by this capture sequencing prove consistentwith established HapMap genotypes. We have concurrently developed a one-step pipeline for calling any HLA typereferenced in the IMGT/HLA database from this target capture sequencing data, which shows over 96% typing accuracywhen deployed at 4 digital resolution. This cost-effective and highly accurate approach for variant detection and HLA typingin the MHC region may lend further insight into immune-mediated diseases studies, and may find clinical utility intransplantation medicine research. This one-step pipeline is released for general evaluation and use by the scientificcommunity.

Citation: Cao H, Wu J, Wang Y, Jiang H, Zhang T, et al. (2013) An Integrated Tool to Study MHC Region: Accurate SNV Detection and HLA Genes Typing in HumanMHC Region Using Targeted High-Throughput Sequencing. PLoS ONE 8(7): e69388. doi:10.1371/journal.pone.0069388

Editor: Xi (Erick) Lin, Emory Univ. School of Medicine, United States of America

Received April 1, 2013; Accepted June 7, 2013; Published July 24, 2013

Copyright: � 2013 Cao et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricteduse, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: The study was supported by the National High Technology Research and Development Program of China (NO.2012AA02A201), and the funder’swebsite is: http://www.863.gov.cn/. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests: This manuscript was finished by collaboration between Roche NimbleGen and BGI, in which Roche NimbleGen was in charge of thedesign and production of probes and BGI was in charge of the selection of target region and test of the probes. The collaboration has been published as news inwebsite of Roche NimbleGen (http://www.roche.com/media/media_releases/med_dia_2011-11-28.htm) and the website of BGI (http://www.genomics.cn/en/news/show_news?nid = 98954). Three persons in the author list are employee of Roche NimbleGen, they are Benjamin Gifford, who was the primary designer ofthe probes; Mark D’Ascenzo, who was in charge of design optimization; and Todd Richmond, who was consulted on the design. Roche NimbleGen has the right toproduce and sale the product freely and this does not alter the authors’ adherence to all the PLOS ONE policies on sharing data and materials.

* E-mail: [email protected] (YL); [email protected] (J. Wang)

. These authors contributed equally to this work.

Introduction

The MHC region, one of the most gene-dense regions of the

human genome, is located on the short arm of human

chromosome 6. It covers over 200 genes, 128 of which are

predicted to be expressed [1]. Most MHC genes play a

fundamental role in immunity, and show a close relationship

with immune-mediated diseases [2,3]. Over 30 years, numerous

studies have demonstrated association between some MHC

alleles and some disease susceptibilities. Today, more than 100

diseases, - with relation to infection, inflammation, autoimmu-

nity, drug sensitivity, and transplantation medicine -have been

reported to be associated with proteins coded by the genes in

MHC region [4].

In addition to its high gene density, the MHC region is also one

of the most complex regions in the human genome, due to the

extremely high density of polymorphism and linkage disequilib-

rium (LD). This inherent complexity has made identification of the

underlying, causative variants contributing to disease phenotypes

by genome-wide association study (GWAS) a challenge. The single

nucleotide variations (SNV) which distinguish PGF and COX, the

two most significant MHC haplotypes in the European popula-

tion, are at variant densities of ,3.46 10–3, higher than the

estimated average heterozygosity between any other two haplo-

PLOS ONE | www.plosone.org 1 July 2013 | Volume 8 | Issue 7 | e69388

Page 52: MHC region and its related disease study

types in the human genome [5]. Recognizing the importance of

fully informative polymorphism and haplotype maps of the MHC

region, as pertaining to MHC-related-diseases, the MHC Haplo-

type Consortium has conducted the MHC Haplotype Project

between 2000 and 2006, and provided the sequence and

annotations of eight different HLA-homozygous-typing haplotypes

(PGF, COX, QBL, APD, DBB, MANN, MCF and SSTO) [6].

The Single Nucleotide Polymorphism (SNP) count in dbSNP for

the MHC region has increased rapidly from 69,076 SNPs in

database version 130 (Apr 30, 2009) to 169,279 in database

version 135 (Oct 13, 2011), due in large part to the progress of the

1000 genome project. Although next generation sequencing (NGS)

has made the cost of human genome sequencing decrease

dramatically, the detection and analysis of all variants in the

human MHC region for a large human cohort by whole genome

sequencing (WGS) is still a challenge for researchers. Particularly

for researchers who focus on genomic variations in MHC region,

and for researchers who use genomic information to do HLA

typing, targeted region sequencing stands out as a promising and

neotypal approach.

The human leukocyte antigen (HLA) genes, the most studied

genes in the MHC region, encode cell-surface proteins responsible

for antigen peptide presentation in adaptive immune response.

Considering the key role of HLA genes in immunology, HLA

typing is widely used for matching donors and recipients in organ

and hematopoietic stem cell transplantation in the clinical context

[7,8], particularly for variants at the four gene loci HLA-A, -B, -C

and -DRB1. High resolution HLA matching is required for

unrelated donor hematopoietic cell transplants [9]. Although

precise HLA matching improves overall transplant survival and

reduces the risk of rejection, graft-versus-host disease (GVHD)

remains a significant and potentially life-threatening complication

after hematopoietic cell transplantation (HCT), even when using

HLA-matched transplants. Besides the five classic HLA genes

(HLA-A, -B, -C, -DRB1 and -DQB1), additional genes in the

MHC region, encoding unidentified transplantation antigens, are

proposed to be responsible for GVHD [10]. Proll et al. found

3,025 small variations between recipients and donors undergoing

unrelated HLA-matched allogeneic stem cell transplantation, and

proposed that various differences causing non-synonymous amino

acid exchanges may lead to GVHD [11]. Therefore, choosing an

HLA matched unrelated donor with the smallest number of

differences to the recipient may help to reduce risk of rejection and

increase odds of transplantee survival.

Target region capture sequencing was developed mainly to

enrich and sequence specific regions of particular interest, such as

specific exons or entire exomes [12,13]. Target Region Sequenc-

ing (TRS) may be a cost-effective solution for studying MHC

region; however, successful enrichments of large, contiguous,

highly variable genomic regions has previously been thought to be

problematic, because of ubiquitous repetitive regions and other-

wise high diversity between haplotypes. In this study, a probe set

was carefully designed for the target regions, using sequences of

eight MHC haplotypes, to capture DNA fragments from the

human MHC region, and then processed via high throughput

sequencing. Using this probe set, we are able to sequence,97% of

the defined MHC region with high SNV calling accuracy by this

capture sequencing technology. We have also developed a HLA

typing pipeline to type all of the genes in the IMGT/HLA

database (http://www.ebi.ac.uk/imgt/hla,Release 3.9.0) with

high accuracy, using this capture sequencing data or whole

genome sequencing data. This toolset will improve and extend the

usage of both capture sequencing and whole genome sequencing

data in related studies which rely on HLA typing.

Results

Evaluation of Target Capture Sequencing, in Terms ofCoverageWe generate sequence libraries from genomic DNA (gDNA) of

the three samples mentioned in the method. After filtering reads

with low sequence quality or sequencing adaptor, the purged data

are mapped to the human genome reference sequence hg19, and

more than 60% of the mapped reads are proved to align to the

MHC region (Table 1). Although the probes themselves cover only

72.8% of the defined MHC region (3,620,871 bp of

4,970,558 bp), the alignment result proves that 97% of the

defined MHC region is covered by at least one read, and over

95% by at least four reads (Table 1).

Several hundred samples have been captured and sequenced

using this method, and the coverage of the MHC region is over

96% for all samples (data not shown). We investigated the 3–4% of

uncovered regions and found that more than 99% of the

uncovered bases were located in a long repeating region with

length.2000 base per repeat (data not shown), which was difficult

to solve with capture method due to NGS reads typically being too

short to span the breadth of the long repeat region. The MHC

region depth distribution showed a similar Poisson distribution for

all three samples (Figure 1), indicating an even enrichment of the

MHC region.

For further evaluation of the coverage of the genes in the MHC

region, RefSeq genes are downloaded from the UCSC table

browser (Hg19) [14], of which 213 genes and 380 transcripts are

annotated in the defined MHC region. The average coverage of

the longest transcripts of the 213 genes in the MHC region was

99.69% (Table S1). As to the coverage of the exons, the average

coverage of all exons was 99.94% and 210 of 213 genes were

100% covered for all exons for all three samples (Table S1). This

high coverage of the genes in MHC region makes it possible to

detect all variations therein.

Evaluation of Variation Detection AccuracyUse of the capture sequencing method for mutation discovery is

critically dependent on the accurate detection of polymorphisms

and genotypes. Based on the target capture sequencing data, using

GATK tools (v1.43) for YH, NA18532 and NA18555, respective-

ly, we detected 19,002, 19,007 and 20,022 SNPs, and 2,332,

2,077, and 2,513 insertions and deletions (InDels).

When analyzing the distribution of variations, we found that

most of those SNPs were located in HLA-A, -B, -C, -DR, -DP and

-DQ loci (Figure 2), leading to high degree of polymorphism

within those genes. An average 3% of the SNPs identified in our

study were determined to be novel, defined by their not being

present in dbSNP132, or not being extant in the 1000 Genomes

project (1000 Genomes Project Consortium, http://www.

1000genomes.org, released on 23 November 2010). Using the

ANNOVAR bioinformatics toolset (http://www.

openbioinformatics.org/annovar) to functionally annotate the

genetic variants in MHC region for the three case samples, we

showed that about 2% of all detected variants were within exonic

regions, 16% within intronic regions, and 70% within regions

defined as intergenic (Table S2). Further investigation showed that

most of the nonsynonymous variants resulting in changes to amino

acid sequences - and hence probable functional differences in the

protein - were also located in HLA-A, -B, -C, -DR, -DP and -DQ

genes, putatively resulting in different peptide-binding interactions

with T cell receptors.

In order to evaluate the accuracy of SNP detection, we

compared our genotype result with the Illumina 2.5 M BeadChip

An Integrated Tool to Study MHC Region

PLOS ONE | www.plosone.org 2 July 2013 | Volume 8 | Issue 7 | e69388

Page 53: MHC region and its related disease study

genotype result for YH, and with HapMap genotype result for

NA18555 and NA18532. There were 6,099 sites on the Illumina

2.5 M BeadChip which are within the MHC region, and 13,920

and 14,036 genotypes in the HapMap database for NA18532 and

NA18555 respectively. 6,066 (99.5%), 13,775 (99.0%) and 13,895

(99.0%) of these genotypes are identified by target capture

sequencing data with an accuracy of 98.83%, 97.82% and

98.26%, respectively, for the three study samples (Table S3, S4

and S5). The rate of divergence (about 2%) of our genotype results

from the HapMap genotypes is calculated to be higher than the

estimated average error rate of HapMap genotypes of 0.5% [15].

Two-thirds of the divergent loci show missing data at one allele in

the HapMap genotype, which indicating that these inconsistent

sites might result from failure of amplification, caused by the high

background heterozygosity of the MHC region. Due to this high

degree of divergence between any given pair of MHC haplotypes,

biased enrichment data may likewise lead to an erroneous result.

In order to evaluate the bias of our reads on two different MHC

haplotypes, we therefore count the support read number for each

allele at the heterozygous locus in the BeadChip for YH, and

HapMap data for NA18532 and NA18555. This count shows that

most of the sites had equal support reads for the non-reference

allele and reference allele, by which we infer that our probes

display good enrichment balance for the two haplotypes (Figure 3).

In order to validate the authenticity of the novel SNPs and

InDels, two already proved fosmids covering the HLA–B gene

locus were generated, based on our previous project using YH

gDNA. Both of the selected fosmids covered the MHC region of

chr6 :31,317,939 – 31,344,876 (hg19), but they each are

templated on different haplotypes. Using the sequence data from

fosmids, we validated the accuracy of 379 of the 382 SNPs

(99.21%) (including 14 of the 15 novel SNPs) and 35 of the 37

InDels (94.59%) (data not shown).

Description of HLA Typing MethodSince the target capture sequencing data can be expected to

cover more than 99% of genes in MHC region, it has the potential

to give all MHC HLA gene types. Here, based on the above cited

TRS data, we have developed an HLA typing pipeline which

enables typing all of the HLA genes in the IMGT/HLA database

until the date of publishing. The pipeline takes the aligned BAM

files as input, and outputs the most reliable HLA types for each

gene (Figure 4).

A brief summary of the pipeline was as follow:

(1) Read preparation: Paired reads with either end mapped

to target gene region, and unmapped paired reads with high

quality, are categorized by gene, and extracted from the BAM. (2)

Quality control and error correction: The collected reads of

each gene are aligned to the IMGT/HLA database, which

contains the sequence of all currently known types, using Basic

Local Alignment Search Tool (BLAST, version 2.2.24). Reads

with a maximum of two discordant (SNP/InDel) to the most

similar reference sequence in the database are kept and corrected

to the closest reference sequence, while maintaining and adding to

a record of the discordant SNP/InDel variants. (3) Haplotypesequence assemble: For each exon, reads are randomly

assembled, with a minimum overlap region length of 10 bp for

reads with no difference allowed in the overlap region, so as to

construct haplotypes. Artificial haplotypes that do not exist in

IMGT/HLA database are filtered out, and remaining haplotypes

are given quality scores as defined below (see method). (4) Typingaccording to assembled haplotype sequence: The final

haplotype of the sample is determined by rank of quality score (See

method).

Figure 1. Distribution of per-base coverage depths in the MHC region for three samples. X-axis denotes coverage depth, Y-axis indicatespercentage of total target region with a given sequencing depth. The fraction of target bases with zero coverage is not shown in the figure.doi:10.1371/journal.pone.0069388.g001

Table 1. Data production and mapping results for the threesamples used.

Samples YH NA18532 NA18555

Target region ( bp) 4970558 4970558 4970558

Raw reads 7988210 5377408 5457998

Raw data (Mb) 719 484 491

Mapped reads 7766430 4875514 4991960

Uniquely mapped reads 7317777 4559340 4672521

Reads uniquely mapped to target 4794113 3023307 3100770

Capture specificity 65.51% 66.31% 66.36%

Mean fold coverage depth (x) 87.32 55.09 56.50

Percent bases with coverage $1x 97.29 96.95 97.20

Percent bases with coverage $4x 96.52 95.66 95.95

Percent bases with coverage $10x 95.50 93.84 94.13

Target regions here refer to contiguous MHC regions, rather than to the regionactually covered by the designed probes. Capture specificity is defined as thepercentage of uniquely mapped reads aligning to the target region.doi:10.1371/journal.pone.0069388.t001

An Integrated Tool to Study MHC Region

PLOS ONE | www.plosone.org 3 July 2013 | Volume 8 | Issue 7 | e69388

Page 54: MHC region and its related disease study

Evaluation of HLA Typing AccuracyFirstly, we test our HLA typing pipeline on simulated data.

Based on the sequence of eight known haplotypes, sequenced by

Sanger’s MHC haplotype project in 2008 with a foreknown

HLA type result, we simulated regional sequence data for a

diploid sample MHC, and ran the HLA typing pipeline upon it.

For each sample, we typed 26 HLA genes, as recorded in

IMGT/HLA database. The simulated data showed the accuracy

of our method at 2 and 4 digital resolutions, and at sequencing

depths ranging from 206 to 1006 with sequencing error rates

ranging from 0–2% (Table S6). The results of this typing were

then validated against the HLA type given by the MHC

haplotype project.

We then used this pipeline on three real human genome

samples selected for this study. For each sample, we typed the

same 26 HLA genes recorded in the IMGT/HLA database. In

order to estimate the accuracy of the HLA typing result, we typed

the nine HLA genes with the highest diversity in the IMGT/HLA

database using the gold standard: Polymerase Chain Reaction

(PCR) based Sanger sequencing, at 4 digital resolution. When our

pipeline was compared with this standard, 100% of the 54 HLA

alleles typed by our method were found consistent with those of

the Sanger standard at 4 digital resolution (Table S7).

In a second test project, 190 human samples which had already

been typed by traditional SSO or PCR based Sanger sequencing

method (the same gold standard) were captured and sequenced

using our pipeline, comparing in particular on five genes (HLA-A,

-B, -C, -DRB1 and -DQB1). The comparison showed that the

concordance of our method with the traditional gold standard was

96.4% at 4 digital resolution, and 98.8% at 2 digital resolution

(data not shown). This result indicates the high performance and

reliability of our targeted region pipeline.

Discussion

In our study, we have developed an efficient approach for

targeted region sequencing of the MHC region of the human

genome, focusing on detection of minor variants and HLA types,

in a high-coverage, high-reliability and low-cost framework.

Beyond the 3.4 Mb of the standard MHC region, we designed

probes for a continuous 5 Mb genomic region, to include further

information from the extended high linkage disequilibrium area of

the broader MHC region.

Our data presents 97% coverage for the broader MHC region

and 99.69% for the transcripts of the crucial MHC genes, in all

three testing samples. Compared with the method of Proll [11],

our study showed a significantly higher coverage of the MHC

region. Moreover, the accuracy of SNVs identified by our target

capture sequencing data was in particularly high concordance with

genotype data established by the most reliable methodology for

this purpose, indicating the high degree of reliability of our

approach.

To overcome the challenge of accurately genotyping the

extremely high density of polymorphisms and linkage disequilibria

in the human MHC region, our method optimized the standard

target capture sequencing pipeline. Unlike traditional probes,

which are designed only based on one of the haplotypes of the

human reference sequence, the probe design for the MHC region

in this study is based on all eight major MHC haplotype

sequences. This enables capture of less common haplotypes in

this region, ultimately reducing capture bias and failure. We also

Figure 2. Single nucleotide variation distribution across the whole MHC region for three samples. The MHC region was split into 4971parts, with 1000 bp in each part. X-axis denotes the 4971 parts and Y-axis indicates the number of SNPs in each part.doi:10.1371/journal.pone.0069388.g002

An Integrated Tool to Study MHC Region

PLOS ONE | www.plosone.org 4 July 2013 | Volume 8 | Issue 7 | e69388

Page 55: MHC region and its related disease study

shear DNA fragments to ,500 bp, increasing the coverage of the

MHC region compared to the shorter fragments length around

,200 bp (Table S8) which is more typically used in conventional

exon capture libraries. This is particularly useful in the MHC

because probes must be annealed to unique regions, but some

repetitive regions, which are not covered by the probes, can be

enriched by flanking to unique genome regions close to them, by

using a longer and more unique genome fragment.

Based on this high-coverage, low-bias capture data, we achieve

exceptionally accurate SNP and InDel results, and exceptionally

accurate HLA gene type results in MHC region, at a level which

before could only be achieved by a more specialized and expensive

experimental method, or based on a large database. The

traditionally utilized HLA typing methods - such as sequence

specific oligonucleotide (SSO), sequence specific primer (SSP) [16],

PCR based capillary sequencing and next generation sequencing

method [17] - only detect variants in two or three exons, and it is

very possible for new alleles in other regions of the gene to be

missed when using these traditional methods. Another forwar-

dlooking method has been reported for the imputation classical

HLA alleles and corresponding amino acid based on the SNP

information in the MHC region [18], but it demands a huge well

studied reference panel. Our data covers essentially all regions of

known HLA genes, and has the potential for more accuracy and

higher resolution in typing, as well the potential for the detection

of novel types (a not infrequent occurrence in this highly

polymorphic region), and an important clinical clue in the event

of tissue rejection and GVHD. High coverage data allows us to

detect all variants and type all HLA genes in the MHC region in

one-step with low cost. Some researchers suggest that the

transplant barrier is defined not only by classical HLA genes,

but also by non-HLA genetic variation within the MHC [10], but

many previously dominant methods have been unable to gather

data of sufficient breadth to verify or refute this contention.

One limitation of our target region method, as well as most

other methods, is the inability to fully cover long repeated regions,

and present 100% coverage of MHC region. This difficulty may

later be overcome by shearing gDNA to longer fragments

(5000 bp, or longer), capturing using probes of this size, and

sequencing on the third generation sequencing platforms. A

solution to this difficulty is important and may open the possibility

of phasing long MHC haplotype blocks at low cost, ultimately

reducing the risk of transplantation rejection and GVHD for those

less able to afford a more expensive solution.

ConclusionThe MHC region harbors the highest density of immunity-

crucial genes in the human genome, and is arguably the region

Figure 3. Capture bias evaluation using heterozygous genotypes in the Beadchip for three samples. X-axis denotes the support readsnumber of the reference allele and Y-axis denotes the support reads number of the non-reference allele.doi:10.1371/journal.pone.0069388.g003

An Integrated Tool to Study MHC Region

PLOS ONE | www.plosone.org 5 July 2013 | Volume 8 | Issue 7 | e69388

Page 56: MHC region and its related disease study

most central to immunogenetics and tissue transplantation. The

ability to accurately call variants and HLA types is very important

to immunology related studies in general, and to clinical

application in particular. In this paper, we outline our develop-

ment of a novel, low cost pipeline for the identification of variants

and typing of HLA alleles in the MHC region in a one step

protocol with high accuracy. This pipeline has great potential for

use in the study of immune disease and transplantation medicine.

Materials and Methods

Design of MHC Capture ProbesThe classical MHC region is defined as a ,3.4 Mb region

stretching between the gene C6orf40 (ZFP57) and the gene

HCG24 [19]. In this study, due to extreme linkage disequilibrium

of the region, and due to the presence of MHC-relevant genes

nearby [19], we extended the MHC region to the telomeric gene

GPX5 and the centromeric gene ZBTB9 in the probes design,

corresponding to the genomic region from chr6:28,477,797 to

chr6:33,448,354 in the human reference genome (NCBI release

GRCh37, UCSC release hg19), encompassing a total of

4,970,558 bp.

Unlike previous studies of the MHC region, we also include the

other seven MHC haplotypes aside from PGF - namely COX,

QBL, APD, DBB, MANN, MCF and SSTO - in our probes

design so as to improve the coverage of the MHC region and

minimize the effect of the regionally high polymorphism density. A

maximum allowance of greater than 5 close matches is set to get

optimal capture efficiency of such high polymorphic regions. In

total, 3,620,871 bp (about 72.8% of the 5 Mb MHC region) was

covered by the capture probes for PGF, and the coverage rose to

4150331 bp (83.5%) when probes extended by 100 bp at both

end. The design name is 110729_HG19_MHC_L2R_D03_EZ

and design file can be downloaded from Roche NimbleGen

website (http://www.nimblegen.com/products/seqcap/ez/

designs/index.html) in the product ‘‘human MHC design’’. Five

Figure 4. The workflow of HLA typing method.doi:10.1371/journal.pone.0069388.g004

An Integrated Tool to Study MHC Region

PLOS ONE | www.plosone.org 6 July 2013 | Volume 8 | Issue 7 | e69388

Page 57: MHC region and its related disease study

capture arrays were initially produced for testing the uniformity of

coverage, and the regions with extremely low or high sequencing

depth were rebalanced by adjusting the number of probes - probe

count for regions with extremely low sequencing depth was

increased, and for regions with extremely high sequencing depth

was decreased. Target region sequencing of the MHC was

conducted for three samples using the set of rebalanced probes.

Samples CollectionThree samples were used to evaluate the performance of the

MHC region capture strategy. One was YH, which had previously

been whole genome deep sequenced in 2008 [20], and genotyped

using Illumina 2.5 M BeadChip. The other two samples were cell

line DNA, NA18532 and NA18555, which were purchased from

Coriell Institute, and which had previously been analyzed using

high-throughput SNP genotyping in the HapMap project [21].

Nine HLA allele (HLA-A, HLA–B, HLA–C, HLA-DRB1,

HLA-DQB1, HLA-DPA1, HLA-DPB1, HLA-DQA1, HLA-G)

types for these samples were already detected using the gold

standard sequence based typing method, in which the exonic

regions of HLA genes were initially amplified by PCR, and the

amplicons were subsequently sequenced using Sanger method.

Whole Genome Shotgun Library ConstructionHigh molecular weight gDNA was extracted using DNeasy

Blood & Tissue Kits (QIAGEN, 69581). Shotgun libraries were

generated from 3 micrograms ( mg) of genomic DNA followed by

the manufacturer’s instruments (Illumina). gDNA in Tris-EDTA

was sheared into 400–600 bp fragments using a Covaris S2

(Covaris). The overhangs were then converted to blunt ends, using

T4 Polynucleotide Kinase, T4 DNA polymerase and Klenow

polymerase. The fragments were then A-tailed using Klenow (39–

59 exo-). Next, Illumina sequencing adapters with a single ‘‘T’ base

overhang were linked to the A-tailed sample using T4 DNA

Ligase. Finally, the fragments with adapters were enriched via four

cycles of PCR.

MHC Region Capture and High Throughput SequencingOne mg of prepared sample library was hybridized to the

capture probes following the manufacturer’s protocols (Roche

NimbleGen). The hybridization mixture was incubated for 70 h at

65uC. After the hybridization was finished, the target fragments

were captured with M-280 Streptavidin Dynabeads (Invitrogen),

and then the captured sample was washed twice at 47uC and three

more times at room temperature using the manufacturer’s buffers.

The captured target fragments were amplified using PlatinumHPfx DNA Polymerase (Invitrogen) at the following condition: 94uCfor 2 min, followed by 15 cycles of 94uC for 15 s, 58uC for 30 s,

72uC for 50 s and a final extension of 72uC for 5 min. The PCR

products were purified and sequenced with standard 26 91

paired-end reads on the Illumina HiSeq2000 sequencer following

manufacturer’s instructions.

Sequencing Read Mapping and Variant CallingLow quality and adapter contaminated reads were filtered out

to get the clean reads and then they were aligned to the reference

assembly (UCSC Hg19) with haplotypes sequence removed using

Burrows-Wheeler Aligner (BWA, version 0.5.9) with the param-

eters -o 1 -e 63 -i 15 -L -k 2 -l 31 -q 10–I [22]. PCR duplicates

were removed by Samtools (version 0.1.17) [23]. Base qualities

were recalibrated and reads were realigned around potential

insertions and/or deletions using The Genome AnalysisToolkit

(GATK) software (version 1.43). SNPs and InDel were also

detected using GATK for the regions with at least 46 sequencing

depth as described previously [24,25].

HLA Genes Haplotype Sequences AssembleAs showed in Figure 4, De novo assembling method was used to

construct the haploid sequence for each exon by using overlapped

reads. The assembled sequence was randomly paired if the whole

exon was broken by low depth region. We filtered those sequences

that did not exist in the IMGT/HLA database. As for the

remaining haplotype sequences, we scored them according to the

following formula:

Score~C2|N= log (R)

C: Coverage of this assembled sequence to its corresponding

exon;

N: Number of reads used to construct the assembled sequence;

R: Reliability of this assembled sequence. It was calculated using

the following formula:

R~

PXi{Xð Þ27X

h i

log dfð Þ :

Xi: The depth of each base of the assembled sequence;

X: The mean depth of the assembled sequence;

df: The degrees of freedom of the assembled sequence.

df =N21, N equals the length of assembled sequence.

Typing According to Assembled Haplotype SequenceWe combined the assembled sequences (called haplotypes) of all

exons together to produce a group of types, and then scored them

using the following formula:

TScore~N|S

N: The number of reads that support the haplotype.

S: The score of the haplotype.

Based on the assumption that number of reads mapped to the

correct haplotype is larger than that mapped to the incorrect

haplotype, we first select the two types with the highest TScore,

which can explain the highest number of the sequence reads for all

exons as candidate types, and then we divide the sequence reads

into two corresponding groups. We judge whether a HLA gene

would be a homozygous or heterozygous, depending on the ratio

of unique supporting reads number of the sub-optimal type,

compared to the supporting reads number of the best type. The

threshold of a heterozygote is 0.1. If the ratio is smaller then 0.1,

the gene is defined as a homozygote, and the type with the largest

score is chosen as the dominant type. Otherwise, the typed gene is

defined as heterozygote, and the two types with the highest score

are chosen as the final type result. Finally, we assess the reliability

of the final type by comparing it with the closed type using AScore:

AScore~S0{S1ð Þ2

S0{S1ð Þ2z S0{S2ð Þ2

An Integrated Tool to Study MHC Region

PLOS ONE | www.plosone.org 7 July 2013 | Volume 8 | Issue 7 | e69388

Page 58: MHC region and its related disease study

S0: The full score of the all assembled haplotypes of the typed

gene;

S1: The score of the closed type;

S2: The score of the final type.

On the other hand, all the discordant information recorded in

the process of ‘error correction’ was judged using variant calling

methods to check whether there is a novel SNP or InDel

comparing to an extant type. If so, the relationship of the novel

variant to the given type was also described.

Data Simulation and HLA TypingBasing on the sequence of eight MHC haplotypes given by the

MHC haplotype project in 2008, we randomly simulated one

diploid sample’s MHC region sequence data using wgsim (v 0.2.3)

provided by the Samtools package. First, we randomly selected

two from the eight haploids, at various sequencing depths and

various sequencing error rates using parameters –e and –N. The

insertion size was set to 500 bp, which was the same as our target

capture sequencing data, and the SD was 10. For each of the given

conditions, we simulated 36 human MHC region data sets. We

then worked out each simulated sample’s 26 HLA gene’s type

result, using the method we describe above. The statistical

accuracy result is given in Table S6.

Data AccessThe raw target capture sequencing data of MHC region for

sample YH, NA18526 and NA18555 has been deposited at the

NCBI Sequence Read Archive (SRA) under accession no.

SRA065846. The HLA typing pipeline is implemented in Perl

script and can be freely downloaded at http://soap.genomics.org.

cn/SOAP-HLA.html.

Supporting Information

Table S1 Coverage of whole gene body and exome of theMHC genes by capture sequencing data.(XLSX)

Table S2 Distribution of variations within differentgenomics functional regions for three samples.(XLSX)

Table S3 Comparison of MHC capture SNPs andIllumina 2.5 M genotyping alleles for sample YH. We

classified both the MHC capture alleles and the alleles that were

called by genotyping into three categories: (1) Hom ref

(homozygotes where both alleles are identical to the reference);

(2) Hom mut (homozygotes where both alleles differ from the

reference); (3) Het ref (heterozygotes where only one allele is

identical to the reference); The number of MHC capture

sequencing sites that are consistent with genotyping at both

alleles, at one allele, or that are inconsistent at both alleles were

categorized as 2, 1, and 0, respectively.

(XLSX)

Table S4 Comparison of MHC capture genotypes andHapMap genotypes for sample NA18532.

(XLSX)

Table S5 Comparison of MHC capture genotypes andHapMap genotypes for sample NA18555.

(XLSX)

Table S6 The accuracy of HLA typing method usingsimulated data at different sequencing depth withdifferent sequencing error rate.

(XLSX)

Table S7 HLA alleles typed by PCR based Sangersequence method and target capture sequence method.

(XLSX)

Table S8 Coverage of the MHC region using 200 bpinsert size library.

(XLSX)

Author Contributions

Conceived and designed the experiments: XX J. Wang YL. Performed the

experiments: J. Wu HJ Xiao Liu JZ WL. Analyzed the data: HC J. Wu YW

TZ YX DL PG Xiaomin Liu FY XT DC. Wrote the paper: J. Wu.

Designed and produced the probe: HC HJ BG MD’A TR. Developed

HLA typing pipeline: HC TZ. Assisted in writing manuscript: YW TZ.

Revised the manuscript: HC HJ Xiao Liu YS LCAMT.

References

1. The MHC Sequencing Consortium (1999) Complete sequence and gene map of

a human major histocompatibility complex. Nature 401: 921–923.

2. Fernando MM, Stevens CR, Walsh EC, De Jager PL, Goyette P, et al. (2008)

Defining the role of the MHC in autoimmunity: a review and pooled analysis.

PLoS Genet 4(4): e1000024.

3. Rioux JD, Goyette P, Vyse TJ, Hammarstrom L, Fernando MM, et al. (2009)

Mapping of multiple susceptibility variants within the MHC region for 7

immune-mediated diseases. Proc Natl Acad Sci USA 106(44): 18680–18685.

4. Trowsdale J (2011) The MHC, disease and selection. Immunol Lett 137: 1–8.

5. Stewart CA, Horton R, Allcock RJ, Ashurst JL, Atrazhev AM, et al. (2004)

Complete MHC haplotype sequencing for common disease gene mapping.

Genome Res 14(6): 1176–1187.

6. Horton R, Gibson R, Coggill P, Miretti M, Allcock RJ, et al. (2008) Variation

analysis and gene annotation of eight MHC haplotypes: the MHC Haplotype

Project. Immunogenetics 60(1): 1–18.

7. Morishima Y, Sasazuki T, Inoko H, Juji T, Akaza T, et al. (2002) The clinical

significance of human leukocyte antigen (HLA) allele compatibility in patients

receiving a marrow transplant from serologically HLA-A, HLA–B, and HLA-

DR matched unrelated donors. Blood 99(11): 4200–4206.

8. Lee SJ, Klein J, Haagenson M, Baxter-Lowe LA, Confer DL, et al. (2007) High-

resolution donor-recipient HLA matching contributes to the success of unrelated

donor marrow transplantation. Blood 110(13): 4576–4583.

9. Bray RA, Hurley CK, Kamani NR, Woolfrey A, Muller C, et al. (2008) National

marrow donor program HLA matching guidelines for unrelated adult donor

hematopoietic cell transplants. Biol Blood Marrow Transplant 14(9 Suppl): 45–

53.

10. Petersdorf EW, Malkki M, Gooley TA, Spellman SR, Haagenson MD, et al.

(2012) MHC-Resident Variation Affects Risks After Unrelated Donor

Hematopoietic Cell Transplantation. Sci Transl Med 4: 144ra101.

11. Proll J, Danzer M, Stabentheiner S, Niklas N, Hackl C, et al. (2011) Sequence

capture and next generation resequencing of the MHC region highlights

potential transplantation determinants in HLA identical haematopoietic stem

cell transplantation. DNA Res 18(4): 201–210.

12. Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, et al. (2009) Solution

hybrid selection with ultra-long oligonucleotides for massively parallel targeted

sequencing. Nat Biotechnol 27(2): 182–189.

13. Shearer AE, DeLuca AP, Hildebrand MS, Taylor KR, Gurrola J, et al. (2010)

Comprehensive genetic testing for hereditary hearing loss using massively

parallel sequencing. Proc Natl Acad Sci U S A 107(49): 21104–21109.

14. Pruitt KD, Tatusova T, Maglott DR (2007) NCBI reference sequences (RefSeq):

a curated non-redundant sequence database of genomes, transcripts and

proteins. Nucleic Acids Res 35(Database issue): D61–65.

15. The International HapMap Consortium (2007) A second generation human

haplotype map of over 3.1 million SNPs. Nature 449(7164): 851–861.

16. Dunckley H (2012) HLA typing by SSO and SSP methods. Methods Mol Biol

882: 9–25.

17. Erlich RL, Jia X, Anderson S, Banks E, Gao X, et al. (2011) Next-generation

sequencing for HLA typing of class I loci. BMC Genomics 12: 42.

18. de Bakker PI, McVean G, Sabeti PC, Miretti MM, Green T, et al. (2006) A

high-resolution HLA and SNP haplotype map for disease association studies in

the extended human MHC. Nat Genet 38(10): 1166–1172.

19. Horton R, Wilming L, Rand V, Lovering RC, Bruford EA, et al. (2004) Gene

map of the extended human MHC. Nat Rev Genet 5: 889–899.

An Integrated Tool to Study MHC Region

PLOS ONE | www.plosone.org 8 July 2013 | Volume 8 | Issue 7 | e69388

Page 59: MHC region and its related disease study

20. Wang J, Wang W, Li R, Li Y, Tian G, et al. (2008) The diploid genome

sequence of an Asian individual. Nature 456(7218): 60–65.21. The International HapMap Consortium (2003) The International HapMap

Project. Nature 426(6968): 789–796.

22. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25: 1754–1760.

23. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al. (2009) The SequenceAlignment/Map format and SAMtools. Bioinformatics 25(16): 2078–2079.

24. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, et al. (2010) The

Genome Analysis Toolkit: a MapReduce framework for analyzing next-

generation DNA sequencing data. Genome Res 20(9): 1297–1303.

25. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, et al. (2011) A

framework for variation discovery and genotyping using next-generation DNA

sequencing data. Nat Genet 43(5): 491–498.

An Integrated Tool to Study MHC Region

PLOS ONE | www.plosone.org 9 July 2013 | Volume 8 | Issue 7 | e69388

Page 60: MHC region and its related disease study

Paper III Rapid detection of structural variation in a human genome using nanochannel-based genome mapping technology By Hongzhi Cao1,3,4, Alex R Hastie2, Dandan Cao1,3, Ernest T Lam2, Yuhui Sun1,5, Haodong Huang1,5, Xiao Liu1,4, Liya Lin1,5, Warren Andrews2, Saki Chan2, Shujia Huang1,5, Xin Tong1, Michael Requa2, Thomas Anantharaman2, Anders Krogh4, Huanming Yang1,3, Han Cao2 and Xun Xu1,3

1BGI-Shenzhen, Shenzhen 518083, China. 2BioNano Genomics, San Diego, California 92121, USA. 3Shenzhen Key Laboratory of Transomics Biotechnologies, Shenzhen 518083, China. 4Department of Biology, University of Copenhagen, Copenhagen 2200, Denmark. 5School of Bioscience and Biotechnology, South China University of Technology, Guangzhou 511400, China. Publication details Published in GigaScience 3:34 (2014)

Page 61: MHC region and its related disease study

Cao et al. GigaScience 2014, 3:34http://www.gigasciencejournal.com/content/3/1/34

RESEARCH Open Access

Rapid detection of structural variation in ahuman genome using nanochannel-basedgenome mapping technologyHongzhi Cao1,3,4†, Alex R Hastie2†, Dandan Cao1,3†, Ernest T Lam2†, Yuhui Sun1,5, Haodong Huang1,5, Xiao Liu1,4,Liya Lin1,5, Warren Andrews2, Saki Chan2, Shujia Huang1,5, Xin Tong1, Michael Requa2, Thomas Anantharaman2,Anders Krogh4, Huanming Yang1,3, Han Cao2* and Xun Xu1,3*

Abstract

Background: Structural variants (SVs) are less common than single nucleotide polymorphisms and indels in thepopulation, but collectively account for a significant fraction of genetic polymorphism and diseases. Base pairdifferences arising from SVs are on a much higher order (>100 fold) than point mutations; however, none of thecurrent detection methods are comprehensive, and currently available methodologies are incapable of providingsufficient resolution and unambiguous information across complex regions in the human genome. To addressthese challenges, we applied a high-throughput, cost-effective genome mapping technology to comprehensivelydiscover genome-wide SVs and characterize complex regions of the YH genome using long single molecules(>150 kb) in a global fashion.

Results: Utilizing nanochannel-based genome mapping technology, we obtained 708 insertions/deletions and 17inversions larger than 1 kb. Excluding the 59 SVs (54 insertions/deletions, 5 inversions) that overlap with N-basegaps in the reference assembly hg19, 666 non-gap SVs remained, and 396 of them (60%) were verified by paired-enddata from whole-genome sequencing-based re-sequencing or de novo assembly sequence from fosmid data. Of theremaining 270 SVs, 260 are insertions and 213 overlap known SVs in the Database of Genomic Variants. Overall, 609 outof 666 (90%) variants were supported by experimental orthogonal methods or historical evidence in public databases.At the same time, genome mapping also provides valuable information for complex regions with haplotypes in astraightforward fashion. In addition, with long single-molecule labeling patterns, exogenous viral sequences weremapped on a whole-genome scale, and sample heterogeneity was analyzed at a new level.

Conclusion: Our study highlights genome mapping technology as a comprehensive and cost-effective method fordetecting structural variation and studying complex regions in the human genome, as well as deciphering viralintegration into the host genome.

Keywords: Genome mapping, Structural variation, Repeat units, Epstein-Barr virus (EBV) integration

* Correspondence: [email protected]; [email protected]†Equal contributors2BioNano Genomics, San Diego, California 92121, USA1BGI-Shenzhen, Shenzhen 518083, ChinaFull list of author information is available at the end of the article

© 2014 Cao et al.; licensee BioMed Central. This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly credited. The Creative Commons Public DomainDedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article,unless otherwise stated.

Page 62: MHC region and its related disease study

Cao et al. GigaScience 2014, 3:34 Page 2 of 11http://www.gigasciencejournal.com/content/3/1/34

BackgroundA structural variant (SV) is generally defined as a regionof DNA 1 kb and larger in size that is different withrespect to another DNA sample [1]; examples includeinversions, translocations, deletions, duplications and in-sertions. Deletions and duplications are also referred toas copy number variants (CNVs). SVs have proven to bean important source of human genetic diversity anddisease susceptibility [2-6]. Base pair differences arisingfrom SVs occur on a significantly higher order (>100fold) than point mutations [7,8], and data from the 1000Genomes Project show population-specific patterns ofSV prevalence [9,10]. Also, recent studies have firmlyestablished that SVs are associated with a number ofhuman diseases ranging from sporadic syndromes andMendelian diseases to common complex traits, particu-larly neurodevelopmental disorders [11-13]. Chromosomalaneuploidies, such as trisomy 21 and monosomy X havelong been known to be the cause of Down’s and Turnersyndromes, respectively. A microdeletion at 15q11.2q12has been shown causal for Prader-Willi syndrome [14],and many submicroscopic SV syndromes have been re-vealed since then [15]. In addition, rare, large de novoCNVs were identified to be enriched in autism spectrumdisorder (ASD) cases [16], and other SVs were describedas contributing factors for other complex traits includingcancer, schizophrenia, epilepsy, Parkinson’s disease andimmune diseases, such as psoriasis (reviewed in [11] and[12]). With the increasing recognition of the importantrole of genomic aberrations in disease and the needfor improved molecular diagnostics, comprehensivecharacterization of these genomic SVs is vital for, not onlydifferentiating pathogenic events from benign ones, butalso for rapid and full-scale clinical diagnosis.While a variety of experimental and computational

approaches exist for SV detection, each has its distinctbiases and limitations. Hybridization-based approaches[17-19] are subject to amplification, cloning andhybridization biases, incomplete coverage, and low dy-namic range due to hybridization saturation. Moreover,detection of CNV events by these methods provides nopositional context, which is critical to deciphering theirfunctional significance. More recently, high-throughputnext generation sequencing (NGS) technologies have beenheavily applied to genome analysis based on alignment/mapping [20-22] or de novo sequence assembly (SA) [23].Mapping methods include paired-end mapping (PEM)[20], split-read mapping (SR) [21] and read depth analysis(RD) [22]. These techniques can be powerful, but are tedi-ous and biased towards deletions owing to typical NGSshort inserts and short reads [24,25]. De novo assemblymethods are more versatile and can detect a larger rangeof SV types and sizes (0 ~ 25 kb) by pair-wise genomecomparison [23-25]. All such NGS-based approaches lack

power for comprehensiveness and are heavily biased againstrepeats and duplications because of short-read mappingambiguity and assembly collapse [9,10,26]. David C.Schwartz’s group promoted optical mapping [27] as analternative to detect SVs along the genome with restrictionmapping profiles of stretched DNA, highlighting the useof long single-molecule DNA maps in genome analysis.However, since the DNA is immobilized on glass surfacesand stretched, the technique suffers from low throughputand non-uniform DNA stretching, resulting in impreciseDNA length measurement and high error rate, hinderingits utility and adoption [24,27-29]. Thus, an effectivemethod to help detect comprehensive SVs and revealcomplex genomic regions is needed.The nanochannel-based genome mapping technology,

commercialized as the “Irys” platform, automatically im-ages fluorescently labeled DNA molecules in a massivelyparallel nanochannel array, and was introduced as an ad-vanced technology [30] compared to other restrictionmapping methods because of high-throughput data col-lection and its robust and highly uniform linearization ofDNA in nanochannels. This technology has previouslybeen described and used to map the 4.7-Mb highly vari-able human major histocompatibility complex (MHC)region [31], as well as for de novo assembly of a 2.1-Mbregion in the highly complex Aegilops tauschii genome[32], lending great promise for use in complete genomesequence analysis. Here, we apply this rapid and high-throughput genome mapping method to discern genomewide SVs, as well as explore complex regions based onthe YH (first Asian genome) [33] cell line. The workflowfor mapping a human genome on Irys requires no libraryconstruction; instead, whole genomic DNA is labeled,stained and directly loaded into nanochannels for im-aging. With the current throughput, one can collectenough data for de novo assembly of a human genomein less than three days. Additionally, comprehensive SVdetection can be accomplished with genome mappingalone, without the addition of orthogonal technologiesor multiple library preparations. Utilizing genome map-ping, we identified 725 SVs including insertions/dele-tions, inversions, as well as SVs involved in N-base gapregions that are difficult to assess by current methods.For 50% of these SVs, we detected a signal of variation byre-sequencing and an additional 10% by fosmid sequence-based de novo assembly whereas the remainder had nosignal by sequencing, hinting at the intractability of detec-tion by sequencing. Detailed analyses showed most of theundetected SVs (80%, 213 out of 270) could be foundoverlapped in the Database of Genomic Variant (DGV)database indicating their reliability. Genome mappingalso provides valuable haplotype information on complexregions, such as MHC, killer cell Immunoglobulin-like re-ceptor (KIR), T cell receptor alpha/beta (TRA/TRB) and

Page 63: MHC region and its related disease study

Cao et al. GigaScience 2014, 3:34 Page 3 of 11http://www.gigasciencejournal.com/content/3/1/34

immunoglobulin light/heavy locus (IGH/IGL), which canhelp determine these hyper-variable regions’ sequencesand downstream functional analyses. In addition, withlong molecule labeling patterns, we were able to accuratelymap the exogenous virus’ sequence that integrated intothe human genome, which is useful for the study of themechanism of how virus sequence integration leads toserious diseases like cancer.

Data descriptionHigh-molecular weight DNA was extracted from the YHcell line, and high-quality DNA was labeled and run onthe Irys system. After excluding DNA molecules smallerthan 100 kb for analysis, we obtained 303 Gb of data giv-ing 95× depth for the YH genome (Table 1). For subse-quent analyses, only molecules larger than 150 kb (223Gb, ~70X) were used. De novo assembly resulted in a setof consensus maps with an N50 of 1.03 Mb. We per-formed “stitching” of neighboring genome maps thatwere fragmented by fragile sites associated with nicksites immediately adjacent to each other. After fragilesite stitching, the N50 improved to 2.87 Mb, and the as-sembly covered 93.0% of the non-N base portion of thehuman genome reference assembly hg19. Structural vari-ation was classified as a significant discrepancy betweenthe consensus maps and the hg19 in silico map. Furtheranalyses were performed for highly repetitive regions,complex regions and Epstein-Barr virus (EBV) integra-tion. Supporting data is available from the GigaSciencedatabase, GigaDB [34-36].

AnalysesGeneration of single-molecule sequence motif mapsGenome maps were generated for the YH cell line by puri-fying high-molecular weight DNA in a gel plug and label-ing at single-strand nicks created by the Nt.BspQI nickingendonuclease. Molecules were then linearized in nano-channel arrays etched in silicon wafers for imaging [31,32].From these images, a set of label locations on each DNAmolecule defined an individual single-molecule map. Single

Table 1 Molecule collection statistics under differentlength thresholds

Lengthcutoff (kb)

No. Molecules Total length (Gb) Estimateddepth (X)*

100 1,568,969 303 95

120 1,313,832 275 86

150 932,855 223 70

180 659,149 179 56

200 523,146 153 48

300 170,432 68 21

500 22,944 14 4

*Estimated depth based on 3.2 Gb genome size.

molecules have, on average, one label every 9 kb and wereup to 1 Mb in length. A total of 932,855 molecules largerthan 150 kb were collected for a total length of 223 Gb(~70-fold average depth) (Table 1). Molecules can bealigned to a reference to estimate the error rates in the sin-gle molecules. Here, we estimated the missing label rate is10%, and the extra label rate is 17%. Most of the error asso-ciated with these reference differences are averaged out inthe consensus de novo assembly. Distinct genetic featuresintractable to sequencing technologies, such as long arraysof tandem repeats were observed in the raw single mole-cules (Additional file 1: Figure S1).

De novo assembly of genome maps from single-moleculedataSingle molecules were assembled de novo into consensusgenome maps using an implementation of the overlap-layout-consensus paradigm [37]. An overlap graph wasconstructed by an initial pairwise comparison of all mol-ecules >150 kb, by pattern matching using commercialsoftware from BioNano Genomics. Thresholds for thealignments were based on a p-value appropriate for thegenome size (thresholds can be adjusted for differentgenome sizes and degrees of complexity) to preventspurious edges. This graph was used to generate a draftconsensus map set that was improved by alignment ofsingle molecules and recalculation of the relative labelpositions. Next, the consensus maps were extended byaligning overhanging molecules to the consensus mapsand calculating a consensus in the extended regions.Finally, the consensus maps were compared and mergedwhere patterns matched (Figure 1). The result of this denovo assembly is a genome map set entirely independentof known reference or external data. In this case, YH wasassembled with an N50 of 1.03 Mb in 3,565 maps and anN50 of 2.87 Mb in 1,634 maps after stitching fragile sites(Additional file 1: Figure S2 and Additional file 1: TableS1). These genome maps define motif positions that occuron every 9 kb on average, and these label site positionshave a resolution of 1.45 kb. The standard deviation forinterval measurements between two labels varies withlength. For example, for a 10 kb interval, the standard de-viation (SD) is 502 bp, and for a 100 kb interval, it is1.2 kb. Consensus genome maps were aligned to an insilico Nt.BspQI sequence motif map of hg19. Ninety-ninepercent of the genome maps could align to hg19 andthey overlap 93% of the non-gap portion of hg19.

Structural variation analysisUsing the genome map assembly as input, we performedstructural variation detection (Figure 1), and the genomemaps were compared to hg19. Strings of intervals be-tween labels/nick motifs were compared and when theydiverged, an outlier p-value was calculated and SVs were

Page 64: MHC region and its related disease study

Figure 1 Flowchart of consensus genome map assembly and structural variant discovery using genome mapping data.

Cao et al. GigaScience 2014, 3:34 Page 4 of 11http://www.gigasciencejournal.com/content/3/1/34

called at significant differences (See Methods for details),generating a list of 725 SVs including 59 that overlappedwith N-base gaps in hg19 (Additional file 2, Spreadsheet3). Based on the standard deviation of interval measure-ments, 1.5 kb is the smallest insertion or deletion thatcan be confidently measured for an interval of about10 kb if there is no pattern change. However, if label pat-terns deviate from the reference, SVs with a net size dif-ference less than 1.5 kb can be detected. Additional file1: Figure S1 shows three mapping examples (one dele-tion, one insertion, and one inversion) of gap region SVs.We present these 59 events separately although technic-ally, in those cases, genome mapping detected structuraldifferences between the genome maps and reference re-gions. For the remaining 666 SVs, 654 of them were

Figure 2 Size distribution of total detected large insertions (green) anhistogram bars in red and blue respectively represent deletions and insertio

insertions/deletions (Figure 2) while 12 were inversions(Additional file 2, Spreadsheet 1 & 2). Out of the 654 in-sertions/deletions, 503 were defined as insertions and151 were deletions, demonstrating an enrichment ofinsertions for this individual with respect to the hg19reference (Figure 2). Of the 59 SV events that span N-gap regions, 5 of them were inversions. Of the remaining54 events, 51 were estimated to be shorter than indicatedand 3 longer. These gap-region related SVs indicate a spe-cific structure of gap regions of the YH genome comparedto the hg19 reference.In order to validate our SVs, we first cross-referenced

them with the public SV database DGV (http://dgv.tcag.ca/dgv/app/home) [38]. For each query SV, we required50% overlap with records in DGV. We found that the

d deletions (purple) using genome mapping. The comparativens supported by NGS. NGS: next-generation sequencing.

Page 65: MHC region and its related disease study

Cao et al. GigaScience 2014, 3:34 Page 5 of 11http://www.gigasciencejournal.com/content/3/1/34

majority of the SVs (583 out of 666; 87.5%) could befound (Additional file 2, Spreadsheet 1 & 2), confirmingtheir reliability. Next, we applied the NGS discordantpaired-end mapping and read depth-based methods, aswell as fosmid-based de novo assembly (See Methods fordetail), and as a result, detected an SV signal in 396(60%, Figure 2) out of 666 SVs by at least one of the twomethods (Figure 2, Additional file 2, Spreadsheet 1 & 2).For the remaining 270 SVs, 79% (213 out of 270, Additionalfile 2, Spreadsheet 1 & 2) were found in the DGV database.Overall, 91% (609 out of 666, Additional file 2, Spreadsheet1 & 2) of SVs had supporting evidence by retrospectivelyapplied sequencing-based methods or database entries.We wanted to determine if SVs revealed by genome

mapping, but without an NGS supported signal, hadunique properties. We firstly investigated the distribu-tion of NGS-supported SVs and NGS-unsupported SVsin repeat-rich and segmental duplication regions. How-ever we did not find significant differences betweenthem (data not shown) which was in concordance withprevious findings [27]. We also compared the distribu-tion of insertions and deletions of different SV categoriesand found that SV events that were not supported by se-quencing evidence were 97% (260 out of 268) insertions;in contrast, the SVs that were supported by sequencing evi-dence were only 61% (243 out of 396, Figure 2, Additionalfile 2, Spreadsheet 1) insertions showing insertion en-richment (p = 2.2e-16 Chi-squared test, Figure 2) inSVs without sequencing evidence. In addition, we fur-ther investigated the novel 57 SVs without either se-quencing evidence or database supporting evidence.We found that the genes they covered had importantfunctions, such as ion binding, enzyme activating andso forth, indicating their important role in cellular bio-chemical activities. Some of the genes like ELMO1,HECW1, SLC30A8, SLC16A12, JAM3 are reported tobe associated with diseases like diabetic nephropathy,lateral sclerosis, diabetes mellitus and cataracts [39],providing valuable foundation for clinical application(Additional file 2, Spreadsheet 1 & 2).

Highly repetitive regions of the human genomeHighly repetitive regions of the human genome areknown to be nearly intractable by NGS because shortreads are often collapsed, and these regions are often re-fractory to cloning. We have searched for and analyzedone class of simple tandem repeats (unit size rangingfrom 2-13 kb) in long molecules derived from the ge-nomes of YH (male) and CEPH-NA12878 (female). Thefrequencies of these repeating units from both genomeswere plotted in comparison with hg19 (Figure 3). Wefound repeat units across the entire spectrum of sizes inYH and NA12878 while there were only sporadic peaks inhg19, implying an under representation of copy number

variation as described in the current reference assembly.Furthermore, we have found a very large peak of approxi-mately 2.5-kb repeats in YH (male, 691 copies) but not inNA19878 (female, 36 copies; Figure 3). This was furthersupported by additional genome mapping in other malesand females demonstrating a consistent and significantquantity of male-specific repeats of 2.5 kb (unpublished).As an example, Additional file 1: Figure S3 shows a rawimage of an intact long molecule of 630 kb with two tractsof at least 53 copies and at least 21 copies of 2.5-kb tan-dem repeats (each 2.5-kb unit has one nick label site, cre-ating the evenly spaced pattern) physically linked byanother label-absent putative tandem repeat spanningover 435 kb, and Additional file 1: Figure S4 shows con-vincing mapping information. Unambiguously elucidatingthe absolute value and architecture of such complex re-peat regions is not possible with other short fragment orhybridization-based methods.

Complex region analysis using genome mappingBesides SV detection, genome mapping data also provideabundant information about other complex regions inthe genome. For complex regions that are functionallyimportant, an accurate reference map is critical for precisesequence assembly and integration for functional analysis[40-43]. We analyzed the structure of some complexregions of the human genome. They include MHCalso called Human leukocyte antigen (HLA), KIR,IGL/IGH, as well as TRA/TRB [44-48]. In the highlyvariable HLA-A and –C loci, the YH genome sharedone haplotype with the previously typed PGF genome(used in hg19) and also revealed an Asian/YH-specificvariant on maps 209 and 153 (Additional file 1: FigureS5), respectively. In the variant haplotype (Map ID153), there is a large insertion at the HLA-A locuswhile at the HLA-D and RCCX loci, YH had an Asian/YH-specific insertion and a deletion. In addition to theMHC region, we also detected Asian/YH-specific struc-tural differences in KIR (Additional file 1: Figure S6),IGH/IGL (Additional file 1: Figure S7), and TRA/TRB(Additional file 1: Figure S8), compared to the referencegenome.

External sequence integration detection using genomemappingExternal viral sequence integration detection is import-ant for the study of diseases such as cancer, but currenthigh-throughput methods are limited in discoveringintegration break points [49-51]. Although fiber fluores-cence in situ hybridization (FISH) was used to discrimin-ate between integration and episomal forms of virusutilizing long dynamic DNA molecules [52], this methodwas laborious, low-resolution and low-throughput. Thus,long, intact high-resolution single-molecule data provided

Page 66: MHC region and its related disease study

Figure 3 A plot of repeat units in two human genomes as seen in single molecules. A repeat unit is defined as five or more equidistantlabels. Total units in bins are normalized to the average coverage depth in the genome.

Figure 4 Circos plot of distribution of integration eventsthroughout YH genome. The genome was divided intonon-overlapping windows of 200 kb. The number of molecules withevidence of integration in each window is plotted with each concentricgrey circle representing a two-fold increment in virus detection.

Cao et al. GigaScience 2014, 3:34 Page 6 of 11http://www.gigasciencejournal.com/content/3/1/34

by genome mapping allows for rapid and effective analysisof which part of the virus sequence has been integratedinto the host genome and its localization. We detectedEBV integration into the genome of the cell line sample.The EBV virus map was assembled de novo during the

whole genome de novo assembly of the YH cell linegenome. We mapped the de novo EBV map to in silicomaps from public databases to determine the strain thatwas represented in the cell line. We found that the YHstrain was most closely related, although not identical,to strain B95-8 (GenBank: V01555.2). To detect EBV in-tegration, portions of the aligned molecules extendingbeyond the EBV map were extracted and aligned withhg19 to determine potential integration sites (Additionalfile 1: Figure S9). There are 1,340 EBV integration eventsacross the genome (Figure 4). We found that the frequencyof EBV integration mapping was significantly lower thanthe average coverage depth (~70X), implying the DNAsample derived from a clonal cell population is potentiallymore diverse than previously thought, and that this methodcould reveal the heterogeneity of a very complex samplepopulation at the single-molecule level. Also, the integratedportion of the EBV genome sequence was detected with alarger fraction towards the tail (Additional file 1: FigureS10). Besides integration events, we also found EBVepisome molecules whose single-molecule map could bemapped to the EBV genome, free of flanking humangenomic regions.

DiscussionStructural variants are increasingly frequently shown toplay important roles in human health. However, availabletechnologies, such as array-CGH, SNP array and NGSare incapable of cataloguing them in a comprehensiveand unbiased manner. Genome mapping, a technologysuccessfully applied to the assembly of complexregions of a plant genome and characterization of

structural variation and haplotype differences in thehuman MHC region, has been adopted to capture thegenome-wide structure of a human individual in thecurrent study. Evidence for over 600 SVs in this indi-vidual has been provided. Despite the difficulty of SVdetection by sequencing methods, the majority of gen-ome map-detected SVs were retrospectively found tohave signals consistent with the presence of an SV,validating genome mapping for SV discovery. Approxi-mately 75% of the SVs discovered by genome mapping

Page 67: MHC region and its related disease study

Cao et al. GigaScience 2014, 3:34 Page 7 of 11http://www.gigasciencejournal.com/content/3/1/34

were insertions; this interesting phenomenon may be amethod bias or a genuine representation of the additionalcontent in this genome of Asian descent that is notpresent in hg19, which was compiled based on genomicmaterials presumably derived from mostly non-Asians.Analysis of additional genomes is necessary for compari-son. Insertion detection is refractory to many existingmethodologies [24,25], so to some extent, genome map-ping revealed its distinct potential to address this chal-lenge. Furthermore, functional annotation results of thedetected SVs show that 30% of them (Additional file 2,Spreadsheet 1 & 2) affect exonic regions of relevant geneswhich may cause severe effects on gene function. Geneontology (GO) analysis demonstrates that these SVs areassociated with genes that contribute to important bio-logical processes (Additional file 2, Spreadsheet 1 & 2 andAdditional file 1: Figure S11), reflecting that the SVs de-tected here are likely to affect a large number of genes andmay have a significant impact on human health. Genomemapping provides us with an effective way to study theimpact of genome-wide SV on human conditions. SomeN-base gaps are estimated to have longer or shorter lengthor more complex structurally compared to hg19, demon-strating that genome mapping is useful for improving thehuman and other large genome assemblies. We alsopresent a genome-wide analysis of short tandem repeatsin individual human genomes and structural informationand differences for some of the most complex regions inthe YH genome. Independent computational analysis hasbeen performed to discern exogenous viral insertions, aswell as exogenous episomes. All of these provide invaluableinsights into the capacity of genome mapping as a promis-ing new strategy for research and clinical application.The basis for the genome mapping technology that en-

ables us to effectively address shortcomings of existingmethodologies is the use of motif maps derived from ex-tremely long DNA molecules hundreds of kb in length.Using these motif maps, we are able to also accesschallenging loci where existing technologies fail. Firstly,global structural variations were easily and quickly de-tected. Secondly, evidence for a deletion bias which iscommonly observed with both arrays and NGS technol-ogy, is absent in genome mapping. In fact, we observemore insertions than deletions in this study. Thirdly, forthe first time, we are able to measure the length of re-gions of the YH genome that represent gaps in thehuman reference assembly. Fourthly, consensus mapscould be assembled in highly variable regions in the YHgenome which are important for subsequent functionalanalysis. Finally, both integrated and un-integrated EBVmolecules are identified, and potential sub-strains differ-entiated, and the EBV genome sequence that integratedinto the host genome was obtained directly. This infor-mation was previously inaccessible without additional

PCR steps or NGS approaches [50]. All in all, we dem-onstrated advantages and strong potential of the genomemapping technology based on nanochannel arrays tohelp overcome problems that have severely limited ourunderstanding of the human genome.In addition to the advantages this study reveals about

the genome mapping technology, aspects that need to beimproved also are highlighted. As genome mapping tech-nology generates sequence-specific motif-labeled DNAmolecules and analyzes these motif maps using an overlap-layout-consensus algorithm, subsequent performance andresolution largely depends on motif density (any individualevent endpoints can only be resolved to the nearest restric-tion sites). For example, the EBV integration analysis in thisstudy was more powerful in the high-density regions(Additional file 1: Figure S10). Hence, higher density label-ing methods to increase the information density that maypromote even higher accuracy and unbiased analysis of ge-nomes are currently being further developed. When datafrom genome mapping is combined with another source ofinformation, one can achieve even higher resolution foreach event. In addition, reducing random errors like extrarestriction sites, missing restriction sites and size measure-ment is important for subsequent analysis. Finally, improve-ments to the SV detection algorithm will provide furtherdiscovery potential, and balanced reciprocal translocationscan be identified in genome maps generated from cancermodel genomes (personal communication, Michael Rossi).The throughput and speed of a technology remains

one of the most important factors for routine use inclinical screening as well as scientific research. At thetime of manuscript submission, genome mapping of ahuman individual could be accomplished with fewerthan three nanochannel array chips in a few days. It isanticipated that a single nanochannel chip would covera human size genome in less than one day within6 months, facilitating new studies aimed at unlockingthe inaccessible portions of the genome. In this way,genome mapping has an advantage over the use of mul-tiple orthogonal methods that are often used to detectglobal SVs. Thus, it is now feasible to conduct largepopulation-based comprehensive SV studies efficientlyon a single platform.

MethodsHigh-molecular weight DNA extractionHigh-molecular weight (HMW) DNA extraction wasperformed as recommended for the CHEF MammalianGenomic DNA Plug Kit (BioRad #170-3591). Briefly,cells from the YH or NA12878 cell lines were washedwith 2x with PBS and resuspended in cell resuspensionbuffer, after which 7.5 × 105 cells were embedded ineach gel plug. Plugs were incubated with lysis buffer andproteinase K for four hours at 50°C. The plugs were

Page 68: MHC region and its related disease study

Cao et al. GigaScience 2014, 3:34 Page 8 of 11http://www.gigasciencejournal.com/content/3/1/34

washed and then solubilized with GELase (Epicentre).The purified DNA was subjected to four hours ofdrop dialysis (Millipore, #VCWP04700) and quantifiedusing Nanodrop 1000 (Thermal Fisher Scientific) and/or the Quant-iT dsDNA Assay Kit (Invitrogen/Mo-lecular Probes).

DNA labelingDNA was labeled according to commercial protocolsusing the IrysPrep Reagent Kit (BioNano Genomics,Inc). Specifically, 300 ng of purified genomic DNA wasnicked with 7 U nicking endonuclease Nt.BspQI (NewEngland BioLabs, NEB) at 37°C for two hours in NEBBuffer 3. The nicked DNA was labeled with afluorescent-dUTP nucleotide analog using Taq polymer-ase (NEB) for one hour at 72°C. After labeling, the nickswere ligated with Taq ligase (NEB) in the presence ofdNTPs. The backbone of fluorescently labeled DNA wasstained with YOYO-1 (Invitrogen).

Data collectionThe DNA was loaded onto the nanochannel array ofBioNano Genomics IrysChip by electrophoresis of DNA.Linearized DNA molecules were then imaged automatic-ally followed by repeated cycles of DNA loading usingthe BioNano Genomics Irys system.The DNA molecules backbones (YOYO-1 stained) and

locations of fluorescent labels along each molecule weredetected using the in-house software package, IrysView.The set of label locations of each DNA molecule definesan individual single-molecule map.

De novo genome map assemblySingle-molecule maps were assembled de novo into con-sensus maps using software tools developed at BioNanoGenomics. Briefly, the assembler is a custom implemen-tation of the overlap-layout-consensus paradigm with amaximum likelihood model. An overlap graph was gen-erated based on pairwise comparison of all molecules asinput. Redundant and spurious edges were removed.The assembler outputs the longest path in the graph andconsensus maps were derived. Consensus maps are fur-ther refined by mapping single-molecule maps to theconsensus maps and label positions are recalculated. Re-fined consensus maps are extended by mapping singlemolecules to the ends of the consensus and calculatinglabel positions beyond the initial maps. After merging ofoverlapping maps, a final set of consensus maps wasgenerated and used for subsequent analysis. Further-more, we applied a “stitching” procedure to join neigh-boring genome maps. Two adjacent genome mapswould be joined together if the junction a) was within50 kb apart, b) contained at most 5 labels, c) contained,or was within 50 kb from, a fragile site, and d) also

contained no more than 5 unaligned end labels. If thesecriteria were satisfied, the two genome maps would bejoined together with the intervening label patterns takenfrom the reference in silico map.

Structural variation detectionAlignments between consensus genome maps and thehg19 in silico sequence motif map were obtained using adynamic programming approach where the scoringfunction was the likelihood of a pair of intervals beingsimilar [53]. Likelihood is calculated based on a noisemodel which takes into account fixed sizing error, sizingerror which scales linearly with the interval size, mis-aligned sites (false positives and false negatives), andoptical resolution. Within an alignment, an interval orrange of intervals whose cumulative likelihood formatching the reference map is worse than 0.01 percentchance is classified as an outlier region. If such a regionoccurs between highly scoring regions (p-value of 10e−6),an insertion or deletion call is made in the outlier re-gion, depending on the relative size of the region on thequery and reference maps. Inversions are defined if adja-cent match-groups between the genome map and refer-ence are in reverse relative orientation.

Signals refined by re-sequencing and de novo assemblybased methodsIn order to demonstrate the capacity of genome mappingfor the detection of large SVs, we tested the candidateSVs using whole-genome paired-end 100 bp sequencing(WGS) data with insert sizes of 500 bp and fosmid se-quence based de novo assembly result. SVs were testedbased on the expectation that authentic SVs would besupported by abnormally mapped read pairs, and thatdeletions with respect to the reference should have lowermapped read depth than average [20,22,23]. We per-formed single-end/(paired-end + single-end) reads ratio(sp ratio) calculations at the whole-genome level to assignan appropriate threshold for abnormal regions as well asdepth coverage. We set sp ratio and depth cutoff thresh-olds based on the whole genome data to define SV signals.Insertions with aberrant sp ratio and deletions with eithersp ratio or abnormal depth were defined to be a supportedcandidate.We also utilized fosmid-based de novo assembly data

to search for signals supporting candidate SVs. We usedcontigs and scaffolds assembled from short reads tocheck for linearity between a given assembly and hg19using LASTZ [54]. WGS-based and fosmid-based SVvalidation showed inconsistency and/or lack of satur-ation as each supported unique variants (Additional file1: Figure S2) [24].

Page 69: MHC region and its related disease study

Cao et al. GigaScience 2014, 3:34 Page 9 of 11http://www.gigasciencejournal.com/content/3/1/34

EBV integration detectionSingle-molecule maps were aligned with a map generatedin silico based on the EBV reference sequence (strain B95-8; GenBank: V01555.2). Portions of the aligned moleculesextending beyond the EBV map were extracted andaligned with hg19 to determine potential integration sites.

Availability of supporting dataThe data sets supporting the results of this article isavailable in the GigaScience GigaDB, repository [55]. Seethe individual GigaDB entries for the YH Bionano data[35] and YH fosmid validation data [36], which is alsoavailable in the SRA [PRJEB7886].

Additional files

Additional file 1: Figure S1. Comparison of consensus genome mapsand hg19 reference across gap regions. Figure S2. Consensus genomemap coverage of human reference assembly (hg19). Figure S3. Examplesof repetitive sequence detected in intact single molecules by genomemapping. Figure S4. Consensus genome map compared to hg19 in along tandem repeat region. Figure S5. Consensus genome mapscompared to hg19 in the MHC region. Figure S6. Consensus genomemaps compared to hg19 in the KIR region. Figure S7. Consensusgenome maps compared to hg19 in the IGH and IGL regions. Figure S8.Consensus genome maps compared to hg19 in the TRA and TRB regions.Figure S9. Single-molecule alignment to EBV in silico motif map (strainB95-8) showing evidence of strain variation and heterogeneous integration.Figure S10. Distribution of integrated portions of the EBV genome.Figure S11. GO annotations of genes within called SVs. Table S1. Summaryof consensus genome map assembly.

Additional file 2: Spreadsheet 1. Summary of insertions and deletionsdetected in BioNano genome maps. Spreadsheet 2: Summary ofinversions detected in BioNano genome maps. Spreadsheet 3: Validationof insertions and deletions detected in BioNano genome maps.

AbbreviationsArray-CGH: Array-based comparative genomic hybridization; AS: De novosequence assembly; ASD: Autism spectrum disorder; BCR: B cell receptor;CNV: Copy number variant; DGV: Database of genomic variants; EBV:Epstein-Barr virus; FISH: Fluorescence in situ hybridization; GO: Geneontology; HLA: Human leukocyte antigen; HMW: High-molecular weight;IGH: Immunoglobulin heavy locus; IGL: Immunoglobulin light locus; KIR: Killercell immunoglobulin-like receptor; LRC: Leukocyte Receptor Complex;MHC: Major histocompatibility complex; NGS: Next-generation sequencing;PCR: Polymerase chain reaction; PEM: Pair-end mapping; RD: Read depth;SNP: Single nucleotide polymorphism; SR: Split read; SV: Structural variation;TCR: T cell receptor; TRA: T cell receptor alpha locus; TRB: T cell receptor betalocus; WGS: Whole-genome sequencing; YH: YanHuang.

Competing interestsThe authors have the following interests: Alex R. Hastie, Ernest T. Lam,Warren Andrews, Saki Chan, Michael Requa, Thomas Anantharaman and HanCao were employees of BioNano Genomics at the time of the study, andthey owned company stock options.

Authors’ contributionsXX, HC and HZC managed the project. HC, HZC, ARH, DC, ETL and SCdesigned the analyses. HC, ARH, ETL, W A, MR and TA produced genomemapping data. HZC, ARH, DC, ETL, YS, HH, XL, LY, WA, SC, SH, XT, MR, TA,AK and HY performed the data analyses. HZC, ARH and DC did most of thewriting with contributions from all authors. All authors read and approvedthe final manuscript.

AcknowledgementsWe wish to recognize BGI-Shenzhen’s sequencing platform for generatingthe data in this study. We thank the faculty and staff at BGI-Shenzhen whocontributed to this project. We also thank everyone at BioNano Genomicswho contributed to the development of the Irys system and the team’shelpful discussion and analysis. This work was supported by the State KeyDevelopment Program for Basic Research of China—973 Program(NO.2011CB809202); the National High Technology Research and DevelopmentProgram of China - 863 Program (NO.2012AA02A201); the Shenzhen MunicipalGovernment of China (NO.JC201005260191A); Shenzhen Key Laboratory ofTransomics Biotechnologies (NO.CXB201108250096A).

Author details1BGI-Shenzhen, Shenzhen 518083, China. 2BioNano Genomics, San Diego,California 92121, USA. 3Shenzhen Key Laboratory of TransomicsBiotechnologies, Shenzhen 518083, China. 4Department of Biology, Universityof Copenhagen, Copenhagen 2200, Denmark. 5School of Bioscience andBiotechnology, South China University of Technology, Guangzhou 511400,China.

Received: 21 May 2014 Accepted: 28 November 2014Published: 30 December 2014

References1. Freeman JL, Perry GH, Feuk L, Redon R, McCarroll SA, Altshuler DM,

Aburatani H, Jones KW, Tyler-Smith C, Hurles ME, Carter NP, Scherer SW,Lee C: Copy number variation: new insights in genome diversity.Genome Res 2006, 16:949–961.

2. Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison VA, Pertz LM, Haugen E,Hayden H, Albertson D, Pinkel D, Olson MV, Eichler EE: Fine-scale structuralvariation of the human genome. Nat Genet 2005, 37:727–732.

3. Feuk L, Carson AR, Scherer SW: Structural variation in the human genome.Nat Rev Genet 2006, 7:85–97.

4. Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, HansenN, Teague B, Alkan C, Antonacci F, Haugen E, Zerr T, Yamada NA, Tsang P,Newman TL, Tüzün E, Cheng Z, Ebling HM, Tusneem N, David R, Gillett W,Phelps KA, Weaver M, Saranga D, Brand A, Tao W, Gustafson E, McKernan K,Chen L, Malig M, et al: Mapping and sequencing of structural variationfrom eight human genomes. Nature 2008, 453:56–64.

5. Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Månér S,Massa H, Walker M, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A,Gilliam TC, Trask B, Patterson N, Zetterberg A, Wigler M: Large-scalecopy number polymorphism in the human genome. Science 2004,305:525–528.

6. Itsara A, Cooper GM, Baker C, Girirajan S, Li J, Absher D, Krauss RM, MyersRM, Ridker PM, Chasman DI, Mefford H, Ying P, Nickerson DA, Eichler EE:Population analysis of large copy number variants and hotspots ofhuman genetic disease. Am J Hum Genet 2009, 84:148–161.

7. Cheng Z, Ventura M, She X, Khaitovich P, Graves T, Osoegawa K, Church D,DeJong P, Wilson RK, Pääbo S, Rocchi M, Eichler EE: A genome-widecomparison of recent chimpanzee and human segmental duplications.Nature 2005, 437:88–93.

8. Lupski JR: Genomic rearrangements and sporadic disease. Nat Genet 2007,39:S43–S47.

9. Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA,Hurles ME, McVean GA: A map of human genome variation frompopulation-scale sequencing. Nature 2010, 467:1061–1073.

10. Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, Abyzov A,Yoon SC, Ye K, Cheetham RK, Chinwalla A, Conrad DF, Fu Y, Grubert F,Hajirasouliha I, Hormozdiari F, Iakoucheva LM, Iqbal Z, Kang S, Kidd JM,Konkel MK, Korn J, Khurana E, Kural D, Lam HY, Leng J, Li R, Li Y, Lin CY,Luo R, et al: Mapping copy number variation by population-scalegenome sequencing. Nature 2011, 470:59–65.

11. Stankiewicz P, Lupski JR: Structural variation in the human genome andits role in disease. Annu Rev Med 2010, 61:437–455.

12. Girirajan S, Campbell CD, Eichler EE: Human copy number variation andcomplex genetic disease. Annu Rev Genet 2011, 45:203–226.

13. Weischenfeldt J, Symmons O, Spitz F, Korbel JO: Phenotypic impact ofgenomic structural variation: insights from and for human disease.Nat Rev Genet 2013, 14:125–138.

Page 70: MHC region and its related disease study

Cao et al. GigaScience 2014, 3:34 Page 10 of 11http://www.gigasciencejournal.com/content/3/1/34

14. Ledbetter DHRV, Airhart SD, Strobel RJ, Keenan BS, Crawford JD: Deletionsof chromosome 15 as a cause of the Prader-Willi syndrome. N Engl J Med1981, 304:325–329.

15. Vissers LE, Stankiewicz P: Microdeletion and microduplication syndromes.Methods Mol Biol 2012, 838:29–75.

16. Sebat J, Lakshmi B, Malhotra D, Troge J, Lese-Martin C, Walsh T, Yamrom B,Yoon S, Krasnitz A, Kendall J, Leotta A, Pai D, Zhang R, Lee YH, Hicks J,Spence SJ, Lee AT, Puura K, Lehtimäki T, Ledbetter D, Gregersen PK,Bregman J, Sutcliffe JS, Jobanputra V, Chung W, Warburton D, King MC,Skuse D, Geschwind DH, Gilliam TC, et al: Strong association of de novocopy number mutations with autism. Science 2007, 316:445–449.

17. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H,Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, GonzálezJR, Gratacòs M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR,Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F,Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, et al: Globalvariation in copy number in the human genome. Nature 2006,444:444–454.

18. McCarroll SA, Kuruvilla FG, Korn JM, Cawley S, Nemesh J, Wysoker A,Shapero MH, de Bakker PI, Maller JB, Kirby A, Elliott AL, Parkin M, Hubbell E,Webster T, Mei R, Veitch J, Collins PJ, Handsaker R, Lincoln S, Nizzari M,Blume J, Jones KW, Rava R, Daly MJ, Gabriel SB, Altshuler D: Integrateddetection and population-genetic analysis of SNPs and copy numbervariation. Nat Genet 2008, 40:1166–1174.

19. Sharp AJ, Locke DP, McGrath SD, Cheng Z, Bailey JA, Vallente RU, Pertz LM,Clark RA, Schwartz S, Segraves R, Oseroff VV, Albertson DG, Pinkel D, EichlerEE: Segmental duplications and copy-number variation in the humangenome. Am J Hum Genet 2005, 77:78–88.

20. Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD,Wendl MC, Zhang Q, Locke DP, Shi X, Fulton RS, Ley TJ, Wilson RK, Ding L,Mardis ER: BreakDancer: an algorithm for high-resolution mapping ofgenomic structural variation. Nat Methods 2009, 6:677–681.

21. Wang J, Mullighan CG, Easton J, Roberts S, Heatley SL, Ma J, Rusch MC,Chen K, Harris CC, Ding L, Holmfeldt L, Payne-Turner D, Fan X, Wei L, ZhaoD, Obenauer JC, Naeve C, Mardis ER, Wilson RK, Downing JR, Zhang J:CREST maps somatic structural variation in cancer genomes withbase-pair resolution. Nat Methods 2011, 8:652–654.

22. Abyzov A, Urban AE, Snyder M, Gerstein M: CNVnator: an approach todiscover, genotype, and characterize typical and atypical CNVsfrom family and population genome sequencing. Genome Res 2011,21:974–984.

23. Li Y, Zheng H, Luo R, Wu H, Zhu H, Li R, Cao H, Wu B, Huang S, Shao H,Ma H, Zhang F, Feng S, Zhang W, Du H, Tian G, Li J, Zhang X, Li S, Bolund L,Kristiansen K, de Smith AJ, Blakemore AI, Coin LJ, Yang H, Wang J, Wang J:Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly. Naturebiotechnology 2011, 29:723–730.

24. Alkan C, Coe BP, Eichler EE: Genome structural variation discovery andgenotyping. Nat Rev Genet 2011, 12:363–376.

25. Zhao M, Wang Q, Jia P, Zhao Z: Computational tools for copy numbervariation (CNV) detection using next-generation sequencing data:features and perspectives. BMC Bioinformatics 2013, 14(11):S1.

26. Alkan C, Sajjadian S, Eichler EE: Limitations of next-generation genomesequence assembly. Nat Methods 2011, 8:61–65.

27. Teague B, Waterman MS, Goldstein S, Potamousis K, Zhou S, Reslewic S,Sarkar D, Valouev A, Churas C, Kidd JM, Kohn S, Runnheim R, Lamers C,Forrest D, Newton MA, Eichler EE, Kent-First M, Surti U, Livny M, SchwartzDC: High-resolution human genome structure by single-moleculeanalysis. Proc Natl Acad Sci U S A 2010, 107:10848–10853.

28. Dong Y, Xie M, Jiang Y, Xiao N, Du X, Zhang W, Tosser-Klopp G, Wang J,Yang S, Liang J, Chen W, Chen J, Zeng P, Hou Y, Bian C, Pan S, Li Y, Liu X,Wang W, Servin B, Sayre B, Zhu B, Sweeney D, Moore R, Nie W, Shen Y,Zhao R, Zhang G, Li J, Faraut T, et al: Sequencing and automatedwhole-genome optical mapping of the genome of a domestic goat(Capra hircus). Nat Biotechnol 2013, 31:135–141.

29. Levy-Sakin M, Ebenstein Y: Beyond sequencing: optical mapping of DNAin the age of nanotechnology and nanoscopy. Curr Opin Biotechnol 2013,24:690–698.

30. Das SK, Austin MD, Akana MC, Deshpande P, Cao H, Xiao M: Singlemolecule linear analysis of DNA in nano-channel labeled with sequencespecific fluorescent probes. Nucleic Acids Res 2010, 38:e177.

31. Lam ET, Hastie A, Lin C, Ehrlich D, Das SK, Austin MD, Deshpande P, Cao H,Nagarajan N, Xiao M, Kwok PY: Genome mapping on nanochannel arraysfor structural variation analysis and sequence assembly. Nat Biotechnol2012, 30:771–776.

32. Hastie AR, Dong L, Smith A, Finklestein J, Lam ET, Huo N, Cao H, Kwok PY,Deal KR, Dvorak J, Luo MC, Gu Y, Xiao M: Rapid genome mapping innanochannel arrays for highly complete and accurate de novo sequenceassembly of the complex Aegilops tauschii genome. PLoS One 2013, 8:e55864.

33. Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, ZhangJ, Guo Y, Feng B, Li H, Lu Y, Fang X, Liang H, Du Z, Li D, Zhao Y, Hu Y, YangZ, Zheng H, Hellmann I, Inouye M, Pool J, Yi X, Zhao J, Duan J, Zhou Y, QinJ, et al: The diploid genome sequence of an Asian individual. Nature 2008,456:60–65.

34. Sneddon TP, Li P, Edmunds SC: GigaDB: announcing the GigaSciencedatabase. biotechnology 2012, 1:11.

35. Cao HZ, Hastie AR, Cao D, Lam ET, Sun Y, Huang H, Liu X, Lin L, Andrews W,Chan S, Huang S, Tong X, Requa M, Anantharaman T, Krogh A, Yang H, CaoH, Xu X: Supporting material for: "Rapid detection of structural variationin a Human genome using nanochannel-based genome mappingtechnology". GigaScience Database 2014, http://dx.doi.org/10.5524/100097.

36. Cao HZ, Wu H, Luo R, Huang S, Sun Y, Tong X, Xie Y, Liu B, Yang H, ZhengH, Li J, Li B, Wang Y, Yang F, Sun P, Liu S, Gao P, Huang H, Sun J, Chan D,Ha G, Huang W, Huang Z, Li Y, Tellier LCAM, Liu X, Feng Q, Xu X, Zhang X,et al: Supporting material for: De novo assembly of a haplotype-resolvedhuman genome. GigaScience Database 2014, http://dx.doi.org/10.5524/100096.

37. Valouev A, Schwartz DC, Zhou S, Waterman MS: An algorithm for assemblyof ordered restriction maps from single DNA molecules. Proc Nat Acad SciUSA 2006, 103:15770–15775.

38. Macdonald JR, Ziman R, Yuen RK, Feuk L, Scherer SW: The Database ofGenomic Variants: a curated collection of structural variation in thehuman genome. Nucleic Acids Res 2014, 42:D986–D992.

39. McKusick VA: Mendelian Inheritance in Man and its online version, OMIM.Am J Hum Genet 2007, 80:588–604.

40. Horton R, Gibson R, Coggill P, Miretti M, Allcock RJ, Almeida J, Forbes S,Gilbert JG, Halls K, Harrow JL, Hart E, Howe K, Jackson DK, Palmer S, RobertsAN, Sims S, Stewart CA, Traherne JA, Trevanion S, Wilming L, Rogers J, deJong PJ, Elliott JF, Sawcer S, Todd JA, Trowsdale J, Beck S: Variation analysisand gene annotation of eight MHC haplotypes: the MHC HaplotypeProject. Immunogenetics 2008, 60:1–18.

41. Tesch HMP, Krueger GR, Fischer R, Diehl V: Analysis of immunoglobulin, Tcell receptor and bcr rearrangements in human malignant lymphomaand Hodgkin's disease. Oncology 1990, 47:215–223.

42. Rajagopalan S, Long EO: Understanding how combinations of HLA andKIR genes influence disease. J Exp Med 2005, 201:1025–1029.

43. Gonzalez D, van der Burg M, Garcia-Sanz R, Fenton JA, Langerak AW,Gonzalez M, van Dongen JJ, San Miguel JF, Morgan GJ: Immunoglobulingene rearrangements and the pathogenesis of multiple myeloma. Blood2007, 110:3112–3121.

44. Beck S, Geraghty D, Inoko H, Rowen L, Aguado B, Bahram S, Campbell RD,Forbes SA, Guillaudeux T, Hood L: Complete sequence and gene map of ahuman major histocompatibility complex. Nature 1999, 401:921–923.

45. Vilches C, Parham P: KIR: diverse, rapidly evolving receptors of innate andadaptive immunity. Annu Rev Immunol 2002, 20:217–251.

46. Katzmann JA, Clark RJ, Abraham RS, Bryant S, Lymp JF, Bradwell AR, Kyle RA:Serum reference intervals and diagnostic ranges for free kappa and freelambda immunoglobulin light chains: relative sensitivity for detection ofmonoclonal light chains. Clin Chem 2002, 48:1437–1444.

47. Tomlinson IM, Cook GP, Walter G, Carter NP, Riethman H, Buluwela L,Rabbitts TH, Winter G: A complete map of the human immunoglobulinVH locus. Ann N Y Acad Sci 1995, 764:43–46.

48. Haynes MR, Wu GE: Gene discovery at the human T-cell receptor alpha/delta locus. Immunogenetics 2007, 59:109–121.

49. Li WZX, Lee NP, Liu X, Chen S, Guo B, Yi S, Zhuang X, Chen F, Wang G,Poon RT, Fan ST, Mao M, Li Y, Li S, Wang J, Jianwang, Xu X, Jiang H, ZhangX: HIVID: an efficient method to detect HBV integration using lowcoverage sequencing. Genomics 2013, 102:338–344.

50. Kan Z, Zheng H, Liu X, Li S, Barber TD, Gong Z, Gao H, Hao K, Willard MD,Xu J, Hauptschein R, Rejto PA, Fernandez J, Wang G, Zhang Q, Wang B,Chen R, Wang J, Lee NP, Zhou W, Lin Z, Peng Z, Yi K, Chen S, Li L, Fan X,

Page 71: MHC region and its related disease study

Cao et al. GigaScience 2014, 3:34 Page 11 of 11http://www.gigasciencejournal.com/content/3/1/34

Yang J, Ye R, Ju J, Wang K, et al: Whole-genome sequencing identifiesrecurrent mutations in hepatocellular carcinoma. Genome Res 2013,23:1422–1433.

51. Zhao G, Krishnamurthy S, Cai Z, Popov VL, da Rosa Travassos AP, Guzman H,Cao S, Virgin HW, Tesh RB, Wang D: Identification of Novel Viruses UsingVirusHunter – an Automated Data Analysis Pipeline. PLoS One 2013,8:e78470.

52. Reisinger J, Rumpler S, Lion T, Ambros PF: Visualization of episomal andintegrated Epstein-Barr virus DNA by fiber fluorescence in situhybridization. Int J Cancer 2006, 118:1603–1608.

53. Anantharaman T, Mishra B: A probabilistic analysis of false positives inoptical map alignment and validation. In Proc. of WABI; 2001:27–40.

54. Harris RS: Improved pairwise alignment of genomic DNA; 2007.55. Sneddon TP, Zhe XS, Edmunds SC, Li P, Goodman L, Hunter CI: GigaDB:

promoting data dissemination and reproducibility. Database 2014,2014:bau018.

doi:10.1186/2047-217X-3-34Cite this article as: Cao et al.: Rapid detection of structural variation in ahuman genome using nanochannel-based genome mapping technology.GigaScience 2014 3:34.

Submit your next manuscript to BioMed Centraland take full advantage of:

• Convenient online submission

• Thorough peer review

• No space constraints or color figure charges

• Immediate publication on acceptance

• Inclusion in PubMed, CAS, Scopus and Google Scholar

• Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit