Automatic Identification of Antibodies in the Protein Data Bank

6
Chinese Journal of Chemistry, 2009, 27, 2328 Communication * E-mail: [email protected]; Tel.: 0086-021-54925128 Received June 19, 2008; revised and accepted October 9, 2008. Project supported by the National Natural Science Foundation of China (No. 20502031) and the Chinese Ministry of Science and Technology (the 863 Project, No. 2006AA02Z337). Dedicated to Professor Qingyun Chen on the occasion of his 80th birthday. © 2009 SIOC, CAS, Shanghai, & WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim Automatic Identification of Antibodies in the Protein Data Bank LI, Xun(李勋) WANG, Renxiao*(王任小) State Key Laboratory of Bioorganic Chemistry, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, Shanghai 200032, China An automatic method has been developed for identifying antibody entries in the protein data bank (PDB). Our method, called KIAb (Keyword-based Identification of Antibodies), parses PDB-format files to search for particular keywords relevant to antibodies, and makes judgment accordingly. Our method identified 780 entries as antibodies on the entire PDB. Among them, 767 entries were confirmed by manual inspection, indicating a high success rate of 98.3%. Our method recovered basically all of the entries compiled in the Summary of Antibody Crystal Structures (SACS) database. It also identified a number of entries missed by SACS. Our method thus provides a more com- plete mining of antibody entries in PDB with a very low false positive rate. Keywords antibody database, keyword-based searching, Keyword-based Identification of Antibodies (KIAb), Protein Data Bank (PDB) Introduction Antibodies, also called immunoglobulins in general, are produced by B-cells in living bodies. They play a critical role in responding to exogenetic stimulus intro- duced by antigens or immunogens. An in-depth under- standing of the unique structures and biological func- tions of antibodies has many significant applications, such as antibody engineering, 1-3 development of anti- body-based therapies 4-6 and catalytic antibodies. 7-9 A compilation of known three-dimensional struc- tures of antibodies will be essential for this purpose. At present, the Protein Data Bank (PDB) 10 provides the primary source of experimentally determined three- dimensional structures of biological macromolecules. Until this article was drafted, over 44000 entries have been already available in PDB. Currently, PDB classi- fies all of the entries in its contents roughly as proteins, nucleic acids, and protein-nucleic acid complexes. It however does not provide a particular compilation of antibody entries in its contents. While it is possible to retrieve from PDB the entries relevant to a specific type of protein, for example, using its name or EC number, this method does not apply to antibodies. It is because no systematic nomenclature currently exists for anti- bodies. Apparently, using keywords like “antibody” or “immunoglobulin” as query in searching will return far too many false hits and is not a useful solution. To the best of our knowledge, the Summary of An- tibody Crystal Structures (SACS) database, 11 maintained at the University College London in England, is the only actively updated compilation of the antibody entries in PDB at present. However, the method used by SACS for selecting antibody entries has not been narrated clearly in literature, which makes us wonder how SACS is compiled and if it provides a complete compilation of the antibody entries in PDB. Thus, we are motivated to develop a new method for identifying the antibody en- tries in PDB. Our method, called KIAb (Keyword-based Identification of Antibodies), uses a simple strategy which can be easily reproduced by other researchers. It is also fully automatic so that the entire PDB can be screened in a short time. Details of our method and its results are given in the following sessions. Methods Our method classifies a given PDB entry by analyz- ing the contents of its structural file in the PDB format. The PDB format 12 has been used by the Protein Data Bank since its advent in 1970s. It should be mentioned that each entry in PDB nowadays is also provided in the mmCIF format and the PDBML/XML format. Never- theless, these two formats are optimized for certain computer applications. The PDB format is certainly more friendly to human readers. The PDB format uses a series of records to save the descriptions and structural information of macromolecules as well as other miscel- laneous components. Each record is initiated with a spe- cific keyword, which defines its contents. By the current definition of the PDB format, however, no particular record is designated to give an unambiguous classifica-

Transcript of Automatic Identification of Antibodies in the Protein Data Bank

Page 1: Automatic Identification of Antibodies in the Protein Data Bank

Chinese Journal of Chemistry, 2009, 27, 23—28 Communication

* E-mail: [email protected]; Tel.: 0086-021-54925128 Received June 19, 2008; revised and accepted October 9, 2008. Project supported by the National Natural Science Foundation of China (No. 20502031) and the Chinese Ministry of Science and Technology (the

863 Project, No. 2006AA02Z337). † Dedicated to Professor Qingyun Chen on the occasion of his 80th birthday.

© 2009 SIOC, CAS, Shanghai, & WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Automatic Identification of Antibodies in the Protein Data Bank†

LI, Xun(李勋) WANG, Renxiao*(王任小)

State Key Laboratory of Bioorganic Chemistry, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, Shanghai 200032, China

An automatic method has been developed for identifying antibody entries in the protein data bank (PDB). Our method, called KIAb (Keyword-based Identification of Antibodies), parses PDB-format files to search for particular keywords relevant to antibodies, and makes judgment accordingly. Our method identified 780 entries as antibodies on the entire PDB. Among them, 767 entries were confirmed by manual inspection, indicating a high success rate of 98.3%. Our method recovered basically all of the entries compiled in the Summary of Antibody Crystal Structures (SACS) database. It also identified a number of entries missed by SACS. Our method thus provides a more com-plete mining of antibody entries in PDB with a very low false positive rate. Keywords antibody database, keyword-based searching, Keyword-based Identification of Antibodies (KIAb), Protein Data Bank (PDB)

Introduction

Antibodies, also called immunoglobulins in general, are produced by B-cells in living bodies. They play a critical role in responding to exogenetic stimulus intro-duced by antigens or immunogens. An in-depth under-standing of the unique structures and biological func-tions of antibodies has many significant applications, such as antibody engineering,1-3 development of anti-body-based therapies4-6 and catalytic antibodies.7-9

A compilation of known three-dimensional struc-tures of antibodies will be essential for this purpose. At present, the Protein Data Bank (PDB)10 provides the primary source of experimentally determined three- dimensional structures of biological macromolecules. Until this article was drafted, over 44000 entries have been already available in PDB. Currently, PDB classi-fies all of the entries in its contents roughly as proteins, nucleic acids, and protein-nucleic acid complexes. It however does not provide a particular compilation of antibody entries in its contents. While it is possible to retrieve from PDB the entries relevant to a specific type of protein, for example, using its name or EC number, this method does not apply to antibodies. It is because no systematic nomenclature currently exists for anti-bodies. Apparently, using keywords like “antibody” or “immunoglobulin” as query in searching will return far too many false hits and is not a useful solution.

To the best of our knowledge, the Summary of An-tibody Crystal Structures (SACS) database,11 maintained at the University College London in England, is the only

actively updated compilation of the antibody entries in PDB at present. However, the method used by SACS for selecting antibody entries has not been narrated clearly in literature, which makes us wonder how SACS is compiled and if it provides a complete compilation of the antibody entries in PDB. Thus, we are motivated to develop a new method for identifying the antibody en-tries in PDB. Our method, called KIAb (Keyword-based Identification of Antibodies), uses a simple strategy which can be easily reproduced by other researchers. It is also fully automatic so that the entire PDB can be screened in a short time. Details of our method and its results are given in the following sessions.

Methods

Our method classifies a given PDB entry by analyz-ing the contents of its structural file in the PDB format. The PDB format12 has been used by the Protein Data Bank since its advent in 1970s. It should be mentioned that each entry in PDB nowadays is also provided in the mmCIF format and the PDBML/XML format. Never-theless, these two formats are optimized for certain computer applications. The PDB format is certainly more friendly to human readers. The PDB format uses a series of records to save the descriptions and structural information of macromolecules as well as other miscel-laneous components. Each record is initiated with a spe-cific keyword, which defines its contents. By the current definition of the PDB format, however, no particular record is designated to give an unambiguous classifica-

Page 2: Automatic Identification of Antibodies in the Protein Data Bank

24 Chin. J. Chem., 2009, Vol. 27, No. 1 LI & WANG

© 2009 SIOC, CAS, Shanghai, & WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

tion of the presented macromolecule by its biological origin. Thus, a straightforward judgment based on a sin-gle record will not identify antibody entries satisfacto-rily. To solve this problem, our method analyzes the contents of multiple relevant records, including “HEADER”, “COMPND”, “KEYWDS”, and “SEQRES”, to judge if the given entry is an antibody.

Our method is implemented as a computer program written in the ANSI C language. A flowchart of our method is given in Figure 1. The details are given be-low.

Figure 1 Flowchart of the KIAb algorithm.

Judgment I As the first step, the HEADER record is analyzed. This one-line record provides a concise de-scription of the given structure regarding its molecular origin, biological function, cellular location and so on. The given entry will be considered as an entry of inter-ests if any of the following conditions is met.

The HEADER record contains terms of “ANTI-BODY” or “ANTIBODIES”.

The HEADER record is exactly “IMMU-NOGLOBULIN”. This requirement is necessary to ex-clude immunoglobulin related non-antibody proteins, such as immunoglobulin receptors or immunoglobulin binding proteins.

The HEADER record contains terms of “COM-PLEX” and “/IMMUNOGLOBULIN” (or “IMMU-NOGLOBULIN/”), which means the given structure is a complexed immunoglobulin.

Judgment II Very limited information is given in the HEADER record. Thus, the KEYWDS records are also analyzed. A given entry will be considered as an entry of interests if the KEYWDS records contain terms of “ANTIBODY” or “ANTIBODIES”. Note that the term of “IMMUNOGLOBULIN” is not used here as a valid query since it will return many false hits if used. For example, the terms included by the KEYWDS re-cords in PDB entry 1KSR are “ACTIN BINDING PROTEIN”, “STRUCTURE”, “IMMUNOGLOBULIN”, “GELATION FACTOR”, and “ABP-120”. In fact, this protein is not an immunoglobulin but an actin-binding

protein with an immunoglobulin fold. Judgment III More details of the given structure

can be found in the COMPND records in a PDB-format file. Two major styles of the COMPND records can be observed in PDB-format files (Table 1). The first style presents each macromolecule in the given entry as a “component”. Each component is characterized by a set of “token: value” descriptions, specifying molecule name, synonyms, EC number, and other relevant infor-mation. In this case, a given entry will be considered as an entry of interests if any of the following conditions is met.

Table 1 Examples of the COMPND records in PDB-format files

Style I PDB entry 1ADQ

COMPND MOL_ID: 1;

COMPND 2 MOLECULE: IGG4 REA;

COMPND 3 CHAIN: A;

COMPND 4 FRAGMENT: FC;

COMPND 5 BIOLOGICAL_UNIT: DIMER;

COMPND 6 MOL_ID: 2;

COMPND 7 MOLECULE: RF-AN IGM/LAMBDA;

COMPND 8 CHAIN: H, L;

COMPND 9 FRAGMENT: FAB;

COMPND 10 BIOLOGICAL_UNIT: MONOMER

Style II PDB entry 1ACY

COMPND IGG1 FAB FRAGMENT (59.1) COM-PLEXED WITH HIV-1 GP120

COMPND 2 (MN ISOLATE) FRAGMENT (RESIDUES 308-332)

The “MOLECULE” token, which gives the name of

the component, contains any of “ANTIBODY”, “#FAB#” (fragment antigen binding), “#MAB#” (monoclonal antibody), and “BENCE” plus “JONES” (Bence-Jones proteins, i.e. immunoglobulin free light chains). Here, the “#” sign denotes an optional prefix or suffix.

The “MOLECULE” token contains terms of “IM-MUNOGLOBULIN”, or “#IG#” (immunoglobulin), or any of “#IGG#”, “#IGA#”, “#IGM#”, “#IGD#” and “#IGE#” (the five classes of human immunoglobulins), but not any of “RECEPTOR”, “BINDING PROTEIN”, “IMMUNOGLOBULIN-LIKE”, and “IG-LIKE” in or-der to exclude immunoglobulin related non-antibody proteins.

The “MOLECULE” token contains terms of “#FC#” (fragment crystallizable) but not “RECEPTOR” in order to exclude Fc receptors.

The “FRAGMENT” token, which specifies a domain or fragment of macromolecule, contains any of “AN-TIBODY”, “#FAB#” and “#FC#”.

If the given entry is identified as an entry of interests, the corresponding chain identifier following the “CHAIN” token in this component will be recorded. This procedure will be repeated until all components in

Page 3: Automatic Identification of Antibodies in the Protein Data Bank

Protein Data Bank Chin. J. Chem., 2009 Vol. 27 No. 1 25

© 2009 SIOC, CAS, Shanghai, & WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

the COMPND records have been examined. The second style of the COMPND records does not

provide “token: value” descriptions but some descrip-tions in a free style (Table 1). In this case, our method will consider a given entry as an entry of interests if any of the following conditions is met.

The COMPND record contains any of “ANTI-BODY”, “#FAB#”, “#MAB#”, and “BENCE” plus “JONES”.

The COMPND record contains any of “IMMU-NOGLOBULIN”, “#IG#”, “#IGG#”, “#IGA#”, “#IGM#”, “#IGD#” and “#IGE#”, and does not contain any of “RECEPTOR”, “BINDING PROTEIN”, “IM-MUNOGLOBULIN-LIKE”, and “IG-LIKE”.

The COMPND record contains “#FC#” but not “RECEPTOR”.

Structural completeness check

As implied in the introduction section, the ultimate purpose of compiling the antibody entries in PDB is to study the unique three-dimensional structures and bio-logical functions of antibodies. A complete antibody molecule is fairly complicated. Figure 2 gives a sche-matic illustration of immunoglobulin G (IgG), a repre-sentative species of antibody. A complete IgG molecule consists of four peptide chains: two identical light chains and two identical heavy chains. Each chain can be further divided into two regions: the variable region (VH and VL) and the constant region (CH and CL). Each region has one or more peptide rings formed via disul-fide bonds, which contain about 60—70 amino acid residues per ring. Cleaving IgG with papain produces three fragments: two antigen binding domains (Fab) and one crystalline domain (Fc).

Figure 2 Structure of immunoglobulin G (C=constant domain; V=variable domain; H=heavy chain; L=light chain; Fab=antigen-binding fragment; Fc=crystallizable fragment).

Due to the structural complication of antibody, a PDB entry normally does not present a complete anti-body molecule. In most cases, a particular domain on an antibody molecule, such as Fab, is presented. Such en-

tries are considered as “antibodies” in our study. In some other cases, only a segment of a particular domain on antibody is presented. Oligo-peptides are not of our interests since they may not form and maintain stable secondary or tertiary structures. Therefore, our method uses a size cutoff of 60 amino acid residues to determine if a given entry contains a somewhat complete segment of antibody domain. If a given entry is suggested by Judgments I to III as relevant to antibody, our method will examine the SEQRES records to determine the length of each peptide chain. If at least one peptide chain, which is part of an antibody molecule, consists of at least 60 amino acid residues, the given entry will be accepted by our method as a valid antibody.

Results and discussion

We downloaded the entire Protein Data Bank re-leased on November 21st, 2006 from ftp://ftp.rcsb.org/, which consists of a total of 40, 261 entries. Our method was applied to all of these entries and identified 780 of them as antibodies. In order to verify the results of our method, we manually inspected the structures of all 780 entries. A total of 767 entries were confirmed to be true antibodies, yielding a high success rate of 98.3%. All of them are listed in Table 2. A total of 13 entries were found to be some miscellaneous types of proteins rather than antibodies, i.e. false positives. They all contain an-tibody-related keywords that meet the conditions set in our method. This indicates that there is still room for refining our keyword-based method to achieve an even higher success rate.

The results of our method were then compared with the SACS database. The latest update of SACS, a com-pilation of 786 entries, is based on exactly the same re-lease of PDB as the one used in our study. A total of 733 entries among our 780 hits are also included in the SACS database. All of them are confirmed to be true antibodies. A total of 53 entries in SACS are not among our 780 hits. Among them, however, 50 entries are theoretical models and thus were simply ignored in our study as valid inputs. When our method was applied to these entries, all of them were correctly identified as antibodies. As for the other three entries (1DVF, 1NIZ, and 1NJ0), entry 1DVF is an idiotopic antibody D1.3 Fv fragment (fragment of variable domain) in complex with an anti-idiotopic antibody E5.2 Fv fragment. It was missed by our method and thus is a real false negative for our method. Entries 1NIZ and 1NJ0 are V3 peptide fragments on exterior membrane glycoprotein. Appar-ently, they should not be classified as antibodies.

A total of 47 entries among our 780 hits are not in-cluded in SACS. Thirty-four of them are confirmed to be valid antibodies, including 2A6I, 2A6J, 2A6K, 2ATY, 2BX5, 2FAT, 2GJ7, 1H3X, 2HFF, 2HFG, 1I1C, 2I5Y, 2I60, 2IH1, 2IH3, 1IIS, 1IIX, 1KXQ, 1KXT, 1KYO, 1MEX, 1NAK, 1PKQ, 1R70, 1RI8, 1RJC, 1VPO, 1WZ1, 1XFP, 1ZA6, 1ZV5, 1ZVH, 1ZVY and 1JV5. Our first thought was that these entries might

Page 4: Automatic Identification of Antibodies in the Protein Data Bank

26 Chin. J. Chem., 2009, Vol. 27, No. 1 LI & WANG

© 2009 SIOC, CAS, Shanghai, & WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Table 2 The 767 PDB entries identified as antibodies by our KIAb method

PDB code

32C2 12E8 43C9 43CA 15C8 25C8 35C8 1A0Q 1A14 2A1W

1A2Y 1A3L 1A3R 1A4J 1A4K 1A5F 1A6T 1A6U 1A6V 1A6W

2A6D 1A7N 1A7O 1A7P 1A7Q 1A7R 2A77 1A8J 2A9M 2A9N

2AAB 1ACY 1AD0 1AD9 1ADQ 2ADF 2ADG 2ADI 2ADJ 1AE6

2AEP 2AEQ 1AFV 2AGJ 1AHW 1AI1 1AIF 2AI0 1AJ7 2AJ3

2AJS 2AJU 2AJV 2AJX 2AJY 2AJZ 2AK1 1AP2 2AP2 1AQK

1AR1 1AR2 2ARJ 2ATK 1AXS 1AXT 1AY1 1B0W 2B0S 2B1A

2B1H 1B2W 2B2X 1B4J 2B4C 1B6D 1BAF 1BBD 1BBJ 2BDN

1BEY 1BFO 1BFV 2BFV 1BGX 1BJ1 1BJM 2BJM 3BJL 4BJL

1BLN 1BM3 2BMK 1BOG 2BOB 2BOC 1BQL 1BRE 2BRR 2BSE

1BVK 1BVL 1BWW 1BZ7 1BZQ 1C08 1C12 1C1E 2C1O 2C1P

1C5B 1C5C 1C5D 1CBV 1CD0 2CD0 1CE1 1CF8 1CFN 1CFQ

1CFS 1CFT 1CFV 1CGS 2CGR 1CIC 2CJU 2CJV 1CK0 2CK0

3CK0 1CL7 1CLO 1CLY 1CLZ 1CQK 1CR9 1CT8 1CU4 1CZ8

2D03 1D5B 1D5I 1D6V 1DBA 1DBB 1DBJ 1DBK 1DBM 2DBL

1DCL 2DD8 2DDQ 1DEE 1DFB 1DL7 1DLF 2DLF 1DN0 1DN2

1DQD 1DQJ 1DQL 1DQM 1DQQ 2DQT 2DQU 1DSF 2DTG 1DZB

1E4K 1E4W 1E4X 1E6J 1E6O 1EAP 1EEQ 1EEU 1EFQ 1EGJ

1EHL 1EJO 1EK3 1EMT 1EO8 2ESG 1ETZ 2EXW 2EXY 1EZV

2EZ0 1F11 2F19 1F2X 1F3D 1F3R 1F4W 1F4X 1F4Y 1F58

2F58 2F5A 2F5B 3F58 1F6A 1F6L 1F8T 1F90 1FAI 4FAB

6FAB 7FAB 8FAB 1FBI 2FB4 2FBJ 1FC1 1FC2 1FCC 3FCT

1FDL 2FD6 1FE8 2FEC 2FED 2FEE 1FGN 1FGV 2FGW 1FH5

1FIG 1FJ1 2FJF 2FJG 2FJH 1FL3 1FL5 1FL6 1FLR 2FL5

1FN4 1FNS 1FOR 1FP5 1FPT 1FRG 1FRT 3FRU 1FSK 1FVC

1FVD 1FVE 2G5B 1G6V 2G60 1G7H 1G7I 1G7J 1G7L 1G7M

2G75 1G84 1G9M 1G9N 1GAF 1GC1 2GFB 1GGB 1GGC 1GGI

1GHF 2GHW 1GIG 2GJJ 2GKI 1GPO 2GSI 1H0D 2H1P 2H2P

2H2S 1H3P 1H3T 1H3U 1H3V 1H3W 1H3Y 1H8N 1H8O 1H8S

2H8P 2H9G 1HCV 1HEZ 2HFE 3HFM 2HG5 1HH6 1HH9 1HI6

1HIL 1HIM 1HIN 1HKL 2HLF 2HMI 1HQ4 2HRP 2HT2 2HT3

2HT4 2HTK 2HTL 1HYS 1HZH 1I1A 1I3G 1I3U 1I3V 1I7Z

1I8I 1I8K 1I8M 1I9I 1I9J 1I9R 1IAI 1IBG 1IC4 1IC5

1IC7 1IEH 1IFH 2IFF 1IGA 1IGC 1IGF 1IGI 1IGJ 1IGM

1IGT 1IGY 2IG2 2IGF 1IKF 1IL1 2IMM 2IMN 1IND 1INE

1IQD 1IQW 1IT9 1IVL 1J05 1J1O 1J1P 1J1X 1J5O 2JEL

1JFQ 1JGL 1JGU 1JGV 1JHK 1JHL 1JN6 1JNH 1JNL 1JNN

1JP5 1JPS 1JPT 1JRH 1JTO 1JTP 1JTT 1JVK 1K4C 1K4D

1K6Q 1KB5 1KC5 1KCR 1KCS 1KCU 1KCV 1KEG 1KEL 1KEM

1KEN 1KFA 1KIP 1KIQ 1KIR 1KN2 1KN4 1KNO 1KTR 1L6X

1L7I 1L7T 1LGV 1LHZ 1LIL 1LK3 1LMK 1LO0 1LO2 1LO3

1LO4 1LVE 2LVE 3LVE 4LVE 5LVE 1M71 1M7D 1M7I 1MAJ

1MAK 1MAM 1MCB 1MCC 1MCD 1MCE 1MCF 1MCH 1MCI 1MCJ

1MCK 1MCL 1MCN 1MCO 1MCP 1MCQ 1MCR 1MCS 1MCW 2MCG

2MCP 3MCG 1MEL 1MF2 1MFA 1MFB 1MFC 1MFD 1MFE 1MH5

1MHH 1MHP 1MIE 1MIM 1MJ7 1MJ8 1MJJ 1MJU 1MLB 1MLC

1MNU 1MOE 1MPA 2MPA 1MQK 1MRC 1MRD 1MRE 1MRF 1MVF

Page 5: Automatic Identification of Antibodies in the Protein Data Bank

Protein Data Bank Chin. J. Chem., 2009 Vol. 27 No. 1 27

© 2009 SIOC, CAS, Shanghai, & WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Continued

PDB code

1MVU 1N0X 1N4X 1N5Y 1N64 1N6Q 1N7M 1N8Z 1NBV 1NBY

1NBZ 1NC2 1NC4 1NCA 1NCB 1NCC 1NCD 1NCW 1ND0 1NDG

1NDM 1NFD 1NGP 1NGQ 1NGW 1NGX 1NGY 1NGZ 1NJ9 1NL0

1NLB 1NLD 1NMA 1NMB 1NMC 1NQB 1NSN 1NTL 1O0V 1OAK

1OAQ 1OAR 1OAU 1OAX 1OAY 1OAZ 1OB1 1OCW 1OHQ 1OL0

1OM3 1OP3 1OP5 1OP9 1OPG 1OQO 1OQX 1ORQ 1ORS 1OSP

1OTS 1OTT 1OTU 1OW0 1P2C 1P4B 1P4I 1P7K 2PCP 1PEW

1PFC 1PG7 1PLG 1PSK 1PW3 1PZ5 1Q0X 1Q0Y 1Q1J 1Q72

1Q9K 1Q9L 1Q9O 1Q9Q 1Q9R 1Q9T 1Q9V 1Q9W 1QAC 1QBL

1QBM 1QD0 1QFU 1QFW 1QGC 1QKZ 1QLE 1QLR 1QNZ 1QOK

1QP1 1QYG 1R0A 1R24 1R3I 1R3J 1R3K 1R3L 2RCS 1REI

1RFD 1RHH 2RHE 1RIH 1RIU 1RIV 1RJL 1RMF 1RU9 1RUA

1RUK 1RUL 1RUM 1RUP 1RUQ 1RUR 1RVF 1RZ7 1RZ8 1RZF

1RZG 1RZI 1RZJ 1RZK 1S3K 1S5H 1S5I 1S78 1SBS 1SEQ

1SHM 1SJV 1SJX 1SM3 1SVZ 1SY6 1T03 1T04 1T2Q 1T3F

1T4K 1T66 1T83 1T89 1TET 1TJG 1TJH 1TJI 1TPX 1TQB

1TQC 1TXV 1TY3 1TY5 1TY6 1TY7 1TZG 1TZH 1TZI 1U0Q

1U6A 1U8H 1U8I 1U8J 1U8K 1U8L 1U8M 1U8N 1U8O 1U8P

1U8Q 1U91 1U92 1U93 1U95 1UA6 1UAC 1UB5 1UB6 1UCB

1UJ3 1UM4 1UM5 1UM6 1UWE 1UWG 1UWX 1UYW 1UZ6 1UZ8

1V7M 1V7N 1VFA 1VFB 1VGE 1VHP 2VIR 2VIS 2VIT 1W72

1WC7 1WCB 1WEJ 1WT5 1WTL 1X9Q 1XCQ 1XCT 1XF2 1XF3

1XF4 1XF5 1XGP 1XGQ 1XGY 1XIW 1Y0L 1Y18 1YC7 1YC8

1YEC 1YED 1YEE 1YEF 1YEG 1YEH 1YEI 1YEJ 1YEK 1YJD

1YMH 1YNK 1YNL 1YNT 1YQV 1YUH 1YY8 1YY9 1YYL 1YYM

1YZZ 1Z3G 1ZA3 1ZAN 1ZEA 1ZLS 1ZLU 1ZLV 1ZLW 1ZMY

1ZTX 1ZVO 1ZWI 2A6I 2A6J 2A6K 2ATY 2BX5 2FAT 2GJ7

1H3X 2HFF 2HFG 1I1C 2I5Y 2I60 2IH1 2IH3 1IIS 1IIX

1KXQ 1KXT 1KYO 1MEX 1NAK 1PKQ 1R70 1RI8 1RJC 1VPO

1WZ1 1XFP 1ZA6 1ZV5 1ZVH 1ZVY 1JV5 a These entries were selected out of the protein data bank as released on Nov. 21st, 2006, a total of 40261 entries with experimentally determined structures.

belong to a particular group which did not meet the definition of “antibodies” by SACS. After careful in-spection, we found that these 34 entries did not exhibit any common features with respect to chain length, se-quence, molecular structure, biological source, resolu-tion and so on. Apparently, they are simply missed by SACS for some reasons. The other 13 entries are not valid antibodies. The false positive rate of our method is thus 13/780=1.7%.

Besides a high level of accuracy, our method is also fully automatic. This feature is important since it elimi-nates human interference and guarantees consistency in final results. In addition, PDB is a fair large database and still in constant growth. One needs to screen the latest release of PDB from time to time to provide an up-to-date compilation of the antibody entries in PDB. Adopting a simple keyword-based strategy, the speed of our method is fast. In fact, it processed the entire PDB

release used in our study (>40000 entries) in less than an hour on an entry-level desktop PC (single Intel Pen-tium D CPU at 3.00 GHz; 1 GB memory). This level of efficiency makes it convenient to refine our method and re-apply it to PDB when necessary.

Conclusion

A complete compilation of antibody entries in PDB may be useful for exploring the relationship between the three-dimensional structures and biological functions of various antibodies. We have developed an automatic method, namely KIAb, for identifying antibody entries in PDB. It adopts a keyword-based strategy, which can be easily reproduced by anyone with necessary pro-gramming skills. Although simple, our method is very effective. As tested on the entire PDB, it recovers basi-cally all antibody entries included in the SACS database.

Page 6: Automatic Identification of Antibodies in the Protein Data Bank

28 Chin. J. Chem., 2009, Vol. 27, No. 1 LI & WANG

© 2009 SIOC, CAS, Shanghai, & WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

It also identifies a number of antibody entries missed by SACS. Our method thus provides a more complete mining of antibody entries in PDB with a very low false positive rate.

Acknowledgements

The authors are grateful to the technical aids pro-vided by Yuan Zhao, Tiejun Cheng, Chunni Lu, and Weiqi Zhang.

References

1 Hayden, M. S.; Gilliland, L. K.; Ledbetter, J. A. Curr. Opin. Immunol. 1997, 9, 201.

2 Presta, L. Curr. Opin. Struct. Biol. 2003, 13, 519. 3 Kim, S. J.; Park, Y.; Hong, H. J. Mol. Cells 2005, 20, 17. 4 Casadevall, A.; Scharff, M. D. Clin. Infect. Dis. 1995, 21,

150.

5 Scott, A. M.; Welt, S. Curr. Opin. Immunol. 1997, 9, 717.

6 Multani, P. S.; Grossbard, M. L. J. Clin. Oncol. 1998, 16,

3691.

7 MacBeath, G.; Hilvert, D. Chem. Biol. 1996, 3, 433.

8 Hilvert, D. Annu. Rev. Biochem. 2000, 69, 751.

9 Schultz, P. G.; Yin, J.; Lerner, R. A. Angew. Chem., Int. Ed. 2002, 41, 4427.

10 Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat,

T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E. Nucleic

Acids Res. 2000, 28, 235.

11 Allcorn, L. C.; Martin, A. C. R. Bioinformatics 2002, 18,

175.

12 Bernstein, F. C.; Koetzle, T. F.; Williams, G. J. B.; Meyer,

E. F.; Brice, M. D.; Rodgers, J. R.; Kennard, O.; Shima-

nouchi, T.; Tasumi, M. J. Mol. Biol. 1977, 112, 535.

(E0806192 LU, Y. J.)