A Prlic - BioJava update

11
How to use BioJava to calculate one billion protein structure alignments at the RCSB PDB website Andreas Prlić

description

Presentation by Prlic at BOSC2012 "BioJava Update"

Transcript of A Prlic - BioJava update

Page 1: A Prlic - BioJava update

How to use BioJavato calculate one billion protein structure alignments at

the RCSB PDB website

Andreas Prlić

Page 2: A Prlic - BioJava update

My Two Hats

RCSB PDBBioJava

Page 3: A Prlic - BioJava update

www.pdb.org

Overview N

umbe

r of r

elea

sed

entr

ies

Year

Page 4: A Prlic - BioJava update

Some of the things you can do at the RCSB PDB site

• Advanced queries

• Custom reports

• Visualization

• Education section

• Comparisons across PDB, based on sequence and 3D structure similarities

Jmol

LigandExplorer

Custom report

Page 5: A Prlic - BioJava update

www.pdb.org

Systematic Structural AlignmentObjective: Find novel relationships

Example: Green Fluorescent Protein§ Nidogen-1: similar 11-stranded § beta-barrel and internal helices§ 3 Å RMSD, only 9% sequence identity§ Nidogen-1: component of basement membrane, no chromophore§ GFP and NID-1 may share common ancestor

Page 6: A Prlic - BioJava update

Open Science Grid

based on the FATCAT (rigid) algorithm Yuzhen Ye & Adam Godzik. Flexible structure alignment by chaining aligned fragment pairs allowing twists. 2003. Bioinformatics vol.19 suppl. 2. ii246-ii255.

Systematic comparisons of representative chains from 40% sequence identity clusters

22000 sequence clusters33000 representative domains

Page 7: A Prlic - BioJava update

PDBCustom Job Management

Java Clients can run anywhere

Open Science

Grid

Sends out instructionsto clients

Writes resultsto disk

.

.

.

Page 8: A Prlic - BioJava update

Initial calculation of frozen snapshot of PDB

~170k CPU hourson OSG

Incremental weekly updates(~1-2 million alignments)

<1000 CPU hours

Code www.biojava.org

1 billion alignmentsavailable freely at

www.rcsb.org

Page 9: A Prlic - BioJava update

BioJava

• Major rewrite - BioJava 3

Page 10: A Prlic - BioJava update

BioJava 1 BioJava 3

core data model

symbols/alphabets, counts, distributions

Genome/sequencing

Mult. seq. align

Structure alignment

Modfinder

AA Properties

Protein Disorder

Hmmer3 WS

NCBI WS

Parsers: Genbank/Embl/Blast

Page 11: A Prlic - BioJava update

Acknowledgments

• Spencer Bliven

• Peter Rose

• Phil Bourne

• all contributors

• A. Yates, J. Jacobsen, P. Troshin, M. Chapman, J. Gao, C.H. Koh, S. Foisy, R. Holland, G. Rimsa, M. Heuer, H. Brandstaetter-Mueller, S. Willis

RCSB PDB BioJava

FundingRCSB PDBGoogle Summer of Code Open Science Grid