CADD meeting 08-30-2016

26
Compact representation of 3D macromolecular structures from the PDB

Transcript of CADD meeting 08-30-2016

Page 1: CADD meeting 08-30-2016

Compact representation of 3D macromolecular

structures from the PDB

Page 2: CADD meeting 08-30-2016

Presented by Yana ValasatavaPostdoctoral Researcher

Structural Bioinformatics GroupSan Diego Supercomputer Center

Page 3: CADD meeting 08-30-2016

The PDB evolving complexity

PDB archive

> 30 GB

~250 MB in mmCIF format

Structural biology efforts meet a big-data era:

● Growing size: ~ 120K structures with an

annual growth by ~10K structures

● Evolving complexity: growing

compositional heterogeneity and size

● Increasing usage: > 300,000 users per

month from over 160 countries

3J3Q

3J3Q has more than 1 million atoms

The PDB has more than 1 billion atoms

Page 4: CADD meeting 08-30-2016

★ Interactive visualization○ slow network transfer○ slow parsing○ slow rendering

★ Mobile visualization○ limited bandwidth○ limited memory

★ Large-scale structural analysis○ slow repeated I/O○ slow repeated parsing

Scalability issues

Page 5: CADD meeting 08-30-2016

PDBx/mmCIF

Flexible, extensible, and verbose format with rich metadata, well suited for archival purposes.

repetitive information

redundant annotations

inefficient representation

Page 6: CADD meeting 08-30-2016

PDB/MMTF

The MacroMolecular Transmission Format

MMTF has the following advantages:

❏ it occupies less space (less disk I/O) ❏ it is faster to read (no time-consuming string parsing)❏ it contains precalculated information useful for structural analysis

and visualisation (covalent bonds and bond orders)

Fields:

○ Format data (e.g. the version number of the specification)○ Metadata (e.g. rFree and resolution)○ Structure data (e.g. number of models, chains, groups, atoms)○ Chain data (e.g. list of chain IDs, chain names)○ Group data (e.g. list of group names, formal charges, bonds)○ Atom data (e.g. B-factors, coordinates, occupancies)

https://github.com/rcsb/mmtf/blob/master/spec.md

Page 7: CADD meeting 08-30-2016

MMTF compression pipeline

integer encodingdictionary encodingrun-length encoding

delta encoding

GZIPrecursive indexing

extract structural datacalculate bonds, SSE

The binary container format of MMTF

Page 8: CADD meeting 08-30-2016

Compression pipeline: dictionary encoding

Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor

ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05

{ "groupName": "ARG",

"singleLetterCode": "R",

"chemCompType": "L-PEPTIDE LINKING",

"atomNameList": [ "N", "CA", "C" ],

"elementList": [ "N", "C", "C"] }

index: 1SER-GLY-ARG-SER-SER

groupTypeList: [ 2, 0, 1, 2, 2 ]

Page 9: CADD meeting 08-30-2016

Compression pipeline: encodings

Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor

ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05

14.699 -> 1469914.500 -> 14500 169

1,2,3->1,1,1->1,3(delta + run-length) -> (integer + delta)

integer encoding: map floating point numbers to integer

run-length encoding: stretches of equal values are represented by the value itself and the occurrence count

delta encoding: differences (deltas) between the numbers are stored

Page 10: CADD meeting 08-30-2016

Compression pipeline: Recursive Indexing

Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor

ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05

Recursive Indexing: [-50, -128, 7, 127, 268] -> [-50, -128, 0, 7, 127, 0, 127, 127, 14]

Array of 8-bit integer values, so the open interval is (127, -128):

Page 11: CADD meeting 08-30-2016

Overview of data

Full format• all atoms (useful for structural bioinformatics analysis)• coordinates with 3 decimal place precision (no loss after decoding)

Reduced format• C-alpha/phosphate backbone atoms and ligands (useful for

visualisation and some structural bioinformatics)• coordinates with 1 decimal place precision (almost further 40 %

reduction in size)• exactly same data structure as full (parsers work for both)

Page 12: CADD meeting 08-30-2016

MMTF size and parsing speed

* Parsing using Java libraries

Page 13: CADD meeting 08-30-2016

Using MMTF

To efficiently store, transmit, and visualize the 3D structures of biological macromolecules

To perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory

Page 14: CADD meeting 08-30-2016

Presented by Anthony BradleyPostdoctoral Researcher

Structural Bioinformatics GroupSan Diego Supercomputer Center

Page 15: CADD meeting 08-30-2016

Using MMTF

To efficiently store, transmit, and visualize the 3D structures of biological macromolecules

To perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory

Page 16: CADD meeting 08-30-2016

Goals

• Analysis should be easy and simple

• Whole archive analysis of the PDB should be trivial AND fast

• Big Data tools (e.g. Spark and Hadoop) are available

Page 17: CADD meeting 08-30-2016

mmtf-python

mmtf-java

Nobody should (have to) write their own parser. Ever.

Page 18: CADD meeting 08-30-2016

MMTF-Spark - Simple API

Page 19: CADD meeting 08-30-2016

Continued…..

Page 20: CADD meeting 08-30-2016

Data mining - speed advantage

Page 21: CADD meeting 08-30-2016

Contact finding

Page 22: CADD meeting 08-30-2016

Contact finding

Page 23: CADD meeting 08-30-2016

Pros and consPros:

● Looping through the whole library performing simple analyses

● Simple to parallelize code● Much more complete data

Cons:

● Tied to Java ● Not a magic unicorn

Page 24: CADD meeting 08-30-2016

Pros and consPros:

● Looping through the whole library performing simple analyses

● Simple to parallelize code● Much more complete data

Cons:

● Tied to Java ● Not a magic unicorn

Page 25: CADD meeting 08-30-2016

Thanks!• http://mmtf.rcsb.org/

• https://github.com/rcsb/mmtf-javascript

• https://github.com/rcsb/mmtf-java

• https://github.com/rcsb/mmtf-python

• http://spark.apache.org/

Page 26: CADD meeting 08-30-2016

Acknowledgements

NCI/NIH (U01 CA198942)