Download - CADD meeting 08-30-2016

Compact representation of 3D macromolecular

structures from the PDB

Presented by Yana ValasatavaPostdoctoral Researcher

Structural Bioinformatics GroupSan Diego Supercomputer Center

The PDB evolving complexity

PDB archive

> 30 GB

~250 MB in mmCIF format

Structural biology efforts meet a big-data era:

● Growing size: ~ 120K structures with an

annual growth by ~10K structures

● Evolving complexity: growing

compositional heterogeneity and size

● Increasing usage: > 300,000 users per

month from over 160 countries

3J3Q

3J3Q has more than 1 million atoms

The PDB has more than 1 billion atoms

★ Interactive visualization○ slow network transfer○ slow parsing○ slow rendering

★ Mobile visualization○ limited bandwidth○ limited memory

★ Large-scale structural analysis○ slow repeated I/O○ slow repeated parsing

Scalability issues

PDBx/mmCIF

Flexible, extensible, and verbose format with rich metadata, well suited for archival purposes.

repetitive information

redundant annotations

inefficient representation

PDB/MMTF

The MacroMolecular Transmission Format

MMTF has the following advantages:

❏ it occupies less space (less disk I/O) ❏ it is faster to read (no time-consuming string parsing)❏ it contains precalculated information useful for structural analysis

and visualisation (covalent bonds and bond orders)

Fields:

○ Format data (e.g. the version number of the specification)○ Metadata (e.g. rFree and resolution)○ Structure data (e.g. number of models, chains, groups, atoms)○ Chain data (e.g. list of chain IDs, chain names)○ Group data (e.g. list of group names, formal charges, bonds)○ Atom data (e.g. B-factors, coordinates, occupancies)

https://github.com/rcsb/mmtf/blob/master/spec.md



MMTF compression pipeline

integer encodingdictionary encodingrun-length encoding

delta encoding

GZIPrecursive indexing

extract structural datacalculate bonds, SSE

The binary container format of MMTF

Compression pipeline: dictionary encoding

Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor

ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05

{ "groupName": "ARG",

"singleLetterCode": "R",

"chemCompType": "L-PEPTIDE LINKING",

"atomNameList": [ "N", "CA", "C" ],

"elementList": [ "N", "C", "C"] }

index: 1SER-GLY-ARG-SER-SER

groupTypeList: [ 2, 0, 1, 2, 2 ]

Compression pipeline: encodings



14.699 -> 1469914.500 -> 14500 169

1,2,3->1,1,1->1,3(delta + run-length) -> (integer + delta)

integer encoding: map floating point numbers to integer

run-length encoding: stretches of equal values are represented by the value itself and the occurrence count

delta encoding: differences (deltas) between the numbers are stored

Compression pipeline: Recursive Indexing



Recursive Indexing: [-50, -128, 7, 127, 268] -> [-50, -128, 0, 7, 127, 0, 127, 127, 14]

Array of 8-bit integer values, so the open interval is (127, -128):

Overview of data

Full format• all atoms (useful for structural bioinformatics analysis)• coordinates with 3 decimal place precision (no loss after decoding)

Reduced format• C-alpha/phosphate backbone atoms and ligands (useful for

visualisation and some structural bioinformatics)• coordinates with 1 decimal place precision (almost further 40 %

reduction in size)• exactly same data structure as full (parsers work for both)

MMTF size and parsing speed

* Parsing using Java libraries

Using MMTF

To efficiently store, transmit, and visualize the 3D structures of biological macromolecules

To perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory

Presented by Anthony BradleyPostdoctoral Researcher

Structural Bioinformatics GroupSan Diego Supercomputer Center

Using MMTF

To efficiently store, transmit, and visualize the 3D structures of biological macromolecules

To perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory

Goals

• Analysis should be easy and simple

• Whole archive analysis of the PDB should be trivial AND fast

• Big Data tools (e.g. Spark and Hadoop) are available

mmtf-python

mmtf-java

Nobody should (have to) write their own parser. Ever.

MMTF-Spark - Simple API

Continued…..

Data mining - speed advantage

Contact finding

Pros and consPros:

● Looping through the whole library performing simple analyses

● Simple to parallelize code● Much more complete data

Cons:

● Tied to Java ● Not a magic unicorn

Thanks!• http://mmtf.rcsb.org/

• https://github.com/rcsb/mmtf-javascript

• https://github.com/rcsb/mmtf-java

• https://github.com/rcsb/mmtf-python

• http://spark.apache.org/

http://mmtf.rcsb.org/

http://mmtf.rcsb.org/

https://github.com/rcsb/mmtf-javascript

https://github.com/rcsb/mmtf-javascript

https://github.com/rcsb/mmtf-java




http://spark.apache.org/

http://spark.apache.org/

Acknowledgements

NCI/NIH (U01 CA198942)