Compact representation of 3D macromolecular
structures from the PDB
Presented by Yana ValasatavaPostdoctoral Researcher
Structural Bioinformatics GroupSan Diego Supercomputer Center
The PDB evolving complexity
PDB archive
> 30 GB
~250 MB in mmCIF format
Structural biology efforts meet a big-data era:
● Growing size: ~ 120K structures with an
annual growth by ~10K structures
● Evolving complexity: growing
compositional heterogeneity and size
● Increasing usage: > 300,000 users per
month from over 160 countries
3J3Q
3J3Q has more than 1 million atoms
The PDB has more than 1 billion atoms
★ Interactive visualization○ slow network transfer○ slow parsing○ slow rendering
★ Mobile visualization○ limited bandwidth○ limited memory
★ Large-scale structural analysis○ slow repeated I/O○ slow repeated parsing
Scalability issues
PDBx/mmCIF
Flexible, extensible, and verbose format with rich metadata, well suited for archival purposes.
repetitive information
redundant annotations
inefficient representation
PDB/MMTF
The MacroMolecular Transmission Format
MMTF has the following advantages:
❏ it occupies less space (less disk I/O) ❏ it is faster to read (no time-consuming string parsing)❏ it contains precalculated information useful for structural analysis
and visualisation (covalent bonds and bond orders)
Fields:
○ Format data (e.g. the version number of the specification)○ Metadata (e.g. rFree and resolution)○ Structure data (e.g. number of models, chains, groups, atoms)○ Chain data (e.g. list of chain IDs, chain names)○ Group data (e.g. list of group names, formal charges, bonds)○ Atom data (e.g. B-factors, coordinates, occupancies)
https://github.com/rcsb/mmtf/blob/master/spec.md
MMTF compression pipeline
integer encodingdictionary encodingrun-length encoding
delta encoding
GZIPrecursive indexing
extract structural datacalculate bonds, SSE
The binary container format of MMTF
Compression pipeline: dictionary encoding
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05
{ "groupName": "ARG",
"singleLetterCode": "R",
"chemCompType": "L-PEPTIDE LINKING",
"atomNameList": [ "N", "CA", "C" ],
"elementList": [ "N", "C", "C"] }
index: 1SER-GLY-ARG-SER-SER
groupTypeList: [ 2, 0, 1, 2, 2 ]
Compression pipeline: encodings
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05
14.699 -> 1469914.500 -> 14500 169
1,2,3->1,1,1->1,3(delta + run-length) -> (integer + delta)
integer encoding: map floating point numbers to integer
run-length encoding: stretches of equal values are represented by the value itself and the occurrence count
delta encoding: differences (deltas) between the numbers are stored
Compression pipeline: Recursive Indexing
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05
Recursive Indexing: [-50, -128, 7, 127, 268] -> [-50, -128, 0, 7, 127, 0, 127, 127, 14]
Array of 8-bit integer values, so the open interval is (127, -128):
Overview of data
Full format• all atoms (useful for structural bioinformatics analysis)• coordinates with 3 decimal place precision (no loss after decoding)
Reduced format• C-alpha/phosphate backbone atoms and ligands (useful for
visualisation and some structural bioinformatics)• coordinates with 1 decimal place precision (almost further 40 %
reduction in size)• exactly same data structure as full (parsers work for both)
MMTF size and parsing speed
* Parsing using Java libraries
Using MMTF
To efficiently store, transmit, and visualize the 3D structures of biological macromolecules
To perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
Presented by Anthony BradleyPostdoctoral Researcher
Structural Bioinformatics GroupSan Diego Supercomputer Center
Using MMTF
To efficiently store, transmit, and visualize the 3D structures of biological macromolecules
To perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
Goals
• Analysis should be easy and simple
• Whole archive analysis of the PDB should be trivial AND fast
• Big Data tools (e.g. Spark and Hadoop) are available
mmtf-python
mmtf-java
Nobody should (have to) write their own parser. Ever.
MMTF-Spark - Simple API
Continued…..
Data mining - speed advantage
Contact finding
Contact finding
Pros and consPros:
● Looping through the whole library performing simple analyses
● Simple to parallelize code● Much more complete data
Cons:
● Tied to Java ● Not a magic unicorn
Pros and consPros:
● Looping through the whole library performing simple analyses
● Simple to parallelize code● Much more complete data
Cons:
● Tied to Java ● Not a magic unicorn
Thanks!• http://mmtf.rcsb.org/
• https://github.com/rcsb/mmtf-javascript
• https://github.com/rcsb/mmtf-java
• https://github.com/rcsb/mmtf-python
• http://spark.apache.org/
Acknowledgements
NCI/NIH (U01 CA198942)
Top Related