The TEXTAL System: Automated Model-Building Using Pattern Recognition Techniques Dr. Thomas R....
-
date post
19-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of The TEXTAL System: Automated Model-Building Using Pattern Recognition Techniques Dr. Thomas R....
The TEXTAL System:Automated Model-Building Using Pattern Recognition Techniques
Dr. Thomas R. IoergerDepartment of Computer Science
Texas A&M University
Collaboration with: Dr. James C. Sacchettini, Center for Structural Biology, Texas A&M Univ.With support from: National Institutes of Health
Automated Structure Determination• Key step to high-throughput Structural Genomics,
structure-based drug design, etc.• Many computational tools to generate a map, but...• Given electron density map, how to extract atomic
coordinates automatically?• Currently requires humans (+O): potential bottleneck • Sources of difficulty: complexity, low resolution, phase
errors, weak density• Related methods: Shake&Bake, ARP/wARP, X-
Powerfit, template convolution...
Overview of TEXTAL• Apply pattern recognition techniques
• Exploit database of previously-solved maps
• Model molecular structures in local regions (e.g. spheres of 5 Angstrom radius)
• Intuitive principles:1) Have I ever seen a region with a pattern of density like this before?
2) If so, what were previous local atomic coordinates?
Overview (cont’d)• Divide-and-Conquer:
1) identify alpha-carbon positions (chain-tracing)
2) model regions around alpha-carbons (CAs), including backbone and side-chain atoms
3) concatenate local models back together, resolve any conflicts
• Database contains many regions centered on CAs from previous maps
• ~5A radius right for “structural repetition”
Main Stages of TEXTALelectron density map
CAPRA
C-alpha chains
LOOKUP
model (initial coordinates)
model (final coordinates)
Post-processing routines
Reciprocal-spacerefinement/ML DM
HumanCrystallographer
(editing)
build-in side-chainand main-chain atoms
locally around each CA
example:real-spacerefinement
Feature Extraction
• Database: ~105 regions from ~100 maps
• How to identify closest match (efficiently)???
• Calculate numerical features that represent the pattern in each region
• Must be rotation-invariant
• Search can be very fast: just compare features
F=<1.72,-0.39,1.04,1.55...> F=<1.58,0.18,1.09,-0.25...>
F=<0.90,0.65,-1.40,0.87...> F=<1.79,-0.43,0.88,1.52...>
Rotation-Invariant Features
• Average density: =(1/n)i, where i is density at each lattice point in region
• Other Statistical Features: standard deviation, kurtosis…
• Distant to center of mass:– <xc,yc,zc>=(1/n)< xii/yii/zii/
– dcen=(xc2+ yc
2+ zc2)
More Features
• Moments of inertia– measures dispersion around axes of symmetry
in a density distribution– calculate 3x3 inertia matrix– diagonalize to get eigenvalues– sort from largest to smallest– take magnitudes and ratios of moments
More Features
• Spoke angles– if region centered on CA, should have 3
“spokes” of density emanating from center– find best-fit vectors; calc. angles among them
• surface area of contours
• connectivity of density/bones in region
• other geometrical features...
Feature WeightsFeature Weight Radius(A)Distance to center of mass 0.183 5ratio of moments 1 and 3 0.153 4ratio of moments 1 and 3 0.136 5skewness 0.080 3skewness 0.055 6ratio of moments 1 and 2 0.055 4median spoke angle 0.052 6minimum spoke angle 0.051 4skewness 0.049 5ratio of moments 1 and 2 0.038 5maximum spoke angle 0.037 4ratio of moments 1 and 3 0.031 3magnitude of moment 1 0.022 6median spoke angle 0.019 4minimum spoke angle 0.015 6
CAPRA: C-Alpha Pattern-Recognition Algorithm
• Tracer - remove lattice points from map (lowest density first) without breaking connectivity
• Neural nework - for each pseudo atom, extract features, input to network, predict distances to CAs (1:10 in trace), trained on example points in real maps
• Linking - desire long chains, good CA predictions (not in side-chains), “structurally plausible” (e.g. linear, helical)
DensityTrace
NeuralNetwork
Linking intoC-alpha chains
pseudo atoms predictions ofdistance to true CA
map C-alphacoordinates
Example of the CAPRA Process
Example of CAPRA chains
The LOOKUP Process
Database Construction• Ideally would use solved MAD/MIR maps• Using “back-transformed” maps works well• PDB structure factors (include B-factors)• keep reflections down to 2.8A• Fourier transform electron density map• 50 proteins from PDBSelect (non-homol.)• about 50,000 regions• Feature extraction done offline
Details of Matching Process• Feature-based matching:
– Euclidean distance metric between feature vectors.
– dist(R1,R2)=wi(Fi(R1)-Fi(R2))2
• Must weight features by relevance– less-relevant features add noise– Slider algorithm: optimize weights by comparing
features in matching regions versus mismatches
• Verify selections by density correlation– requires search for optimal rotation
Post-Processing Routines• Imperfections in the initial model:
– backbone atoms not necessarily juxtaposed between adjacent residues, or in same direction
– side-chains occasionally “flipped” into backbone– residue identities often incorrect (based on dens.)
• Fixing “flips” and direction - take candidate match with next highest correlation
• Real-space refinement: regularizes backbone
• Use sequence alignment to fix identities?
New Results on Real MAD Mapsprotein size (#aa) type source reso.CZRA 95 MAD 2.3AM01 317 MAD 2.4A
CZRA M01# residues built 86 286
CAs missed 11a 34b
incorrect CAs 2 5# chains 4 8longest chains 39,27,18 96,85,65CA RMSD 0.79 0.97overall RMSD 0.84 1.07
aCZRA: missed a 5-res loop (weak density) and C-terminusbM01: missed a 17-res helix, 9 deletions, 5 due to breaks, 3-res false backbone
Histograms of DistancesBetween Matched Atoms
0
20
40
60
80
100
120
140
160
0.2
0.6 1
1.4
1.8
2.2
2.6 3
CZRA
0
50
100
150
200
250
300
350
400
0.2
0.6 1
1.4
1.8
2.2
2.6 3
M01
Analysis of Amino Acid Types
G A C S P V T I D N L Q E M H K F Y R WG | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0A | 1 3 0 2 0 2 0 0 1 0 0 1 0 0 2 0 0 0 1 0C | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0S | 1 2 0 5 0 1 0 0 0 1 2 1 0 0 2 1 0 0 0 0P | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0V | 0 0 0 2 0 3 2 3 0 1 2 1 1 0 0 1 0 0 0 0T | 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0I | 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0D | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0N | 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0L | 0 0 0 0 0 0 0 1 1 2 6 0 1 0 0 1 0 1 0 0Q | 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0E | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0M | 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0H | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0K | 0 0 0 0 0 0 0 0 0 1 0 0 2 0 2 1 0 0 2 0F | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0Y | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0R | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0W | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
% identity % struct. sim.CZRA 27.4 71.4M01 16.3 56.2
Confusion Matrix for CZRA:Amino acid in true structure
Am
ino
acid
in T
EX
TA
L m
odel