The TEXTAL System: Automated Model-Building Using Pattern Recognition Techniques Dr. Thomas R....

The TEXTAL System:Automated Model-Building Using Pattern Recognition Techniques

Dr. Thomas R. IoergerDepartment of Computer Science

Texas A&M University

Collaboration with: Dr. James C. Sacchettini, Center for Structural Biology, Texas A&M Univ.With support from: National Institutes of Health

Automated Structure Determination• Key step to high-throughput Structural Genomics,

structure-based drug design, etc.• Many computational tools to generate a map, but...• Given electron density map, how to extract atomic

coordinates automatically?• Currently requires humans (+O): potential bottleneck • Sources of difficulty: complexity, low resolution, phase

errors, weak density• Related methods: Shake&Bake, ARP/wARP, X-

Powerfit, template convolution...

Overview of TEXTAL• Apply pattern recognition techniques

• Exploit database of previously-solved maps

• Model molecular structures in local regions (e.g. spheres of 5 Angstrom radius)

• Intuitive principles:1) Have I ever seen a region with a pattern of density like this before?

2) If so, what were previous local atomic coordinates?

Overview (cont’d)• Divide-and-Conquer:

1) identify alpha-carbon positions (chain-tracing)

2) model regions around alpha-carbons (CAs), including backbone and side-chain atoms

3) concatenate local models back together, resolve any conflicts

• Database contains many regions centered on CAs from previous maps

• ~5A radius right for “structural repetition”

Main Stages of TEXTALelectron density map

CAPRA

C-alpha chains

LOOKUP

model (initial coordinates)

model (final coordinates)

Post-processing routines

Reciprocal-spacerefinement/ML DM

HumanCrystallographer

(editing)

build-in side-chainand main-chain atoms

locally around each CA

example:real-spacerefinement

Feature Extraction

• Database: ~105 regions from ~100 maps

• How to identify closest match (efficiently)???

• Calculate numerical features that represent the pattern in each region

• Must be rotation-invariant

• Search can be very fast: just compare features

F=<1.72,-0.39,1.04,1.55...> F=<1.58,0.18,1.09,-0.25...>

F=<0.90,0.65,-1.40,0.87...> F=<1.79,-0.43,0.88,1.52...>

Rotation-Invariant Features

• Average density: =(1/n)i, where i is density at each lattice point in region

• Other Statistical Features: standard deviation, kurtosis…

• Distant to center of mass:– <xc,yc,zc>=(1/n)< xii/yii/zii/

– dcen=(xc2+ yc

2+ zc2)

More Features

• Moments of inertia– measures dispersion around axes of symmetry

in a density distribution– calculate 3x3 inertia matrix– diagonalize to get eigenvalues– sort from largest to smallest– take magnitudes and ratios of moments

More Features

• Spoke angles– if region centered on CA, should have 3

“spokes” of density emanating from center– find best-fit vectors; calc. angles among them

• surface area of contours

• connectivity of density/bones in region

• other geometrical features...

Feature WeightsFeature Weight Radius(A)Distance to center of mass 0.183 5ratio of moments 1 and 3 0.153 4ratio of moments 1 and 3 0.136 5skewness 0.080 3skewness 0.055 6ratio of moments 1 and 2 0.055 4median spoke angle 0.052 6minimum spoke angle 0.051 4skewness 0.049 5ratio of moments 1 and 2 0.038 5maximum spoke angle 0.037 4ratio of moments 1 and 3 0.031 3magnitude of moment 1 0.022 6median spoke angle 0.019 4minimum spoke angle 0.015 6

CAPRA: C-Alpha Pattern-Recognition Algorithm

• Tracer - remove lattice points from map (lowest density first) without breaking connectivity

• Neural nework - for each pseudo atom, extract features, input to network, predict distances to CAs (1:10 in trace), trained on example points in real maps

• Linking - desire long chains, good CA predictions (not in side-chains), “structurally plausible” (e.g. linear, helical)

DensityTrace

NeuralNetwork

Linking intoC-alpha chains

pseudo atoms predictions ofdistance to true CA

map C-alphacoordinates

Example of the CAPRA Process

Example of CAPRA chains

The LOOKUP Process

Database Construction• Ideally would use solved MAD/MIR maps• Using “back-transformed” maps works well• PDB structure factors (include B-factors)• keep reflections down to 2.8A• Fourier transform electron density map• 50 proteins from PDBSelect (non-homol.)• about 50,000 regions• Feature extraction done offline

Details of Matching Process• Feature-based matching:

– Euclidean distance metric between feature vectors.

– dist(R1,R2)=wi(Fi(R1)-Fi(R2))2

• Must weight features by relevance– less-relevant features add noise– Slider algorithm: optimize weights by comparing

features in matching regions versus mismatches

• Verify selections by density correlation– requires search for optimal rotation

Post-Processing Routines• Imperfections in the initial model:

– backbone atoms not necessarily juxtaposed between adjacent residues, or in same direction

– side-chains occasionally “flipped” into backbone– residue identities often incorrect (based on dens.)

• Fixing “flips” and direction - take candidate match with next highest correlation

• Real-space refinement: regularizes backbone

• Use sequence alignment to fix identities?

New Results on Real MAD Mapsprotein size (#aa) type source reso.CZRA 95 MAD 2.3AM01 317 MAD 2.4A

CZRA M01# residues built 86 286

CAs missed 11a 34b

incorrect CAs 2 5# chains 4 8longest chains 39,27,18 96,85,65CA RMSD 0.79 0.97overall RMSD 0.84 1.07

aCZRA: missed a 5-res loop (weak density) and C-terminusbM01: missed a 17-res helix, 9 deletions, 5 due to breaks, 3-res false backbone

Histograms of DistancesBetween Matched Atoms

0

20

40

60

80

100

120

140

160

0.2

0.6 1

1.4

1.8

2.2

2.6 3

CZRA

0

50

100

150

200

250

300

350

400

0.2

0.6 1

1.4

1.8

2.2

2.6 3

M01

Analysis of Amino Acid Types

G A C S P V T I D N L Q E M H K F Y R WG | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0A | 1 3 0 2 0 2 0 0 1 0 0 1 0 0 2 0 0 0 1 0C | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0S | 1 2 0 5 0 1 0 0 0 1 2 1 0 0 2 1 0 0 0 0P | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0V | 0 0 0 2 0 3 2 3 0 1 2 1 1 0 0 1 0 0 0 0T | 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0I | 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0D | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0N | 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0L | 0 0 0 0 0 0 0 1 1 2 6 0 1 0 0 1 0 1 0 0Q | 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0E | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0M | 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0H | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0K | 0 0 0 0 0 0 0 0 0 1 0 0 2 0 2 1 0 0 2 0F | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0Y | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0R | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0W | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

% identity % struct. sim.CZRA 27.4 71.4M01 16.3 56.2

Confusion Matrix for CZRA:Amino acid in true structure

Am

ino

acid

in T

EX

TA

L m

odel

The TEXTAL System: Automated Model-Building Using Pattern Recognition Techniques Dr. Thomas R....

Documents

Transcript of The TEXTAL System: Automated Model-Building Using Pattern Recognition Techniques Dr. Thomas R....