Cloud Computing for Chemical P roperty P rediction May 5, 2010
-
Upload
brady-rogers -
Category
Documents
-
view
35 -
download
1
description
Transcript of Cloud Computing for Chemical P roperty P rediction May 5, 2010
![Page 1: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/1.jpg)
Cloud Computing for Chemical Property Prediction
May 5, 2010
![Page 2: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/2.jpg)
• Computing Science– Paul Watson, Hugo Hiden, Simon Woodman, Jacek Cala,
Martyn Taylor
• Northern Institute for Cancer Research– David Leahy, Dominic Searson, Valdimir Sykora
• Microsoft– Christophe Poulain
The Team
![Page 3: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/3.jpg)
The problem...
What are the properties of this molecule?
Toxicity
Solubility
Biological Activity
Perform experiments
Time consuming
Expensive
Ethical constraints
![Page 4: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/4.jpg)
The alternative to Experiments
Predict likely properties based on similar molecules
CHEMBL Database: data on 622,824 compounds,collected from 33,956 publications
WOMBAT-PK Database: data on 1230 compounds,for over 13,000 clinical measurements
WOMBAT Database: data on 251,560 structures,for over 1,966 targets
What is the relationship between structure and activity?
All these databases contain structure information and numerical activity data
![Page 5: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/5.jpg)
QSAR
Quantitative Structure Activity Relationship
f( )Activity ≈
More accurately, Activity related to a quantifiable structural attribute
Activity ≈ f(number of atoms, logP, shape....)
Currently > 3,000 recognised attributeshttp://www.qsarworld.com/
![Page 6: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/6.jpg)
JUNIOR ProjectAim:
To generate models for as much of the freely available data as possible and make it available on www.openqsar.com
... so that researchers can generate predictions for their own molecules
![Page 7: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/7.jpg)
Using QSARQSAR is a Regression exercise
![Page 8: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/8.jpg)
Using QSARQSAR is a Regression exercise
Select a “training set” of compounds
Compounds Activities
9.3
6.2
4.1
3.9
5.0
Select descriptors most related to activity
Descriptor Values – logP, shape...
35.3
10.4
24.0
20.9
14.2
312
102
194
242
109
0.2
0.1
0.5
0.3
0.6Calculate descriptor values
which compounds?
which descriptors?
which model?
Calculate model parameters
X Y
y = mx + c
![Page 9: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/9.jpg)
Using QSAR
Calculate same set of descriptors10.4
20.9
102
242
0.1
0.3
Descriptors
Use model to estimate activities
4.7
5.7
EstimateSelect a new set of compounds notused during model building
Compounds
4.6
5.9
Actual
Compare the estimated activities with the actual activities
0.1
0.2
Error
Keep model if error acceptable...what measure?
![Page 10: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/10.jpg)
Using QSAR
Use the model to estimate activities for new compounds
f(212, 0.9, 9.8) = 5.7
Use
Discard
![Page 11: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/11.jpg)
• QSAR Process requires many choices– Which descriptors?– Which modelling algorithm?– What model testing strategy?
• Quality of result depends on make correct choices– All runs are different
• Discovery Bus manages this process– Apply everything approach
QSAR choices
![Page 12: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/12.jpg)
Linear regressionNeural NetworkPartial Least SquaresClassification Trees
Correlation analysisGenetic algorithmsRandom selection
Java CDK descriptorsC++ CDL descriptors
Random split80:20 split
The Discovery Bus
Add to database
Partition training & test data
Calculate descriptors
Select descriptors
Build model
Manages the many model generation paths
![Page 13: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/13.jpg)
• Legacy system– Uses Oracle stored procedures and shell scripts– Models built using R– Designed to scale using agents on multiple machines
• Hard to use and maintain– Undocumented– Specific library and OS versions
• Runs on Amazon VMs• AIM: Extend Discovery Bus using Azure
– New agent types– Make use of more computing resources
• Move towards more maintainable system
The Discovery Bus
![Page 14: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/14.jpg)
Discovery Bus Plans
“QSPRDiscover&Test”
![Page 15: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/15.jpg)
• Move computationally intensive tasks to Cloud– Descriptor calculation– Descriptor selection– Model building
• Keep Discovery Bus for management• Apply this system to CHEMBL and WOMBAT
databases• MOAQ
– Mother Of All QSAR– Model everything in the databases in one shot
Using Cloud to extend Discovery Bus
![Page 16: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/16.jpg)
Extending Discovery BusBus Master
Planner
Cal
cula
tion
Ag
ents
e-Science Central Workflows
“Proxy” Agent
Agent code executes on multiple Azure nodes
![Page 17: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/17.jpg)
Initial results
~70 Mins2 Workers
~20 Mins10 Workers
~15 Mins15 Workers
Azure CPU Utilisation:
![Page 18: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/18.jpg)
An excellent result?
0 2 4 6 8 10 12 14 160
1020304050607080
Processing Time
No
Number of Azure Nodes
Pro
cess
ing
time
(min
utes
)
![Page 19: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/19.jpg)
Initial problems
Scales to ~20 worker nodes Want scalability to at least 100
Improves calculation time, but still takes too long for a run:
Model validation and admin tasks form a long tail
![Page 20: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/20.jpg)
Analysis of problems
Bus Master
Planner
Cal
cula
tion
Ag
ents
e-Science Central Workflows
“Proxy” Agent
Agent code executes on multiple Azure nodes
Not sending enough work fast enough to Azure
Add more admin capable agents
Architecture?
No improvement
Tops out at over 400 concurrent workflows
Discovery Bus?
![Page 21: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/21.jpg)
Discovery Bus optimisation
Bus Master
Planner
Discovery Bus?Optimise Amazon VM configuration
Standard VM High CPU VM
EBS database Local disk database
Custom file transfer NFS file transfer
Significant gains Planner optimisation
Takes ~1 second to plan a request to Azure
Feature / limitation of Discovery Bus – always “active”
Multiple planning agents
Flatten / simplify plan structure
20 Nodes 40 - 50 Nodes
![Page 22: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/22.jpg)
• Current setup scales to 50 nodes– Run two parallel Discovery Bus instances– Feeds 100 Azure nodes
• Moving more of the Admin tasks to Azure / e-Science Central– Move model validation out of Discovery Bus– More co-ordination outside Discovery Bus
• Work in progress– Azure utilisation increasing (average ~60% over entire run)
Updated configuration
![Page 23: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/23.jpg)
Improved Results (CPU Utilisation)
Before Optimisation:
After Optimisation:
![Page 24: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/24.jpg)
Improved Results (Queue Length)
Before Optimisation:
After Optimisation:
![Page 25: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/25.jpg)
• Moving to the Cloud exposes architectural weaknesses– Not always where expected– Stress test individual components in isolation
• Good utilisation needs the right kind of jobs– Computationally expensive– Modest data transfer sizes– Easy to run in parallel
• Be prepared to modify architecture• However
– We are dramatically shortening calculation time– We do have a system that can expand on demand
Lessons learned
![Page 26: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/26.jpg)
• Finish runs on 100 nodes using 2 Discovery Busses– Publish results on www.openqsar.com
• Move more code away from Discovery Bus– Many tasks already on Azure– Fixing planning issue will be complex– Move planning to e-Science Central / Azure
• Generic model building / management environment in the Cloud
Future work
![Page 27: Cloud Computing for Chemical P roperty P rediction May 5, 2010](https://reader030.fdocuments.net/reader030/viewer/2022032805/5681331e550346895d99ea61/html5/thumbnails/27.jpg)
Microsoft External ResearchProject funders for 2009 / 2010Roger Barga, Savas Parastatidis, Tony HeyPaul Appleby, Mark HolmesChristophe Poulain
Acknowledgements