Life Science Software and High Performance Computing Seminar Series Part II Craig A. Stewart...

Life Science Software and High Performance Computing

Seminar Series Part II

Craig A. Stewart

Fulbright Senior Scholar beim ZIH

Associate Vice President, Research & Academic Computing

License Terms

• Please cite this presentation as: Stewart, C.A. Life Science Software and High Performance Computing: Seminar Series Part II. 2006. Presentation. Presented at: Technische Universitaet Dresden (Dresden, Germany, 20 Apr 2006). Available from: http://hdl.handle.net/2022/14767

• Portions of this document that originated from sources outside IU are shown here and used by permission or under licenses indicated within this document.

• Items indicated with a © are under copyright and used here with permission. Such items may not be reused without permission from the holder of copyright except where license terms noted on a slide permit reuse.

• Except where otherwise noted, the contents of this presentation are copyright 2007 by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.

Life Science Software and HPC Seminar Plan as of today

• Thursday (April 20, 8 am): – Finish up with sequence analysis applications– Life science databases– Discussion of application use patterns and users needs

(hopefully interactive!)• Tuesday (April 25, 8 am):

– Systems biology– Portals, interfaces, workbenches, grids

• Thursday (April 27, 8 am): – Performance analysis and tuning for life science

applications: Dotter, BLAST, maybe GeneIndex• Requests?

Mopping up a few details

• T-Coffee remains open source (thankfully)

• Clarification on Matlab• New Genbank graph

(rechts)

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

Abstracting Multiple Alignments• Hidden Markov models can be used to describe alignments• Called profile HMMS• Think of them as definitions of proteins or averages• Useful for aligning newly discovered sequences• Search sequence databases for sequences that match the

alignment profile (Consider the alternative!)• Build databases of profiles and search for profiles that match

query sequences

HMMER

• http://hmmer.wustl.edu/• Profile HMMs for protein

sequence analysis• Builds profiles from existing

alignments, can then be used to search for new matches

• Basic assumptions for HMMR:– Position-specific scoring

matrix (indicates degree of conservation)

– Assumes that each position is independent of all others

– Has stronger theoretical basis for gaps and insertions than BLAST

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids Durbin, Eddy, Krogh, and Mitchison Cambridge University Press, 1998.

Pieces of HMMER

• hmmalign aligns sequences to an existing model• hmmbuild builds a model from a multiple sequence

alignment• hmmcalibrate takes an HMM and empirically determines

parameters that are used to make searches more sensitive by calculating more accurate expectation value scores (E-values)

• hmmpfam searches an HMM database for matches to a query sequence

• hmmsearch searches a sequence database for matches to an HMM.

Pfam databases• Databases that have archives of HMM models usable with

HMMER, incl http://pfam.cgb.ki.se/ (Stockholm)

Excellent work on HMMER

• Edda Happ. 2004. HMMER for Vector Computers. (Doktorarbeit). Albert-Ludswigs-Universität Freiburg

• Improved I/O• Vectorization of algorithm

Some thoughts about data

http://www.ncbi/nlm/nih/gov

Almost a real conversation

• Me: “Can you tell me what databases you want”• Customer: “All you can get”• Me: “Can you be more specific”• Customer: “What about the word ‘all’ do you find to be not

specific?”

An overview of some of the key databasesBIND Pathways, Gene interactions 29.6 GB

dbSNP Single Nucleotide Polymorphisms 40.0 GB

ENZYME Enzyme nomenclature 0.005

ePCR ePCR results of UniSTS vs Homo sapiens 0.2 GB

KEGG Pathway map coordinates 0.26

LIGAND Pathways, Reactions, & Compounds 0.004

SGD Saccharomyces Genome Database 0.05

UniGene

Entrez

PubMed

Gene clusters

Basically all of the BLAST data except GSS

Licensed – not free!

2.7 GB

90 GB

55 GB

2-23-2006

Microarray Data Portal

• Web application and database designed for annotation and analysis of microarray experiments.

• Annotation: Designed for users to set up experimental design first minimizing amount of time for sample entry but still getting in the essential info

• Analysis– Allows user to partition data into groups based on their

annotation.– Extensive filtering, search, and display options– T-test, Clustering, SVD, etc.– Allows different views of data based on informatics associated

with the genes (e.g. KEGG, GO, Chromosome Location)

The biggest issue with public databases is…

• Downloading the data regularly and correctly• IU is open sourcing its scripts for doing this• And as soon as we have them in a shape decent enough to

use we’ll ship them to ZIH if you want to be our beta testers!

Some cycle usage information

• Everything we know about usage at IU right now is about a cycle-starved community

• I hope to learn as much (or more) from you as you learn from me in this discussion!

Some past usage information from IU

Q4 2004/2005Q3 2004/2005Q3 2003/2004

Some info on top users and usage• Of the top 30 users, and of the top 100 users, roughly 1/3 use bio software• Here are the top bio codes and average number of processors for the last 3 months at IU:

– Rank CODE Wallclock (hrs) ave/job # proc– 1 Chemistry codes 7 2.8 2 2 4.5 3 0.25 1 (x 2800!) 4 1 2.2

5 C programs and R 1 1.3 MrBAYES & PAUP 13 14.8 MEME 0.25 2.55 Gene family studies (?) 12 13.9 PDB searches 19 19.2 PHI-BLAST 7 12.2

PAUP, CLUSTAL-W MPI 4 6

Top user group

• “ We are interested in understanding reaction mechanisms and the electronic structure of molecules in atomic detail. We use theoretical methods to make realistic computer models and to identify electronic features that determine the chemical nature and the reactivity of the molecule. For example, we want to understand conceptually how different ligands control the shapes and energies of the orbitals at the metal that they are attached to and how these changes determine the chemistry of metal complexes. We then use this concept to reason how desirable behavior could be enhanced while suppressing undesirable properties. Our line of argument typically involves thinking about frontier orbital shapes and energies, understanding the local hardness/softness of the reaction center or following electron density deformations. Finally, we test our proposals in extensive computer simulations and collaborate with experimental groups to confirm or disprove our hypotheses. On the computer science part, we are interested in developing a new generation of chemical expert software that uses information management and artificial intelligence(AI) to increase both the depth and the scope of research through automated data mining and AI-enhanced analysis. “

How big does the user community need to be for the law of large numbers to save us?

• Darned if I know• fastDNAml example• Workflows tend to move quickly• People tend to want to work interactively as quickly as

possible when interactive is possible• Encouraging ‘nice’ behavior

Utilization of Idle Cycles

Red: total owner Blue: total idle Green: total Condor

The Big PictureWe’ll discuss each part in more detail next…

The shaded box indicates components hosted on multiple desktop computers

What could these systems be used for?

• BLAST• MEME• fastDNAml

20.4 TFLOPS system – queue structure and job strategy

• Key issues are scalability, memory, usage of VMX capabilities

• Scalability of bio codes is a REAL issue– Compucell NG, for example

• Memory requirements• What an HPC expert would call basic software scalability

engineering• Arbeitslos will ich nicht…..

Queue structure (tentative)

Queue type Queue processor count

Maximum runtime Maximum number of simultaneously active jobs

Priority production jobs

516 48 hours 1

260 48 hours 2

20 240 hours 8

Interactive 516 1 hour 516

Debug 64 30 minutes 64

Preemptable 260 12 hours No limit

Workflow summary thoughts

• Interactive• Often changing• Scripts!• Pick large cycle

consumers for focused effort

• Adapting to the user community work patterns

Acknowledgments

• Funding for projects described in this talk has come from the National Science Foundation, National Institutes of Health, Lilly Endowment, Inc., State of Indiana (particularly through support of I-light Initiative and the 21st Century Fund)

• The work described here was made possible by the faculty, students, and staff of Indiana University. Thanks especially to the staff of RAC, CPO, Telecommunications, PTL, UITS generally, the participants in the Indiana Genomics Initiative, and the participants in the METACyt Initiative.

• Several of the slides and ideas presented here were developed by colleagues or collaborators – the Research and Academic Computing Division of UITS in general, and Dick Repasky in particular.

• Stewart’s visit to Dresden is funded in part by the Center for the International Exchange of Scholars, the Technical University of Dresden, and Indiana University

Life Science Software and High Performance Computing Seminar Series Part II Craig A. Stewart...

Documents

Transcript of Life Science Software and High Performance Computing Seminar Series Part II Craig A. Stewart...