GeneWeaver: A prototype for bioinformatics Michael Luck University of Southampton, UK Kevin Bryson...
-
date post
19-Dec-2015 -
Category
Documents
-
view
219 -
download
2
Transcript of GeneWeaver: A prototype for bioinformatics Michael Luck University of Southampton, UK Kevin Bryson...
GeneWeaver: A prototype for bioinformatics
Michael LuckUniversity of Southampton, UKKevin Bryson and David Jones, UCLMike Joy, University of Warwick
The Structure of DNAThe Structure of DNA
The Result of 15 The Result of 15 Years Hard WorkYears Hard Work
> contig 1TAAGTTATTATTTAGTTAATACTTTTAACAATATTATTAAGGTATTTAAAAAATACTATTATAGTATTTAACATAGTTAAATACCTTCCTTAATACTGTTAAATTATATTCAATCAATACATATATAATATTATTAAAATACTTGATAAGTATTATTTAGATATTAGACAAATACTAATTTTATATTGCTTTAATACTTAATAAATACTACTTATGTATTAAGTAAATATTACTGTAATACTAATAACAATATTATTACAATATGCTAGAATAATATTGCTAGTATCAATAATTACTAATATAGTATTAGGAAAATACCATAATAATATTTCTACATAATACTAAGTTAATACTATGTGTAGAATAATAAATAATCAGATTAAAAAAATTTTATTTATCTGAAACATATTTAATCAATTGAACTGATTATTTTCAGCAGTAATAATTACATATGTACATAGTACATATGTAAAATATCATTAATTTCTGTTATATATAATAGTATCTATTTTAGAGAGTATTAATTATTACTATAATTAAGCATTTATGCTTAATTATAAGCTTTTTATGAACAAAATTATAGACATTTTAGTTCTTATAATAAATAATAGATATTAAAGAAAATAAAAAAATAGAAATAAATATCATAACCCTTGATAACCCAGAAATTAATACTTAATCAAAAATGAAAATATTAATTAATAAAAGTGAATTGAATAAAATTTTGGGAAAAAATGAATAACGTTATTATTTCCAATAACAAAATAAAACCACATCATTCATATTTTTTAATAGAGGCAAAAGAAAAAGAAATAAACTTTTATGCTAACAATGAATACTTTTCTGTCAAATGTAATTTAAATAAAAATATTGATATTCTTGAACAAGGCTCCTTAATTGTTAAAGGAAAAATTTTTAACGATCTTATTAATGGCATAAAAGAAGAGATTATTACTATTCAAGAAAAAGATCAAACACTTTTGGTTAAAACAAAAAAAACAAGTATTAATTTAAACACAATTAATGTGAATGAATTTCCAAGAATAAGGTTTAATGAAAAAAACGATTTAAGTGAATTTAATCAATTCAAAATAAATTATTCACTTTTAGTAAAAGGCATTAAAAAAATTTTTCACTCAGTTTCAAATAATCGTGAAATATCTTCTAAATTTAATGGAGTAAATTTCAATGGATCCAATGGAAAAGAAATATTTTTAGAAGCTTCTGACACTTATAAACTATCTGTTTTTGAGATAAAGCAAGAAACAGAACCATTTGATTTCATTTTGGAGAGTAATTTACTTAGTTTCATTAATTCTTTTAATCCTGAAGAAGATAAATCTATTGTTTTTTATTACAGAAAAGATAATAAAGATAGCTTTAGTACAGAAATGTTGATTTCAATGGATAACTTTATGATTAGTTACACATCGGTTAATGAAAAATTTCCAGAGGTAAACTACTTTTTTGAATTTGAACCTGAAACTAAAATAGTTGTTCAAAAAAATGAATTAAAAGATGCACTTCAAAGAATTCAAAetc etc etc
Flow of Biological DataFlow of Biological Data
DNA
Protein Sequence
Protein Structure
Protein Function
… ATG GAT TTC ...
Met Asp Phe ...
Data AnalysisData Analysis
Lots of primary data -- need to discover gene function. Scan databases for similar sequences Collect matching sequences and alignments Infer function from annotations of matched proteins. Analysis by range of existing programs. Interpret results.
Additional factors: some programs/results available over WWW/email; continual updates of primary databases -- need for
reassessment.
Biological DatabasesBiological Databases
DNA Databases GenBANK(Genomes) EMBL
NDBJ
Protein Sequence SwissProtDatabases PIR
Protein Structure PDBDatabases SCOP
CATH
Pattern Databases PROSITEPRINTSBLOCKS
SwissProt EntrySwissProt EntryID PRIO_BOVIN STANDARD; PRT; 264 AA.AC P10279;DT 01-MAR-1989 (Rel. 10, Created)DT 01-NOV-1991 (Rel. 20, Last sequence update)DT 15-JUL-1998 (Rel. 36, Last annotation update)DE MAJOR PRION PROTEIN 1 PRECURSOR (PRP) (MAJOR SCRAPIE-ASSOCIATED FIBRILDE PROTEIN 1).GN PRNP.OS Bos taurus (Bovine).OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;...CC -!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THECC HOST GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS.CC -!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLEDCC "RODS".CC -!- SUBCELLULAR LOCATION: ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR.CC --------------------------------------------------------------------------CC This SWISS-PROT entry is copyright. It is produced through a collaborationCC between the Swiss Institute of Bioinformatics and the EMBL outstation -...SQ SEQUENCE 264 AA; 28614 MW; DEA01B4E CRC32; MVKSHIGSWI LVLFVAMWSD VGLCKKRPKP GGGWNTGGSR YPGQGSPGGN RYPPQGGGGW GQPHGGGWGQ PHGGGWGQPH GGGWGQPHGG GWGQPHGGGG WGQGGTHGQW NKPSKPKTNM KHVAGAAAAG AVVGGLGGYM LGSAMSRPLI HFGSDYEDRY YRENMHRYPN QVYYRPVDQY SNQNNFVHDC VNITVKEHTV TTTTKGENFT ETDIKMMERV VEQMCITQYQ RESQAYYQRG ASVILFSSPP VILLISFLIF LIVG//
PDB EntryPDB EntryHEADER PRION PROTEIN 20-SEP-99 1QM3 TITLE HUMAN PRION PROTEIN FRAGMENT 121-230 COMPND MOL_ID: 1; COMPND 2 MOLECULE: PRION PROTEIN; COMPND 3 CHAIN: A; COMPND 4 SYNONYM: PRP, MAJOR PRION PROTEIN, PRP27-30, PRP33-35C, COMPND 5 (ASCR).PRP; COMPND 6 FRAGMENT: RESIDUES 121-230; COMPND 7 ENGINEERED: YES; COMPND 8 MUTATION: YES SOURCE MOL_ID: 1; SOURCE 2 ORGANISM_SCIENTIFIC: HOMO SAPIENS; SOURCE 3 ORGANISM_COMMON: HUMAN; SOURCE 4 ORGAN: BRAIN; ...ATOM 1 N LEU A 125 5.041 -9.143 -1.920 1.00 0.00 N ATOM 2 CA LEU A 125 4.764 -7.837 -1.351 1.00 0.00 C ATOM 3 C LEU A 125 5.308 -7.848 0.071 1.00 0.00 C ATOM 4 O LEU A 125 4.554 -8.101 1.013 1.00 0.00 O ATOM 5 CB LEU A 125 3.275 -7.484 -1.391 1.00 0.00 C ATOM 6 CG LEU A 125 2.781 -7.205 -2.821 1.00 0.00 C ATOM 7 CD1 LEU A 125 1.683 -8.197 -3.182 1.00 0.00 C ATOM 8 CD2 LEU A 125 2.266 -5.774 -2.970 1.00 0.00 C ATOM 9 H LEU A 125 4.919 -9.913 -1.281 1.00 0.00 H ATOM 10 HA LEU A 125 5.307 -7.076 -1.916 1.00 0.00 H ATOM 11 1HB LEU A 125 2.703 -8.290 -0.932 1.00 0.00 H ...
Homology/Similarity PSI-BLASTSearching BLAST
FASTA
Sequence Clustal-WAlignment GCG Pileup
Motif/Pattern PROSITESearching HMMer
Secondary Structure PSIPREDPrediction PHD
DSC
Analysis ToolsAnalysis Tools
BLAST OutputBLAST Output
...Database: pdb_seq 14,442 sequences; 3,011,261 total letters
Searching..................................................done
Score ESequences producing significant alignments: (bits) Value
pdb|1NBD|1 cftrfragment: nbd1, first (or n-terminal) nucleotide-... 79 6e-16pdb|1WAI|1 DNA polymerase(t4 gp43)DNA substrate (tttt)DNA 28 1.0…
>pdb|1NBD|1 cftrfragment: nbd1, first (or n-terminal) nucleotide-binding domain; (cftr nbd1, cystic fibrosis transmembrane conductance regulator nucleotide-binding domain 1) Length = 214 Score = 78.8 bits (191), Expect = 6e-16 Identities = 37/40 (92%), Positives = 39/40 (97%)
Query: 4 TTLLVTSKMEHLKKADKILILHEGSSYFYGTFSELQNLRP 43 T +LVTSKMEHLKKADKILILHEGSSYFYGTFSELQNL+PSbjct: 175 TRILVTSKMEHLKKADKILILHEGSSYFYGTFSELQNLQP 214...
Alignment OutputAlignment Output
*>>>TRANSGELIN : TRANSGELIN SEQUENCE*P1;A60598 : actin-associated protein p27 - mouse*>>>SM22_RAT : SMOOTH MUSCLE PROTEIN 22-ALPHA (SM22-ALPHA)....
MANKGPSYGMSREVQSKIEKKYDEELEERLVEWIIVQCGPDVGRPDRGRLGFQVWLKNGVILSKLVNSLYPDGSKPVKVPMANKGPSYGMSREVQSKIEKKYDEELEERLVEWIVVQCGPDVGAPDRGRLGFQVWLKNGVILSKLVNSLYPEGSKPVKVP ANKGPSYGMSREVQSKIEKKYDEELEERLVEWIVMQCGPDVGRPDRGRLGFQVWLKNGVILSKLVNSLYPEGSKPVKVPMANKGPSYGMSREVQSKIEKKYDEELEERLVEWIVMQCGPDVGRPDRGRLGFQVWLKNGVILSKLVNSLYPEGSKPVKVP ANKGPAYGMSRDVQSKIEKKYDDELEDRLVEWIVAQCGSSVGRPDRGRLGFQVWLKNGIVLSQLVNSLYPDGSKPVKIPMANKGPAYGMSRDVQSKIEKKYDDELEDRLVEWIVAQCGSSVGRPDRGRLGFQVWLKNGIVLSQLVNSLYPDGSKPVKIP ANKGPSYGMSREVQSKIEKKYDEELEERLVEWIIVQCGPDVGRPDRGPLGFQVWLKNGVILSKLVNSLYPDGSKPVKVPMANKGPSYGMSREVQSKIEKKYDEELEERLVEWIIVQCGPDVGRPDRGRLGFQVWLKNGVILSKLVNSLYPEGSKPVKVPMANRGPAYGLSREVQQKIEKQYDADLEQILIQWITTQCRKDVGRPQPGRENFQNWLKDGTVLCELINALYPEGQAPVKKIMANRGPSYGLSREVQEKIEQKYDADLENKLVDWIILQCAEDIEHPPPGRTHFQKWLMDGTVLCKLINSLYPPGQEPIPKI MSLERAVRAKIAGKRNPEMDKEAQEWIEAIIAEKFPAGQS YEDVLKDGQVLCKLINVLSPNA VPKV EFPPSGLSYQVKKKLEGKRDKDQENEALEWIEALTGLKLDRSKL YEDILKDGTVLCKLMNSIKPGC IKKI MELWRQCTHWLIQCRVLPPSHRVTWDGAQVCELAQALRDGVLLCQLLNNLLPHAINLREVN MELWRQCTHWLIQCRVLPPSHRVTWEGAQVCELAQALRDGVLLCQLLNNLLPQAINLREVN MSMEGISYTNSNPSATPNMEDTLLTFSMGILPITMDCDPVTQLSQLFQQGAPLCILFNSVKPQF KLP
ENPPSMVFKQMEQVAQFLKAA EDYGVTKTDMFQTVDLFEGKDMAAVQRTVMALGSLAVTKNDGHYRGDPNWFMKKAQEHENPPSMVFKQMEQVAQFLKAA EDYGVIKTDMFQTVDLYEGKDMAAVQRTLMALGSLAVTKNDGNYRGDPNWFMKKAQEHENPPSMVFKQMEQVAQFLKAA EDYGVTKTDMFQTVDLFEGKDMAAVQRTVMALGSLAVTKNDGHYRGDPNWFMKKAQEHENPPSMVFKQMEQVAQFLKAA EDYGVTKTDMFQTVDLFEGKDMAAVQRTVMALGSLAVTKNDGHYRGDPNWFMKKAQEH...
Determining Protein FunctionDetermining Protein Function
Protein Sequence (Genome)
Remove regions oflow complexity (SEG)
Rapid similaritysearch against allknown proteins (PSI-BLAST)
Slower, more sensitiveprotein categorysearch (HMMer)
Rapid protein analysis tools,i.e. motif search (ScanProsite)
Consistent and sensible (Human)
Annotate function.
E < 0.001
Primary database agents manage remote primary sequence databases, providing up-to-date data in various common formats.Non-redundant database agents filter and combine data from various primary database agents into non-redundant data sources. Calculation agents encapsulate pre-existing methods or tools for the analysis of data to determine function.Genome agents manage genome information for a particular organism and use other agents to derive annotations.Broker agents provide information about agents registered within the agent community.
Agent ClassesAgent Classes
SwissAgent (PrimaryDB)
PDBAgent (PrimaryDB)
PIRAgent (PrimaryDB)
Non-redundant Protein Agent (NRDB)
BrokerAgent (Broker)
HInfAgent (Genome)
Web
HInfAgent (Genome)
HInfAgent (Genome)
BlastAgent (Calculation)
ClustalAgent (Calculation)
GeneWeaver Agent CommunityGeneWeaver Agent Community
register Register with a broker.unregister Cancel a registration with a broker.ask Ask about data.derive Request an agent to derive particular data.tell Inform another agent about data.deny Inform another agent about lack of data.subscribe Obtain regular updates of certain data.unsubscribe Stop receiving regular updates of data.ok Indicates success.sorry Indicates failure on the agent’s part.error Indicates problem with protocol or other error.
BAL PerformativesBAL Performatives
Metadata
AgentInfo General information about an agent.ProviderInfo Information about a provider protocol.SkillInfo Information about a skill.PlanInfo Information about a plan.
Data
Genome A genome.SeqFile A sequence file.SeqEntry A sequence entry.
Example Types of BAL DataExample Types of BAL Data
BAL Message ExampleBAL Message Example
Sender: //localhost.localdomain/7/HInfReceiver: //localhost.localdomain/0/BrokerTransport: rmiLanguage: balPerform: registerRef: hinf77f001_0Content:AgentInfo(TYPE = Genome,OWNER = hinf77f001,UPD_TIME = 962601420367,MOD_TIME = 962601420367,ID = HInf,DESCRIPTION = "H. Influenzae Genome Agent")
RStart
Register Conversation ClassRegister Conversation Class
Requester
Provider
RRegistering RRegistered
RDeclined RDoneRError RTimeout
> register< ok
RUnregistering
> unregister
< sorry< sorry
< ok
PStart PRegistering PRegistered
PDeclined PDone
PError PTimeout
< register> ok
PUnregistering
< unregister
> sorry> sorry
> ok
SwissAgent (PrimaryDB)
Sequences
NonRedundant Protein Agent NRDB
Subscribed to:Broker for PrimaryDB infoPrimaryDBs for Sequences
Swiss SequencesPDB SequencesPIR Sequences
PDBAgent (PrimaryDB)
Sequences
PIRAgent (PrimaryDB)
Sequences
Web
SwissAgent (PrimaryDB)
Sequences
tellSequences
NonRedundant Protein Agent NRDB
Subscribed to:Broker for PrimaryDB infoPrimaryDBs for Sequences
Swiss SequencesPDB SequencesPIR Sequences
FlyDBAgent (PrimaryDB)
Fly Sequences
register FlyDBAgentok
BrokerAgent (Broker)
Subscribed to:All agents for Info
SwissAgent Info PDBAgent Info PIRAgent InfoNRDBAgent Info
subscribe Agent Infotell Agent Info
BrokerAgent (Broker)
Subscribed to:All agents for Info
SwissAgent Info PDBAgent Info PIRAgent InfoNRDBAgent Info FlyDBAgent Info tell
FlyDBAgent Info
Subscribe Sequencestell Sequences
NonRedundant Protein Agent NRDB
Subscribed to:Broker for PrimaryDB infoPrimaryDBs for Sequences
Swiss SequencesPDB SequencesPIR SequencesFly Sequences
Agent Interaction: ExampleAgent Interaction: Example
Higher Level GoalsDeriveGoalAgent should try to derive data with particular properties.UpdateGoalAgent should try to update data matching a given template.RelationGoalAgent should attempt to establish the given type ofrelationship with another agent.
Lower Level Goals DoGoal, QueryGoal, TellGoal.
GoalsGoals
Communication Other Agents
Interaction
Messages
Control
Goal Manager Plan Library
GoalsInteractions
Meta-Store Motivation
Action Data-StoreAnalysis Tools
Goals MetaData
ActionsData
Agent ArchitectureAgent Architecture
Genome Agent1 In response to a higher level motivation, DeriveGoal(SeqFunction) is created to annotate any sequences with annotated function confidence < 0.5.2 Using a plan from the plan library, DeriveGoal(SeqFunction) is decomposed into RelationalGoal(derive), DeriveGoal(Homologue) and function assignment using the homologue if confidence > 0.5.3 A suitable agent with DeriveProvider and ‘homology’ skill is located.4 Derive requester interaction used to accomplish RelationGoal(derive).
Blast Agent5 Skill used to satisfy DoGoal(Homologue).
Annotate Function ExampleAnnotate Function Example
SummarySummary
Applications: bioinformatics problem not created by the technologies used to solve it practical developments to inform conceptual infrastructure
Tensions between biological sciences and computer science
Work remaining Consolidation of existing prototype Inclusion of multiple calculation agents Evaluation of implementation infrastructure Staged and full deployment
Future work: agent marketplace with calculation agents competing?