Zmasek bosc2010 topsan

15
Connecting TOPSAN to Computational Analysis Christian M Zmasek, Kyle Ellrott, Dana Weekes, Constantina Bakolitsa, John Wooley, Adam Godzik Joint Center for Structural Genomics Sanford-Burnham Medical Research Institute, La Jolla, California, USA University of California, San Diego, La Jolla, California, USA Joint Center for Molecular Modeling

Transcript of Zmasek bosc2010 topsan

Page 1: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis

Christian M Zmasek, Kyle Ellrott, Dana Weekes, Constantina Bakolitsa, John Wooley, Adam Godzik

Joint Center for Structural GenomicsSanford-Burnham Medical Research Institute, La Jolla, California, USA

University of California, San Diego, La Jolla, California, USAJoint Center for Molecular Modeling

Page 2: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 2

Overview

• What is TOPSAN?– TOPSAN: The Open Protein Structure Annotation Network – community based annotation protein structures

• “Semantic” TOPSAN• How to enter machine-readable, structured data• Example: editor → entry → semantic web• Different ways to download information• SPARQL example• Availability and licenses• Acknowledgements

Page 3: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 3

What is TOPSAN?

• TOPSAN: The Open Protein Structure Annotation Network • Ten-thousands of protein structures have been determined

by structural genomics (SG) centers and many more are expected

• While these structures are available in PDB (Protein Data Bank)…

• … annotations for most of them a limited to one-line PDB titles

• TOPSAN is the first database that specifically focuses on proving extensive annotations for the thousands of structures solved by the SG centers

Page 4: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 4

What is TOPSAN?

• TOPSAN’s main content are collaboratively (“open”) written articles/annotations for each solved protein structure

• TOPSAN combines automated with human edited elements • TOPSAN spans the range of analysis of

– single proteins– characterization of protein families– reconstruction of entire genomes

• Articles are created by structural genomics (SG) center staff and over 400 external users, so far covering 7,250 proteins

• Collaborating with PFAM to use JCSG structures to refine and create new PFAM families

Page 5: Zmasek bosc2010 topsan

5

TOPSAN example entry

Connecting TOPSAN to Computational Analysis

Page 6: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 6

“Semantic” TOPSAN

• Use the principles of the semantic web to turn TOPSAN into a database that can be:– edited– searched– linked

• TOPSAN content is being made accessible to computational query and analysis via semantic web technologies

Page 7: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 7

Entering machine-readable, structured data with the TOPSAN Protein Syntax (TPS)

• Takes the form subject, predicate, object• Subject: the protein in question• Predicate, examples:

– homologous– encoded_by– citation– member_of

• Object: “direct value” or link to other database• Example:

– {{ note.link( ‘pfam_family_member’, ‘PFAM:PF07980′ ) }}

• More information: http://topsan.wordpress.com/2010/06/01/96/

Page 8: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 8

Example: in the Editor

Page 9: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 9

Example: the resulting TOPSAN entry

Page 10: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 10

Example: on the Semantic Web

<http://purl.org/topsan/protein/2qcv> <http://purl.org/topsan/tps#simular_structure> <http://www.pdb.org/pdb/explore/explore.do?structureId=2afb>

<http://purl.org/topsan/protein/2qcv> <http://purl.org/topsan/tps#simular_structure> <http://www.pdb.org/pdb/explore/explore.do?structureId=2var>

<http://purl.org/topsan/protein/2qcv> <http://purl.org/topsan/tps#functional_assignment> <http://purl.org/obo/owl/EC#EC_2.7.1.45>

Page 11: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 11

Different ways to download information

• Generic TOPSAN page– Semantic information embedded into every TOPSAN page

• RDFa interface– http://topsan.org/rdfa/2A2M– XML

• Bulk Download– http://files.topsan.org/topsan.n3.gz– All unique semantic triples stored in a single N3 formatted

file

Page 12: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 12

Simple SPARQL

PREFIX tps:<http://purl.org/topsan/tps#>

SELECT ?id ?weight WHERE {

?id tps:molecular_weight ?weight

}

Page 13: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 13

Availability and Licenses

• Project Site: http://www.topsan.org • Software: http://www.topsan.org/Tools • Data: Open Source Licenses: Creative

Commons Attribution 3.0 License• Software: GNU General Public License

Page 14: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 14

Summary

• Structural genomics centers produce a large number of proteins structures, most of which never get a publication

• TOPSAN provides a means for community annotation of such protein structures

• The TOPSAN Protein Syntax (TPS) allows annotators to easily enter machine-readable, structured data

• TOPSAN content is being made accessible to computational query and analysis via semantic web technologies

• Many aspects of TOPSAN are still under development and are planned to evolve with user needs

Page 15: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 15

Acknowledgements

• Inspiration for TOPSAN/semantic web connection: DBCLS BioHackathon 2010

• Developers: Krishna Subramanian, Kyle Ellrott, Dana Weekes

• All contributors and users