2013 03-15- Institut Jacques Monod - bioinfoclub

download

of 75

  • date post

    28-Jan-2015
  • Category

    Documents
  • view

    105
  • download

    0

Embed Size (px)

description

 

transcript

<ul><li> 1. Doing computational science better Some sources of inspirationSome tools Getting help A vous</li></ul><p> 2. Some sources of inspiration 3. EducationA Quick Guide to Organizing Computational BiologyProjectsWilliam Stafford Noble1,2*1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science andEngineering, University of Washington, Seattle, Washington, United States of AmericaIntroduction understanding your work or who may be under a common root directory. The evaluating your research skills. Most com-exception to this rule is source code or Most bioinformatics coursework focus- monly, however, that someone is you. Ascripts that are used in multiple projects.es on algorithms, with perhaps somefew months from now, you may notEach such program might have a projectcomponents devoted to learning pro-remember what you were up to when you directory of its own.gramming skills and learning how tocreated a particular set of files, or you mayWithin a given project, I use a top-leveluse existing bioinformatics software. Un-not remember what conclusions you drew. organization that is logical, with chrono-fortunately, for students who are prepar-You will either have to then spend time logical organization at the next level, anding for a research career, this type ofreconstructing your previous experimentslogical organization below that. A samplecurriculum fails to address many of theor lose whatever insights you gained from project, called msms, is shown in Figure 1.day-to-day organizational challenges as- those experiments.At the root of most of my projects, I have asociated with performing computationalThis leads to the second principle,data directory for storing fixed data sets, aexperiments. In practice, the principles which is actually more like a version ofresults directory for tracking computa-behind organizing and documentingMurphys Law: Everything you do, youtional experiments peformed on that data,computational experiments are oftenwill probably have to do over again.a doc directory with one subdirectory perlearned on the fly, and this learning is Inevitably, you will discover some flaw inmanuscript, and directories such as srcstrongly influenced by personal predilec-your initial preparation of the data beingfor source code and bin for compiledtions as well as by chance interactionsanalyzed, or you will get access to new binaries or scripts.with collaborators or colleagues.data, or you will decide that your param-Within the data and results directo- The purpose of this article is to describeeterization of a particular model was not ries, it is often tempting to apply a similar,one good strategy for carrying out com-broad enough. This means that the logical organization. For example, youputational experiments. I will not describeexperiment you did last week, or even may have two or three data sets againstprofound issues such as how to formulate the set of experiments youve been work-which you plan to benchmark yourhypotheses, design experiments, or drawing on over the past month, will probably algorithms, so you could create oneconclusions. Rather, I will focus on need to be redone. If you have organizeddirectory for each of them under data.relatively mundane issues such as organiz- and documented your work clearly, thenIn my experience, this approach is risky,ing files and directories and documentingrepeating the experiment with the new because the logical structure of your final 4. EducationA Quick Guide to Organizing Computational BiologyProjectsWilliam Stafford Noble1,2*1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science andEngineering, University of Washington, Seattle, Washington, United States of AmericaIntroductionunderstanding your work or who may beunder a common root directory. Theevaluating your research skills. Most com- exception to this rule is source code or Most bioinformatics coursework focus-monly, however, that someone is you. A scripts that are used in multiple projects.es on algorithms, with perhaps some few months from now, you may not Each such program might have a projectcomponents devoted to learning pro- remember what you were up to when youdirectory of its own.gramming skills and learning how to created a particular set of files, or you mayWithin a given project, I use a top-leveluse existing bioinformatics software. Un- not remember what conclusions you drew.organization that is logical, with chrono-fortunately, for students who are prepar- You will either have to then spend timelogical organization at the next level, anding for a research career, this type of reconstructing your previous experiments logical organization below that. A samplecurriculum fails to address many of the or lose whatever insights you gained fromproject, called msms, is shown in Figure 1.day-to-day organizational challenges as-those experiments. At the root of most of my projects, I have asociated with performing computationalThis leads to the second principle,data directory for storing fixed data sets, aexperiments. In practice, the principleswhich is actually more like a version of results directory for tracking computa-behind organizing and documenting 1. Directory structure for a sample project. Directorydo, youin large tional experiments in smaller typeface. Only a subset of Figure names aretypeface, and filenames areMurphys that the dates are formatted ,year.-,month.-,day. so that they can bepeformed on that data, the files are shown here. NoteLaw: Everything you sorted in chronological order. Thecomputational experiments are often code src/ms-analysis.c have to to do over again. and is documented in doc/ms-analysis.html. The README source will probablyis compiled create bin/ms-analysis a doc directory with one subdirectory per what date. The driver script results/2009-01-15/runalllearned on the fly, and this learning is the data directories specify who downloaded the data files from what URL on manuscript, and directories such as src files in automatically Inevitably, you will discover some flaw split3, corresponding to three cross-validation splits. The bin/parse-generates the three subdirectories split1, split2, and instrongly influenced by personal predilec- script is called by bothpreparation driverthe data being sqt.pyyour initial of the runall of scripts. for source code and bin for compiled doi:10.1371/journal.pcbi.1000424.g001tions as well as by chance interactions analyzed, or you will get access to newbinaries or scripts.with collaborators or colleagues.with this approach,or you will decide that Lab Notebookdata, the distinction be- The your param-Within the data and results a complete These types of entries provide directo- The purpose of this article is to describe data and results may of a particular model was not tweeneterization not be useful. ries, it is often tempting to apply of the project picture of the development a similar,In parallel with this chronologicalone good strategy for carrying out com- onebroad imagine a top-level means structure,the find itlogical toorganization. For example, you Instead, could enough. This directory that I useful over time. directory called something like experi-In practice, I ask members of myputational experiments. I will not describe , with subdirectories with names like last week, chronologically organizedhave two or group to data sets notebooks mentsexperiment you didmaintain a or even may lab research three put their lab againstprofound issues such as how to formulate 2008-12-19. Optionally, the directorynotebook. This is a document that residesthe set of experiments youveroot of the results directory andyou online, behind benchmark yourin the been work- whichplan to password protection ifhypotheses, design experiments, or draw might ing on over word past month, will probably namealso include aor twonecessary. When I meet with a memberthat records your progress algorithms, ofso lab or a could team, we can one indicating the topic of the the experiment in detail.my you project create referconclusions. Rather, I will focus therein. In practice,to single experiment you have organized should be dated, for each of lab notebook, focusing on on need a be redone. If and they should be relatively verbose, with to the online them under data.Entries in the notebookdirectoryrelatively mundane issues such as organiz- will often require more than one day of the current entry but scrolling up toand documented your work clearly, thenimages In my experience, entries approach is risky,this work, and so you may end up working alinks or embeddedor tables previous as necessary. The URLing files and directories and documenting or repeating creating a new displaying the results of the experiments the can also be provided toof yourcollabo- few daysmore before the experiment with the new becauselogical structure remote final 5. EducationA Quick Guide to Organizing Computational BiologyProjectsWilliam Stafford Noble1,2*1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science andEngineering, University of Washington, Seattle, Washington, United States of AmericaIntroductionunderstanding your work or who may beunder a common root directory. Theevaluating your research skills. Most com- exception to this rule is source code or Most bioinformatics coursework focus-monly, however, that someone is you. A scripts that are used in multiple projects.es on algorithms, with perhaps some few months from now, you may not Each such program might have a projectcomponents devoted to learning pro- remember what you were up to when youdirectory of its own.gramming skills and learning how to created a particular set of files, or you mayWithin a given project, I use a top-leveluse existing bioinformatics software. Un- not remember what conclusions you drew.organization that is logical, with chrono-fortunately, for students who are prepar- You will either have to then spend timelogical organization at the next level, anding for a research career, this type of reconstructing your previous experiments logical organization below that. A samplecurriculum fails to address many of the or lose whatever insights you gained fromproject, called msms, is shown in Figure 1.day-to-day organizational challenges as-those experiments. At the root of most of my projects, I have asociated with performing computationalThis leads to the second principle,data directory for storing fixed data sets, aexperiments. In practice, the principleswhich is actually more like a version of results directory for tracking computa-behind organizing and documenting 1. Directory structure for a sample project. Directorydo, youin large tional experiments in smaller typeface. Only a subset of Figure names aretypeface, and filenames areMurphys that the dates are formatted ,year.-,month.-,day. so that they can bepeformed on that data, the files are shown here. NoteLaw: Everything you In each results folder: sorted in chronological order. Thecomputational experiments are often code src/ms-analysis.c have to to do over again. and is documented in doc/ms-analysis.html. The README source will probablyis compiled create bin/ms-analysis a doc directory with one subdirectory per what date. The driver script results/2009-01-15/runalllearned on the fly, and this learning is the data directories specify who downloaded the data files from what URL on manuscript, and directories such as src files in automatically Inevitably, you will discover some flaw split3, corresponding to three cross-validation splits. The bin/parse-generates the three subdirectories split1, split2, and in script: getResults.rb or WHATIDID.txtstrongly influenced by personal predilec- script is called by bothpreparation driverthe data being sqt.pyyour initial of the runall of scripts. for source code and bin for compiled doi:10.1371/journal.pcbi.1000424.g001tions as well as by chance interactions analyzed, or you will get access to newbinaries or scripts.with collaborators or colleagues.with this approach,or you will decide that Lab Notebookdata, the distinction be- The your param-Within the data and results a complete These types of entries provide directo- intermediates The purpose of this article is to describe data and results may of a particular model was not tweeneterization not be useful. ries, it is often tempting to apply of the project picture of the development a similar,In parallel with this chronologicalone good strategy for carrying out com- onebroad imagine a top-level means structure,the find itlogical toorganization. For example, you Instead, could enough. This directory that I useful over time. directory called something like experi-In practice, I ask members of myputational experiments. I will not describe , with subdirectories with names like last week, chronologically organizedhave two or group to data sets notebooksmaintain a or even may lab research three put their lab against output mentsexperiment you didprofound issues such as how to formulate 2008-12-19. Optionally, the directorynotebook. This is a document that residesthe set of experiments youveroot of the results directory andyou online, behind benchmark yourin the been work- whichplan to password protection ifhypotheses, design experiments, or draw might ing on over word past month, will probably namealso include aor twonecessary. When I meet with a memberthat records your progress algorithms, ofso lab or a could team, we can one indicating the topic of the the experiment in detail.my you project create referconclusions. Rather, I will focus therein. In practice,to single experiment you have organized should be dated, for each of lab notebook, focusing on on need a be redone. If and they should be relatively verbose, with to the online them under data.Entries in the notebookdirectoryrelatively mundane issues such as organiz- will often require more than one day of the current entry but scrolling up toand documented your work clearly, thenimages In my experience, entries approach is risky,this work, and so you may end up working alinks or embeddedor tables previous as necessary. The URLing files and directories and documenting or repeating creating a new displaying the results of the experiments the can also be provided toof yourcollabo- few daysmore before the experiment with the new becauselogical structure remote final 6. Best Practices for Scientic ComputingGreg Wilson , D.A. Aruliah , C. Titus Brown , Neil P. Chue Hong , Matt Davis , Richard T. Guy ,Steven H.D. Haddock , Katy Hu , Ian M. Mitchell , Mark D. Plumbley , Ben Waugh ,Ethan P. White , Paul Wilson Software Carpentry (gvwilson@software-carpentry.org), University of Ontario Institute of Technology (Dhavide.AruState Unive...</p>