Alan Turing Institute Symposium on Reproducibility for ... · Alan Turing Institute Symposium on...

18
Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s College, University of Oxford Wednesday 6 and Thursday 7 April 2016 1 Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Wednesday 6 April 2016 Programme Location 09:00 - 09:30 Registration and coffee Outside of the Lecture Theatre 09:30 - 09:40 Welcome and Introduction; outcomes of the symposium (Lucie Burgess, University of Oxford) Lecture Theatre 09:40 – 10:10 What is reproducibility in the setting of computational data analytics? (Prof Carole Goble, Manchester University) Lecture Theatre 10:10 – 10:30 Overview of the ATI (Prof Jared Tanner, University of Oxford) Lecture Theatre 10:30 – 11:00 Coffee break Wordsworth Tea Room 11:00 - 12:45 Session 1 – Data provenance to support reproducibility (Lead: Dr Paolo Missier, University of Newcastle, and Prof Tom Nichols, University of Warwick) Speakers: Prof Luc Moreau, University of Southampton Prof Dorothy Bishop, University of Oxford Dr Paolo Missier, University of Newcastle Lecture Theatre Break-out rooms – see Session format for more details 12:45 – 13:45 Lunch Wordsworth Tea Room 13:45 – 15:30 Session 2 – Computational models and simulations (Lead: Prof Jeremy Gibbons, University of Oxford) Speakers: Dr Nicola Botta, Potsdam Institute for Climate Impact Research Prof Patrik Jansson, Chalmers University of Technology Dr Camil Demetrescu, Sapienza University of Rome Lecture Theatre Break-out rooms – see Session format for more details 15:30 – 16:00 Coffee break Wordsworth Tea Room 16:00 – 17:15 Lightning talks Lecture Theatre 17:15 – 17:45 Day 1 reportage, plans for day 2 Lecture Theatre 19:00 Pre-dinner drinks; welcome from Lucie Burgess At the entrance to St Hugh’s College Hall 19:30 Conference dinner St Hugh’s College Hall

Transcript of Alan Turing Institute Symposium on Reproducibility for ... · Alan Turing Institute Symposium on...

Page 1: Alan Turing Institute Symposium on Reproducibility for ... · Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s

Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s College, University of Oxford Wednesday 6 and Thursday 7 April 2016

1

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch

Wednesday6April2016

Programme Location

09:00-09:30Registrationandcoffee OutsideoftheLectureTheatre

09:30-09:40WelcomeandIntroduction;outcomesofthesymposium(LucieBurgess,UniversityofOxford)

LectureTheatre

09:40–10:10Whatisreproducibilityinthesettingofcomputationaldataanalytics?(ProfCaroleGoble,ManchesterUniversity)

LectureTheatre

10:10–10:30OverviewoftheATI(ProfJaredTanner,UniversityofOxford) LectureTheatre

10:30–11:00Coffeebreak WordsworthTeaRoom

11:00-12:45Session1–Dataprovenancetosupportreproducibility(Lead:DrPaoloMissier,UniversityofNewcastle,andProfTomNichols,UniversityofWarwick)Speakers:ProfLucMoreau,UniversityofSouthamptonProfDorothyBishop,UniversityofOxfordDrPaoloMissier,UniversityofNewcastle

LectureTheatreBreak-outrooms–seeSessionformatformoredetails

12:45–13:45Lunch WordsworthTeaRoom

13:45–15:30Session2–Computationalmodelsandsimulations(Lead:ProfJeremyGibbons,UniversityofOxford)Speakers:DrNicolaBotta,PotsdamInstituteforClimateImpactResearchProfPatrikJansson,ChalmersUniversityofTechnologyDrCamilDemetrescu,SapienzaUniversityofRome

LectureTheatreBreak-outrooms–seeSessionformatformoredetails

15:30–16:00Coffeebreak WordsworthTeaRoom

16:00–17:15Lightningtalks LectureTheatre

17:15–17:45Day1reportage,plansforday2 LectureTheatre

19:00Pre-dinnerdrinks;welcomefromLucieBurgess AttheentrancetoStHugh’sCollegeHall

19:30Conferencedinner StHugh’sCollegeHall

Page 2: Alan Turing Institute Symposium on Reproducibility for ... · Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s

Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s College, University of Oxford Wednesday 6 and Thursday 7 April 2016

2

Thursday7April2016

Programme Location

08:30Coffeeavailable WordsworthTeaRoom

09:00–09:15Re-capofday1andoverviewofday2 LectureTheatre

09:15-11:00Session3–Reproducibilityforreal-timebigdata(Lead:ProfDaviddeRoure,UniversityofOxford)Speakers:DrSuzyMoat,UniversityofWarwickProfDaviddeRoure,UniversityofOxford

LectureTheatreBreak-outrooms–seeSessionformatformoredetails

11:00–11:15Coffeebreak WordsworthTeaRoom

11:15–13:00Session4–PublicationofData-IntensiveResearch(Leads:ProfCaroleGoble,ManchesterUniversity;DavidCrotty,RichardO’Beirne,OxfordUniversityPress)Speakers:DrLaurieGoodman,GigaScienceMrNeilChueHong,SoftwareSustainabilityInstitute

LectureTheatreBreak-outrooms–seeSessionformatformoredetails

13:00–14:00Lunch WordsworthTeaRoom

14:00–15:45Session5–Novelarchitecturesandinfrastructuretosupportreproducibility(Leads:DrRichardMortier,UniversityofCambridge,andDrAdamFarquhar,TheBritishLibrary)Speakers:DrKenjiTakeda,MicrosoftResearchLimitedDrRichardMortier,UniversityofCambridge

LectureTheatreBreak-outrooms–seeSessionformatformoredetails

15:45–16:30Day2reportage LectureTheatre

16:30Endofsymposium

BREAK-OUTGROUPSSESSIONFORMATTherewillbe3break-outgroupspersession,eachwithadifferenttopic;delegateswillchoseonthedaywhichgrouptojoin.

Questionsforsmallgroupdiscussioncouldinclude:• Currentlandscape,scientificchallenges-Whatarethelatestadvancesinresearchrelatingtothetheme?Whatis

stateoftheart?Whatarethekeyresearchquestionsandscientificchallenges?Wheredothegreatestgapslie?• Disciplinaryandinter-disciplinarychallenges-Whatarethedata-intensivescientificdisciplinesthatshouldbe

broughttogether;wheredotheyinter-relatetosupportresearchinthisarea?• Foundationalandappliedresearchchallenges-Whataretheappliedstakeholderchallengesandhowshouldthese

drivefundamentalresearch?e.g.inhealth,finance,utilities,engineering,etc…• Benefitsandimpact–Whatarethedownstreamimpacts,e.g.lesscostlycomputations,moreefficientand

widespreaddatare-use,greatertransparencyandpublictrustinscience?• WhatcouldanemergingATIprogrammelooklikeinthisarea?Whatambitioustargetscouldweachievewithin1

year,3,5years?WhatistheaddedvalueofworkingwithintheATIframework?WhatpartnersshouldtheATIworkwithinthisarea?

Page 3: Alan Turing Institute Symposium on Reproducibility for ... · Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s

Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s College, University of Oxford Wednesday 6 and Thursday 7 April 2016

3

Sessionabstracts

Sessionlocation,logistics,reportageKeynotesandsessiontalkstakeplaceintheLECTURETHEATRE.Followingeachsessiontherewillbethreebreakoutgroups,intheLectureTheatre,theLoueySeminarRoomonthe2ndfloor,andtheHoTimSeminarRoomonthe1stfloor.Eachbreakoutgroupwillhaveadifferenttopic.Topicsandbreakoutroomsarescheduledbelow.Foreachbreakoutgroup,weinviteonedelegatetoactasascribeandtonotethekeypointsofthediscussioninaGoogledocument,forwhichlinksareprovidedbelow.OtherdelegatesarewelcometoaddtotheGooglenotesduringeachsession.Wewillphotographpost-it-notesandflip-chartoutputsforavisualrecord.Thesessionchairswillfeedbackthekeypointsinthereportagesessions.Theoutcomeofthesymposiumwillbeawhitepaper,whichwewillpublishonline.Ifyouwouldliketodosomepre-readingonthetopicscoveredinthesymposium,alistofreferencesisavailableinBibText,RefWorksandEndNotesformatsonasharedGooglefolderhere:https://drive.google.com/open?id=0B1EyUglIzGARZjFVa3dLUWExbzg

WEDNESDAY6APRIL2016

Openingkeynote-WhatisReproducibility?-ProfessorCaroleGoble,UniversityofManchester‘WhenIuseaword’,HumptyDumptysaidinratherascornfultone,‘itmeansjustwhatIchooseittomean-neithermorenorless.’[1].ItisthesamewithReproducible.Reusable.Repeatable.Recomputable.Replicable.Rerunnable.Regeneratable.Reviewable.ItisR*mayhem.Orpride[2].Doesitmatter?Atleastitdoesforcomputationalscience.Differentshadesof‘reproducible’matterinthepublishingworkflow,dependingonwhetheryouaretestingforrobustness(rerun),defence(repeat),certification(replicate),comparison(reproduce)ortransferringbetweenresearchers(reuse).Differentformsof‘R’makedifferentdemandsonthecompleteness,depthandportabilityofresearch[3].Ifweviewcomputationaltools(software,scripts)asinstruments–‘datascopes’ratherthan‘telescopes’or‘microscopes’–thenweneedtobeclearwhenwetalkaboutreproduciblecomputationalexperimentsaboutwhetherwearererunningwiththesamesetuponthesame(preserved)instrument(sayavirtualmachine),orreproducingtheinstrumenttoreplicatetheexperiment(sayadescriptionofanalgorithmrecoded)orrepairingtheinstrumentsowecanreuseitforsomeotherexperiment(sayreplacingadefunctwebserviceoradeprecatedlibrary).InthistalkProfessorCaroleGoblewilldiscusstheR*brouhahaanditspracticalconsequencesforcomputationaldatadrivenscience.[1]LewisCarroll,ThroughtheLooking-Glass(1872)[2]DavidDeRoure,MoreRsthanPirateshttp://www.scilogs.com/eresearch/more-rs-than-pirates/[3]JulianaFreire,PhilippeBonnet,DennisShasha,Computationalreproducibility:state-of-the-art,challenges,anddatabaseresearchopportunitiesSIGMOD'12Proceedingsofthe2012ACMSIGMODInternationalConferenceonManagementofData:593-596,ACMNewYork,NY,USA,doi:10.1145/2213836.2213908OverviewoftheAlanTuringInstitute–ProfessorJaredTanner,UniversityofOxfordProfessorJaredTannerwillgiveanoverviewofthemissionandstrategicobjectivesofthenewly-establishedAlanTuringInstitute,andwilltakequestionsfromdelegates.TheInstitute’smissionisto:undertakedatascienceresearchattheintersectionofcomputerscience,mathematics,statisticsandsystemsengineering;providetechnicallyinformedadvicetopolicymakersonthewiderimplicationsofalgorithms;enableresearchersfromindustryandacademiatoworktogethertoundertakeresearchwithpracticalapplications;andactasamagnetforleadersinacademiaandindustryfromaroundtheworldtoengagewiththeUKindatascienceanditsapplications.

Page 4: Alan Turing Institute Symposium on Reproducibility for ... · Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s

Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s College, University of Oxford Wednesday 6 and Thursday 7 April 2016

4

Session1–Dataprovenancetosupportreproducibility(Wed6April,11:00-12:45)LinktoGoogledocfornotes:GroupA–https://docs.google.com/document/d/1Ac2J7WVzXfEHeWnexWOQxBiEM3s8c7eO6FFXj-gJGZw/edit?usp=sharingGroupB–https://docs.google.com/document/d/1UMFEtudI8Rs-2VptMZc2_K24qC4gmQaHG0iE7weaa7o/edit?usp=sharingGroupC–https://docs.google.com/document/d/155NKzZ-U9jSygmyZe2yMFdc0P79HC8MksZlx7lsyFuQ/edit?usp=sharing

Sessionchairs DrPaoloMissier,NewcastleUniversity

ProfessorTomNichols,WarwickUniversity

Sessionabstract

Thedriveforgreaterreproducibilitydemandsthateveryaspectofexperimentaldesign,dataacquisition,pre-processing,analysisandresultsgenerationbetrackable.Completeprovenancewouldthenallowanindependentinvestigatortounderstandexactlywhatwasdonetothedataateachstep,andattempttoreproducetheresultwitheithertheoriginal(shared)dataoranewsetofdata.Inthesamespirit,theabilitytocaptureprovenanceisalsoimportanttosupporttheexplorationofalternativeexperimentaldesigns,bymakingitpossibletoreasonabout,andexplain,differencesinoutcomesproducedbydifferentversionsofanexperiment.Realisingthispotentialhasbeenprovingdifficult,however.Thegoalofthissessionistotryandunderstandtherealityofprovenancemanagementpracticeswithrespecttoreproducibility.Webeginwiththreeshortpresentations,whichwill(1)exploreprovenanceissuesintheareaofneurosciences(2)summarisetheprovenancestandardsthatareavailabletoaddresstheserequirements,and(3)provideanoverviewofexperimentaltoolsthatleveragethosestandardstofacilitatereproducibility.

Sessionspeakers(inorderoftalks)

1.1ProfessorLucMoreau,UniversityofSouthampton

ProvenanceforexplainingandreproducingpastresultsTheESRCEBookprojectaimstoofferamulti-modaltool-suite(commandline,web-basedinteractiveportal,andinteractiveworkflows)aidingintheuseandteachingofstatisticalanalysistechniqueswithaparticularemphasisontheirapplicationtosocialscience.Provenanceisattheheartofthisapproach,capturingtracesofexecutionsteps,irrespectiveoftheirmodality.Provenancecanalsobeusedasinputtoaworkflowreconstructioncomponent,allowingtracesofpreviouslycapturedstepstobeeditedasre-executableworkflows.Inthistalk,IwilloutlinetheEBookapproachandIwillillustratethesalientaspectsoftheprovenancePROVmodel.

1.2ProfessorDorothyBishop,UniversityofOxford

Opendata:unintendedconsequencesandsuggestionsforavertingthemOpendataistypicallypresentedaspartofthesolutiontothereproducibilitycrisis,buttherearesituationswhenitcouldhavequitetheoppositeeffect.Anylarge,multivariatedatasetprovidesampleopportunitiesforp-hacking–i.e.diggingaroundinthedatatolookfor‘interesting’patternsthatgivesignificantresults.Iwillconsiderthreepossiblewaysofavertingproblems:dataaccessagreementswithpre-registeredanalysis,dividingdataintodiscovery(open)andreplication(restrictedaccess)samples,andmaskeddata.

1.3DrPaoloMissier,UniversityofNewcastle

Asscientificexperimentalresearchbecomesincreasinglydata-intensiveanddata-driven,anecosystemoftechnologyandtoolsisslowlyemergingtoaddresstheneedtoensureitsoutcomesarereproducible.Thispresentationwillbrieflyexploresomeofthetechnologywhereprovenancefeaturesaspartofthesolution.Thesewillinclude,amongstothers,theNoWorkflowandYesWorkflowtools(thinkyinandyangofscientificprogramming),aswellasaprovenance-awareMatlabclientdevelopedbytheDataONE‘federateddatapreservation’project,aspartofitstoolkitinsupportoftheEarthSciences.

Breakoutgroups

Page 5: Alan Turing Institute Symposium on Reproducibility for ... · Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s

Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s College, University of Oxford Wednesday 6 and Thursday 7 April 2016

5

GroupA–LectureTheatre GroupB–LoueySeminarRoom GroupC–HoTimSeminarRoom

Whatarethemotivations,challengesandlimitationsintherecordingandexploitationofprovenanceofopendata?

Provenanceisonlyoneoftheelementsofreproducibility.Howdoesitintegratewithotherpiecesofthe‘repropuzzle’?

Whatkindofautomatedreasoningcanweperformusingprovenance?Forinstance:canwedesignanalgorithmthatgeneratesanewinterestingexperimentusingthe(detailed)provenancerecordingsofotherexperiments?

Chair:ProfTomNichols Chair:DrPaoloMissier Chair:ProfLucMoreau

Scribe:DrSusannaSansone Scribe:DrPaoloMissier Scribe:ProfPatrikJansson

Page 6: Alan Turing Institute Symposium on Reproducibility for ... · Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s

Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s College, University of Oxford Wednesday 6 and Thursday 7 April 2016

6

Session2–Computationalmodelsandsimulations(Wed6April,13:45-15:30)LinktoGoogledocfornotes:GroupA–https://docs.google.com/document/d/1pdNNwu-b1aLmLwdbJWfCSHiG0ehQPFPDdBPPdQt7FHM/edit?usp=sharingGroupB–https://docs.google.com/document/d/1isJ15RwA-KjxFnfMygEER2NlqrFMVN6izdrxqpDDU6s/edit?usp=sharingGroupC–https://docs.google.com/document/d/1lnmQxFoFIzw1spDvJwapTgIaOuOKQLcHO4CQmlVRQE0/edit?usp=sharing

Sessionchair ProfessorJeremyGibbons,UniversityofOxford

Sessionabstract

Introductorypresentationsonartifactevaluationincomputingpublications(validatingthataccompanyingcodedoesindeedgiveanyresultsreportedinthepaper)andintheotherdirection,aboutcomputingtechniquesthathelptoyieldcorrectcomputationalsimulationsofabstractmodels,suchasdifferentialequationsinaneconomicspaper(domain-specificlanguages,codegeneration,programcorrectness,typesafety,verificationetc.)Foragentle,accessibleintroductiontothetopic,pleasesee:https://theconversation.com/science-relies-on-computer-modelling-so-what-happens-when-it-goes-wrong-56859

Sessionspeakers(inorderoftalks)

2.1DrNicolaBotta,PotsdamInstituteforClimateImpactResearchProfessorPatrikJansson,ChalmersUniversityofTechnology

FromNumericalSimulationstoRigorousScientificAdvice:It'sAllAboutLanguages!Wecansimulatetheevolutionofcoupledearthsystemmodelsoverthousandsofyears,createsyntheticpopulationsofmillionsofagentsandanalysenetworksofmaterialandenergyflowsonveryfinescales.Butcanweinterpretandcommunicateourresultsinarigorous,unequivocalway?Canwebuilduponshared,unambiguousnotionsofsustainability,stability,avoidability?Canweprovideaccountableadvicetodecisionmaking?Wearguethat,inspiteoftheamazinggrowthofavailablecomputingpower,thegapbetweenscientificcomputingandrigorousscientificadvicehasbeenwideningandthatcomputingsciencecanplayacrucialroleinhelpingtoclosethisgap.

2.3DrCamilDemetrescu,SapienzaUniversityofRome

ArtifactEvaluationsforSoftwareConferencesWedescribeanevaluationprocessofartifactsthatcomplementconferencepublicationswithsupplementarymaterial(software,dataetc.).Theprocesshasbeenwidelytestedinseveralmainstreamconferencesincomputingsince2011.Wereportonthemotivations,implementationandoutcomeofthisprocess.

Breakoutgroups

GroupA–LectureTheatre GroupB–LoueySeminarRoom GroupC–HoTimSeminarRoom

LanguagesforScientificModelling Domain-specificlanguagesforscientists

ArtifactEvaluation

Chair:DrJonathanCooper Chair:DrJamesCheney Chair:DrCamilDemetrescu

Scribe:CatherineJones Scribe:ProfPatrikJansson DrCraigAnslow

Page 7: Alan Turing Institute Symposium on Reproducibility for ... · Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s

Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s College, University of Oxford Wednesday 6 and Thursday 7 April 2016

7

LIGHTNINGTALKS(Wed6April,16:00-17:15)

1.Layingthegroundworkforbiomedicaldatadiscovery,sharingandreuseSpeaker:DrSusannaAssuntaSansone,AssociateDirector,Oxforde-ResearchCentre,UniversityofOxford;Consultant,NaturePublishingGroupSlideshere.ThebiomedicalandhealthcareDataDiscoveryIndexEcosystem(bioCADDIE,https://biocaddie.org)isacooperativeresearchefforttofacilitatedatadiscovery,sharingandreuseviathedevelopmentofDataMed,aDataDiscoveryIndexprototypefortheNIHBigDatatoKnowledgeInitiative(BD2K,https://datascience.nih.gov).WeneedtobetransformativeandimpactfulfordataasPubMedisforthebiomedicalliterature.Inthisshorttalkthehighlightsofourjourneysofar.2.OntheResearchObjectsframeworkSpeaker:ProfessorCaroleGoble,SchoolofComputerScience,UniversityofManchesterTobereproducibleincludesbundling,alongwiththenarrative,theotherstuff:experimentalmethods,computationalcodes,data,algorithms,workflows,scripts.Someoftheotherstuffishostedremotelyanditalsohasthepotentialtochangeunderyourfeet.Folksincreasinglyreferto‘ResearchObjects’asageneraltermfor‘stuffthatsupplementsanarticleoranewcurrencyunitforresearch’or‘somethingotherthanapaper’.ButInfrastructuremakerswhohavetomake,supportandexchangeResearchObjectsneedmorethanconcepts.Weneedframeworks,metadataspecifications,referenceimplementations,examples.Metaphors.Researchobject.orghasdefinedsuchaROframeworkwithreferencespecificationsandimplementations.ROsaremetadataobjectsforexplicitlydescribingaggregationsorpackagesofcontent:boxesofcomponents,andassemblinginstructions,withashippingmanifestforwhatisintheboxandwhereitisfrom.Wespecifytheontologiesneededtoconstructmanifests(aggregationandannotation)andtoguidetheircontent(checklists,provenance,versioning,dependencies).TheROcontainerneedstobeimplementedusingoff-the-shelfplatforms–Zip,BagIt,Docker.Iwillsketchouttheframeworkandpointtosomeimplementations.3.ProvenanceinneuroimagingwithNIDMResultsSpeaker:ProfessorThomasNichols,HeadofNeuroimagingStatistics,DepartmentofStatistics&WarwickManufacturingGroup,WarwickUniversityFunctionalMagneticResonanceImaging(fMRI)isoneoftheearly'bigdata'biomedicaldisciplines,withindividual'srawdataoccupying~1GBaround2000,andgrowinglargerwithmodernacquisitiontechniques.Thislarge,complexdatahoweverisroutinelyreducedtosummariesthatcouldbewrittenonaPostItnote:alistofx,y,zcoordinatesofactivationlocations.20yearsago,theneedforthisdistillationwasdefendedastherewasnowaytoshareorpublishthebinaryimagefiles.Inthemodernera,however,thereislittlejustificationfornotsharingthefullanddetailedrepresentationoffMRIresults.Iwilldiscussastandardizationinitiativetolinkanddescribeallfacetsofaneuroimagingexperiment.TheeffortiscalledtheNeuroimagingDataModel(NIDM),andissupportedbytheInternationalNeuroinformaticsCoordinatingFacility(INCF)NeuroimagingDataSharingTaskForce.Aftergivinganoverviewoftheproject,IwillfocusonNIDM-Results,theportionofthemodelthatrepresentsmassunivariatestatisticalresults.Thisstandardapproachrepresentsresultsfromthe3mostwidelyusedsoftwarepackages,covering80%ofcitationsinfMRI,andfacilitatessharingandreuseofresults,formeta-analysisinparticular.4.SoftwareRe-use,re-purposingandreproducibilitySpeaker:CatherineJones,SoftwareEngineeringGroupLeader,ScientificComputingDepartment,ScienceandTechnologyFacilitiesCouncilCatherinewillgiveanoverviewoftheJiscfundedproject,SoftwareReuse,Repurposing&Reproducibilitywhichexaminedissuesofpersistentidentificationofsoftwareandissuesaroundkeepingsoftwarerunningthroughuseofvirtualmachines/Docker.5.Howtomakedataworkforyou–adataecosystemvisionbyElsevierSpeaker:DrWouterHaak,VPResearchDataManagementSolutions,Elsevier

Page 8: Alan Turing Institute Symposium on Reproducibility for ... · Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s

Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s College, University of Oxford Wednesday 6 and Thursday 7 April 2016

8

Howtogetanecosystemgoingwheredatasharingbenefitsallstakeholders,startingwiththeresearcher?‘Datasharing’isanoftenheardphrase,andwiththechangingfundermandatestoencourageopendata,animmediateneed.Weneedtolookbeyonddata-compliance.Elsevierisbuildinganopenecosystemfordatauseanddatare-use,andisengagingwiththescientificcommunitytomakethedataworkforthem.6.BayesianuncertaintyquantificationforassessingmodelpredictivityandreproducibilitySpeaker:DrMatteoIcardi,WarwickZeemanLecturer,WarwickUniversityComputersimulationsofphysicalmodelsareoftenaffectedbyacertaindegreeofirreproducibility.Thiscanbecausedbytheactualirreproducibilityoftheunderlyingphysicalproblem,presentalsointheexperiments,oritcouldberelatedtoapoormathematicalformulationorparametrisationofthemodel.Inothercases,thelimitedpredictivityofamodelissimplyduetoanunbalancedratiobetweenmodelcomplexityandinformationavailabletoinformthemodelparameters.Thefirstcauseisoftenreferredtoasanaleatoricuncertaintyandthelastoneasepistemicuncertainty.Inthisshorttalk,IwillillustratethepotentialofcombiningBayesianstatisticaltechniqueswithphysicalmodelstoobtainquantitativeinformationaboutthesesourcesofuncertainty.Anexamplerelatedtothesimulationofsubsurfaceflowswillbepresented.7.Theimpactofresearchdatapublishing:DatasharingstoriesfromScientificDataSpeaker:DrVarshaKhodiyar,DataCurationEditor,NaturePublishingGroupScientificData(NaturePublishingGroup)hasbeenpublishingpeer-reviewed,openaccessandmultidisciplinaryresearchdatasetssinceMay2014.Reuseofshareddatabytheresearchcommunityiswellestablishedinmanyfields.However,inourexperience,openlypublishedpeer-revieweddataisalsousedbythoseoutsideoftheformalresearchcommunity.Wehighlightexamplesofdatareusebyboththeresearchandnon-researchcommunities.8.InolongerknowwhatdataisSpeaker:DrWilliamKilbride,ExecutiveDirector,DigitalPreservationCoalitionInolongerknowwhatdatais.ForsometimenowIhavestruggledwithameaningfuldefinition.Withabackgroundinthehumanities,Ihaveneverbeenentirelycomfortablewiththeideaofdatawithouttheory:‘rawdata’seemstomeanoxymoron.ButIamhappytogivelogicalpositiviststheirspaceinotherdisciplinesifthatistheconsensus.Sowhenaskedaboutdata,Ihaveduckedthephilosophicalquestionandgivenasoftercomputingdefinitionthatdistinguishesdatafromsoftwareandhardware.Datacanbesharedbecauseitcanbepackagedandbecauseitisindependentofthetoolsweapplytoit.Datahasnoon-offswitchanditisnotexecutable.Butmydefinitionnolongerworks–anditneverreallyhas.Practicalexperiencetellsmethatdataincludesallsortsofinternaldependencesandapplications:librariesandservicesthatexistinthespacebetweeninertdataandactiveprocesses.Aswehearevergreaterrhetoricofdatapolicyanddataprocessanddatavalueinscience,soIthinkweneedtobeabitcleareraboutwhatisinscope.Iamnolongersurewhatdatais.Thereismoreatstakethanbitsandbytes.9.OPUS-KeepingTrackOfYourResearchDataSpeaker:DrRipdumanSohan,ResearchAssociate,DigitalTechnologyGroup,CambridgeUniversityEPSRC'sopendatarequirementsnowmeanitisnecessarytotrackthedataandmetadatainvolvedinthecreationofeverypublication.Manualcollectionofthisinformationiscumbersome,tediousanderrorprone,whileaddingautomaticcollectionislikelytobeadomain-specific,timeconsumingandexpensiveexercise.AttheCambridgeComputerLaboratorywehavedevelopedasystemcalledOPUSthatassistswiththetaskofdataandmetadatacollectionforLinuxapplications.OPUSseamlesslyandtransparentlymonitorstheprogramsrunonmachinesandrecordsavarietyofinformationsuchasthefilesthatwereaccessedandtheuserrunningtheprogram.Thisisachievedwithnegligibleperformanceoverhead.OPUSiscapableofsupportingreproducibilityforad-hocworkflowsandindividualprograms.WehavealsoadaptedOPUStosatisfytheEPSRCopendatarequirements.DuringthetalkwewillintroduceOPUS,highlightkeyfeaturesandoutlineourplansforthefuture.

Page 9: Alan Turing Institute Symposium on Reproducibility for ... · Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s

Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s College, University of Oxford Wednesday 6 and Thursday 7 April 2016

9

10.SustainingcomputationiskeytothefutureofdigitalSpeaker:ProfessorNatasaMilic-Frayling,UniversityofNottingham/IntactDigitalVenture/UNESCODigitalisessentiallycomputational.Withoutsoftwarewecannotuseandexperiencedata,contentandprograms.Yet,asoftwarelifespanisshortduetoeconomicreasons.Asdemandforaspecificsoftwaredeclines,itisnoteconomicallyfeasibletocontinuemaintainingit.Thus,itsoonbecomesoutdatedandobsolete.However,virtualizationandemulationaretechniquesthatcanprovidetechnologicallyandeconomicallysustainablesolutiontosupporttherareuseoflegacycontentanddata.Currentlywecanvirtualizesoftwareapplicationsthatgo25yearsback.Withaconcertedefforttosupportvirtualizedlegacysoftwareservices,wecanensurethatlegacysoftwareremainsfunctionalinthefarfuture.UNESCOPERSISTprogrammeaimstoestablishaninternationalbankofallsoftwareneededtoenableaccesstoDigitalHeritage.11.IfIcan'thavethedata,canIpleasehaveapointer?Speaker:DrKevinAshley,DirectoroftheDigitalCurationCentre,UniversityofEdinburghMoreandmoredatacollectionsarebeingmadeavailableviasafehavensorothermechanismsthatprovideuserswiththemeanstoanalysedatawithinthem,butnottoseethedataitselfortoexportittoanotherenvironment.Whilstnecessary,thesedonotalwaysmakeiteasyforresearcherstoknowexactlywhattheyhaveruntheiranalysison.Eventhosewhowanttodotherightthingwithregardtoreproducibilitymaynotbegiventhetoolstodoso.12.ReproducibleModelDevelopmentwiththeCardiacElectrophysiologyWebLabSpeaker:DrJonathanCooper,DepartmentofComputerScience,UniversityofOxfordThepromiseofsystemsbiologyistousemathematicalmodellingtoelucidatehowthebehaviouroflargesystemsemergesfrominteractionsofcomponentsatlowerscales,synthesisingexperimentaldatafromdifferentscaleswithinquantitativehypotheses(i.e.models),withtheultimategoalofprovidingapredictiveunderstandingoflivingsystems.Onereasonforthelimitedrealisationofthispromiseisthedifficultyinrelatingmodelstoexperimentaldatarobustlyandreproducibly,beingabletoeasily:testhowwellamodelcapturesobservedbehavioursandpredictsnovelscenarios,andupdateittoincorporatenewdata.Weareaimingtomaketheprocessofproducinganewmodelfromexperimentaldata(i.e.modelselection,parameterisationandvalidation)documented,automatedandrepeatable.Ourideasarebeingtestedinthecontextofcardiacelectrophysiology,arguablythemostmatureareaofsystemsbiology,inwhichmodellinghasunderpinnedmanyadvancesinourbasicunderstandingandnumeroustreatments.Theendresultwillbeacommunityresourceforcardiacresearcherstodevelop,reproduceandcomparemathematicalmodelsofcardiaccellelectrophysiologywithfullconfidenceinthephylogenyandrobustnessofthosemodels.13.TowardModularEmpiricalResearchSpeaker:DrAleksiAaltonen,AssistantProfessor,WarwickBusinessSchoolThenatureofempiricalresearchvariesconsiderablybetweenacademicdisciplines.Whilescholarlyspecialisationaccountsformuchofthetremendousadvancesinmodernscience,italsohinderspracticalopportunitiesforcross-disciplinarycollaborationasresearchersfinditdifficulttoplug-inandinterfacedifferentwaysofdoingresearch.Theproblemis:howcanwemakeempiricalresearchmoremodular?Empiricalresearchmeanstheproductionofaposterioriknowledge,thatis,justifyingknowledgeclaimsbyreferencetoobservation.Anecessarypartofanyempiricalstudyisaprocessthatstartsfromacquiring,simulatingorexperimentallygeneratingdataaboutaphenomenonofinterestandthenproceedsbyperforminganalyticaloperationswiththedata.Thisprocesscanbeconceivedasachainofresearchoperationswithspecificinputsandoutputs.Unfortunately,wecurrentlylacktoolstodescribe,modelandmanagethepracticalstepsinempiricalresearchprocessesinwaythatwouldbeuniversallyacceptable.Asimple,formallywell-definedapproachtomodelempiricalresearchprocesscouldbebased,forinstance,ongraphtheory.Itwouldhelpresearcherstothinkmoreclearlyabouttheirpracticesandtodevelopinformationsystemsthatoffloadadministrativeworktodigitalresearchinfrastructures.AnanaloguecanbefoundinversioncontrolsystemssuchasGitHuborBitbucketthatprovideacommoninfrastructureacrossvarioussoftwaredevelopmentprojectsbyofferingaplatformonwhichmanydifferentdeveloperscandevelopcomponentsforcomplexprojects.Asimilarresearch

Page 10: Alan Turing Institute Symposium on Reproducibility for ... · Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s

Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s College, University of Oxford Wednesday 6 and Thursday 7 April 2016

10

infrastructurewouldenhancereplicabilityandcross-disciplinarycommunication,andsupportnewtypesofcollaboration,ultimatelyleadingtomoreeffectivescholarship.14.F1000ResearchandreproducibilitySpeaker:DrMichaelaTorkar,EditorialDirectorF1000ResearchF1000Researchisanopensciencepublishingplatformforlifescientists,operatingafullytransparentpost-publicationpeerreviewmodelandencouragingthesharingofalltypesofstudies,includingnegativeandnullfindings,confirmatorystudiesandattemptstoreproducepreviousstudies.F1000Researchoperatesastrictopendatapolicy,ensuringthatauthorsincludeallthesourcedataunderlyingtheirfiguresandtables.Editorialchecksandthepeerreviewaimtoensurethatsufficientmethodologicaldetails(anddata)areprovidedtoallowotherstoreproducetheresearch;forexample,authorsareaskedtoaddResearchResourceIdentifiers(RRIDs)inordertounambiguouslyidentifyresourcesusedinastudy.F1000ResearchrecentlylaunchedachannelonPreclinicalReproducibilityandRobustness(http://f1000research.com/channels/PRR),specificallyforreproducibilitystudies;thefirstsetofpapersincludes3studiespublishedbyAmgenresearchers,whocouldnotreproducepreviousfindings;inallcases,allthedatageneratedinthereproducibilityattemptsaredepositedontheOpenScienceFrameworkrepository,andthereferees’reports(andtheirnames)havebeenpublishedalongsidethearticles.15.SteppingtowardsopensciencewithaninstitutionaldatarepositorySpeaker:RobinRice,DataLibrarian,EDINAandDataLibrary,UniversityofEdinburghEdinburghDataShare(www.ed.ac.uk/is/datashare),anopenaccessinstitutionaldatarepository,holdsover1,000datasetsfromdisciplinesspanningtheUniversity.Itisdesignedasasustainablesolutionforthosewhodonothaveamoreappropriatedisciplinaryrepositoryoption,andisabulwarkoftheUniversity’sResearchDataManagementPolicy.In2015wereceivedtheDataSealofApprovalpeer-reviewedtrusteddigitalrepositorystandard.WebelievetherepositorymeetstheFAIRframeworkofFindable,Accessible,InteroperableandRe-usable,helpingouracademicstomaketheirdata‘intelligentlyopen’ontheweb(ScienceasanOpenEnterprise.2012).16.ProjectSkye:BridgingTheoryandPracticeforScientificDataCurationSpeaker:DrJamesCheney,Reader,UniversityofEdinburghThistalkprovidedabriefoverviewoftheSkyeproject,fundedbyafive-year€1.99MERCconsolidatorgrant.Scienceisincreasinglydata-driven.Scientificresearchfundersnowroutinelymandateopenpublicationofpublicly-fundedresearchdata.Safelyreusingsuchdatacurrentlyrequireslabour-intensivecuration.Provenancerecordingthehistoryandderivationofthedataiscriticaltoreapingthebenefitsandavoidingthepitfallsofdatasharing.Therearehundredsofcuratedscientificdatabasesinbiomedicinethatneedfine-grainedprovenance;oneimportantexampleistheIUPHAR/BPSGuidetoPharmacologydatabase(GtoPdb),apharmacologicaldatabasedevelopedinEdinburgh.TheSkyeprojectwillbuildsupportforcurationintotheprogramminglanguageitself,buildingonrecentresearchontheLinksWebprogramminglanguage,includingadvancesinlanguage-integratedquery,andonprovenanceanddatacuration.Linksisastrongly-typedlanguagethatprovidesstate-of-the-artsupportforlanguage-integratedqueryandWebprogramming.ThisprojectwillbuildonLinksandotherrecentlanguagedesignsforheterogeneousmeta-programmingtodevelopanewlanguage,calledSkye,thatcanexpressmodular,reusablecurationandprovenancetechniques.Tokeepfocusontherealneedsofscientificdatabases,SkyewillbeevaluatedinthecontextofGtoPdbandotherscientificdatabaseprojects.Bridgingthegapbetweencurationresearchandthepracticesofscientificdatabasecuratorswillcatalyseavirtuouscyclethatwillincreasethepaceofbreakthroughresultsfromdata-drivenscience.Forfurtherinformationontheproject,pleasesee:http://homepages.inf.ed.ac.uk/jcheney/group/skye.html#project

Page 11: Alan Turing Institute Symposium on Reproducibility for ... · Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s

Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s College, University of Oxford Wednesday 6 and Thursday 7 April 2016

11

THURSDAY7APRIL2016

Session3–ReproducibilityforReal-TimeBigData(Thu7April,09:15-11:00)LinktoGoogledocfornotes:GroupA–https://docs.google.com/document/d/1MQq-4gByirOcpZI7BjSaH6_Evx_IsEpLXcC_PmJt-bE/edit?usp=sharingGroupB–https://docs.google.com/document/d/1FHIC_dDU8jRZbRzuxjXtc0L7e8V_5cXdtooc9BqeWUs/edit?usp=sharingGroupC–https://docs.google.com/document/d/1QOoXmtw_fLTo8ChGLxf4xIO1psFrnSF0URy7UayeOi8/edit?usp=sharing

Sessionchair ProfessorDaviddeRoure,UniversityofOxford

Sessionabstract

Reproducibilityinnewdigitalscholarship–bigger,faster,better?Newareasofscholarshiparecharacterisedbymachinesandpeopleoperatingtogetheratscale:widespreadadoptionofnewtechnologiesleadstomassivedatageneration,whileatthesametimewehavecrowd-scalepersonalengagementwiththedataanditsanalysis.Thisdemocratisationandempowermentleadstoentirelynewsocialprocesses,whichaffordnewopportunitiestoconductscholarship,suchasthroughcitizenscience,andwhichthemselvesdemandscholarlyexamination.Forexample,howdowereproduceexperimentsinsocialmediaanalytics,whichexaminenewsocialprocesses,atthescaleofthepopulation,inrealtime?Thissessionexploresthechangingscholarlylandscapeandthenewchallengesinreproducibility.

Sessionspeakers(inorderoftalks)

3.1DrSuzyMoat,UniversityofWarwick

SensinghumanbehaviourwithonlinedataOureverydayusageoftheInternetgenerateshugeamountsofdataonhowhumansexchangeinformation.Inrecentwork,wehaveinvestigatedwhetherdatafromsourcessuchasGoogle,WikipediaandFlickrcanbeusedtomeasureandevenpredicthumanbehaviourintherealworld.Inthistalk,Iwillgiveanoverviewoftheopportunitiesandchallengesforreproducibilitycreatedbyworkingwiththesenewformsofdata.

3.2ProfessorDaviddeRoure,UniversityofOxford

TheEthicsofAutomation–adystopianviewofourevolvingknowledgeinfrastructureTodayweembracethemethodologicalchallengesofbigandrealtimedata,fromthelargehadroncollidertothelargepeoplecollider,knownassocialmedia.Thisisallausefulrehearsal,buttherealchallengeslieahead.TheInternetofThings,deployedinourcities,cars,homesandbodies,bringsyetmoredata—machinetomachine.Meanwhile,theengagementofthecrowdinanalyticsmaysoonbeoutofitssweetspot,aswegivewaytohumanstrainingthemachinestoprocessatrisingscale.Clearlythefutureisincreasinglyautomated,butwhatdoesthismeanforresearch,andforresearchcommunication?Thistalkwillprovideasocialmachinesperspectiveonourknowledgeinfrastructureandlookatourincreasinglyautomatedfuture,askingwhetheritismeaningfultoautomatereproducibility,andifandhowweshouldkeepthehumanintheloop.

Breakoutgroups

GroupA–LectureTheatre GroupB–LoueySeminarRoom GroupC–HoTimSeminarRoom

TopicTBC Reproducibilityinthesocialsciences:doesthearrivaloflargeonlinedatasourcesmaketheoutlookbetterorworse?

TopicTBC

Chair:ProfDaviddeRoure Chair:DrSuzyMoat Chair:ProfEricMeyer

Scribe:DrJonathanCooper Scribe:Dr.ClareDyer-SmithandLucieBurgess Scribe:

Page 12: Alan Turing Institute Symposium on Reproducibility for ... · Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s

Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s College, University of Oxford Wednesday 6 and Thursday 7 April 2016

12

Session4–PublicationofData-IntensiveResearch(Thu7April,11:15-13:00)LinktoGoogledocfornotes:GroupA–https://docs.google.com/document/d/1_jHOHQSDG5pvExPWyNZFp70Y_eyDlUYB543i4zKMhbY/edit?usp=sharingGroupB–https://docs.google.com/document/d/1uHB--IyDRAUl8DTfafWBT2moj65s98hBsipbRTe0M24/edit?usp=sharingGroupC–https://docs.google.com/document/d/1XTd-pLfFj0sRvFIcLZfiaMTk039Eh3jLUEwy-QZXXFo/edit?usp=sharing

Sessionchairs ProfessorCaroleGoble,UniversityofManchester

DrDavidCrotty,OxfordUniversityPressMrRichardO’Beirne,OxfordUniversityPress

Sessionabstract

Thepublicationofdata-intensiveresearch

Asthemethodsandoutputsofresearchchange,whataretheissuessurroundingthepublicationofdata-intensiveresearch?Thissessionwilldiscusstheroleofsoftwarewithregardstoreproducibility,andhowthistiestoskills,funderpolicyandpublishing,aswellasaneditorinchief’sviewofdatacitation:whatworks,whatdoesn’t,whatrequiresmoreeducation,andwhatisneededtomakesuredataisasreusableaspossible.

Sessionspeakers(inorderoftalks)

4.1DrLaurieGoodman,GigaScience

OvercomingHurdlestoDataPublicationAlthoughdatapublicationisnotnew(CharlesDarwin’sANaturalist’sVoyageAroundtheWorldisa‘classic’example),theideaofdatabeingapublishableentityhasrecentlybecomeaprominentpartofresearcherandpublisherconversations.Here,asthestartofadiscussionondatapublication,IwillpresentwhatthejournalGigaSciencehasbeendoingwithregardtodatapublication,theneedfordatapublication,andresponsesfromthecommunity.

4.2MrNeilChueHong,SoftwareSustainabilityInstitute

There'sNoSuchThingAsIrreproducibleResearch(SoftwareCreditedition)Howoftenhavewereadastoryinanewspaper,anddespairedoveralackofdatatobackitup?Theavailabilityandaccessibilityofdata,softwareandotherresearchoutputsisfundamentaltogoodresearch,yetmanybarriersstillexist.And,whilstopendataismovingusforward,weriskbeingstalledbyoursoftware.Thistalkcallsouttheseissuesandsilos,andexamineshowwehavetochangeourattitudesto‘shame’ifresearchistosurvivepubliccriticism.

Breakoutgroups

GroupA–LectureTheatre GroupB–LoueySeminarRoom GroupC–HoTimSeminarRoom

Overcomingbarrierstodatapublication

Willpublicationofcodealongsidedatasolvethereproducibilityproblem?

Threethingswecandotoday:CalltoArms

Chairs:DrDavidCrottyandDrLaurieGoodman

Chairs:MrRichardO’BeirneandMrNeilChueHong

Chair:DrSimonHodson

Scribe:DrRobDavidson Scribe: Scribe:

Page 13: Alan Turing Institute Symposium on Reproducibility for ... · Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s

Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s College, University of Oxford Wednesday 6 and Thursday 7 April 2016

13

Session5–Novelarchitecturesandinfrastructuretosupportreproducibility(Thu7April,14:00-15:45)LinktoGoogledocfornotes:GroupA–https://docs.google.com/document/d/1QwJ6MPaFWtoX1vJEVYVmvRrEls9u3p-RIjGVCeVD_68/edit?usp=sharingGroupB–https://docs.google.com/document/d/1wjIoAT8Fk54URHtrgg5natS2b-jquhnADYA5dm3cS0c/edit?usp=sharingGroupC–https://docs.google.com/document/d/1Gv-51ox41mmGF5Ewno4MUMw4VdMBWRv9Fl8WYd2Bj-E/edit?usp=sharing

Sessionchairs DrRichardMortier,UniversityofCambridge

DrAdamFarquhar,BritishLibrary

Sessionabstract

FutureArchitecturesandInfrastructuresRecentyearshaveseendramaticadvancesincomputinginfrastructurethatsupportreproducibility.Virtualmachines,cloudcomputing,containersallprovidemeanstocaptureandreplicatetheenvironmentinwhichcodemustrun.Thissessionwillconsiderhowtheseinfrastructuretechnologiesarebeingused,andhowsomeofthemarecurrentlybeingdeveloped,beforeconsideringwhatspecificnewrequirementsarisefromreproducibilityfordataintensiveresearch.

Sessionspeakers(inorderoftalks)

5.1DrKenjiTakeda,MicrosoftResearch ReproducibilityandsustainabilityusingcloudcomputingWewilldiscusshowresearchersaroundtheworldareexploringandusingcloudcomputingasacorepartoftheirreproducibilityandsustainabilityplansacrossmanydisciplines,andwhatthefuturelookslike.Wewilldescribehowlinkingscholarlycommunicationsacrossthewebalsoprovidesexcitingopportunitiesahead,includingthroughournewAcademicKnowledgeGraphservice-https://www.projectoxford.ai/academic.Wewilllooktooffercloudcomputingawardstoparticipantsoftheworkshop,andwanttomakesureeveryonecantakeadvantageofthisasapotentialpositiveoutcomeoftheevent–www.azure4research.com

5.2DrRichardMortier,UniversityofCambridge

Unikernels:EvolvingContainersandVirtualMachinesVirtualMachinesandcontainershaverevolutionisedsoftwaredevelopmentandsystemoperations.Providingmeanstocaptureenvironmentaldependenciesisaboontoanyonewishingtoreliablyreproduceasoftwareenvironment,whetherfordevopsorscience.However,neitherisapanacea.Athird,morerecentoption,isnowcomingtolight:unikernels.Iwillbrieflyintroduceunikernelsandindicatesomeofthewaysinwhichtheymayberelevanttofuturedevelopmentofreproducibledatasciencesystems.

Breakoutgroups

GroupA–LectureTheatre GroupB–LoueySeminarRoom GroupC–HoTimSeminarRoom

Futuretechnicalrequirementsforreproducibledatascience

Warstoriesandtalltales:casestudiesofreproducibility

Businessstructuresandcommercialpressures

Chair:DrRichardMortier Chair:DrAdamFarquhar Chair:DrKenjiTakeda

Scribe:DrPaoloMissier Scribe:MrBrianMatthews Scribe:

Page 14: Alan Turing Institute Symposium on Reproducibility for ... · Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s

Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s College, University of Oxford Wednesday 6 and Thursday 7 April 2016

14

s

Organisersandspeakersbios

ProfessorDorothyBishopFMedSci,FBA,FRSisaWellcomeTrustPrincipalResearchFellowandProfessorofDevelopmentalNeuropsychologyattheUniversityofOxford,wheresheheadsaprogrammeofresearchintochildren’scommunicationimpairments.SheisasupernumeraryfellowofStJohn’sCollegeOxford.Hermaininterestsareinthenatureandcausesofdevelopmentallanguageimpairments,withaparticularfocusonpsycholinguistics,neurobiologyandgenetics.ShealsoisactiveinthefieldofopenscienceandresearchreproducibilityandshechairedasymposiumonreproducibilityattheWellcomeTrustlastyear.Aswellaspublishinginconventionalacademicoutlets,shewritesapopularblogwithpersonalreactionstoscientificandacademicmatters(Bishopblog)andtweetsas@deevybee.

DrNicolaBottaisaseniorscientistatPotsdamInstituteforClimateImpactResearch(PIK).HehasreceivedaPhDinengineeringfromtheETHZürichin1994.HehasworkedatDLR(nationalaeronauticsandspaceresearchcentre),GöttingenattheFU(FreieUniversität)Berlinand,since1998,atPIK.Hehaspublishedinhigh-impactjournalsincomputationalfluidmechanics,parallelcomputing,agent-basedmodellingandprogramspecification.Hismainresearchinterestsareprogramspecificationanddevelopmentanddependentlytypedlanguages.

LucieBurgessisAssociateDirectorforDigitalLibrariesattheBodleianLibraries,UniversityofOxford,andaSeniorResearchFellowatHertfordCollege,Oxford.LucieleadstheBodleianDigitalLibrarySystemsandServicesteamof40staffandisamemberoftheBodleianExecutive.LucieisamemberofOxfordUniversity’sITCommitteeandDigitalStrategycommittee;isaboardmemberoftheDigitalPreservationCoalitionandistheJiscrepresentativetotheArXiv.orgmemberadvisoryboardandscientificadvisoryboard.From2007-2014LucieworkedattheBritishLibrary,whereasHeadofStrategyandPlanningsheledthedevelopmentoftheBritishLibrary’s2020Vision.ShealsoledtheUKLegalDepositlibraries’effortstoextendlegaldeposittothedigitaldomain.Luciehasalsoworkedinpublishing,businessdevelopmentandstrategyatUnitedBusinessMedia,aFTSE-250informationservicescompany,andforArthurAndersenBusinessConsulting.LuciebeganhercareerworkingwiththeUnitedNationsFrameworkConventiononClimateChangesecretariatinBonn,Germany.LuciehasaMaster’sdegreeinPhysicsfromHertfordCollege,UniversityofOxford.

MrNeilChueHongisthefoundingDirectorandPIoftheSoftwareSustainabilityInstitute,acollaborationbetweentheuniversitiesofEdinburgh,Manchester,OxfordandSouthampton.Heenablesresearchsoftwareusersanddeveloperstodrivethecontinuedimprovementandimpactofresearchsoftware.From2007-2010,hewasDirectorofOMII-UKattheUniversityofSouthampton,whichprovidedandsupportedfree,open-sourcesoftwarefortheUKe-Researchcommunity.Inadditiontosittingonseveralprojectadvisorycommittees,heistheEditor-in-ChiefoftheJournalofOpenResearchSoftware,thecurrentAdvisoryCouncilchairoftheSoftwareCarpentryFoundation,co-authorof‘BestPracticesforScientificComputing’and‘AnOpenSciencePeerReviewOath’,andamemberoftheEPSRCStrategicAdvisoryTeamone-Infrastructure.

Page 15: Alan Turing Institute Symposium on Reproducibility for ... · Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s

Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s College, University of Oxford Wednesday 6 and Thursday 7 April 2016

15

DrDavidCrottyistheEditorialDirector,JournalsPolicyforOxfordUniversityPress.HeoverseesjournalpolicyandcontributestostrategyacrossOUP’sjournalsprogram,drivestechnologicalinnovation,servesasaninformationofficer,andmanagesasuiteofresearchsociety-ownedjournals.DavidwaspreviouslyanExecutiveEditorwithColdSpringHarborLaboratoryPress,creatingandeditingnewsciencebooksandjournals,andwastheEditorinChiefforColdSpringHarborProtocols.DavidreceivedhisPhDinGeneticsfromColumbiaUniversityanddiddevelopmentalneuroscienceresearchatCaltechbeforemovingfromthebenchtopublishing.DavidhasbeenelectedtotheSTMAssociationBoardandservesontheinterimBoardofDirectorsforCHORInc.,anot-for-profitpublic-privatepartnershiptoincreasepublicaccesstoresearch.AstheExecutiveEditoroftheSocietyforScholarlyPublishing'sScholarlyKitchenblog,Davidregularlywritesabouttheintersectionoftechnologyandpublishing.

DrCamilDemetrescuconductsresearchattheintersectionofdifferentareasincomputing,rangingfromprogramminglanguagesandsystems,algorithmsanddatastructures,andsoftwareengineering.Hisresearchactivityfocusesonthedesignofefficientalgorithms,tools,andtechniquesforengineeringtheperformanceofsoftwaresystems,withparticularemphasisonperformanceanalytics,incrementalalgorithms,anddatastreaming.Hehasbeenprincipalinvestigatorandsitecoordinatorofmanyresearchprojects.CamilDemetrescuhasbeenvisitingscientistatMicrosoftResearch--SiliconValley,attheAT&TResearchLaboratories--FlorhamPark,andattheITUniversityofCopenhagen.HisPh.D.thesiswasawardedoneofthetwo2002PrizesoftheItalianChapteroftheEATCSforthebestdissertationsinTheoreticalComputerScience.HehasservedassteeringcommitteememberandprogramcommitteechairofpremierconferencessuchastheEuropeanSymposiumonAlgorithms.CamilDemetrescuregularlyservesintheprogramandartifactevaluationcommitteesofpremierinternationalconferences.HehasorganizedseveralscientificeventsatBertinoroandDagstuhlandisthegeneralchairofthe30thEuropeanConferenceonObject-OrientedProgramming(ECOOP2016).Heismemberoftheeditorialboardofthe"MathematicalProgrammingComputation"(MPC)journal.

DrAdamFarquharisHeadofDigitalScholarshipattheBritishLibrary,whereheandhisteamfocusonestablishingservicesforresearchersthattakefulladvantageofthepossibilitiesthatdigitalcollectionsanddatapresentacrossallformatsandsubjects.HeisprincipleinvestigatorfortheBritishLibraryLabsproject;co-ordinatestheTHORprojectthatwillprovideseamlessidentifierservicesforresearchersanddata;memberoftheInternationalImageInteroperability(IIIF)Consortiumexecutivecommittee;DirectoroftheEndangeredArchivesProgrammethatworkswithteamsaroundtheglobetopreservearchivalmaterialthatisindangerofdestruction,neglectorphysicaldeterioration;PresidentofDataCite,aninternationalassociationdedicatedtomakingiteasiertoidentify,cite,andreusescientificdata;andfounderandBoardmemberoftheOpenPreservationFoundation.HehasbeenresponsiblefortheLibrary’smaps,newspaper,photographic,audioandmovingimagecollections.BeforejoiningtheLibrary,hewastheprincipleknowledgemanagementarchitectforSchlumbergerandresearchscientistattheStanfordKnowledgeSystemsLaboratory.

ProfessorJeremyGibbonsisProfessorofComputingintheDepartmentofComputerScienceatOxfordUniversity,whereheisdirectorofthepart-timeprofessionalpostgraduateSoftwareEngineeringProgramme,andheadoftheProgrammingLanguagesandSoftwareEngineeringresearchtheme.Hisresearchinterestsareindomain-specificlanguages,functionalprogrammingandthemathematicsofprogramconstruction.

Page 16: Alan Turing Institute Symposium on Reproducibility for ... · Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s

Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s College, University of Oxford Wednesday 6 and Thursday 7 April 2016

16

ProfessorCaroleGobleisProfessorofComputerScienceattheUniversityofManchesterUKandco-founderoftheSoftwareSustainabilityInstituteUK.Forthepast20yearsshehasleadaresearchanddevelopmentteamworkingone-Infrastructure,platformsandtoolstoenablescientiststoshareassetsofallkinds,interoperateresourcesandrepresentknowledge.ShehasworkedinallareasofSciencebutparticularlytheLifeSciences.HersystemsandservicesservingtheneedsofReproducibleResearchinclude:ResearchObjects(researchobject.org),workflowsystems(ApacheTavernawww.taverna.org.uk),andsharingplatforms(SEEKseek4science.org,myExperimentmyexperiment.org,BioCataloguebiocatalogue.org).SheisafoundingmemberoftheScholarlyCommunicationsorganisationForce11.organdservesoninternationalWGsfordataandsoftwarecitation,datasetpublishing,workflowinteroperabilityandidentifiermanagement.Sheco-leadstheEURIELIXIRforlifesciencedatainteroperabilitystreamandtheFAIRDOM(fair-dom.org)initiativeforreproducibilityofSystemsBiologyprojects.Shehasgivenmanykeynotesonthetopicofreproducibility.

DrLaurieGoodmanistheEditor-in-Chieffortheinternationalopen-accessopen-datajournalGigaScience,co-publishedbyBGIandBioMedCentral.Dr.GoodmanreceivedherBSandMSfromStanfordUniversityin1986,andPhDinBiochemistryandMolecularBiologyfromtheUniversityofChicagoin1991.Duringhergraduatework,shepublishedanovel,ASpellofDeceit,withDelReyBooks.ShecompletedapostdoctoralfellowshipattheUniversityofColoradoatBoulderthenleftthebenchin1995toworkasAssistantEditoratNatureGenetics.In1997,shemovedtoColdSpringHarborLaboratoryPresstoserveastheExecutiveEditorofGenomeResearchandManagingEditorofLearning&Memory.In2006,shestartedherowncompany,GoodmanWriting&Editing,whichprovidesavarietyofservicesincludingmanuscriptwritingseminarsandhigh-leveleditingofscientificmanuscripts,withaspecialtyineditingmanuscriptsfromnon-nativeEnglishspeakers.ORCIDID:0000-0001-9724-5976.

ProfessorPatrikJanssonisaFullProfessorofComputerScienceandHeadofthedivisionofSoftwareTechnologyatChalmersUniversityofTechnology,Sweden.HisresearchareaisSoftwareTechnologyandhehasworkedonGenericProgramming,FunctionalProgrammingandDependentTypeTheory.Inparallelhehasspentafewyearsbuildingamulti-disciplinarycommunityandresearchagendacalled"GlobalSystemsScience"togetherwithresearchersineconomics,climatechange,risk&resiliance,etc.(inadditiontocomputerscienceandmathematics).CurrentlyheworksonDomainSpecificLanguagesintheCentreofexcellenceforGlobalSystemsScience[1],intheGRACeFULproject[2]andintheDSLsofMathproject[3].Twitterprofile[4]:Computerscientist,Haskellhacker,catalystofresearchideas,likestoconnectthebigpicturewithformaldetails,software&languagetechnologyadvocate.[1]http://coegss.eu/[2]https://www.graceful-project.eu/[3]https://github.com/DSLsofMath/DSLsofMath[4]https://twitter.com/patrikja

DrPaoloMissierisReaderinLarge-ScaleInformationManagementwiththeSchoolofComputingScience,NewcastleUniversity,UK.Hejoinedacademiain2004,afterapriorcareerasaResearchScientistatBellCommunicationsResearch,USA(1994-2001),asaResearchFellowattheUniversityofManchester,SchoolofComputerScience(2004-2011).Hiscurrentresearchinterestsareinlarge-scalemetadataanalytics(i.e.theapplicationofpredictivedataanalyticstechniquestolargecorporaofmetadata),anddataprovenancemanagementandanalysisinparticular.Between2012and2013,PaolocontributedtotheW3CWorkingGrouponProvenanceontheWebandco-editedthePROVstandard.Hecurrentlyholdsa3-yearprojectgrantfromEPSRCfortheReCompproject.Aimedatbuildingametadatamanagementinfrastructure,ReCompwillenabledecision-makingontheneedsandopportunitiestore-computeexpensivedataanalyticstasksastheiroutcomeslosevalueovertime.AtNewcastle,MissierisalsoresponsibleforthePost-graduate(MSc)moduleonBigDataAnalytics,partoftheMScprograminCloudComputingforBigDataAnalytics.PaoloholdsaPhDinComputerSciencefromtheUniversityofManchester,UK(2007),anMScinComputerSciencefromUniversityofHouston,Texas,USA(1993)andaBScandMScinComputerSciencefromUniversita'diUdine,Italy(1990).

Page 17: Alan Turing Institute Symposium on Reproducibility for ... · Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s

Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s College, University of Oxford Wednesday 6 and Thursday 7 April 2016

17

DrSuzyMoatisanAssociateProfessorofBehaviouralScienceatWarwickBusinessSchool,wheresheco-directstheDataScienceLab.HerresearchinvestigateswhetherdataonourusageoftheInternet,fromsourcessuchasGoogle,WikipediaandFlickr,canhelpusmeasureandevenpredicthumanbehaviourintherealworld.Moat’sworktouchesonproblemsasdiverseaslinkingonlinebehaviourtostockmarketmoves(withPreis,Curme,Stanley,etal.),estimatingcrowdsizes(withBottaandPreis)andevaluatingwhetherthebeautyoftheenvironmentweliveinmightaffectourhealth(withSeresinheandPreis).Theresultsofherresearchhavebeenfeaturedbytelevision,radioandpressworldwide,byoutletssuchasCNN,BBC,TheGuardian,WallStreetJournal,NewScientistandWired.WithhercollaboratorandDataScienceLabco-director,TobiasPreis,sherecentlyledanonlinecourseonusingbigdatatomeasureandpredicthumanbehaviourwhichattractedover15,000learners.Moathasalsoactedasanadvisortogovernmentandpublicbodiesonthepredictivecapabilitiesofbigdata.

ProfessorLucMoreauisProfessorofComputerScienceandHeadoftheWebandInternetSciencegroup(WAIS),intheDepartmentofElectronicsandComputerScience(ECS)attheUniversityofSouthampton.Lucwasco-chairoftheW3CProvenanceWorkingGroup,whichresultedinfourW3CRecommendationsandnineW3CNotes,specifyingPROV,aconceptualdatamodelforprovenancetheWeb,anditsserializationsinvariousWeblanguages.Previously,heinitiatedthesuccessfulProvenanceChallengeseries,whichsawtheinvolvementofover20institutionsinvestigatingprovenanceinter-operabilityin3successivechallenges,andwhichresultedinthespecificationofthecommunityOpenProvenanceModel(OPM).

DrRichardMortierisamemberoffacultyintheSystemsResearchGroupattheCambridgeUniversityComputerLab.PastworkincludesInternetrouting,distributedsystemperformanceanalysis,networkmanagement,aestheticdesignablemachine-readablecodes,andhomenetworking.Heworksintheintersectionofsystemsandnetworkingwithhuman-computerinteraction,andiscurrentlyfocusedonhowtobuilduser-centricsystemsinfrastructurethatenablespeopletobettersupportthemselvesinaubiquitouscomputingworldthroughHuman-DataInteraction.

ProfessorThomasNicholsisaProfessorofNeuroimagingStatisticsandaWellcomeTrustSeniorResearchFellowattheUniversityofWarwick,holdingajointpositionbetweenWarwickManufacturingGroup&theDepartmentofStatistics.Heisastatisticianwithasolitary,20-yearfocusonmodellingandinferencemethodsforbrainimagingresearch.Beforegraduatestudies,heworkedasaprogrammerandstatisticianattheUniversityofPittsburgh'sPositronEmissionTomograpyFacility.HeearnedhisPhDinStatisticsatCarnegieMellonUniversitywithcross-traininginCognitiveNeuroscience,andin2000joinedthefacultyattheDepartmentBiostatisticsattheUniversityofMichigan.Hehada3yearsojurninindustry,workingatGlaxoSmithKline'sClinicalImagingCentre,London,wherehedevelopedmethodsforfMRIclinicaltrialsandimaginggeneticsstudies.In2009hereceivedtheWileyYoungInvestigatorAwardbytheOrganizationforHumanBrainMappinginrecognitionforhiscontributionstostatisticalmodeling&inferenceofneuroimagingdata.

Page 18: Alan Turing Institute Symposium on Reproducibility for ... · Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s

Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research Dickson Poon China Centre, St. Hugh’s College, University of Oxford Wednesday 6 and Thursday 7 April 2016

18

RichardO’Beirne,DigitalStrategyManager,OxfordUniversityPress,hasworkedindigitalpublishingsince1994andwithOUPsince2004.Astrongadvocatefortheimportanceofstandardscomplianceandopencollaboration,herepresentsOUPonanumberofpublishingindustrybodies.

ProfessorDaviddeRourePhDFBCSMIMA,isProfessorofe-ResearchatUniversityofOxfordandDirectoroftheOxforde-ResearchCentre.HehasstrategicresponsibilityforDigitalHumanitiesatOxfordwithinTORCH(TheOxfordResearchCentreintheHumanities),collaboratesinOxford'sWebSciencelaboratorywiththeOxfordInternetInstitute,andisamemberoftheOxfordCyberSecurityNetwork.ForseveralyearshehasalsoheldESRCroles,asdirectorofDigitalSocialResearchandasastrategicadvisorintheareaofnewformsofdataandrealtimeanalytics.Focusedonadvancingscholarshipusinginnovativedigitalmethods,Davidworkscloselywithmultipledisciplinesincludingsocialsciences(studyingsocialmachines),digitalhumanities(computationalmusicology),computerscience(largescaledistributedsystems,socialcomputing,InternetofThings)andpreviouslysciencesandsocialstatistics.Hehasextensiveexperienceinhypertext,Web,LinkedData,andscientificworkflows.Drawingonthisbroadinterdisciplinarybackgroundheisafrequentspeakerandwriterondigitalscholarshipandthefutureofscholarlycommunications.

DrKenjiTakedaisSolutionsArchitectandTechnicalManagerforMicrosoftResearch.HeiscurrentlyfocussedonAzureforResearch,AzureMachineLearningandacademicoutreachinEurope,MiddleEastandAfrica.Heisworkingwithresearchersacrossdisciplinestobestunderstandtheuseofcloudcomputingtoacceleratetheirresearch.HehasextensiveexperienceinCloudComputing,HighPerformanceandHighProductivityComputing,Data-intensiveScience,ScientificWorkflows,ScholarlyCommunication,EngineeringandEducationalOutreach.Hisalsohasresearchexpertiseisintheareasofaerodynamics,aeroacousticsandflightsimulation.Hehasapassionfordevelopingnovelcomputationalapproachestotacklefundamentalandappliedproblemsinscienceandengineering.Kenjiadvisesfundingagenciesandgovernmentbodiesonpolicyandinnovation.HeisontheeditorialboardfortheJournalofOpenResearchSoftware,andsteeringcommitteesformajorresearchconsortiaandinternationalconferences.Heisanadvocateofopenscienceandreproducibleresearch.

ProfessorJaredTannerisProfessoroftheMathematicsofInformationintheMathematicsInstituteatOxfordUniversityandaFellowatExeterCollege;OxfordUniversity’sliaisondirectorfortheAlanTuringInstitute;SIAMUKIEVicePresident(2011-2013)andFoundingEditor-in-ChiefofInformationandInference:AJournaloftheIMA.ProfTanner’sresearchfocusisonthedesign,analysis,andapplicationofnumericalalgorithmsforinformationinspiredapplicationsinsignalandimageprocessing.