394 wade word2007-ssp2008

35
Alex D. Wade Senior Research Program Manager External Research Microsoft Research Microsoft Corporation

Transcript of 394 wade word2007-ssp2008

Page 1: 394 wade word2007-ssp2008

Alex D. WadeSenior Research Program Manager

External ResearchMicrosoft Research

Microsoft Corporation

Page 2: 394 wade word2007-ssp2008

• Science @ Microsoft – and the role of Scholarly Communication

• Office 2007– File Format Overview– Bibliography Support– UI Extensibility

• A Sampling of Related Projects

Page 3: 394 wade word2007-ssp2008

• Advancement of Science• Global Collaboration

• Technology Excellence• Interoperability

Putting computing into science…Applying Microsoft products and research technologies to advance the scientific research and engineering innovation process

Putting science into computing…Ensuring that research community requirements are factored into future versions of Microsoft software

Page 4: 394 wade word2007-ssp2008

• Science + computation are not the entire equation• Authoring, Analysis, Publishing, Discoverability, and Data

Storage/Preservation are key components to scientists’ everyday work…and Microsoft’s core businesses

• The scholarly community has made it clear to us:• Microsoft must improve its offerings throughout the

scholarly communication lifecycle

• Our approach: Conduct prototyping projects and proofs-of-concept to evolve Microsoft’s scholarly communication offerings

Page 5: 394 wade word2007-ssp2008

• Data Acquisition and Modeling– Data capture from source, cleaning, storage, etc.– SQL Server, SQL Integration Services, Windows Workflow Foundation

• Support Collaboration– Allow researchers to work together, share context, facilitate interactions– SharePoint Server, One Note 2007 (shared)

• Data Analysis, Modeling, and Visualization– Mining techniques (OLAP, cubes) and visual analytics– SQL Analysis Services, BI, Excel, Optima, SILK (MSR-A)

• Disseminate and Share Research Outputs– Publish, Present, Blog, Review and Rate– Word, PowerPoint

• Archiving– Published literature, reference data, curated data, etc.– SQL Server

5

Microsoft is the only company that can offer end-to-end support

Page 6: 394 wade word2007-ssp2008

• Optimize for data-driven research & science – To both data (scientific) and to information (scholarly publications)– Reproducible research + computational science– Properly document / annotate scholarly output

• Interoperability is paramount – Actively lobby and drive for consensus around technical standards and standardized protocols proactively

adopted by the community; enable broad community engagement• Customers have told Microsoft that the interoperability (and intellectual property) are OUR responsibility

• Data preservation (and provenance) should be baseline– Documentation of the data’s provenance– Reliable and secure long-term storage – at a massive scale– Preservation needs to be like “accessibility” features – i.e., assumed as required

• Social networking & semantic knowledge discovery – Harnessing collective intelligence must be a consideration – since accessing research is a core step in the

life-cycle. Enable knowledge discovery – Optimize for Web 2.0 scenarios and allow end-users/experts to find things easier

• Metadata conventions / taxonomies / ontologies– This is a crucial strength for libraries – and a critical component in enabling Web 2.0

Page 7: 394 wade word2007-ssp2008

• New file format– New file extension (DOCX)– All content expressed in XML (Office Open XML)– Contained in a zip file (OPC)

• ECMA specification – 376 & ISO Standard– OpenXML– Open Packaging Conventions

Page 8: 394 wade word2007-ssp2008

• Easy to access the different parts of document– XML file– Images– Annotations

• Simpler to transform Word’s XML into other XML formats or extract relevant data

• Ability to build .docx files programmatically or through transformations

• Ability to extend Word UI (and content) to support additional or custom data

Page 9: 394 wade word2007-ssp2008

• Compatibility pack– Open and save to docx from older Word versions

• Add-in to export to PDF or XPS

• ODF Converter– Open Source project on SourceForge– Provides two-way conversion between ODF and

OpenXML (WordprocessingML, SpreadsheetML, and PresentationML)

– ‘Save As ODF’ to be included in Office 2007 SP2

Page 10: 394 wade word2007-ssp2008

• Manual Entry of Source Metadata

Page 11: 394 wade word2007-ssp2008

• Sources saved as Bibliography XML• Sources.XML contains all sources• Sources can be imported into new documents

for easy reuse• Sources.XML can be shared between users• Documentation Styles are XSLTs

Page 12: 394 wade word2007-ssp2008

• Citations and Bibliographies can be inserted inline with a single click

• Automatically Formatted according to active Documentation Style

Page 13: 394 wade word2007-ssp2008

• Ribbon Control• Research Pane• Smart Tags

Page 14: 394 wade word2007-ssp2008
Page 15: 394 wade word2007-ssp2008

• Tools for Authors– Search Commands in Office– Ribbon for Researchers

• Semantic Information– Ontology-based markup of scholarly papers– Authoring of chemical drawings + semantic information– NLM DTD (Pablo Fernicola)

• Data Preservation & Access– File format preservation + interoperability– Scientific datasets for research reproducibility– Publisher submission workflow for dataset archiving

Page 16: 394 wade word2007-ssp2008

Goals• Office 2007 Add-in that aids in finding commands, options, wizards and

galleries in Word, Excel and PowerPoint• Includes Guided Help, which acts as a tour guide for specific tasks

Project Status• Available now via http://www.officelabs.com/projects/searchcommands/

Search Commands in OfficeOffice Labs

Search Commands in OfficeOffice Labs

Page 17: 394 wade word2007-ssp2008
Page 18: 394 wade word2007-ssp2008

Ribbon for ResearchersConcept

Ribbon for ResearchersConcept

Page 19: 394 wade word2007-ssp2008

Search against the Live Search Academic service straight from within Word

One-click insert to the bibliography

Search against the Live Search Academic service straight from within Word

One-click insert to the bibliography

Integration with various servicesIntegration with various services

Page 20: 394 wade word2007-ssp2008

Goals• Semantic markup using domain-specific ontologies and controlled vocabularies• Facilitate/automate referencing to PDB (and other resources) from manuscript• A domain-specific ontology is downloaded and made available from within

Microsoft Word 2007• Authors can record their intention, the meaning of the terms they use based on

their community’s agreed vocabularyProject Status• Phase 1 complete• Beta testing with PLoS later this year

Semantic Markup in Word 2007with UC San Diego

Semantic Markup in Word 2007with UC San Diego

Page 21: 394 wade word2007-ssp2008

Domain-specific ontology

Support for annotationsstraight from within Word

Annotations travel with the document

Can be used to improve domain-specific discovery of information, cross-linking, etc.

Page 22: 394 wade word2007-ssp2008

Chemistry Drawing for OfficePreliminary investigation

Chemistry Drawing for OfficePreliminary investigation

Goals• Support students/researchers in simple chemistry structure

authoring/editing• Storage and transportability of semantic chemical data not just images via

Chemistry Markup Language (CML) • Enable automatic extraction/harvesting of chemical data

Project Status• Early investigation stage• Will be encouraging on-going publisher feedback

Page 23: 394 wade word2007-ssp2008

Organization• EU Commission Project, €14M for 4 years • Consortium of 5 national libraries, 4 national archives, 4 universities and 4

industry partners

Goals• Tools and methods for sustainable long-term preservation of digital objects• Preservation of Office Documents based on OpenXML

Project Status• OpenXML conversion tools available now:

– http://research.microsoft.com/research/rpp/projects/MSConversionTools/OpenXMLConversionTools.htm

PLANETSLong-term Preservation of

Digital Objects

PLANETSLong-term Preservation of

Digital Objects

Page 24: 394 wade word2007-ssp2008

GenePattern for Word 2007 with Broad Institute @ MIT

GenePattern for Word 2007 with Broad Institute @ MIT

Goals•Integrate data/images from GenePattern workflows into research papers. •Allow for research reproducibility by combining data with the text•Highlight OpenXML and Office 2007 technologies and break new research ground with the integration of data & workflows with research papers•Testing/linkage to other labs – moving beyond initial installation

Project Status•Currently in final phase of testing•Will move into production in June 2008•Code to be published http://www.codeplex.com

Page 25: 394 wade word2007-ssp2008
Page 26: 394 wade word2007-ssp2008

Data Archive Projectwith Johns Hopkins University

Data Archive Projectwith Johns Hopkins University

Goals•Mechanism for long-term preservation of data sets•Authoring tool to support creation of relationship resource map•Use of OAI-ORE resource maps for collection description•Workflow for text & data linkage between publisher and data archive

Page 27: 394 wade word2007-ssp2008

author

publisher

archive

Word 2007 OPC format contains data set(s) as well as resource map of relationships.

Word 2007 OPC format contains data set(s) as well as resource map of relationships.

Publisher retains article and replaces it with the article URL. Forwards data to Data Archive

Publisher retains article and replaces it with the article URL. Forwards data to Data Archive

Archive stores data set(s) and returns data set URL(s) to publisher as part of updated resource map

Archive stores data set(s) and returns data set URL(s) to publisher as part of updated resource map

Page 28: 394 wade word2007-ssp2008

• Direct publisher/repository submission via Word• Research Output Repository Platform• Conference Management Tool• eJournal Service• …

Alex D. [email protected]

http://www.microsoft.com/science/

Page 29: 394 wade word2007-ssp2008

Compatibility packs for older versions of Word• http://www.microsoft.com/downloads/details.aspx?FamilyId=941B347

0-3AE9-4AEE-8F43-C6BB74CD1466&displaylang=en

Add-in for saving to PDF or XPS• http://www.microsoft.com/downloads/details.aspx?FamilyId=4D95191

1-3E7E-4AE6-B059-A2E79ED87041&displaylang=en

SDK for OpenXML formats• http://msdn2.microsoft.com/en-us/library/bb448854.aspx

Developer community forum• http://openxmldeveloper.org/

Open Source OpenXML/ODF converter (both ways)• http://sourceforge.net/projects/odf-converter/

Page 30: 394 wade word2007-ssp2008
Page 31: 394 wade word2007-ssp2008

Microsoft ventures into open access chemistryRoyal Society of ChemistryBy Richard van NoordenJanuary 29th, 2007http://www.rsc.org/chemistryworld/News/2008/January/29010803.asp

Computational chemists have secured funding from computing giant Microsoft to showcase how chemistry can benefit from open access data sharing on the internet. 

The two-year eChemistry pilot project represents 'a major test case' for proposed new protocols for sharing scholarly information over the web, said Lee Dirks, director of scholarly communications at Microsoft Research. Microsoft's support is also a boost for the small band of chemists keen to promote open access internet publishing. 

The public-private collaboration is one of many Microsoft projects to probe the potential of computing to advance scientific research, and bring back what they learn to improve the company's product line, Dirks told Chemistry World. 'But chemistry is a discipline we've not typically worked in,' he said. 'From everything I've heard, it's not as progressive a field as, say, astronomy in use of the web'. 

Most chemical information on the web is published in closed journals and databases which guarantee high quality but also require a subscription to view. Pre-print servers, collaborative documents, open databases, video sites, online lab notebooks and blogs provide other ways of communicating research. Combining the lot offers the enticing prospect of a vast, free-to-access repository. This could transform the sharing of scientific research if the disparate data sources were machine-readable, so that a search engine could automatically gather data about a particular molecule from a crystal structure, a movie, an online lab book, and an archived article, for example. 

Radical changeThe international standards required for this challenge are being developed by the Open Archives Initiative Object Reuse and Exchange Project (OAI-ORE), based at Cornell University, Ithaca, US. Their model protocols will be officially launched on 3 March at Johns Hopkins University in Maryland.   

The eChemistry project, Dirks explained, was chosen as an exemplar to show that the new standards are actually useful to scientists. Chemists and computer scientists at Cambridge and Southampton universities in the UK, and Indiana, Cornell, and Penn State in the US, will search and index existing online databases and print archives; and work out how best to record chemistry data captured in lab experiments. The results will be hosted by the US National Institutes of Health open access PubChem database and other repositories. 

Page 32: 394 wade word2007-ssp2008

http://chronicle.com/daily/2008/02/1585n.htm Monday, February 11, 2008

Researchers Develop Online Tools for Science CollaborationsBy LILA GUTERMAN

Blogs, wikis, and social-networking sites such as Facebook may get media buzz these days, but for scientists, engineers, and doctors, they are not even on the radar. The most effective tools of the Internet for such people tend to be efforts more narrowly targeted to their needs, such as software that helps geneticists replicate one another's experiments. That was the underlying message of many presentations at the annual conference of the Professional/Scholarly Publishing Division of the Association of American Publishers held here last week.

Philip E. Bourne, a professor of pharmacology at the University of California at San Diego, spoke about the Web site SciVee, where scientists can link videos to their research papers that appear in open-access biomedical journals (The Chronicle, August 21, 2007). Mr. Bourne, who created the site, calls the videos pubcasts; they are typically about 10 minutes long and go into more detail than an abstract but less than the full-length article.

The videos are coming in at a trickle, says Mr. Bourne. (He attributes the slow rate to the high quality: the graduate students and postdoctoral researchers who make the videos have been crafting polished presentations.) But some of the ones already online have been viewed more than 100,000 times. When the pubcasts are uploaded, Mr. Bourne has also witnessed a steep increase in downloads of the linked article.

Jill P. Mesirov described an application that she hopes will ultimately become mainstream for journals that publish computational science. Ms. Mesirov, director of computational biology and bioinformatics at the Broad Institute of Massachusetts Institute of Technology and Harvard University, has designed a way to make computational work repeatable by other scientists.

The software, called GenePattern, stores both data and analytical routines. As the researcher works to collect and analyze the data, GenePattern records the steps the scientist has taken, so that anyone else can follow the steps and check the result or expand on the method using new data. Ms. Mesirov said that more than 6,000 people from more than 100 countries use the software.

She is now working with Microsoft to link such information to manuscripts that could be published online by peer-reviewed journals, to give readers access to a researcher's computational methods. "One of the problems with publishing a paper that relies heavily on computational work," she said, "is that all of the methods that you would need to reproduce it never appear in the journal. If you're lucky, they're in the supplementary material [online]. How much better if the journal had a link to the paper which had the data and an instantiation of the method embedded right in that paper.”

Page 33: 394 wade word2007-ssp2008

How can we be sure we’ll remember our digital past?Christian Science MonitorBy Chris GaylordFebruary 13th 2008 http://www.csmonitor.com/2008/0214/p13s02-stct.html

Fading media, formatsThe problem of digital preservation reaches across two standards. There's the media – floppies, CDs, hard drives – and the format of the files themselves – does it run in DOS, Hypercard, ClarisWorks 2.0?

Microsoft tackles this issue of "legacy" computing by running a kind of corporate museum. The company protects its multiplatform history by preserving old copies of "every major hardware and software change," says Lee Dirks, director of Scholarly Communications at Microsoft and a task force member.

"We've got computers stored on campus that go back to the Altair, the first computer [to run Microsoft software]," he says. "In fact, we bought multiple copies of the Altair just in case."

But maintaining antique computers is a costly way to keep the past alive.

A concept that is gaining momentum, Mr. Dirks says, is emulation, where programmers trick modern computers into thinking the way their classic cousins did. This lets them run old software without retro machines. Another problem arises when the emulator itself is written for last generation's operating systems. Do you write an emulator to handle the original emulator?

A more likely approach to long-term preservation is migration, says Berman. This calls for updating the file format every generation – without changing the contents, one hopes. This method has problems, as well. Some of the original context will be lost in translation, says Dirks. Also, the scale of the conversation will snowball as the number, size, and back-catalog of the files increases with each passing generation of technology.

Page 34: 394 wade word2007-ssp2008

• ICSTI Annual 2007 – Jun07• Nature Asia-Pacific Summit – Jun07• CODATA Summer School – Jul07• DCC Annual 2007 – Dec07• iSchool Conference 2008 – Feb08• OAI-ORE Launch – Mar08• BioMed Central 2007 Research Awards – Mar08• Open Repositories 2008 – Apr08• JCDL Annual 2008 – Jun08

Page 35: 394 wade word2007-ssp2008

• “Global Research Library 2020” with University of Washington (Oct07 and Mar08)

• Participating in two application(s) to the final round of the NSF “DataNet” solicitation (as an unfunded partner)

• Sponsoring BioMed Central’s 2007 Research Awards (Mar08)

• Aug07 Issue of CT Watch Quarterly (v. 3, no. 3)“The Coming Revolution in Scholarly Communications & Cyberinfrastructure”http://www.ctwatch.org/quarterly/articles/2007/08/

• New Scholarly Publishing website at:– http://www.microsoft.com/mscorp/tc/scholarly-publishing.mspx