NIST Scientific Data for Data Science United Nations Open Data / Open Government Conference, April...

22
NIST Scientific Data for Data Science United Nations Open Data / Open Government Conference, April 26-28, Abu Dhabi http://semanticommunity.info/Data_Science/NIST_Scientific_Data_for_Data_Science Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Work ing_Group_Meetup April 26, 2014 1

Transcript of NIST Scientific Data for Data Science United Nations Open Data / Open Government Conference, April...

1

NIST Scientific Data for Data ScienceUnited Nations Open Data / Open Government Conference, April 26-28, Abu Dhabi

http://semanticommunity.info/Data_Science/NIST_Scientific_Data_for_Data_Science

Dr. Brand NiemannDirector and Senior Data Scientist

Semantic Communityhttp://semanticommunity.info/

http://www.meetup.com/Federal-Big-Data-Working-Group/http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup

April 26, 2014

2

Open Data / Open Government Conference

• Request:– Interesting case studies about open government / open data.– Information on relevant federal apps designed.– A short bio.

• Response:– AOL Government published about 80 of my 200 some stories at

Semantic Community about open government data and activities.– Over 250 Spotfire dashboard apps in my cloud library including

most of the major open government dashboards and new data sets.

– Helped Data.gov get started in the US, and open government data get started in the SEMIC.EU and Japan.

3

Speaker Bio

• Brand Niemann, former Senior Enterprise Architect & Data Scientist with the US EPA, works as a data scientist, produces data science products, and publishes data stories for Semantic Community, AOL Government, & Data Science & Data Visualization DC.

• He co-organized the Federal Big Data Working Group Meetup with Kate Goodier that has Data Science Teams producing big data applications for government and business and provides a free on-line graduate course entitled Practical Data Science for Data Scientists.

4

Broader Context• NIST and other agencies need to support the following Federal

Government Initiatives:– Big Data– Digital Government Strategy– Public access mandated for "scientific results" supported by the U.S.

government– Federal agencies have submitted their "initial plans" for public access to

scientific data to OSTP– Digital Object Architecture: One result will be to make the scientific

record into a first class scientific object• The author has suggested that all of these can be addressed with

agency digital content by following the Data Mining Standard.– See “Data Science Makes Data More Important Than Code and Ontology”

5

Data Mining Standard• Business Understanding:

– NIST Mission• Standardize measurement

• Data Understanding:– NIST Digital Archives

• Promised to publish raw data sets

• Data Preparation:– Knowledge Base of the Above

• Need raw data for figures

• Modeling:– Semantic Knowledge Base, Data Papers, and

NanoPublications• See White Paper on “Making Big Data Small" using

Data Science and Semantics

• Evaluation:– Searchability, Discovery, and Reasoning

• Relational Queries and Graph Traversal

• Deployment:– Story and Knowledge Base in MindTouch, Excel,

NodeXL, Spotfire, and Be Informed• Data ecosystem

6

NIST

• NIST Supports its employees and others with the following Information Services:– Research Library– Publishing Services– NIST Museum and Archives

• The NIST Digital Archives (NDA) present images of NIST Museum artifacts and full-text NIST publications:– NBS Bulletins– Journal of Research of NIST– NBS-NIST Directors– NBS-NIST Histories– NBS Circulars and Reports

7

NIST Home Page

http://www.nist.gov/

8

NIST Virtual Library

http://www.nist.gov/nvl

9

NIST Digital Archive Interface

http://nistdigitalarchives.contentdm.oclc.org/

10

NIST Digital Archive Contents

http://nistdigitalarchives.contentdm.oclc.org/cdm/search/display/200/order/title/ad/asc

My Note: 9602 Items!

11

NIST Digital Archive Example

http://cdm16009.contentdm.oclc.org/cdm/compoundobject/collection/p13011coll6/id/153009/rec/1

My Note: Can Read PDF On-line, but Where Is the Data?

13

Modeling: Approaches by the Federal Big Data Working Group Meetup

• Semantic Medline:– Semantic MEDLINE Query: mesothelioma and

Data Science for VIVO• Data Papers:– Sepublica 2014: The Semantics for e-science in an intelligent

Big Data Context• http://sepublica.mywikipaper.org/

• Nanopublications:– The smallest unit of publishable information: an assertion

about anything that can be uniquely identified and attributed to its author.• http://nanopub.org/wordpress/?page_id=65

14

Modeling: Examples

Most Recent: 500 citations,Start Date: 01/01/1900,End Date: 11/30/2013,3169 predications extracted.Summarized for Substance Interactions

Dr. Barend Mons: BRAIN Dr. Tom Rindflesch: Semantic Medline

15

Evaluation and Deployment

• The Evaluation and Deployment examples of each is as follows:– Semantic Knowledge Base: Web & PDF– Selected Data Papers: PDF-to-MindTouch

• Measurement of Scattering and Absorption Cross Sections of Microspheres for Wavelengths between 240 nm and 800 nm

• OMNIDATA and the Computerization of Scientific Data

– Nanopublication: Extracts from the Data Papers-to-Excel• My Note: Still need the NIST raw data sources to re-

create the figures in the publications.– I have been promised that NIST is going to publish their

data sets as part of the Open Government Data Initiative.

16

How was the data collected?

http://semanticommunity.info/Data_Science/NIST_Scientific_Data_for_Data_Science

My Note: Unstructured Information to Structured Data, Including the Two PDF Papers, with Well-defined URLsAccording to the SEMIC.EU Standards.

17

Where is the unstructured and structured data stored?

http://semanticommunity.info/@api/deki/files/28860/NISTDataScience.xlsx

Web and PDFFootnote and ReferencesMetadata and Data SourcesWell-defined URLs for Linked DataRelational and GraphReady for NodeXL & Spotfire

18

What are the results?:NIST Scientific Data Knowledge Base Visualization

My Note: Sections with Many Reference Links Can be Very Important!

19

What are the results?:NIST Digital Archives Century of Excellence

My Note: The Featured Seminal Data Paper is the 60th out of 106 Which I Found from Doing the Index Below!

20

What are the results?:NIST Digital Archives

My Note: The NIST Digital Archive Can be an Interface to Data Papers with Data Tables and Interactive Visualizations. This Work Can be Used to Prioritize the Additional Work and Reduce Duplication.

21

What are the results?:NIST Library Catalog Search for Data

My Note: This Was a Test for Searching the Catalog for “data” and Converting the Results to a Spreadsheet (20 of 259). There is Also the Need to Search for Data Tables Within the Individual Publications.

22

What is our data story and product?

• Need a scientific data publishing environment that supports:– Conformance to editorial policies– Facilitates peer review– Standardizes dissemination– Manages references and URLs– Promotes data publication, validation, and mining

• Semantic Community is doing that for NIST:– More work in progress to be reported at the

conference and elsewhere