Federal Big Data Working Group Meetup

1

Federal Big Data Working Group Meetup

Dr. Brand NiemannDirector and Senior Data Scientist

Semantic Communityhttp://semanticommunity.info/

http://www.meetup.com/Federal-Big-Data-Working-Group/http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup

June 30, 2014

http://semanticommunity.info/

http://www.meetup.com/Federal-Big-Data-Working-Group/

http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup

2

Mission Statement• Federal: Supports the Federal Big Data Initiative, but not

endorsed by the Federal Government or its Agencies;• Big Data: Supports the Federal Digital Government Strategy which

is "treating all content as data", so big data = all your content;• Working Group: Data Science Teams composed of Federal

Government and Non-Federal Government experts producing big data products (How was the data collected, Where is it stored, and What are the results?); and

• Meetup: The world's largest network of local groups to revitalize local community and help people around the world self-organize like MOOCs (Massive Open On-line Classes) being considered by the White House.

Co-organizers: Brand Niemann and Katherine Goodier

3

What Are We Doing?• Leadership of the Semantic Data Science Team that produced Semantic Medline

running on the Yarc Data Graph Appliance.• Founding and co-organizing of the Federal Big Data Working Group Meetup.• A graduate class prepared for GMU entitled “Practical Data Science for Data

Scientists”.• Using the Cross Industry Standard Process for Data Mining (CRISP-DM; Shearer,

2000) to build a Data Science Knowledge Base• Mining of the Data Science and Digital Earth scientific journals for the CODATA

International Workshop on Big Data for International Scientific Programmes, (June 8-9, in Beijing).

• Participation in the Data FAIRport (Findable, Accessible, Interoperable, and Reusable) with “Data Publication in Data Browsers”.

• Providing data stories that persuade and presentation materials for public education conferences like the COM.BigData Conference (August 4-6, in Washington, DC).

4

How Are we Doing it?• Federating Uses Cases: Data Science (Brand Niemann); Environmental

and Earth Science (Joan Aron); and Astronomy (Kirk Borne)• Federating Data Publications: Structured Scientific Content (Papers,

journals, books, reports, etc.); Data FAIRports (Findable, Accessible, Interoperable); and Reusable Data Stories That Persuade (Claims and Evidence)

• Federating Solutions & Technologies: Hand-Crafted by Individuals and Teams (Mary Galvin, STEM); Data Mining Standards and Products (Brand Niemann, Data Publications in Data Browsers); Machine Processing (Fredrik Salvesen, Semantic Data Publications on Yarc Data Graph Appliance); Reading and Reasoning (Katherine Goodier and Chuck Rehberg (Semantic Insights on Elsevier Content Text Mining); and Data Curation at Scale (Alan Wagner, Tamr on 1000s of Spreadsheets)

5

Data FAIRPort

http://datafairport.org/http://semanticommunity.info/Data_Science/Euretos_BRAIN

Final Report, Interview, andJoint Hackathons Started

http://datafairport.org/

http://datafairport.org/

http://semanticommunity.info/Data_Science/Euretos_BRAIN

http://semanticommunity.info/Data_Science/Euretos_BRAIN

6

June 2nd Meetup: Ontology Summit 2014 Postmortem and Reading & Reasoning with Semantic Insights

• How Was the Meetup?– Meeting was very good. – This is a smart group. Thank you.

• We Listen and Respond:– Visual Document Mining. Simple clustering applied

to text (with no ontology).– SIRA is much more advanced, but you might like to

watch the 4 minute video.• https://www.overviewproject.org/

http://www.meetup.com/Federal-Big-Data-Working-Group/events/186838842/

https://www.overviewproject.org/




7

The Overview Project




8

Fourth Paradigm and Fourth Question

• The Fourth Paradigm of Science (1):– First Paradigm. Observation, descriptions of natural phenomena, and

experimentation.– Second Paradigm. Theoretical science such as Newton’s laws of motion and

Maxwell’s equations.– Third Paradigm. Simulation and modelling, such as in astronomy.– Fourth Paradigm. Data-intensive science that exploits the large volumes of

data in new ways for scientific exploration, such as the International Virtual Observatory Alliance in astronomy.

• The Fourth Question of Big Data for Science (2):– How was the data collected?– Where is the data stored?– What are the data results?– Does the data story persuade?

(1) Bell G, Hey, T., & Szalay, A. (2009) Beyond the data deluge, Science 323, 6 March 2009, pp. 1297-1298.(2) de Waard, Anita, (2014) About Stories, that Persuade With Data, Federal Big Data Working Group Meetup, 20 May,, 41 slides.

9

Activities• Mentoring:

– White House Energy Datapalooza, May 28 (In process with Alexandra Winkler, Knowledge Cities Graduate Student)

• Health Datapalooza V, June 1-3, and HHS Fellowship:– Story and Application for HHS 12-month External Entrepreneur Fellowship for Innovative Design,

Development and Linkages of Databases• Big Data for Government, June 16-17:

– Keynote from Dr. George Strawn and Presentation by Dr. Tom Rindflesch and Semantic Medline/YarcData Team

• Improving Government Performance in the Era of Big Data: Opportunities and Challenges for Federal Agencies, June 19:– A big data workshop hosted by White House Office of Science and Technology Policy and the

Georgetown University McCourt School of Public Policy’s Massive Data Institute with Mary Galvin• Earth Cube All-Hands Meeting, June 24-26:

– ESIP Earth Science Analytics with Joan Aron, Global Environmental/Climate Change Scientist• Keynote and Panel: COM.BigData 2014, August 4-6:

– Katherine Goodier, Organizer and Moderator, with Joan Aron, Mary Galvin, Chuck Rehberg, Tom Rindflesch, and Kirk Borne

10

EarthCube Data Science Publications

http://workspace.earthcube.org/earthcube-data-science-publicationshttp://semanticommunity.info/Data_Science/EarthCube_Data_Science_Publications

http://semanticommunity.info/Data_Science/EarthCube_Data_Science_Publications




11

Keynote and Panel: COM.BigData 2014

http://www.com-geo.org/conferences/2014/prog_keynotes.htm



12

Next Meetup: July 7Data Science of White House Big Data Review and

Brooke Aker: Big Data Lens on OpendFDA

• Katherine Goodier:– Legislative Data and Transparency Conference and Use Case on Privacy and Security

• Mary Galvin:– Improving Government Performance in the Era of Big Data: Opportunities and Challenges f

or Federal Agencies• Brooke Aker:

– Background• Working on data analytics since 1987 when I did my first regression analysis on surplus government

cheese !! Now working on healthcare and security predictive analytics and machine learning.– Networking and Agenda with Announcements, Presentations, Training, and Demos

• Here is a nice method to use if you are seeking to understand new technology, it's applicability and readiness for use. It is also emblematic of good Big Data practice - turning a large, free information resource into something valuable with simple straightforward thinking and driven by sophisticated software. Enjoy. http://www.bigdatalens.com/blog/2014...ta-methodology

– Participation in other Meetups• Lots of other Big Data Meetup Groups. Was at the Data Salon in Cambridge Mass last night !!

– Do you live near a DC Metro Station and use Skype?• Use Skype for sure

https://cha.house.gov/2014-legislative-data-and-transparency-conference

mailto:/@api/deki/files/28996/XcelerateFederalBigDatatheUseCaseforCognitiveMetadata.pptx

mailto:/@api/deki/files/28996/XcelerateFederalBigDatatheUseCaseforCognitiveMetadata.pptx

http://mspp.georgetown.edu/events/big-data-and-federal-agencies

http://mspp.georgetown.edu/events/big-data-and-federal-agencies

http://www.bigdatalens.com/blog/2014/6/4/technology-readiness-level-excellent-big-data-methodology

13

June 30th Meetup:Continue Data Science Tutorial

• Practical Data Science for Data Scientists:– Reading Assignments:

• Chapter 15: The Students Speak– We invited the students who took Introduction to Data Science version 1.0 to contribute a chapter

to the book. They chose to use their chapter to reflect on the course and describe how they experienced it.

• Chapters 16: Next-Generation Data Scientists, Hubris, and Ethics– The best minds of my generation are thinking about how to make people click ads… That sucks. —

Jeff Hammerbacher– We’d like to encourage the next-gen data scientists to become problem solvers and question

askers, to think deeply about appropriate design and process, and to use data responsibly and make the world better, not worse.

– Resources: AmericasDataFest Competition– Team Homework Exercise:

• Study about Graph Databases, Graph Computing, and Semantic Medline• Review Wiki and View Videos: YarcData Videos (Schizo-7 minutes, Cancer-21 minutes).• Ask Me Questions and Prepare to Ask Questions Next Week

http://semanticommunity.info/AmericasDataFest



http://semanticommunity.info/Data_Science/Graph_Databases

http://semanticommunity.info/Data_Science/Bigdata_SYSTAP_Literature_Survey_of_Graph_Databases

http://semanticommunity.info/A_NITRD_Dashboard/Semantic_Medline

http://semanticommunity.info/Data_Science/Data_Science_Symposium_2013

http://yarcdata.com/Resources/video15.php



http://www.youtube.com/watch?v=ShfI4SNzNO4

http://www.youtube.com/watch?v=6frNAmPD0mo

14

Practical Data Science for Data Scientists

http://semanticommunity.info/Data_Science/Practical_Data_Science_for_Data_Scientists

Class 8

Providing On-Line ClassWith Private Tutoring

http://semanticommunity.info/Data_Science/Practical_Data_Science_for_Data_Scientists

15

Follow Ben Shneiderman's 8 Golden Rules of Data Science

• Preparation– Choose actionable problems & appropriate theories– Consult domain experts & generalists

• Exploration– Examine data in isolation & contexually– Keep cleaning & add related data– Apply visualizations& statistical patterns, clusters, gaps, outliers,

missing & uncertain data• Decision

– Evaluate your efficacy, refine your theory– Take responsibility, own your failures– World is complex, proceed with humility

Source: "8 Golden Rules of Data Science“http://semanticommunity.info/Data_Science/Ben_Shneiderman

https://twitter.com/SethGrimes/status/398539779807002624/photo/1

https://twitter.com/SethGrimes/status/398539779807002624/photo/1

http://semanticommunity.info/Data_Science/Ben_Shneiderman

http://semanticommunity.info/Data_Science/Ben_Shneiderman

16

Agenda• MIT Big Data Initiative: Sam Madden, & Current Elephants: Michael Stonebraker

– Background: See Workshops on Extremely Large Databases– 6:30 pm Welcome and Introduction– 6:35 Laura Keilson - Xcelerate Solutions is hiring!– 6:40 pm MIT Big Data Initiative: bigdata@CAIL and the new

Intel Science and Technology Center for Big Data, Sam Madden– 7:10 pm Brief Member Introductions– 7:15 pm Alan Wagner, Tamr Demo– 7:30 pm Why the current "elephants" are good at nothing, Data Tamer, and data

integration issues, Michael Stonebraker– 8:30 p.m. Open Discussion– 8:45 p.m. Networking– 9:00 p.m. Depart

• July 7 and August 4: Once a month– Silver Line Spring Hill Metro Station Opens July 26th

http://semanticommunity.info/Data_Science/Big_Data_Science_for_CODATA/Workshops_on_Extremely_Large_Databases

http://bigdata.csail.mit.edu/

http://istc-bigdata.org/

http://en.wikipedia.org/wiki/Samuel_Madden_(computer_scientist)

http://www.data-tamer.com/

http://www.tamr.com/

http://en.wikipedia.org/wiki/Michael_Stonebraker

http://en.wikipedia.org/wiki/Michael_Stonebraker

Federal Big Data Working Group Meetup

Documents

Transcript of Federal Big Data Working Group Meetup