Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of...
-
Upload
chloe-small -
Category
Documents
-
view
224 -
download
0
Transcript of Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of...
![Page 1: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/1.jpg)
Chad BerkleyNCEAS
National Center for Ecological Analysis and Synthesis (NCEAS),University of California Santa Barbara
Long Term Ecological Research Network Office, University of New MexicoUniversity of Kansas
San Diego Supercomputer Center
Kepler: A Workflow Tool for Heterogeneous Ecological Data
Analysis
http://seek.ecoinformatics.org December 4, 2003Edinburgh, Scotland
![Page 2: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/2.jpg)
Outline
Quick history SEEK overview Ecological Metadata Language Using workflows in Ecology Workflow editing with Kepler Future visions
![Page 3: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/3.jpg)
History
Late 1990s – patterns noticed in the problems surrounding data synthesis at NCEAS
1999 - Michener et al paper on ecological metadata
2000 – Knowledge Network for Biocomplexity Morpho, Metacat, Ecological Metadata Language Some footholds into workflow creation and execution
2003 – Scientific Environment for Ecological Knowledge (SEEK) Grant Continues the work done on the KNB grant Emphasis on using metadata for advanced data
processing
![Page 4: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/4.jpg)
SEEK approach
General approach to specific ecological problems
Data described with adequate metadata in a grid accessible repository
Reasoning engine (ontology based) to locate and extract data and processes
Modeling system to put it all together and control execution flow
![Page 5: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/5.jpg)
SEEK Components
Ecogrid Analysis Library Metadata and data repository
Semantic Mediation System Controlled semantic vocabulary Ontological discovery system
Analysis and Modeling System (Kepler) Workflow control system Utilizes resources from other components
![Page 6: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/6.jpg)
SEEK Architecture
![Page 7: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/7.jpg)
Ecological Metadata Language
Common language for archiving and transport of datasets
XML based Designed for/by the ecological
community Describes physical and logical
structure of data Also includes project, literature and
software information SEEK will add semantic information
![Page 8: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/8.jpg)
Workflows in SEEK
In the SEEK model, data ingestion/cleaning is metadata driven (specifically with EML)
Output generation includes creating appropriate metadata
The analysis pipeline itself becomes metadata
![Page 9: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/9.jpg)
Metadata driven data ingestion
Key information needed to read and machine process a data file is in the metadata File descriptors (CSV, Excel, RDBMS, etc.) Entity (table) and Attribute (column)
descriptions Name Type (integer, float, string, etc.) Codes (missing values, nulls, etc.) In the future, this will include semantic typing
![Page 10: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/10.jpg)
Metadata revision
Metadata is revised following any transformation
Versioning of metadata and data is very important
This process results in a lineage of the data file as it has been transformed
![Page 11: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/11.jpg)
Typical ecological workflow example
Workflows can automate the integration process if data is described with adequate structured metadata
![Page 12: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/12.jpg)
Homogeneous data integration
Integration of homogeneous or mostly homogeneous data via EML metadata is relatively straightforward
![Page 13: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/13.jpg)
Heterogeneous Data integration
Integration of heterogeneous data requires much more advanced metadata and processing
Attributes must be semantically typed Collection protocols must be known Units and measurement scale must be known Measurement mechanics must be known (i.e. that
Density=Count/Area)
![Page 14: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/14.jpg)
Label data with semantic types Label inputs and outputs of analytical components with semantic types
Use Semantic Mediation System (SMS) to generate transformation steps Beware analytical constraints
Use SMS to discover relevant components Ontology – specification of a conceptualization (a knowledge map)
Semantic typing and ontologies
Data Ontology Workflow Components
![Page 15: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/15.jpg)
Measurement Ontology
Density is part of a larger measurement ontology SEEK’s intent is to create one or more community created
ecological ontologies Creates a controlled vocabulary for ecological metadata More about this in Bertram’s talk
![Page 16: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/16.jpg)
About Kepler
Kepler is the name of the SEEK/SDM additions to the Ptolemy modeling system
Ptolemy was designed by the UC Berkeley EECS department
Primary use is modeling EE circuits Free, opensource, pure Java Flexible design GUI for building
workflows
![Page 17: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/17.jpg)
Kepler
A Kepler model consists of linked “actors” (which correspond to workflow steps)
Timing is controlled by a “director” All actors are written in Java but can
call other applications (such as SAS and MATLAB or native language code via JNI)
Actors can call arbitrary Web (or Grid) Services
Ptolemy already has a very large inventory of actors
Easy to use, drag ‘n drop interface
![Page 18: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/18.jpg)
SEEK Contributions to Kepler (so far)
EML data ingestion actor
Actor design tool
![Page 19: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/19.jpg)
EML data ingestion actor
Ingests any data format described by EML metadata
Converts raw data to Kepler format Data can then be operated on with other
actors Produces one output port for each attribute
in the dataset Individual attributes can then be mapped to
other actors
![Page 20: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/20.jpg)
Ptolemy model with EML ingestion actor
![Page 21: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/21.jpg)
SEEK Contributions to Kepler (so far)
EML data ingestion actor
Actor design tool
![Page 22: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/22.jpg)
Actor design tool
Allows “place-holder” actors to be defined on the fly by non-programmers during workflow creation
Domain scientists can thereby create workflows without programming knowledge
Workflows created with these actors can be executed once their functionality is implemented by a programmer
Allows quick prototyping of workflows by domain scientists
“Place-holder” actors can still be linked to other working actors
![Page 23: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/23.jpg)
Ptolemy and dynamically created actor
![Page 24: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/24.jpg)
How domain scientists will benefit
More fully automated integration systems
A library of pre-defined analytical processes which can be executed on heterogeneous data
Semantic data discovery and processing
Automated unit and measurement scale conversions
A fuller understanding of cross site research implications
![Page 25: Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.](https://reader030.fdocuments.net/reader030/viewer/2022033100/56649ede5503460f94beefba/html5/thumbnails/25.jpg)
Acknowledgements
This material is based upon work supported by:
The National Science Foundation under Grant Numbers 9980154, 9904777, and 0225676 to NCEAS and its collaborators.
The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus.
Primary Collaborators: University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research)
More info: http://seek.ecoinformatics.org
Questions? IRC: irc.ecoinformatics.org #seek