Semantics Technology Demonstration Ecoinformatics International Technical Collaboration
description
Transcript of Semantics Technology Demonstration Ecoinformatics International Technical Collaboration
1
Semantics Technology DemonstrationEcoinformatics International Technical Collaboration
April 9, 2008
Research Triangle Park, North Carolina, USA
Bruce BargmeyerLawrence Berkeley National LaboratoryandUniversity of California, BerkeleyTel: +1 [email protected]
Topics
Describe challenges to be addressed Describe the demo scenarios Describe the initial demo Describe the technology/infrastructure Discuss Collaboration
2
3
Challenge: Access Dispersed Data. Convey Common Understanding of meaning between
Data Creators and Data Users
Users Information systems
Data Creation
UsersUsers
EEA
USGS
DoD
EPAenvironagricultureclimatehuman healthindustrytourismsoilwaterair
123345445670248591308
123345445670248591308
3268082513485038270800002178
3268082513485038270800002178
text data
environagricultureclimatehuman healthindustrytourismsoilwaterair
123345445670248591308
123345445670248591308
3268082513485038270800002178
3268082513485038270800002178
text
ambienteagriculturatiemposalud hunanoindustriaturismotierraaguaaero
123345445670248591308
123345445670248591308
3268082513485038270800002178
3268082513485038270800002178
text data
data
environagricultureclimatehuman healthindustrytourismsoilwaterair
123345445670248591308
123345445670248591308
3268082513485038270800002178
3268082513485038270800002178
text data
Others . . .
ambienteagriculturatiemposalud hunoindustriaturismotierraaguaaero
123345445670248591308
123345445670248591308
3268082513485038
3268082513485038270800002178
text data
A common interpretation of what the data represents
4
Challenge: Combine Data, Metadata & Concept Systems
ID Date Temp Hg
A 06-09-13 4.4 4
B 06-09-13 9.3 2
X 06-09-13 6.7 78
Name Datatype Definition Units
ID textMonitoring Station Identifier
not applicable
Date date Date yy-mm-dd
Temp numberTemperature (to 0.1 degree C)
degrees Celcius
Hg numberMercury contamination
micrograms per liter
Inference Search Query:“find water bodies downstream from Fletcher Creek where chemical contamination was over 10 micrograms per liter between December 2001 and March 2003”
Data:
Metadata:
Biological Radioactive
Contamination
lead cadmiummercury
Chemical
Concept system:
5
Challenge: Use data from systems that record the same facts with different terms
Reduce the human toil of drawing information together and performing analysis -> shift to computer processing.
6
Data Elements
DZ
BE
CN
DK
EG
FR
. . .
ZW
ISO 3166English Name
ISO 31663-Numeric Code
012
056
156
208
818
250
. . .
716
ISO 31662-Alpha Code
Algeria
Belgium
China
Denmark
Egypt
France
. . .
Zimbabwe
Name:Context:Definition:Unique ID: 4572Value Domain:Maintenance Org.Steward:Classification:Registration Authority:Others
ISO 3166French Name
L`Algérie
Belgique
Chine
Danemark
Egypte
La France
. . .
Zimbabwe
DZA
BEL
CHN
DNK
EGY
FRA
. . .
ZWE
ISO 31663-Alpha Code
Same Fact, Different Terms
Algeria
Belgium
China
Denmark
Egypt
France
. . .
Zimbabwe
Name: Country IdentifiersContext:Definition:Unique ID: 5769Conceptual Domain:Maintenance Org.:Steward:Classification:Registration Authority:Others
DataElementConcept
Demo with Microsoft eScience
Collaborate with Microsoft Research, San Francisco Office
Collaboration already ongoing with LBNL and UCB, Berkeley Water Center.
Somewhat like Hydroseek, but with XMDR for concept systems and metadata
Hydroseek accesses EPA STORET and USGS NWIS (Water Data)
7
Scenarios
Scenario 1 – Semantics enabled data access Semantics enabled access to data and metadata
that may serve as an indicator (or as input to a more complex indicator)
Scenario 2 – Data harmonization People from different states or countries (political
jurisdictions) are interested in water quality. They want to develop a particular indicator of interest based on data that crosses political jurisdictions.
8
Scenarios (Continued)
Scenario 3 – Simulation models Use XMDR to document parameters: input
data, output data, initialization parameters, etc. for water, air, subsurface, models. So as to support remote simulation model integration. If put a box around some geography, can see if there are models that have been run.
9
Scenario 1
Semantics enabled access to data and metadata that may serve as an indicator (or as input to a more complex indicator)
Person uses concept systems to find variables of interest, accesses the data for the variables, and views metadata describing the data.
Use concept system to identify possible variables that have data for a specific time and geographic coverage.
Use concept system to create query to access data from multiple sources. Access/obtain the data System performs mediation of results from different result sets (simple
transformations based on information in metadata registry). Display data with links to metadata. User can go to metadata to better understand the data, e.g., provenance,
measurement units, collection methods.
10
Scenario 1
Use combination of XMDR and Hydroseek-like software XMDR holds the concept system(s) and metadata Hydroseek-like software interacts with user, accesses data, and
displays results. Mediation tool is separate from XMDR, but draws on metadata
from XMDR. Also need what is necessary to interact with the external data
source (e.g., screen scraping, database access). Bora currently has concept system that serves as the global
ontology for variables in ~25 systems. E.g., STORET and NWIS. He used USGS water words dictionary.
11
Hydroseek
Hydroseek is an ontology-aided search engine for finding scientific data on water quality and hydrology from approximately 1.9 million sites in the USA. Hydroseek creates a unified view over databases of agencies such us US Geological Survey, Environmental Protection Agency.
It helps researchers to remove the semantic, syntactic and information system heterogeneity barriers, improves the search experience, and reduces time spent on data discovery and preparation prior to processing. Depending on the method of interaction (GUI or web services) and the function invoked, output can be provided using CUAHSI WaterML, Geography Markup Languages Features, or Microsoft Excel.
The system uses Microsoft Virtual Earth map interface with OWL ontologies providing the knowledge base used in supplying the auto-complete keywords and classifying of search results.
Hydroseek follows Services Oriented Architecture (SOA) and most functionalities are available via SOAP webservices. The system also supports queries using NASA's Global Change Master Directory (GCMD) keywords via web services.
12
Hydroseek
13
public•Tagging Application Demo •Admin Interface Demo
private•Tagging Application •Admin Interface •User Management Console
other•Registration •Help & Credits
/wEPDwU/wEPDwU
Linking Concept Systems to Data
14
Little Demo
A little demo to show that what we talked about can be done with XMDR.
Use latitude, longitudes for 3-4 sites and what they measure. A small ontology with 5-6 concepts and two ontologies for data sources
(let’s say USGS and EPA) with 5-6 variables (variable = name of what is being measured i.e. parameter name), measurement method metadata etc.
The idea is showing how these (including mappings between variables) can be stored in XMDR and how can they be discovered.
So it is a matter of getting them into XMDR and putting together some sort of a web interface that gets a keyword and returns a list of sites, relevant measurements etc.
This will be done with samples from US at first and then with content from JRC and WISE.
15
Sample Data
16
SitesSite Code Site Name Latitude Longitude Elevation Vertical Datum State County
NWIS:10038000BEAR RIVER BLW SMITHS FORK, NR COKEVILLE, WY 42.1266021 -110.973243 1872NGVD29 Wyoming Lincoln
NWIS:10075000BEAR RIVER AT SODA SPRINGS, ID 42.61381026 -111.583556 1756.1NGVD29 Idaho Caribou
NWIS:10020100BEAR RIVER ABOVE RESERVOIR, NEAR WOODRUFF, UT 41.4343899 -111.0176863 1968NGVD29 Wyoming Uinta
Observations CatalogSite Code Variable Code BeginDate EndDate Observation countNWIS:10038000 NWIS:00038 1/1/1956 1/1/2008 581NWIS:10038000 NWIS:00038 1/1/1980 1/1/2007 1254NWIS:10075000 NWIS:00038 2/5/1969 4/8/2001 65NWIS:10075000 NWIS:00038 1/2/1956 4/6/1990 47NWIS:10020100 EPA:12356-1 3/3/1920 4/3/1945 16NWIS:10020100 EPA:12356-1 2/5/1969 4/6/1990 9
VariablesVariable Code Variable Name MediumNWIS:00038 Ammonia Nitrogen WaterEPA:12356-1 Ammonia Nitrogen Water
Small Concept System
17
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xm="http://www.xmdr.org/"> <xm:concept rdf:about="http://www.xmdr.org/ammoniaNitrogen" dc:title="Ammonia Nitrogen"> <xm:narrowerThan> <xm:concept rdf:about="http://www.xmdr.org/nitrogen" dc:title="Nitrogen"/> </xm:narrowerThan><xm:hasMedium><xm:medium rdf:about="http://www.xmdr.org/water"/></xm:hasMedium> </xm:concept></rdf:RDF>
Technology Overview
18
Metadata:Provenance, etc.
8
XMDR
XMDR
XMDR
Adapted from a slide from Bora Beran
Modular XMDR Archtitecture
Registry Store
Search & Content Serving (Jena, Lucene)
XMDR metamodel (OWL & xml schema)
standard XMDR filesstandard XMDR files
standard XMDR filesstandard XMDR files
LogicIndex
Content Loading & Transformation
(Lexgrid & custom)
Human User Interface(HTML fromJSP and javascript; Exhibit)
Metadata Sources concept systems,
data elements
USERSWeb Browsers…..Client
Software
Application Program Interface (REST)
Authentication ServiceValidation
(XML Schema)
MappingEngine
Logic Indexer(Jana & Pellet)
Text Indexer(Lucene)
Metamodel specs(UML & Editing)
(Poseidon, Protege)
XMDR data model & exchange format
XML, RDF, OWL
TextIndex
Postgres Database
Third Party Software
Video and Discussion
View Video Discuss
20
Acknowledgements
John McCarthy, LBNL Kevin Keck, LBNL Bora Beran, Microsoft Research
This material is based upon work supported by the National Science Foundation under Grant No. 0637122, USEPA and USDOD. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, USEPA or USDOD.
21