Post on 12-Jan-2016
Introduction to Apache OODT
Yang Li
Mar 9, 2012
What is OODT
• Object Oriented Data Technology
• Science data management
• Archiving Systems that span scientific disciplines
• Enable interoperability among data agnostic systems (astrophysics, planetary, space science data systems, open source web analytics)
History
• 2001– deployed to make virtual specimen bank for Early
Detection Research Network (oncology)• 2004
– Core architectural software of Planetary Data System Data Distribution deployed by NASA (planetary science)
• 2007– deployed for the Orbiting Carbon Observatory and
Seawinds missions (earth science)• 2008
– deployed in for National Polar-Orbiting Environmental Satellite System (atmospheric science)
Framework
• Catalog & Archive
• Utilities
• Grid
• Agility
Catalog & Archive
• Deal with large-scale ingest of data, metadata extraction of data, post-processing of data into derived and higher-order products, cataloging of data, searching of catalogs, versioning, and retrieval
• Components:– Catalog, Crawling framework, Curation, File
manager, Metadata, PCS, Push/Pull framework, Resource management, Workflow, CAS install, Web apps
Catalog
• Virtualize underlying catalogs for use in the CAS system
• Heterogeneous catalog models are mapped to a common dictionary, and then integrated locally so that they may be queried across and ingested into
CAS Crawler
• Standardize the common ingestion activities– identification of files and directories to
crawl– satisfaction of ingestion pre-conditions– metadata extraction
• Ingestion
CAS Crawler
Curation
• A web application for managing policy for products and files and metadata that have been ingested via the CAS component– Use a servlet container to deploy the web app– Staging area
• Directories on local machine holding data products
– Metadata generation area• Create metadata files to associate with data
products
File Manager
• Provide everything to catalog, archive and manage files, and directories, and their associated metadata
• Separate data stores and metadata stores as standard interfaces
Workflow
• Provides everything to execute workflows, and science processing pipelines.
• Separate workflow repositories and workflow engines as standard interfaces
Resource Management
• Job management– Execution, monitoring, traking
• Underlying software system and hardware resources– e.g. disk space, computational resources,
and shared identity
Resource Management (Cont)
• Critical objects– Job, Job Input, Job Spec, Job Instance,
Resource Node
Metadata
• A Multi-valued, generic Metadata container class
• Internal map of string keys pointing to vectors of strings – [std:string key] std:vector of std:strings⇒
Framework
• Catalog & Archive
• Common Utilities
• Grid
• Agility
Common Utilities
• Provide needed support for catalogs, archives, and grids
• Query Expression – Platform neutral and extensible way of
posing questions
• Single Sign On
• Commons– Lots of miscellaneous utilities, including I/O
streams, logging, XML, and more
Query Expression
• Provide a way to express queries in a generic manner
• Use boolean postfix expressions to capture the domain, range, and constraint of a query, regardless of the source of the query
• Encapsulate the results of a query– standard way to pass a query and its
results between servers, clients, nodes, and other components.
Framework
• Catalog & Archive
• Utilities
• Grid
• Agility
Grid
• Profile (metadata) and Product (data) services• Product
– Retrieves resources (products) in platform-neutral formats
• Profile– Describes and discovers resources using
extensible metadata called "profiles"• Web Grid
– provides profile and product services over a REST-ful interface.
• XML Product/Profile handlers– provides XML-configurable, Database profile and
product handlers.
Product
• Provide access to data products– datasets, images, documents, or anything
with an electronic representation
• Accept standard query expressions and return zero or more matching products
• Transform products from proprietary formats and into Internet standard formats without impacting local stores or operations.
Profile
• Describes and Locates resources using metadata descriptions– resource's inception, composition, and
location
• Catalogs metadata descriptions and provides creating, updating, and querying capabilities.
Framework
• Catalog & Archive
• Utilities
• Grid
• Agility
Agility
• Re-implementation of Grid in Python with a focus on high performance in the face of gargantuan data sets as well as accelerated development and integration into existing systems.
Questions