Neural, Bayesian, and Evolutionary Systems for High-Performance

University of Illinois at Urbana-ChampaignPET Program Year-End Review, 1999

Neural, Bayesian, and Evolutionary Systemsfor High-Performance

Computational Knowledge Management:Progress Report

Wednesday, August 4, 1999

William H. Hsu, Ph.D.

Automated Learning Group

National Center for Supercomputing Applicationshttp://www.ncsa.uiuc.edu/People/bhsu


Overview: T&E Data ModelingOverview: T&E Data Modeling

• Short-Term Objectives: Building a Data Model– Data Integrity

– Rudimentary Entity-Relational Data Model (cf. Oracle 8)

– Definition of Prognostic Monitoring Problem

• Longer-Term Objectives: Scalable Data Mining from Time Series– Multimodal Sensor Integration

– Relevance Determination

– Building Causal (Explanatory) Models

• Example: Super ADOCS Data Format (SDF)– 1719-channel asynchronous data bus (General Dynamics)

– Data types: time (7), ballistics/firing control (~350), fuel (~10), hydraulics (~10), wiring harness/other electrical (~310), spatial/GPS (~60), diagnostics/feedback/command (~750), profilometer (~50), unused (~135)

– Engineering units: counts, elapsed time, rates, percent efficiency, etc.– 33 caution/warning channels; internal diagnostics

– Analytical Applications: Learning, Inference (Decision Support)


Data Mining: ObjectivesData Mining: Objectivesfor Testing and Evaluationfor Testing and Evaluation

• Objectives– Scalability: handling disparity in temporal, spatial granularity

– Data integrity: verification (formal model) or validation (testing)

– Multimodality: ability to integrate knowledge/data sources

– Efficiency: consume only the necessary bandwidth for model– Acquisition (data warehousing)– Maintenance (incrementality)– Analysis (interactive, configurable data mining system)– Visualization (transparent user interface)

• Applicable Technologies– Selective downsampling: adapting grain size of data model

– Data model validation– Simple relational database (RDB) model

– Ontology: knowledge base definition, units, abstract data types

– Multimodal sensor integration: mixture models for data fusion

– Data preparation: selection, synthesis, partitioning of data channels


Data Models and OntologiesData Models and Ontologies(Super ADOCS Data Format)(Super ADOCS Data Format)

Ballistics

Hazard

Diagnostic


MultiattributeData Set

xAttribute Selection

and Partitioning'1x

'nx

SubproblemDefinition

'1y

'ny

?

?

?

?

PartitionEvaluator

Metric-BasedModel Selection

LearningArchitecture

LearningMethod

Learning Specification

Subproblem ( Architecture,Method )

DataFusionOverall

Prediction

Data Mining: Data Fusion SystemData Mining: Data Fusion Systemfor Testing and Evaluationfor Testing and Evaluation


Data Mining: Integrated Modeling and Testing Data Mining: Integrated Modeling and Testing (IMT) Information Systems(IMT) Information Systems

• Application Testbed

– Aberdeen Test Center: M1 Abrams main battle tank (SEP data, SDF)

– Reliability testing

• T&E Information Systems: Common Characteristics

– Large-Scale Data Model • Input (M1 A2 SEP): 1.8Mb ~ 459Mb; minutes to hours• Output: 33 caution/warning channels; internal diagnostics

– Data Integrity Requirements• Specification of test objective and metrics (in progress)• Generated by end user (e.g., author of test report, instrumentation report)

– Multimodality• Selection of relevant data channels (given prediction objective)

• Data fusion problem: data channels from different categories

– Data Reduction Requirements• Excess bandwidth: non-uniform downsampling (frequency reduction)• Irrelevant data channels (e.g., targeting with respect to excess RPMs)


Relevance Determination Problems inRelevance Determination Problems inTesting and EvaluationTesting and Evaluation

• Problems– Machine learning for decision support and monitoring– Extraction of temporal features– Model selection– Sensor and data fusion

• Solutions– Clustering and decomposition of learning tasks– Selection, synthesis, and partitioning of data channels

• Approach– Simple relational data model– Relevance determination (importance ranking) for data channels– Multimodal Data Fusion

– Hierarchy of time series models

– Quantitative (metric-based) model selection


Deployment of KDD and VisualizationDeployment of KDD and VisualizationComponentsComponents

• Database Access

– SDF (Super ADOCS Data File) import

– Flat file export

– Internal data model: interaction with learning modules

• Deployment

– Java stand-alone application

– Interactive management of modules, data flow

• Presentation: Web-Based Interface

– Simple, URL-based invocation system• Common Gateway Interface (CGI) and Perl• Alternative implementation: servlets (http://www.javasoft.com)

– Configurable using forms

• Messaging Systems (Deployment Presentation)

– Between configurators and deployment layer

– Between data management modules and visualization components


NCSA Infrastructure for High-Performance NCSA Infrastructure for High-Performance Computation in Data Mining [1]Computation in Data Mining [1]

Rapid KDD Development Environment


NCSA Infrastructure for High-Performance NCSA Infrastructure for High-Performance Computation in Data Mining [2]Computation in Data Mining [2]


Cluster (Network of Workstations) ModelCluster (Network of Workstations) Modelfor Master/Slave Genetic Wrapperfor Master/Slave Genetic Wrapper

NCSA ALG(8-node Beowulf cluster)

Slave 1 Slave 2 Slave 3 Slave 4

Slave 5 Slave 6 Slave 7 Slave 8

Master

• Jenesis: Java-based simple genetic algorithm (sGA) running in master virtual machine (VM)

• Load balancing task manager

• Message passing communication (TCP/IP, MPI, PVM)

100-base-T

Ethernet

Slaves (Linux PCs)

• Migratable processes• Replicated data set• MLC++ (machine learning library written in C++)


Progress to DateProgress to Date(Functionality Demonstrated)(Functionality Demonstrated)

• SDF Viewer/Editor

– Platform-independent user interface• Selection, grouping of attributes

• Implementation: Common Gateway Interface/Perl• Work in progress: downsampling; Java servlet version

– Demonstration: data format; data dictionary; integrity checking

• Data Model/Ontology

– Built from SDF data dictionary

– General process

– Future work: JDBC/SQL, MS Access Oracle 8

• Attribute Subset Selection System

– Workstation cluster

– Genetic algorithm (Java/C++)

• D2K: Rapid Application Development System for HP Data Mining

– Visual programming system

– Supports code reuse and interfaces with existing codes (e.g., MLC++)


Current and Future DevelopmentCurrent and Future Development

• Distributivity

– Cluster: Network Of Workstations (NOW)

– Supports (task/functionally) parallel, distributed applications

– Example: simple genetic algorithm for attribute subset selection

• Rapid Application Development Environment

– Simple and intuitive user interface

– NCSA Data to Knowledge (D2K): visual programming system (Java)

• Incorporation of Domain Expert Knowledge

– Knowledge engineering and interactive elicitation

– Probabilistic knowledge: relevance, inter-attribute relations

– Correlation and causality (priors in reliability testing)

• Refinement of IMT Plans

– Test plan refinement

– Instrumentation plan refinement

• Future Work: Using Refined Data Model to Improve Data Bus Specification


Time Series Analysis:Time Series Analysis:Development TimelineDevelopment Timeline

• Development Vision

– Prognostic capability• Monitoring• Data integrity checking, general data model development tools/expertise

– Decision support (especially relevance determination)

• Development Schedule: Technology Transfer

– CY4 (Fiscal 2000)• Time series visualization tool (integrated with model identification tool)• User interface for analytical tool, user training

– CY5 (Fiscal 2001) technology transfer: D2K, user training

– New personnel• NCSA Automated Learning Group• Aberdeen Test Center

• CY4 Action Items for IMT

– Model deployment: using time series modeling tools and techniques

– Elicitation: subject matter expertise (for training systems)


Summary: Model Development Process

• Model Identification– Queries: test/instrumentation reports– Specification of data model– Grouping of data channels by type

• Prediction Objective Identification– Specification of test objective: failure

modes, hazard levels– Identification of metrics

• Reduction– Refinement of data model– Selection of relevant data channels

(given prediction objective)

• Synthesis– New data channels that improve

prediction quality

• Integration– Multiple time series data sources

Relevant Inputs(Single Objective)

DecompositionMethods

Heterogeneous Data(Multiple Sources)

Relevant Inputs(Multiple Objectives)

Decision Support Systems

Single-TaskMultistrategy

Learning

Task-SpecificMultistrategy

Learning

Definition of NewData Mining Problem(s)

SupervisedSupervised

Reduction ofInputs

Subdivision ofInputs

Unsupervised Unsupervised

Neural, Bayesian, and Evolutionary Systems for High-Performance

Documents

Transcript of Neural, Bayesian, and Evolutionary Systems for High-Performance