Neural, Bayesian, and Evolutionary Systems for High-Performance
-
Upload
jarrod-allen -
Category
Documents
-
view
23 -
download
2
description
Transcript of Neural, Bayesian, and Evolutionary Systems for High-Performance
University of Illinois at Urbana-ChampaignPET Program Year-End Review, 1999
Neural, Bayesian, and Evolutionary Systemsfor High-Performance
Computational Knowledge Management:Progress Report
Wednesday, August 4, 1999
William H. Hsu, Ph.D.
Automated Learning Group
National Center for Supercomputing Applicationshttp://www.ncsa.uiuc.edu/People/bhsu
University of Illinois at Urbana-ChampaignPET Program Year-End Review, 1999
Overview: T&E Data ModelingOverview: T&E Data Modeling
• Short-Term Objectives: Building a Data Model– Data Integrity
– Rudimentary Entity-Relational Data Model (cf. Oracle 8)
– Definition of Prognostic Monitoring Problem
• Longer-Term Objectives: Scalable Data Mining from Time Series– Multimodal Sensor Integration
– Relevance Determination
– Building Causal (Explanatory) Models
• Example: Super ADOCS Data Format (SDF)– 1719-channel asynchronous data bus (General Dynamics)
– Data types: time (7), ballistics/firing control (~350), fuel (~10), hydraulics (~10), wiring harness/other electrical (~310), spatial/GPS (~60), diagnostics/feedback/command (~750), profilometer (~50), unused (~135)
– Engineering units: counts, elapsed time, rates, percent efficiency, etc.– 33 caution/warning channels; internal diagnostics
– Analytical Applications: Learning, Inference (Decision Support)
University of Illinois at Urbana-ChampaignPET Program Year-End Review, 1999
Data Mining: ObjectivesData Mining: Objectivesfor Testing and Evaluationfor Testing and Evaluation
• Objectives– Scalability: handling disparity in temporal, spatial granularity
– Data integrity: verification (formal model) or validation (testing)
– Multimodality: ability to integrate knowledge/data sources
– Efficiency: consume only the necessary bandwidth for model– Acquisition (data warehousing)– Maintenance (incrementality)– Analysis (interactive, configurable data mining system)– Visualization (transparent user interface)
• Applicable Technologies– Selective downsampling: adapting grain size of data model
– Data model validation– Simple relational database (RDB) model
– Ontology: knowledge base definition, units, abstract data types
– Multimodal sensor integration: mixture models for data fusion
– Data preparation: selection, synthesis, partitioning of data channels
University of Illinois at Urbana-ChampaignPET Program Year-End Review, 1999
Data Models and OntologiesData Models and Ontologies(Super ADOCS Data Format)(Super ADOCS Data Format)
Ballistics
Hazard
Diagnostic
University of Illinois at Urbana-ChampaignPET Program Year-End Review, 1999
MultiattributeData Set
xAttribute Selection
and Partitioning'1x
'nx
SubproblemDefinition
'1y
'ny
?
?
?
?
PartitionEvaluator
Metric-BasedModel Selection
LearningArchitecture
LearningMethod
Learning Specification
Subproblem ( Architecture,Method )
DataFusionOverall
Prediction
Data Mining: Data Fusion SystemData Mining: Data Fusion Systemfor Testing and Evaluationfor Testing and Evaluation
University of Illinois at Urbana-ChampaignPET Program Year-End Review, 1999
Data Mining: Integrated Modeling and Testing Data Mining: Integrated Modeling and Testing (IMT) Information Systems(IMT) Information Systems
• Application Testbed
– Aberdeen Test Center: M1 Abrams main battle tank (SEP data, SDF)
– Reliability testing
• T&E Information Systems: Common Characteristics
– Large-Scale Data Model • Input (M1 A2 SEP): 1.8Mb ~ 459Mb; minutes to hours• Output: 33 caution/warning channels; internal diagnostics
– Data Integrity Requirements• Specification of test objective and metrics (in progress)• Generated by end user (e.g., author of test report, instrumentation report)
– Multimodality• Selection of relevant data channels (given prediction objective)
• Data fusion problem: data channels from different categories
– Data Reduction Requirements• Excess bandwidth: non-uniform downsampling (frequency reduction)• Irrelevant data channels (e.g., targeting with respect to excess RPMs)
University of Illinois at Urbana-ChampaignPET Program Year-End Review, 1999
Relevance Determination Problems inRelevance Determination Problems inTesting and EvaluationTesting and Evaluation
• Problems– Machine learning for decision support and monitoring– Extraction of temporal features– Model selection– Sensor and data fusion
• Solutions– Clustering and decomposition of learning tasks– Selection, synthesis, and partitioning of data channels
• Approach– Simple relational data model– Relevance determination (importance ranking) for data channels– Multimodal Data Fusion
– Hierarchy of time series models
– Quantitative (metric-based) model selection
University of Illinois at Urbana-ChampaignPET Program Year-End Review, 1999
Deployment of KDD and VisualizationDeployment of KDD and VisualizationComponentsComponents
• Database Access
– SDF (Super ADOCS Data File) import
– Flat file export
– Internal data model: interaction with learning modules
• Deployment
– Java stand-alone application
– Interactive management of modules, data flow
• Presentation: Web-Based Interface
– Simple, URL-based invocation system• Common Gateway Interface (CGI) and Perl• Alternative implementation: servlets (http://www.javasoft.com)
– Configurable using forms
• Messaging Systems (Deployment Presentation)
– Between configurators and deployment layer
– Between data management modules and visualization components
University of Illinois at Urbana-ChampaignPET Program Year-End Review, 1999
NCSA Infrastructure for High-Performance NCSA Infrastructure for High-Performance Computation in Data Mining [1]Computation in Data Mining [1]
Rapid KDD Development Environment
University of Illinois at Urbana-ChampaignPET Program Year-End Review, 1999
NCSA Infrastructure for High-Performance NCSA Infrastructure for High-Performance Computation in Data Mining [2]Computation in Data Mining [2]
University of Illinois at Urbana-ChampaignPET Program Year-End Review, 1999
Cluster (Network of Workstations) ModelCluster (Network of Workstations) Modelfor Master/Slave Genetic Wrapperfor Master/Slave Genetic Wrapper
NCSA ALG(8-node Beowulf cluster)
Slave 1 Slave 2 Slave 3 Slave 4
Slave 5 Slave 6 Slave 7 Slave 8
Master
• Jenesis: Java-based simple genetic algorithm (sGA) running in master virtual machine (VM)
• Load balancing task manager
• Message passing communication (TCP/IP, MPI, PVM)
100-base-T
Ethernet
Slaves (Linux PCs)
• Migratable processes• Replicated data set• MLC++ (machine learning library written in C++)
University of Illinois at Urbana-ChampaignPET Program Year-End Review, 1999
Progress to DateProgress to Date(Functionality Demonstrated)(Functionality Demonstrated)
• SDF Viewer/Editor
– Platform-independent user interface• Selection, grouping of attributes
• Implementation: Common Gateway Interface/Perl• Work in progress: downsampling; Java servlet version
– Demonstration: data format; data dictionary; integrity checking
• Data Model/Ontology
– Built from SDF data dictionary
– General process
– Future work: JDBC/SQL, MS Access Oracle 8
• Attribute Subset Selection System
– Workstation cluster
– Genetic algorithm (Java/C++)
• D2K: Rapid Application Development System for HP Data Mining
– Visual programming system
– Supports code reuse and interfaces with existing codes (e.g., MLC++)
University of Illinois at Urbana-ChampaignPET Program Year-End Review, 1999
Current and Future DevelopmentCurrent and Future Development
• Distributivity
– Cluster: Network Of Workstations (NOW)
– Supports (task/functionally) parallel, distributed applications
– Example: simple genetic algorithm for attribute subset selection
• Rapid Application Development Environment
– Simple and intuitive user interface
– NCSA Data to Knowledge (D2K): visual programming system (Java)
• Incorporation of Domain Expert Knowledge
– Knowledge engineering and interactive elicitation
– Probabilistic knowledge: relevance, inter-attribute relations
– Correlation and causality (priors in reliability testing)
• Refinement of IMT Plans
– Test plan refinement
– Instrumentation plan refinement
• Future Work: Using Refined Data Model to Improve Data Bus Specification
University of Illinois at Urbana-ChampaignPET Program Year-End Review, 1999
Time Series Analysis:Time Series Analysis:Development TimelineDevelopment Timeline
• Development Vision
– Prognostic capability• Monitoring• Data integrity checking, general data model development tools/expertise
– Decision support (especially relevance determination)
• Development Schedule: Technology Transfer
– CY4 (Fiscal 2000)• Time series visualization tool (integrated with model identification tool)• User interface for analytical tool, user training
– CY5 (Fiscal 2001) technology transfer: D2K, user training
– New personnel• NCSA Automated Learning Group• Aberdeen Test Center
• CY4 Action Items for IMT
– Model deployment: using time series modeling tools and techniques
– Elicitation: subject matter expertise (for training systems)
University of Illinois at Urbana-ChampaignPET Program Year-End Review, 1999
Summary: Model Development Process
• Model Identification– Queries: test/instrumentation reports– Specification of data model– Grouping of data channels by type
• Prediction Objective Identification– Specification of test objective: failure
modes, hazard levels– Identification of metrics
• Reduction– Refinement of data model– Selection of relevant data channels
(given prediction objective)
• Synthesis– New data channels that improve
prediction quality
• Integration– Multiple time series data sources
Relevant Inputs(Single Objective)
DecompositionMethods
Heterogeneous Data(Multiple Sources)
Relevant Inputs(Multiple Objectives)
Decision Support Systems
Single-TaskMultistrategy
Learning
Task-SpecificMultistrategy
Learning
Definition of NewData Mining Problem(s)
SupervisedSupervised
Reduction ofInputs
Subdivision ofInputs
Unsupervised Unsupervised