Infrastructure, Data Cleansing and Mining for Scientific...
Transcript of Infrastructure, Data Cleansing and Mining for Scientific...
11/26/03 Yingping Huang1
Infrastructure, DataCleansing and Mining for
Scientific Simulations
Yingping HuangCommittee Members:
Dr. BowyerDr. FlynnDr. MadeyDr. Uhran
11/26/03 Yingping Huang2
Agenda
uOverview
uBackground
uMulti-tier infrastructure
uData cleansing algorithms
uData mining applications
uSummarize
uTimeframe
11/26/03 Yingping Huang3
Overview
uMulti-tier infrastructure powers scientificsimulations.
uData cleansing algorithms result inbetter data quality.
uData mining applications discoverhidden knowledge in environmental andsocial science.
11/26/03 Yingping Huang4
Motivation
Infrastructure
NOM OSS
Data StorageData AnalysisReports
CollaborationPersonalizationWeb-based
Simulation Anytime and Anywhere
11/26/03 Yingping Huang5
Agenda
uOverview
uBackground
uMulti-tier infrastructure
uData cleansing algorithms
uData mining applications
uSummarize
uTimeframe
11/26/03 Yingping Huang6
Backgroundu Projects under way
n NOMu Research on natural organic matter (NOM)u Study evolution of NOM over timeu Joint work of scientists across disciplines including
chemists, biochemists, environmental scientists
n OSSu Research on the open source software (OSS)
development phenomenonu Study the behavior of OSS developers and their
motivationsu Joint work with social scientists
11/26/03 Yingping Huang7
Simulation Modelsu Standalone or traditional client-server
n Software needs to be installed on clientsn Incompatibility makes installation difficult
u Web-based using appletsn Security – file permission, firewalln Inconvenience – plug-ins downloadn Network traffic – download before executingn Incompatibility – Swarm
u What should be done?n Web-based server-side simulation modelsn Centralized simulation managementn Collaboration and personalization
11/26/03 Yingping Huang8
Data Cleansingu Known approaches
n Sorted neighborhood (Stolfo 1995/1998)u Domain dependent keys for sorting
n Record matching(Monge, 2000)u Edit distance only
n String mapping (Li, 2003)u Potential high dimensional target space
u Our approachesn Sample databasen Lipschitz mapping
11/26/03 Yingping Huang9
Data Miningu Data mining in astronomy
n SKYCAT: star/galaxy classification (Fayyad, 1996)n JARTool: detect volcanoes on Venus (Burl, 1998)n Sapphire: find galaxies (Kamath, 2001)
u Data mining in biologyn Bioinformaticsn SARS diagnosis (ehealth.org)
u What should be done?n Data mining for social science (OSS)n Data mining for environmental science (NOM)n Add intelligence to simulation models by applying data
mining results
11/26/03 Yingping Huang10
Agenda
uOverview
uBackground
uMulti-tier infrastructure
uData cleansing algorithms
uData mining applications
uSummarize
uTimeframe
11/26/03 Yingping Huang11
Physical Layout
InternetInternetNetworkSwitch
…
10.10.0.1
10.10.0.210.10.0.310.10.0.410.10.0.5
10.10.0.610.10.0.710.10.0.810.10.0.910.10.0.10
129.74.xxx.yyy
129.74.aaa.bbb
The Simulation Manager
External Servers and ClientsPrivate network
11/26/03 Yingping Huang12
Multi-tierArchitecture
HTTP Client tier
HTTP Server tier
Application Server tier
Database Server tier
Client 1
Client 2
…Client N
Server
Server 1
Server 2
Server 3
Server 4
Server 5
SD
SMDW
STDBY
11/26/03 Yingping Huang13
Two Features
u Load-balancingn Scalability achieved
n Implementation using JMS, AQ & EJB
n Implementation using Shell scripts & PL/SQL
u Simulation-resumingn Reliability achieved
n Checkpoint
n Implementation using JTA/JTS
11/26/03 Yingping Huang14
Load-balancingUsing JMS & AQ
Job queue00100
statuscheckpointresumedjob_id
Topic 1
mdb1 mdb2 mdb3 mdb4 mdb5
Loadavg queue
Judge bean Topic 2
mdb6 mdb7 mdb8 mdb9 mdb10
Invoke simulation & update job queue
11/26/03 Yingping Huang15
Shell Scripts &PL/SQL
uDispatcher (HTTP server)n Dispatch simulationsn Send KEEPALIVE messages to running
simulations
u Intelligent agent (application server)n Upload load averagesn Check simulationsn Send ACK to KEEPALIVE messages
11/26/03 Yingping Huang16
Load-balancingAlgorithm
u Instance learning approachn Based on completion time prediction
uTwo step completion time predictionn Completion time estimation
u Load average
uData amount
n Completion time predictionuNearest neighborhood
11/26/03 Yingping Huang17
Completion TimeEstimation
uCompletion time estimation formula
11/26/03 Yingping Huang18
Checkpoint
JDBC
SD SM
JTA/JTS
One transaction
11/26/03 Yingping Huang19
CheckpointIssues
uCheckpoint datan All data for restarting the simulation
n Size depends on number of agents
uCheckpoint frequencyn Checkpoint-interval
u # of MB data
n Checkpoint-timeoutu # of minutes
11/26/03 Yingping Huang20
Simulation-resuming
u To restart a terminated simulationn A new simulation with same job_id inserted into
the job queue
n A terminated simulation has smaller job_id thannew simulations, higher priority
u In case of application server failuren All simulations’ job_ids inserted into the job queue
n All simulations will be running on other applicationservers
11/26/03 Yingping Huang21
CollaborationSuite
11/26/03 Yingping Huang22
Graphical Reports
11/26/03 Yingping Huang23
XML Reports
11/26/03 Yingping Huang24
Agenda
uOverview
uBackground
uMulti-tier information system
uData mining applications
uSummarize
uTimeframe
11/26/03 Yingping Huang25
Methodologyu Traditional approach
n Form hypothesesn Verify hypotheses by finding patterns in data
u Data mining approachn Find patterns in datan Form hypothesesn Design simulation modelsn Verify hypotheses
u U. Fayyad, J. Gray at Microsoft Research
11/26/03 Yingping Huang26
Technology &Software
u Data mining technologyn Clustering
u K-means
u Orthogonal cluster
n Classificationu Decision tree
u Naïve Bayes
n Association rulesu Apriori
u Data mining softwaren Oracle Data Mining
Suiten DM4Jn JDeveloper
11/26/03 Yingping Huang27
OSSu Study behavior of open source software (OSS)
developersn Agent-basedn Stochastic
u Data mining involvingn Clusteringn Classification
u Churn predictionu Acquisition prediction
n Association rules
11/26/03 Yingping Huang28
OSS DataWarehousing
u Data from sourceforge.comn Developers
n Projects
u Data warehousingn Table partitioning
n Aggregation
n Star schema
n Analysis SQL
n ETL tools ‡ Warehouse Builder
11/26/03 Yingping Huang29
NOMu Study behavior of natural organic matter
(NOM)n Agent-basedn Stochastic
u Data mining involvingn Clustering
u Micelle formation
n Classificationu Transportation predictionu Adsorption prediction
n Association rules
11/26/03 Yingping Huang30
Agenda
uOverview
uBackground
uMulti-tier information system
uData mining applications
uSummarize
uTimeframe
11/26/03 Yingping Huang31
Summarize
uMulti-tier information system integratesn Application servers & reports server
n Database servers
n Data warehousing & data mining
n Swarm
uCollaboration suite
uData mining guided model-design
11/26/03 Yingping Huang32
Insights &Impacts
u Server-side simulation modelsn Centralized simulation managementn Centralized data repository
u Collaboration suiten Simulation sharingn Knowledge sharing
u Data mining applicationsn Find patterns in datan Model deployment for simulation-design
11/26/03 Yingping Huang33
Agenda
uOverview
uBackground
uMulti-tier information system
uData mining applications
uSummarize
uTimeframe
11/26/03 Yingping Huang34
TimeframeMay 2003 ~ May 2004
0 3 6 9 12
Implement infrastructure
Data collection & statistical analysis
Data mining model design
Data mining model evaluation
Deployment
Writing up
11/26/03 Yingping Huang35
ExpectedPublications
u Information system design for scientificsimulationsn By August 2003
u Data warehousing for scientific simulationsn By November 2003
u Data mining for OSSn By February 2004
u Data mining for NOMn By March 2004
11/26/03 Yingping Huang36
Demo
Demonstration
11/26/03 Yingping Huang37
Finally
Thank you!
11/26/03 Yingping Huang38
Featuresu Multi-tier information system
n HTTP client tier ‡ HTTP server tier ‡Applicationserver tier ‡EIS tier
u Scalability at the application server tiern Load-balancing
u Reliability at the application server tiern Simulation-resuming
u Reliability at the database tiern Standby databases
11/26/03 Yingping Huang39
Features (cont.)
uData mining modelsn Stored in databasen Stored Java proceduresn PL/SQL procedure call using JDBC
uSimulation modelsn Agent-basedn Stochasticn Data mining guided