Chan UA Faculty Forum 2008 (PP97)
-
Upload
sunitatevatiya -
Category
Documents
-
view
214 -
download
0
Transcript of Chan UA Faculty Forum 2008 (PP97)
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
1/21
C.-C. ChanDepartment of Computer Science
University of Akron
Akron, OH 44325-4003USA
1UA Faculty Forum 2008 by C.-C. Chan
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
2/21
Outliney Overview of Data Mining
y Software Tools
y A Rule-Based System for Data Mining
y Concluding Remarks
2UA Faculty Forum 2008 by C.-C. Chan
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
3/21
Data Mining (KDD)y From Data to Knowledge
y Process of KDD (Knowledge Discovery in Databases)
y Related Technologiesy Comparisons
3UA Faculty Forum 2008 by C.-C. Chan
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
4/21
Why KDD?
We are drowning in information, but starving forknowledge John Naisbett
GrowingGap between Data Generation and DataUnderstanding:
Automation ofbusiness activities:Telephone calls, credit card charges, medical tests, etc.
Earth observation satellites:Estimated will generate one terabyte (1015 bytes) ofdata per day. At a rate of
one picture per second.
Biology:Human Genome database project has collected over gigabytes ofdata on the humangenetic code [Fasman, Cuticchia, Kingsbury, 1994.]
US Census data:NASA databases:
World Wide Web:
4UA Faculty Forum 2008 by C.-C. Chan
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
5/21
ProcessofKDD
5
[1] Fayyad, U., Editorial,[1] Fayyad, U., Editorial, Int. J. of Data Mining and Knowledge DiscoveryInt. J. of Data Mining and Knowledge Discovery, Vol.1, Issue 1, 1997., Vol.1, Issue 1, 1997.[2] Fayyad, U., G.[2] Fayyad, U., G. PiatetskyPiatetsky--Shapiro, and P. Smyth, "From data mining to knowledge discovery: anShapiro, and P. Smyth, "From data mining to knowledge discovery: anoverview," inoverview," in Advances in Knowledge Discovery and Data MiningAdvances in Knowledge Discovery and Data Mining, Fayyad et al (Eds.), MIT Press, 1996., Fayyad et al (Eds.), MIT Press, 1996.
UA Faculty Forum 2008 by C.-C. Chan
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
6/21
ProcessofKDD
1. Selection Learning the application domain Creating a target dataset
2. Pre-Processing Data cleaning and preprocessing
3. Transformation Data reduction and projection
4. Data Mining Choosing the functions and algorithms of data mining Association rules, classification rules, clustering rules
5. Interpretation and Evaluation Validate and verify discovered patterns
6. Using discovered knowledge
6UA Faculty Forum 2008 by C.-C. Chan
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
7/21
Typical Data Mining Tasksy Finding Association Rules [Rakesh Agrawal et al, 1993]
y Each transaction is a set of items.
Given a set of transactions, an association rule is of the form
X Ywhere X and Yare sets of items.
y e.g.: 30% of transactions that contain beer also contain diapers;y 2% of all transactions contain both of these items.
Applications:
y Market basket analysis and cross-marketingy Catalog designy Store layouty Buying patterns
7UA Faculty Forum 2008 by C.-C. Chan
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
8/21
y Finding Sequential Patternsy Each data sequence is a list of transactions.
y Find all sequential patterns with a user-specified minimum support.
y e.g.: Consider a book-club database
y A sequential pattern might be
y 5% of customers bought Harry Potter I, then Harry Potter II,and then Harry Potter III.
Applications:
y Add-on sales
y Customer satisfactiony Identify symptoms/diseases that precede certain diseases
8UA Faculty Forum 2008 by C.-C. Chan
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
9/21
y Finding Classification Rulesy Finding discriminant rules for objects of different classes.
y
Approaches:y Finding Decision Trees
y Finding Production Rules
Applications:
y
Process loans and credit cards applicationsy Model identification
9UA Faculty Forum 2008 by C.-C. Chan
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
10/21
y Text Mining
y Web Usage Mining
y Etc.
10UA Faculty Forum 2008 by C.-C. Chan
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
11/21
Related Technologies
y Database Systemsy MS SQL server
y Transaction databasesy OLAP (Data Cubes)y Data Mining
y Decision Treesy Clustering Tools
y Machine Learning/DataMining Systemsy CART (Classification And Regression Trees)y C 5.x (Decision Trees)y WEKA (Waikato Environment for Knowledge Analysis)y LERSy ROSE 2
y Rule-Based Expert System
Deve
lopment Environments
y CLIPS, JESSy EXSYS
y Web-based Platformsy Javay MS .Net
11UA Faculty Forum 2008 by C.-C. Chan
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
12/21
12
Pre-
Processing
Learning
Data Mining
Inference
Engine
End-User
Interface
Web-Based
Access
Reasoning
with
Uncertainties
MS SQL
Server
N/A Decision Trees
Clustering
N/A N/A N/A N/A
CART
C 5.x
N/A Decision Trees Built-in Embedded N/A N/A
WEKA Yes Trees, Rules,
Clustering,
Association
N/A Embedded Need
Programming
N/A
CLIPSJESS N/A N/A Built-in Embedded NeedProgramming 3rd
partiesExtensions
Comparisons
UA Faculty Forum 2008 by C.-C. Chan
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
13/21
Rule-Based Data Mining System
Objectivesy Develop an integrated rule-based data mining system
provides
y
Synergy of database systems, machine learning, andexpert systems
y Dealing with uncertain rules
y Delivery of web-based user interface
13UA Faculty Forum 2008 by C.-C. Chan
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
14/21
StructureofRuleStructureofRule--Based SystemsBased Systems
Rule
Base
Working
Memory
Execution
Selector
Matcher
No
Yes
Answer
InferencResult
14UA Faculty Forum 2008 by C.-C. Chan
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
15/21
15
System Workflow
InputData Set
Data Pre-processing
RuleGenerator
User
InterfaceGenerator
UA Faculty Forum 2008 by C.-C. Chan
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
16/21
16
y InputData Set:y Text file with comma separated values (CSV)
y It is assumed that there are N columns of values corresponding to Nvariables or parameters, which may be real or symbolic values.y The first N 1 variables are considered as inputs and the last one is
the output variable.
y Data Preprocessing:y Discretize domains of real variables into a finite number of intervalsy Discretized data file is then used to generate an attribute
information file and a training data file.
y Rule Generator:y A symbolic learning program called BLEM2 is used to generate
rules with uncertainty
y User
Interface Generator:y Generate a web-based rule-based system from a rule file and
corresponding attribute file
UA Faculty Forum 2008 by C.-C. Chan
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
17/21
Architecture of RBC generator
17
Requests
Middle TierClient
Responses
SQL DB server
Workflow of RBC generator
Rule set File Metadata File
SQL Rule Table Rule Table Definition
RBC Generator
UA Faculty Forum 2008 by C.-C. Chan
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
18/21
ConcludingRemarksA system for generating rule-based classifier from datawith the following benefits:
y
No need of end user programmingyAutomatic rule-based system creation
y Delivery system is web-based provides easy access
18UA Faculty Forum 2008 by C.-C. Chan
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
19/21
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
20/21
Future Worky More advanced features in Data Preprocessing such as
data cleansing, data transformation, and data statistics
y
Learning from multi-criteria inputs with preferentialrankings to support Multiple Criteria DecisionMaking processes
y Concept-Oriented information retrieval and search
20UA Faculty Forum 2008 by C.-C. Chan
-
8/6/2019 Chan UA Faculty Forum 2008 (PP97)
21/21
Thank You!
21UA Faculty Forum 2008 by C.-C. Chan