Chan UA Faculty Forum 2008 (PP97)

8/6/2019 Chan UA Faculty Forum 2008 (PP97)

1/21

C.-C. ChanDepartment of Computer Science

University of Akron

Akron, OH 44325-4003USA

[email protected]

1UA Faculty Forum 2008 by C.-C. Chan


2/21

Outliney Overview of Data Mining

y Software Tools

y A Rule-Based System for Data Mining

y Concluding Remarks



3/21

Data Mining (KDD)y From Data to Knowledge

y Process of KDD (Knowledge Discovery in Databases)

y Related Technologiesy Comparisons



4/21

Why KDD?

We are drowning in information, but starving forknowledge John Naisbett

GrowingGap between Data Generation and DataUnderstanding:

Automation ofbusiness activities:Telephone calls, credit card charges, medical tests, etc.

Earth observation satellites:Estimated will generate one terabyte (1015 bytes) ofdata per day. At a rate of

one picture per second.

Biology:Human Genome database project has collected over gigabytes ofdata on the humangenetic code [Fasman, Cuticchia, Kingsbury, 1994.]

US Census data:NASA databases:

World Wide Web:



5/21

ProcessofKDD

5

[1] Fayyad, U., Editorial,[1] Fayyad, U., Editorial, Int. J. of Data Mining and Knowledge DiscoveryInt. J. of Data Mining and Knowledge Discovery, Vol.1, Issue 1, 1997., Vol.1, Issue 1, 1997.[2] Fayyad, U., G.[2] Fayyad, U., G. PiatetskyPiatetsky--Shapiro, and P. Smyth, "From data mining to knowledge discovery: anShapiro, and P. Smyth, "From data mining to knowledge discovery: anoverview," inoverview," in Advances in Knowledge Discovery and Data MiningAdvances in Knowledge Discovery and Data Mining, Fayyad et al (Eds.), MIT Press, 1996., Fayyad et al (Eds.), MIT Press, 1996.

UA Faculty Forum 2008 by C.-C. Chan


6/21

ProcessofKDD

1. Selection Learning the application domain Creating a target dataset

2. Pre-Processing Data cleaning and preprocessing

3. Transformation Data reduction and projection

4. Data Mining Choosing the functions and algorithms of data mining Association rules, classification rules, clustering rules

5. Interpretation and Evaluation Validate and verify discovered patterns

6. Using discovered knowledge



7/21

Typical Data Mining Tasksy Finding Association Rules [Rakesh Agrawal et al, 1993]

y Each transaction is a set of items.

Given a set of transactions, an association rule is of the form

X Ywhere X and Yare sets of items.

y e.g.: 30% of transactions that contain beer also contain diapers;y 2% of all transactions contain both of these items.

Applications:

y Market basket analysis and cross-marketingy Catalog designy Store layouty Buying patterns



8/21

y Finding Sequential Patternsy Each data sequence is a list of transactions.

y Find all sequential patterns with a user-specified minimum support.

y e.g.: Consider a book-club database

y A sequential pattern might be

y 5% of customers bought Harry Potter I, then Harry Potter II,and then Harry Potter III.

Applications:

y Add-on sales

y Customer satisfactiony Identify symptoms/diseases that precede certain diseases



9/21

y Finding Classification Rulesy Finding discriminant rules for objects of different classes.

y

Approaches:y Finding Decision Trees

y Finding Production Rules

Applications:

y

Process loans and credit cards applicationsy Model identification



10/21

y Text Mining

y Web Usage Mining

y Etc.



11/21

Related Technologies

y Database Systemsy MS SQL server

y Transaction databasesy OLAP (Data Cubes)y Data Mining

y Decision Treesy Clustering Tools

y Machine Learning/DataMining Systemsy CART (Classification And Regression Trees)y C 5.x (Decision Trees)y WEKA (Waikato Environment for Knowledge Analysis)y LERSy ROSE 2

y Rule-Based Expert System

Deve

lopment Environments

y CLIPS, JESSy EXSYS

y Web-based Platformsy Javay MS .Net



12/21

12

Pre-

Processing

Learning

Data Mining

Inference

Engine

End-User

Interface

Web-Based

Access

Reasoning

with

Uncertainties

MS SQL

Server

N/A Decision Trees

Clustering

N/A N/A N/A N/A

CART

C 5.x

N/A Decision Trees Built-in Embedded N/A N/A

WEKA Yes Trees, Rules,

Clustering,

Association

N/A Embedded Need

Programming

N/A

CLIPSJESS N/A N/A Built-in Embedded NeedProgramming 3rd

partiesExtensions

Comparisons



13/21

Rule-Based Data Mining System

Objectivesy Develop an integrated rule-based data mining system

provides

y

Synergy of database systems, machine learning, andexpert systems

y Dealing with uncertain rules

y Delivery of web-based user interface



14/21

StructureofRuleStructureofRule--Based SystemsBased Systems

Rule

Base

Working

Memory

Execution

Selector

Matcher

No

Yes

Answer

InferencResult



15/21

15

System Workflow

InputData Set

Data Pre-processing

RuleGenerator

User

InterfaceGenerator



16/21

16

y InputData Set:y Text file with comma separated values (CSV)

y It is assumed that there are N columns of values corresponding to Nvariables or parameters, which may be real or symbolic values.y The first N 1 variables are considered as inputs and the last one is

the output variable.

y Data Preprocessing:y Discretize domains of real variables into a finite number of intervalsy Discretized data file is then used to generate an attribute

information file and a training data file.

y Rule Generator:y A symbolic learning program called BLEM2 is used to generate

rules with uncertainty

y User

Interface Generator:y Generate a web-based rule-based system from a rule file and

corresponding attribute file



17/21

Architecture of RBC generator

17

Requests

Middle TierClient

Responses

SQL DB server

Workflow of RBC generator

Rule set File Metadata File

SQL Rule Table Rule Table Definition

RBC Generator



18/21

ConcludingRemarksA system for generating rule-based classifier from datawith the following benefits:

y

No need of end user programmingyAutomatic rule-based system creation

y Delivery system is web-based provides easy access



19/21


20/21

Future Worky More advanced features in Data Preprocessing such as

data cleansing, data transformation, and data statistics

y

Learning from multi-criteria inputs with preferentialrankings to support Multiple Criteria DecisionMaking processes

y Concept-Oriented information retrieval and search



21/21

Thank You!


Chan UA Faculty Forum 2008 (PP97)

Documents

Transcript of Chan UA Faculty Forum 2008 (PP97)