Chan UA Faculty Forum 2008 (PP97)

download Chan UA Faculty Forum 2008 (PP97)

of 21

Transcript of Chan UA Faculty Forum 2008 (PP97)

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    1/21

    C.-C. ChanDepartment of Computer Science

    University of Akron

    Akron, OH 44325-4003USA

    [email protected]

    1UA Faculty Forum 2008 by C.-C. Chan

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    2/21

    Outliney Overview of Data Mining

    y Software Tools

    y A Rule-Based System for Data Mining

    y Concluding Remarks

    2UA Faculty Forum 2008 by C.-C. Chan

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    3/21

    Data Mining (KDD)y From Data to Knowledge

    y Process of KDD (Knowledge Discovery in Databases)

    y Related Technologiesy Comparisons

    3UA Faculty Forum 2008 by C.-C. Chan

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    4/21

    Why KDD?

    We are drowning in information, but starving forknowledge John Naisbett

    GrowingGap between Data Generation and DataUnderstanding:

    Automation ofbusiness activities:Telephone calls, credit card charges, medical tests, etc.

    Earth observation satellites:Estimated will generate one terabyte (1015 bytes) ofdata per day. At a rate of

    one picture per second.

    Biology:Human Genome database project has collected over gigabytes ofdata on the humangenetic code [Fasman, Cuticchia, Kingsbury, 1994.]

    US Census data:NASA databases:

    World Wide Web:

    4UA Faculty Forum 2008 by C.-C. Chan

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    5/21

    ProcessofKDD

    5

    [1] Fayyad, U., Editorial,[1] Fayyad, U., Editorial, Int. J. of Data Mining and Knowledge DiscoveryInt. J. of Data Mining and Knowledge Discovery, Vol.1, Issue 1, 1997., Vol.1, Issue 1, 1997.[2] Fayyad, U., G.[2] Fayyad, U., G. PiatetskyPiatetsky--Shapiro, and P. Smyth, "From data mining to knowledge discovery: anShapiro, and P. Smyth, "From data mining to knowledge discovery: anoverview," inoverview," in Advances in Knowledge Discovery and Data MiningAdvances in Knowledge Discovery and Data Mining, Fayyad et al (Eds.), MIT Press, 1996., Fayyad et al (Eds.), MIT Press, 1996.

    UA Faculty Forum 2008 by C.-C. Chan

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    6/21

    ProcessofKDD

    1. Selection Learning the application domain Creating a target dataset

    2. Pre-Processing Data cleaning and preprocessing

    3. Transformation Data reduction and projection

    4. Data Mining Choosing the functions and algorithms of data mining Association rules, classification rules, clustering rules

    5. Interpretation and Evaluation Validate and verify discovered patterns

    6. Using discovered knowledge

    6UA Faculty Forum 2008 by C.-C. Chan

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    7/21

    Typical Data Mining Tasksy Finding Association Rules [Rakesh Agrawal et al, 1993]

    y Each transaction is a set of items.

    Given a set of transactions, an association rule is of the form

    X Ywhere X and Yare sets of items.

    y e.g.: 30% of transactions that contain beer also contain diapers;y 2% of all transactions contain both of these items.

    Applications:

    y Market basket analysis and cross-marketingy Catalog designy Store layouty Buying patterns

    7UA Faculty Forum 2008 by C.-C. Chan

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    8/21

    y Finding Sequential Patternsy Each data sequence is a list of transactions.

    y Find all sequential patterns with a user-specified minimum support.

    y e.g.: Consider a book-club database

    y A sequential pattern might be

    y 5% of customers bought Harry Potter I, then Harry Potter II,and then Harry Potter III.

    Applications:

    y Add-on sales

    y Customer satisfactiony Identify symptoms/diseases that precede certain diseases

    8UA Faculty Forum 2008 by C.-C. Chan

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    9/21

    y Finding Classification Rulesy Finding discriminant rules for objects of different classes.

    y

    Approaches:y Finding Decision Trees

    y Finding Production Rules

    Applications:

    y

    Process loans and credit cards applicationsy Model identification

    9UA Faculty Forum 2008 by C.-C. Chan

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    10/21

    y Text Mining

    y Web Usage Mining

    y Etc.

    10UA Faculty Forum 2008 by C.-C. Chan

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    11/21

    Related Technologies

    y Database Systemsy MS SQL server

    y Transaction databasesy OLAP (Data Cubes)y Data Mining

    y Decision Treesy Clustering Tools

    y Machine Learning/DataMining Systemsy CART (Classification And Regression Trees)y C 5.x (Decision Trees)y WEKA (Waikato Environment for Knowledge Analysis)y LERSy ROSE 2

    y Rule-Based Expert System

    Deve

    lopment Environments

    y CLIPS, JESSy EXSYS

    y Web-based Platformsy Javay MS .Net

    11UA Faculty Forum 2008 by C.-C. Chan

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    12/21

    12

    Pre-

    Processing

    Learning

    Data Mining

    Inference

    Engine

    End-User

    Interface

    Web-Based

    Access

    Reasoning

    with

    Uncertainties

    MS SQL

    Server

    N/A Decision Trees

    Clustering

    N/A N/A N/A N/A

    CART

    C 5.x

    N/A Decision Trees Built-in Embedded N/A N/A

    WEKA Yes Trees, Rules,

    Clustering,

    Association

    N/A Embedded Need

    Programming

    N/A

    CLIPSJESS N/A N/A Built-in Embedded NeedProgramming 3rd

    partiesExtensions

    Comparisons

    UA Faculty Forum 2008 by C.-C. Chan

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    13/21

    Rule-Based Data Mining System

    Objectivesy Develop an integrated rule-based data mining system

    provides

    y

    Synergy of database systems, machine learning, andexpert systems

    y Dealing with uncertain rules

    y Delivery of web-based user interface

    13UA Faculty Forum 2008 by C.-C. Chan

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    14/21

    StructureofRuleStructureofRule--Based SystemsBased Systems

    Rule

    Base

    Working

    Memory

    Execution

    Selector

    Matcher

    No

    Yes

    Answer

    InferencResult

    14UA Faculty Forum 2008 by C.-C. Chan

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    15/21

    15

    System Workflow

    InputData Set

    Data Pre-processing

    RuleGenerator

    User

    InterfaceGenerator

    UA Faculty Forum 2008 by C.-C. Chan

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    16/21

    16

    y InputData Set:y Text file with comma separated values (CSV)

    y It is assumed that there are N columns of values corresponding to Nvariables or parameters, which may be real or symbolic values.y The first N 1 variables are considered as inputs and the last one is

    the output variable.

    y Data Preprocessing:y Discretize domains of real variables into a finite number of intervalsy Discretized data file is then used to generate an attribute

    information file and a training data file.

    y Rule Generator:y A symbolic learning program called BLEM2 is used to generate

    rules with uncertainty

    y User

    Interface Generator:y Generate a web-based rule-based system from a rule file and

    corresponding attribute file

    UA Faculty Forum 2008 by C.-C. Chan

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    17/21

    Architecture of RBC generator

    17

    Requests

    Middle TierClient

    Responses

    SQL DB server

    Workflow of RBC generator

    Rule set File Metadata File

    SQL Rule Table Rule Table Definition

    RBC Generator

    UA Faculty Forum 2008 by C.-C. Chan

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    18/21

    ConcludingRemarksA system for generating rule-based classifier from datawith the following benefits:

    y

    No need of end user programmingyAutomatic rule-based system creation

    y Delivery system is web-based provides easy access

    18UA Faculty Forum 2008 by C.-C. Chan

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    19/21

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    20/21

    Future Worky More advanced features in Data Preprocessing such as

    data cleansing, data transformation, and data statistics

    y

    Learning from multi-criteria inputs with preferentialrankings to support Multiple Criteria DecisionMaking processes

    y Concept-Oriented information retrieval and search

    20UA Faculty Forum 2008 by C.-C. Chan

  • 8/6/2019 Chan UA Faculty Forum 2008 (PP97)

    21/21

    Thank You!

    21UA Faculty Forum 2008 by C.-C. Chan