Data Mining- IMT Nagpur-Manish

download Data Mining- IMT Nagpur-Manish

of 82

Transcript of Data Mining- IMT Nagpur-Manish

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    1/88

    Data MiningData Mining

    &&

    Its Business ApplicationsIts Business Applications

    MANISH GUPTA

    Principal Analytics Consultant

    Innovation Labs, 24/7 Customer Pvt. Ltd.

    Bangalore-560071

    (Email: [email protected])

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    2/88

    22

    Why Data Mining?Why Data Mining?

    Data explosion problemData explosion problem

    Automated data collection tools and mature databaseAutomated data collection tools and mature database

    technology lead to tremendous amounts of datatechnology lead to tremendous amounts of data

    stored in databases, data warehouses and otherstored in databases, data warehouses and other

    information repositoriesinformation repositories

    We are drowning in data, but starving for knowledge!We are drowning in data, but starving for knowledge!

    Secret of Success in Business is knowing that whichSecret of Success in Business is knowing that which

    nobody else knows.nobody else knows.

    Solution: Data Warehousing and Data MiningSolution: Data Warehousing and Data Mining

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    3/88

    33

    What is Data Mining?What is Data Mining?(Knowledge Discovery in Databases)(Knowledge Discovery in Databases)

    DefinitionDefinition

    ExtractionExtraction ofof interestinginteresting ((nonnon--trivialtrivial,,implicitimplicit,, previouslypreviously unknownunknown andandpotentiallypotentially usefuluseful)) informationinformation ororpatternspatterns fromfrom datadata inin largelargedatabasesdatabases

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    4/88

    Data Mining Vs DBMSData Mining Vs DBMS--SQLSQL

    44

    DBMS-SQL Data Mining

    Queries based on the data

    held

    Infers knowledge from

    the data to answer queries

    Examples:yLast months sales for

    each product

    y Sales grouped by

    customer age etc.

    y List of customerswhose policies lapsed

    Examples:yWhat characteristics do

    customers have whose

    policies have lapsed ?

    y Is the sales of this

    product dependent on thesales of some other

    product?

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    5/88

    55

    Data Mining:Data Mining:

    Confluence of Multiple DisciplinesConfluence of Multiple Disciplines

    Data Mining

    DatabaseTechnology

    Statistics

    OtherDisciplines

    InformationScience

    MachineLearning

    Visualization

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    6/88

    66

    Data Mining and Business IntelligenceData Mining and Business Intelligence

    Increasing potential

    to support

    business decisions End User

    BusinessAnalyst

    Data

    Analyst

    DBA

    Making

    Decisions

    Data Presentation

    Visualization Techniques

    Data Mining

    Information Discovery

    Data Exploration

    OLAP,

    Statistical Analysis, Querying and Reporting

    Data Warehouses

    Data SourcesPaper, Files, Information Providers, Database Systems

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    7/88

    77

    Data Mining: A KDD ProcessData Mining: A KDD Process

    Data mining: the core ofData mining: the core ofknowledge discoveryknowledge discoveryprocess.process.

    Data Cleaning

    Data Integration

    Databases

    Data Warehouse

    Task-relevant Data

    Selection

    Data Mining

    Pattern Evaluation

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    8/88

    88

    Architecture of a Typical DataArchitecture of a Typical Data

    Mining SystemMining System

    Data

    Warehouse

    Data cleaning & data integration Filtering

    Databases

    Database or datawarehouse server

    Data mining engine

    Pattern evaluation

    Graphical user interface

    Knowledge-base

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    9/88

    99

    ApplicationsApplicationsBusiness DomainBusiness Domain

    MarketMarket--Basket DatabasesBasket Databases

    Financial DatabasesFinancial Databases

    Insurance DatabaseInsurance Database

    Telecommunication DatabaseTelecommunication Database

    Business Anal yticsBusiness Analytics

    CRMCRM

    DefenceDefence DomainDomain

    MSDFMSDF ELINT Data AnalysisELINT Data Analysis

    Emitter ClassificationEmitter Classification

    Intrusion DetectionIntrusion Detection

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    10/88

    1010

    Business ApplicationsBusiness Applications

    DatabaseDatabase analysisanalysis andand decisiondecision supportsupport

    MarketMarket analysisanalysis andand managementmanagement

    target marketing, customer relation management,target marketing, customer relation management,

    market basket analysis, cross selling, marketmarket basket analysis, cross selling, market

    segmentationsegmentation

    FraudFraud detectiondetection andand managementmanagement

    OtherOther ApplicationsApplications

    TextText miningmining (news(news group,group, email,email, documents)documents) andand

    WebWeb analysisanalysis..

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    11/88

    1111

    Market Analysis & ManagementMarket Analysis & Management

    Where are the data sources for analysis?Where are the data sources for analysis?

    Credit card transactions, loyalty cards, discount coupons,Credit card transactions, loyalty cards, discount coupons,

    customer complaint calls, plus (public) lifestyle studiescustomer complaint calls, plus (public) lifestyle studies

    Target marketingTarget marketing

    Find clusters of model customers who share the sameFind clusters of model customers who share the samecharacteristics: interest, income level, spending habits, etc.characteristics: interest, income level, spending habits, etc.

    Determine customer purchasing patterns over timeDetermine customer purchasing patterns over time

    Conversion of single to a joint bank account: marriage, etc.Conversion of single to a joint bank account: marriage, etc.

    CrossCross--market analysismarket analysis

    Associations/coAssociations/co--relations between product salesrelations between product sales

    Prediction based on the association informationPrediction based on the association information

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    12/88

    1212

    Customer profilingCustomer profiling

    data mining can tell you what types of customers buy whatdata mining can tell you what types of customers buy what

    products (clustering or classification)products (clustering or classification)

    Identifying customer requirementsIdentifying customer requirements

    identifying the best products for different customersidentifying the best products for different customers

    use prediction to find what factors will attract newuse prediction to find what factors will attract new

    customerscustomers

    Provides summary informationProvides summary information

    various multidimensional summary reportsvarious multidimensional summary reports

    statistical summary information (data central tendency andstatistical summary information (data central tendency and

    variation)variation)

    ContdContd

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    13/88

    1313

    Data Mining TechniquesData Mining Techniques

    Clustering

    Classification

    Association Rules Mining(Market Basket Analysis)

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    14/88

    ClusteringClustering

    1414

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    15/88

    1515

    Clustering: Basic IdeaClustering: Basic Idea

    ClusteringClustering Grouping a set of data objects into clustersGrouping a set of data objects into clusters

    Similar objects within the same clusterSimilar objects within the same cluster

    Dissimilar objects in different clustersDissimilar objects in different clusters

    Clustering is unsupervisedClustering is unsupervised No previous categorization knownNo previous categorization known

    Totally data drivenTotally data driven

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    16/88

    1616

    Clustering:ExampleClustering:Example

    A good clustering method will produce high qualityA good clustering method will produce high quality

    clusters withclusters with

    high intrahigh intra--class similarityclass similarity

    low interlow inter--class similarityclass similarity

    ***

    ***

    *

    *

    *

    *

    * *

    ***

    *

    *

    *

    ***

    *

    *

    *

    *

    **

    *

    Outlier

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    17/88

    1717

    Similarity ComputationSimilarity Computation

    Distance between objects used as metricDistance between objects used as metric

    The definitions of distance functions usuallyThe definitions of distance functions usually

    different for different type of attributesdifferent for different type of attributes

    Must satisfy following propertiesMust satisfy following properties

    d(d(i,ji,j)) uu 00

    d(d(i,ji,j)) ==d(d(j,ij,i))

    d(d(i,ji,j)) ee d(d(i,ki,k)) ++d(d(k,jk,j))

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    18/88

    1818

    Distance Calculation: objects XDistance Calculation: objects Xii and Xand Xjj

    MinskowskiMinskowski

    EuclidianEuclidian

    ManhattanManhattan

    ||...||||),(2211 pp jxixjxixjxixjid

    !

    qq

    pp

    qq

    jx

    ix

    jx

    ix

    jx

    ixjid )||...|||(|),(

    2211!

    )||...|||(|),( 2222

    2

    11 pp jx

    ix

    jx

    ix

    jx

    ixjid !

    (p attributes)

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    19/88

    1919

    MethodsMethods

    Partition MethodsPartition Methods

    Iterative MethodsIterative Methods

    Convergence criteria specified byConvergence criteria specified bythe userthe user

    Hierarchical MethodsHierarchical Methods

    Agglomerative / DivisiveAgglomerative / Divisive

    UseUse DendrogramDendrogram representationrepresentation

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    20/88

    2020

    Partitioning MethodsPartitioning Methods

    KK--Means ClusteringMeans Clustering

    Decide kDecide k no. of clustersno. of clusters

    Randomly pick k seedsRandomly pick k seeds use as centroidsuse as centroids Repeat until conditionRepeat until condition

    Scan database and assign each object to a clusterScan database and assign each object to a cluster

    Compute centroidsCompute centroids

    Evaluate quality of clusteringEvaluate quality of clustering

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    21/88

    2121

    RecordsRecords Feature1Feature1 Feature2Feature2 Feature3Feature3 Feature4Feature4

    L1L1 33 1010 2323 3636

    L2L2 1212 66 1212 4141

    L3L3 55 1212 1717 2424L4L4 44 88 77 1313

    L5L5 11 1616 11 2828

    L6L6 1818 00 2222 5151

    L7L7 66 88 66 1212

    L8L8 1515 55 22 66

    L9L9 00 1010 1515 1818

    L10L10 99 22 2424 1515

    Example:Example:

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    22/88

    2222

    InitializationInitialization

    We take, the number of cluster centers asWe take, the number of cluster centers as

    3 i.e. K=3.3 i.e. K=3.

    Lets take the initial Cluster Centers asLets take the initial Cluster Centers as L1 ( 3,10,23,36)L1 ( 3,10,23,36)

    L5 (1,16,1,28)L5 (1,16,1,28)

    L8 (15,5,2,6)L8 (15,5,2,6)

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    23/88

    Pictorial View of the Clusters After First IterationPictorial View of the Clusters After First Iteration

    2323

    L2

    L1

    L4

    L5L6

    L10

    L9

    L3

    L8

    L7

    (9.4,6,19.6,33.4)

    (0.5,13,8,23)

    (8.3,7,5,10.3)

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    24/88

    2424

    L2

    L1

    L4

    L5L6

    L10

    L9

    L3

    L8

    L7

    (8.33,7,5,10.3)

    (10.5,4.5,20.3,35.8)(2,12.7,11,23.3)

    Pictorial View of the Clusters After Second IterationPictorial View of the Clusters After Second Iteration

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    25/88

    2525

    L2

    L1

    L4

    L5L6

    L10

    L9

    L3

    L8

    L7

    (8.33,7,5,10.3)

    (10.5,4.5,20.3,35.8)

    The cluster

    centers remain

    same as in the

    second iteration

    so we stop here.

    (2,12.7,11,23.3)

    Pictorial View of the Clusters After Third IterationPictorial View of the Clusters After Third Iteration

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    26/88

    2626

    Hierarchical MethodsHierarchical Methods

    Agglomerative MethodsAgglomerative Methods

    Bottom Up approachBottom Up approach

    Divisive MethodsDivisive Methods

    Top Down ApproachTop Down Approach

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    27/88

    2727

    DendrogramDendrogram

    BA C D E F

    Database

    Distance Matrix

    00

    ddabab 00

    ddacac ddbcbc 00

    ddadad ddbdbd ddcdcd 00

    ddaeae ddbebe ddcece dddede 00

    ddafaf ddbfbf ddcfcf dddfdf ddefef 00A, B, C E,F

    A,B,C,D

    Agglomerative ApproachAgglomerative Approach

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    28/88

    2828

    Database

    BA C E FD

    A, B, C

    E,FA,B,C,D

    Divisive ApproachDivisive Approach

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    29/88

    2929

    Clustering: ApplicationsClustering: Applications

    Marketing ManagementMarketing Management

    Discover distinct groups in customer bases, andDiscover distinct groups in customer bases, and

    then use this knowledge to develop targetedthen use this knowledge to develop targeted

    marketing programsmarketing programs

    BankingBanking ATM Location identificationATM Location identification

    Text MiningText Mining Grouping documents with similar characteristicsGrouping documents with similar characteristics

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    30/88

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    31/88

    Clustering Companies using DowClustering Companies using Dow

    Jones IndexJones Index

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    32/88

    Trading System DevelopmentTrading System Development

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    33/88

    Clustering for Customer ProfilingClustering for Customer Profiling

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    34/88

    3434

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    35/88

    3535

    New Product line developmentNew Product line development

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    36/88

    Crime Hot Spot AnalysisCrime Hot Spot Analysis

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    37/88

    Clustering for MedicalClustering for Medical

    DiagnosticsDiagnosticsHuman Genome Project:Human Genome Project: Finding Relationships between diseases,Finding Relationships between diseases,

    cellular functions, and drugs.cellular functions, and drugs.

    WincosinWincosin Breast Cancer StudyBreast Cancer Study

    Cancer Diagnosis and PredictionsCancer Diagnosis and Predictions

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    38/88

    ClassificationClassification

    3838

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    39/88

    3939

    Easy to agree these are sunset pictures!

    These are all As! Handwritten Characters from

    NIST database

    In most cases, easy for experts to attach class labels

    difficult to explain why!

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    40/88

    4040

    ClassificationClassification

    Supervised learning method

    Use historical data to construct a model(Hypothesis Formulation)

    Discover relationship between inputattributes and target

    Use the model for prediction

    Major Classification Methods Decision Tree(ID3, CART, C4.5, SLIQ) Neural Network(MLP)

    Support Vector Machine

    Bayesian Classifiers(NBC, BBN)

    K-Nearest Neighbor(KNN)

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    41/88

    The classification task

    Input: a training set of tuples, eachlabelled with one class label

    Output: a model (classifier) which assigns

    a class label to each tuple based on theother attributes.

    The model can be used to predict the

    class of new tuples, for which the classlabel is missing or unknown

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    42/88

    Training step

    TrainingData

    NAME AGE INCOME CREDIT

    Mary 20 - 30 low poor

    James 30 - 40 low fair

    Bill 30 - 40 high goodJohn 20 - 30 med fair

    Marc 40 - 50 high good

    Annie 40 - 50 high good

    ClassificationAlgorithms

    IF age = 30 - 40OR income = highTHEN credit = good

    Classifier(Model)

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    43/88

    Test step

    TestData

    NAME AGE INCOME CREDIT

    Paul 20 - 30 high good

    Jenny 40 - 50 low fairick 30 - 40 high fair

    Classifier(Model)

    CREDIT

    good

    fairgood

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    44/88

    Prediction

    UnseenData

    Classifier(Model)

    CREDIT

    good

    good

    fair

    NAME AGE INCOME

    Doc 20 - 0 high

    Phil 30 - 0 low

    Kat 0 - 0 med

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    45/88

    4545

    Classification: ApproachesClassification: Approaches

    Decision Tree InductionDecision Tree Induction

    Neural NetworksNeural Networks

    Support Vector MachineSupport Vector MachineBayesian ApproachBayesian Approach

    Rule InductionRule Induction

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    46/88

    4646

    Decision Tree InductionDecision Tree Induction

    Recursive partitioning of T until stoppingRecursive partitioning of T until stopping

    criterion satisfied (criterion satisfied (purity of partition, depth of tree etc.)purity of partition, depth of tree etc.)

    Decide the split criterionDecide the split criterion

    Select the splitting attributeSelect the splitting attribute

    Partition the data according to the selected attributePartition the data according to the selected attribute

    Apply induction method recursively on each partitionApply induction method recursively on each partition

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    47/88

    4747

    Decision tree inducersDecision tree inducers

    ID3ID3 RJ QuinlanRJ Quinlan 19861986 Simple, Uses information gain, no pruningSimple, Uses information gain, no pruning

    C4.5C4.5 -- RJ QuinlanRJ Quinlan 19931993 Uses gain ratio, handles numeric attributes andUses gain ratio, handles numeric attributes and

    missing values, errormissing values, error--based pruningbased pruning

    SLIQSLIQ --Mehta et al., 1996Mehta et al., 1996 Scalable, one scan of database, usesScalable, one scan of database, uses ginigini indexindex

    CARTCART -- BriemanBrieman et al. 1984et al. 1984

    constructs binary tree, costconstructs binary tree, cost--complexity pruning, cancomplexity pruning, cangenerate regression treesgenerate regression trees

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    48/88

    4848

    Attribute Selection CriteriaAttribute Selection Criteria

    Information GainInformation Gain

    Entropy(C, S) =Entropy(C, S) = -- ppii log(plog(pii))

    Entropy (Before split)Entropy (Before split) -- Entropy (After split)Entropy (After split)

    Gain RatioGain Ratio Information Gain / Entropy (Before split)Information Gain / Entropy (Before split)

    GiniGini IndexIndex

    Measures divergenceMeasures divergence

    GiniGini(C, S) =1(C, S) =1 ppii22

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    49/88

    Classical example: play tennis?

    tl empe at e mi ity in y Class

    s nny ot hi h alse

    s nny hot hi h t e

    ove cast hot hi h alse

    ain mil hi h alse

    ain cool normal alse

    rain cool normal true

    overcast cool normal true

    sunny mil hi h alse

    sunny cool normal alserain mil normal alse

    sunny mil normal true

    overcast mil hi h true

    overcast hot normal alse

    rain mil hi h true

    z Trainingset fromQuinlansbook

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    50/88

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    51/88

    Decision tree obtained with ID3(Quinlan 86)

    outlook

    overcast

    humidity windy

    high normal falsetrue

    sunny rain

    N NP P

    P

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    52/88

    From decision trees to classification rules

    One rule is generated for eachpath in the treefrom the root to a leaf

    Rules are generally simpler to understand than

    trees

    outlook

    overcast

    humidity windy

    high normal falsetrue

    sunny rain

    N NP P

    P

    IF outlook=sunny

    AND humidity=normal

    THEN play tennis

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    53/88

    5353

    Advantages & LimitationsAdvantages & Limitations

    AdvantagesAdvantages

    Self explanatorySelf explanatory

    handle both numeric and categorical datahandle both numeric and categorical data

    NonNon--parametric methodparametric method

    LimitationsLimitations

    Most algorithms predict only categorical attribMost algorithms predict only categorical attrib

    OverTrainingOverTraining Need for pruningNeed for pruning

    LargeLarge TreeTree

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    54/88

    Bayesian classification

    The classification problem may be formalizedusing a-posteriori probabilities:

    P(C|X) = prob. that the sample tupleX= is of class C.

    E.g. P(class=N | outlook=sunny,windy=true,)

    Idea: assign to sample X the class label C suchthat P(C|X) is maximal

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    55/88

    Estimating a-posteriori probabilities

    Bayes theorem:P(C|X) = P(X|C)P(C) / P(X)

    P(X) is constant for all classes

    P(C) = relative freq of class C samplesC such that P(C|X) is maximum =

    C such that P(X|C)P(C) is maximum

    Problem: computing P(X|C) is unfeasible!

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    56/88

    Nave Bayesian Classification

    Nave assumption: attribute independence

    P(x1,,xk|C) = P(x1|C)P(xk|C)

    If i-th attribute is categorical:P(x

    i|C) is estimated as the relative freq of

    samples having value xi as i-th attribute in classC

    If i-th attribute is continuous:P(xi|C) is estimated thru a Gaussian densityfunction

    Computationally easy in both cases

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    57/88

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    58/88

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    59/88

    If data set is not so large

    Cross-validation

    AvailableExamples

    TrainingSet

    Test.Set

    10%90%

    Repeat 10

    times

    Used to develop 10 different tree Tabulateaccuracies

    Generalizationean and stddev

    of accurac

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    60/88

    Classification: ApplicationClassification: Application

    Bank Loan Granting SystemBank Loan Granting System

    Decision trees constructed from bankDecision trees constructed from bank--loanloanhistories to produce algorithms to decidehistories to produce algorithms to decide

    whether to grant a loan or not.whether to grant a loan or not.AntiAnti Money Laundering SystemMoney Laundering System

    KYC StatusKYC Status

    Email Classification SystemEmail Classification System

    Spam or NotSpam or Not

    6060

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    61/88

    Stock Research ApplicationStock Research Application

    Efficient Prediction of Option Prices usingEfficient Prediction of Option Prices using

    Machine Learning TechniquesMachine Learning TechniquesPrediction of both European and American Option PricesPrediction of both European and American Option Prices

    using General Regression Neural Network and Supportusing General Regression Neural Network and Support

    Vector Regression.Vector Regression.

    Stock Portfolio Management: Prediction ofStock Portfolio Management: Prediction of

    Risk using Text ClassificationRisk using Text ClassificationPrediction or classification of risk in investment of a particularPrediction or classification of risk in investment of a particular

    company by Text classification using Navecompany by Text classification using Nave BayesBayes(NB) and(NB) andKK--Nearest Neighbor(KNN).Nearest Neighbor(KNN).

    Prediction of Financial Data SeriesPrediction of Financial Data Series: using: using

    MATLAB GARCH ToolboxMATLAB GARCH Toolbox

    6161

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    62/88

    Pattern Recognition:Pattern Recognition:Artificial Neural Network ApplicationArtificial Neural Network Application

    Letter Recognition SystemLetter Recognition System

    Zip Code Identification SystemZip Code Identification System

    Apple's Newton uses a neural netApple's Newton uses a neural net

    Speech Recognition SystemSpeech Recognition System

    Voice DialingVoice Dialing

    Image ProcessingImage Processing

    BioinfomaticsBioinfomatics

    6262

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    63/88

    Emitter ClassificationEmitter Classification

    ELINT Data AnalysisELINT Data Analysis

    Identification of Radars and PlatformIdentification of Radars and Platform

    Successfully Delivered DAPR software toSuccessfully Delivered DAPR software toIndian Navy (INTEG)Indian Navy (INTEG)

    6363

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    64/88

    6464

    Text

    Speaker 2

    Speaker 1

    Although the spoken words are the same the recorded digital signals

    are very different!

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    65/88

    6565

    Pattern Recognition ExamplePattern Recognition Example

    Noisy image Recognized

    pattern

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    66/88

    Association Rule MiningAssociation Rule Mining(Market Basket Analysis)(Market Basket Analysis)

    6666

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    67/88

    6767

    Association RulesAssociation Rules

    Intrarecord LinksIntrarecord Links

    Finding associations among sets of objectsFinding associations among sets of objects

    in transaction databases, relationalin transaction databases, relational

    databases.databases. Rule form: AntecedentRule form: Antecedent ppConsequentConsequent

    [support, confidence][support, confidence]..

    Examples.Examples. shirt, tie, socksshirt, tie, socks pp shoesshoes [0.5%, 60%][0.5%, 60%]

    White bread, butterWhite bread, butterpp eggegg [2.3%, 80%][2.3%, 80%]

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    68/88

    6868

    PreliminariesPreliminaries

    Given: (1) database of transactionsGiven: (1) database of transactions

    (2) each transaction is a list of items(2) each transaction is a list of items

    Find:Find: allall rules that correlate the presence of one set of itemsrules that correlate the presence of one set of itemswith that of another set of itemswith that of another set of items

    E.g.,E.g., 95% of people who purchase PC and color printer95% of people who purchase PC and color printeralso purchase compter tablealso purchase compter table

    Business Questions:Business Questions:

    ** Electronic itemsElectronic items (What the store should do to boost(What the store should do to boostsale of electronic items)sale of electronic items)

    Herbal Health productsHerbal Health products ** (What other products should(What other products shouldthe store stocks up?)the store stocks up?)

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    69/88

    6969

    Formal DefinitionFormal Definition

    If X and Y are twoIf X and Y are two itemitem--setssets, such that X, such that X

    Y =Y = , then for an association rule, then for an association rule

    XX ppYY

    SupportSupport is the probability that X and Yis the probability that X and Y

    occur together [ P(X U Y)]occur together [ P(X U Y)]

    ConfidenceConfidence is the conditional probabilityis the conditional probability

    that Y occurs in a transaction, given X isthat Y occurs in a transaction, given X is

    present in the same transaction [P(Y/X)]present in the same transaction [P(Y/X)]

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    70/88

    7070

    Itemset and SupportItemset and Support

    Item A(4)

    TranTran

    s IDs ID

    ItemsItems

    1010 A, B,CA, B,C

    2020 A, CA, C

    3030 A, C, DA, C, D

    4040 B, C, EB, C, E5050 A, C, EA, C, E Sup(A): 4 (80%), Sup (AB): 1 (20%)

    Sup (ABC): 1 (20%), Sup (ABCD): 0

    Sup(ABCDE): 0

    Item C (5)Item A & C(4)

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    71/88

    7171

    ConfidenceConfidence

    Strength of theStrength of the

    discovered rulediscovered ruleComputed asComputed as

    P(XP(X , Y)/P(X), Y)/P(X)

    AA pp C (4/4)C (4/4)CC ppA (4/5)A (4/5)

    Item

    C(5)

    Item A & C(4)

    Item A(4)

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    72/88

    7272

    InterestingnessInterestingness

    Minimum SupportMinimum Support User specified parameter (Frequent itemsets)User specified parameter (Frequent itemsets)

    For minsup of50% F = {A, C, AC}For minsup of50% F = {A, C, AC}

    For minsup of30% F = {A, B, C, AC, E}For minsup of30% F = {A, B, C, AC, E}

    Minimum ConfidenceMinimum Confidence Report rules that satisfy minimum confidenceReport rules that satisfy minimum confidence

    levellevel

    With minconf of50% some of the discoveredWith minconf of50% some of the discoveredrules arerules are

    AApp

    C [75%

    ], ABC [75%

    ], ABpp

    C[100%

    ], EC[100%

    ], Epp

    F[100%

    ]F[100%

    ]etc.etc.

    TransTrans

    IDID

    ItemsItems

    1010 A, B,CA, B,C

    2020 A, CA, C

    3030 A, C, DA, C, D

    4040 B, C, EB, C, E

    5050 A, C, EA, C, E

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    73/88

    7373

    The Apriori algorithmThe Apriori algorithm

    The best known algorithmThe best known algorithmTwostepsTwosteps::

    Find all itemsets that have minimum supportFind all itemsets that have minimum support

    ((frequent itemsetsfrequent itemsets, also called large itemsets)., also called large itemsets).

    Use frequent itemsets toUse frequent itemsets to generate rulesgenerate rules..

    E.g., a frequent itemsetE.g., a frequent itemset{Chicken, Clothes, Milk} [sup =3/7]{Chicken, Clothes, Milk} [sup =3/7]

    and one rule from the frequent itemsetand one rule from the frequent itemset

    ClothesClothes pp Milk, ChickenMilk, Chicken [sup =3/7,[sup =3/7,

    conf=3/3]conf=

    3/3]

    A i iA i i E lE l

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    74/88

    AprioriApriori ExampleExample

    7474

    Itemset Count

    Bread 4

    Coke 2

    Milk 4

    Bear 3

    Diaper 4

    Eggs 1

    Itemset Count{Bread, Milk} 3

    {Bread, Beer} 2

    {Bread, Diaper} 3

    {Milk, Beer} 2

    {Milk, Diaper} 3

    {Coke,Diaper} 2

    {Milk,Coke} 2

    {Beer,Coke} 1

    {Bread,Coke} 1

    {Beer, Diaper} 3

    2-itemsets1-itemsets

    Itemset Count

    {Milk, Coke, Diaper} 2

    {Milk, Coke, Beer} 1

    {Beer, Milk, Diaper} 2

    {Bread, Beer, Diaper} 2

    {Bread, Beer, Milk} 1

    {Bread, Milk, Diaper} 2

    3-itemsets

    TID

    Items

    1 Bread, Milk

    2 Beer, Diaper, Bread, Eggs

    3 Beer, Coke, Diaper, Milk

    4 Beer, Bread, Diaper, Milk

    5 Coke, Bread, Diaper, Milk

    TID Items

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    75/88

    7575

    ExampleExample Finding frequentFinding frequent itemsetsitemsets

    Dataset T

    TID Items

    T100 1, 3, 4

    T200 2, 3, 5

    T300 1, 2, 3, 5

    T400 2, 5

    itemset:count

    1. scan T C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3

    F1: {1}:2, {2}:3, {3}:3, {5}:3

    C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}

    2. scan T C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2

    F2

    : {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2

    C3: {2, 3,5}

    3. scan T C3: {2, 3, 5}:2 F3:{2, 3, 5}

    minsup=0.5

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    76/88

    7676

    An exampleAn example

    FF33 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4},= {{1, 2, 3}, {1, 2, 4}, {1, 3, 4},

    {1, 3, 5}, {2, 3, 4}}{1, 3, 5}, {2, 3, 4}}

    After joinAfter join CC44 = {{1, 2, 3, 4}, {1, 3, 4, 5}}= {{1, 2, 3, 4}, {1, 3, 4, 5}}

    After pruning:After pruning:

    CC44 = {{1, 2, 3, 4}}= {{1, 2, 3, 4}}

    becausebecause {1, 4, 5}{1, 4, 5} is not inis not in FF33 ({1, 3, 4, 5} is removed)({1, 3, 4, 5} is removed)

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    77/88

    7777

    Generating rules: an exampleGenerating rules: an example

    Suppose {2,3,4} is frequent, with sup=50%Suppose {2,3,4} is frequent, with sup=50% Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, withProper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with

    sup=50%, 50%, 75%, 75%, 75%, 75% respectivelysup=50%, 50%, 75%, 75%, 75%, 75% respectively

    These generate these association rules:These generate these association rules:

    2,32,

    3pp

    4,4, confidence=100%

    confidence=100%

    2,42,4 pp 3,3, confidence=100%confidence=100%

    3,43,4 pp 2,2, confidence=67%confidence=67%

    22pp 3,4,3,4, confidence=67%confidence=67%

    33pp 2,4,2,4, confidence=67%confidence=67%44 pp 2,3,2,3, confidence=67%confidence=67%

    All rules have support =50%All rules have support =50%

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    78/88

    7878

    On Apriori AlgorithmOn Apriori Algorithm

    Seems to be very expensiveSeems to be very expensive

    LevelLevel--wise searchwise search

    K = the size of the largest itemsetK = the size of the largest itemset

    It makes at most K passes over dataIt makes at most K passes over dataIn practice, K is bounded (10).In practice, K is bounded (10).

    The algorithm is very fast. Under some conditions,The algorithm is very fast. Under some conditions,

    all rules can be found inall rules can be found in linear timelinear time..

    Scale up to large data setsScale up to large data sets

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    79/88

    7979

    AR: ApplicationsAR: Applications

    Retail Marketing

    Floor Planning, Discounting, Catalogue Design

    Medical Diagnosis Comparison of the genotype of people with/without aComparison of the genotype of people with/without a

    condition allowed the discovery of a set of genes thatcondition allowed the discovery of a set of genes that

    together account for many cases of diabetes.together account for many cases of diabetes.

    Geographical Information systems Link Analysis

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    80/88

    WalmartWalmart StudyStudy

    8080

    T i l B i D i iT i l B i D i i

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    81/88

    Typical Business DecisionsTypical Business Decisions(For(ForWalmartWalmart))

    WhatWhat toto putput onon sale?sale?

    HowHow toto designdesign coupons?coupons?

    HowHow toto placeplace merchandisemerchandise etcetc onon shelf shelf totomaximisemaximise profitprofit ??

    8181

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    82/88

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    83/88

    8383

    Defence ApplicationsDefence Applications

    Applications in DefenceApplications in Defence Finding Associations inFinding Associations in

    Terrorists activitiesTerrorists activities ( Ex: 9/11( Ex: 9/11

    attack)attack)

    Finding Associations in studyingFinding Associations in studyingbehavior of the enemy duringbehavior of the enemy during

    warwar

    Finding Associations which mayFinding Associations which may

    lead to intrusion threat atlead to intrusion threat at

    strategic locations.strategic locations.

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    84/88

    List of PapersList of Papers

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    85/88

    List of PapersList of PapersPublishedPublished

    1.1. RobustRobust ApproachApproach forfor EstimatingEstimating ProbabilitiesProbabilities inin NaiveNaive--BayesBayes ClassifierClassifier forfor

    GeneGene ExpressionExpression Data,Data, ElsevierElsevier:: ExpertExpert SystemsSystems withwithApplicationsApplications,, 20102010,, doidoi::1010..10161016/j/j..eswaeswa..20102010..0606..076076..

    2.2. BestBest PaperPaper AwardAward::,, RankingRanking PolicePolice AdministrationAdministration UnitsUnits onon thethe BasisBasis ofof CrimeCrime

    PreventionPrevention MeasuresMeasures usingusing DataData EnvelopmentEnvelopment AnalysisAnalysis andand Clustering,Clustering, 66thth

    InternationalInternational ConferenceConference onon EE--GovernanceGovernance (ICEG(ICEG 20082008),), 4040--5353..

    3.3. TowardsTowards situationsituation awarenessawareness inin integratedintegrated airair defencedefence usingusing clusteringclustering andand

    casecase basedbased reasoningreasoning,, SpringerSpringer:: LectureLecture NotesNotes inin ComputerComputer ScienceScience,, 59095909,,579579584584,, 20092009..

    4.4. AdaptiveAdaptive QueryQuery InterfaceInterface forfor MiningMining CrimeCrime Data,Data, SpringerSpringer:: LectureLecture NotesNotes inin

    ComputerComputer ScienceScience (LNCS)(LNCS),, 47774777,, 20072007,, 285285--296296..

    5.5. RobustRobust ApproachApproach forfor EstimatingEstimating ProbabilitiesProbabilities inin NaiveNaive--BayesBayes Classifier,Classifier,

    SpringerSpringer:: LectureLecture NotesNotes inin ComputerComputer ScienceScience (LNCS),(LNCS), 48154815,, 20072007,, 1111--1616..

    6.6. AA MultivariateMultivariate TimeTime SeriesSeries ClusteringClustering ApproachApproach forfor CrimeCrime TrendsTrends Prediction,Prediction,

    ProcProc.. ofofIEEEIEEE SystemSystem ManMan && CyberneticsCybernetics,, 20082008,, 892892--896896..

    7.7. CrimeCrime DataData MiningMining forfor IndianIndian PolicePolice InformationInformation System,System, ProcProc.. 55thth InternationalInternational

    ConferenceConference onon EE--governancegovernance (ICEG(ICEG 20072007),), 388388--397397..

    8.8. ClusteringClustering withwith VaryingVarying WeightsWeights onon TypesTypes ofof Crime,Crime, ORSIORSI ConferenceConference,, 20082008

    List of Papers (Contd )List of Papers (Contd )

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    86/88

    List of Papers (Contd.)List of Papers (Contd.)CommunicatedCommunicated

    1.1. An Efficient Statistical Feature Selection Approach for Classification ofAn Efficient Statistical Feature Selection Approach for Classification of

    Gene Expression Data,Gene Expression Data, Journal of Biomedical InformaticsJournal of Biomedical Informatics, July2010,, July2010,Resubmission with minor modificationResubmission with minor modification..

    2.2. Towards A Framework of Intelligent Decision Support System for IndianTowards A Framework of Intelligent Decision Support System for Indian

    Police,Police, Elsevier: Decision Support Systems, May2010Elsevier: Decision Support Systems, May2010..

    3.3. A Statistical Approach for Feature Selection and Ranking,A Statistical Approach for Feature Selection and Ranking, Elsevier: PatterElsevier: Patter

    RecognitionRecognition, June 2010., June 2010.

    4.4. A Novel Approach for DistanceA Novel Approach for Distance--Based SemiBased Semi--Supervised Clustering usingSupervised Clustering usingFunctional Link Neural Network,Functional Link Neural Network, Springer Soft ComputingSpringer Soft Computing, June 2010., June 2010.

    5.5. An Efficient Similarity Measure based Multivariate Time Series ClusteringAn Efficient Similarity Measure based Multivariate Time Series Clustering

    Approach for Performance Analysis,Approach for Performance Analysis, IEEE System Man and CyberneticsIEEE System Man and Cybernetics,,

    May2010.May2010.

    6.6. Issues and Challenges for Emitter Classification in the Context of ElectronicIssues and Challenges for Emitter Classification in the Context of Electronic

    Warfare.Warfare. DefenceDefence Science Journal.Science Journal.

    7.7. A Novel Approach forWeighted Clustering using HyperlinkA Novel Approach forWeighted Clustering using Hyperlink--Induced TopicInduced Topic

    Search (HITS) Algorithm,Search (HITS) Algorithm, DefenceDefence Science Journal.Science Journal.

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    87/88

    8787

    ReferencesReferencesHan and Kamber, Data Mining: Concepts and Techniques, Morgan Kauffman

    Arun Pujari, Data Mining Techniques, University Press

    Hand et al., Principles of Data Mining, PHI.

    J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993.

    J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.

    M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier fordata mining. In Proc. 1996 Int. Conf. Extending Database Technology(EDBT'96), Avignon, France, March 1996.

    R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between

    sets of items in large databases. SIGMOD'93, 207-216, Washington, D.C.

    R. Agrawal and R. Srikant. Fast algorithms for mining association rules.

    VLDB'94 487-499, Santiago, Chile.

    J. McQueen, Some methods for classification and analysis of multivariateobservations, Proc. Symp. Math. Statist. And Probability, 5th, Berkeley, 1,

    1967, 281298.

    A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: a review, ACM

    Computing Surveys, 31(3), 1999, 264323.

  • 8/8/2019 Data Mining- IMT Nagpur-Manish

    88/88

    Questions?Questions?