ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation

47
DAT204 DAT204 Introduction to Data Introduction to Data Mining with SQL Server Mining with SQL Server 2000 2000 ZhaoHui Tang ZhaoHui Tang Program Manager Program Manager SQL Server Analysis Services SQL Server Analysis Services Microsoft Corporation Microsoft Corporation

description

DAT204 Introduction to Data Mining with SQL Server 2000. ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation. Agenda. What is Data Mining The Data Mining Market OLE DB for Data Mining Overview of the Data Mining Features in SQL Server 2000 Demo Q&A. - PowerPoint PPT Presentation

Transcript of ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation

Page 1: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

DAT204DAT204

Introduction to Data Mining with Introduction to Data Mining with SQL Server 2000 SQL Server 2000

ZhaoHui TangZhaoHui Tang

Program Manager Program Manager

SQL Server Analysis ServicesSQL Server Analysis Services

Microsoft CorporationMicrosoft Corporation

Page 2: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

AgendaAgenda

• What is Data MiningWhat is Data Mining• The Data Mining MarketThe Data Mining Market• OLE DB for Data MiningOLE DB for Data Mining• Overview of the Data Mining Features in Overview of the Data Mining Features in

SQL Server 2000SQL Server 2000• DemoDemo• Q&AQ&A

Page 3: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

What Is Data Mining?What Is Data Mining?

Page 4: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation
Page 5: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

What is DM?What is DM?

• A process of data exploration and analysis A process of data exploration and analysis using automatic or semi-automatic meansusing automatic or semi-automatic means– Techniques origin from Machine Learning, statistics and Techniques origin from Machine Learning, statistics and

databasedatabase– ““Exploring data” – scanning samples of known facts Exploring data” – scanning samples of known facts

about “cases”.about “cases”.– ““knowledge”: knowledge”: Clusters, Rules, Decision treesClusters, Rules, Decision trees, , Equations, Equations,

Association rules…Association rules…

• Once the “knowledge” is extracted it:Once the “knowledge” is extracted it:– Can be browsed Can be browsed

• Provides a very useful insight on the cases behaviorProvides a very useful insight on the cases behavior– Can be used to predict values of other casesCan be used to predict values of other cases

• Can serve as a key element in closed loop analysisCan serve as a key element in closed loop analysis

Page 6: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

What drives high school students What drives high school students to attend college?to attend college?

Page 7: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

The deciding factors for high school The deciding factors for high school students to attend college are…students to attend college are…

Attend College:55% Yes45% No

All Students

Attend College:79% Yes21% No

IQ=High

Attend College:45% Yes55% No

IQ=Low

IQ ?

Wealth

Attend College:94% Yes6% No

Wealth = True

Attend College:69% Yes21% No

Wealth = False

ParentsEncourage?

Attend College:70% Yes30% No

Attend College:31% Yes69% No

ParentsEncourage = No

ParentsEncourage = Yes

Page 8: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

Business Oriented DM ProblemsBusiness Oriented DM Problems

• Targeted adsTargeted ads– ““What banner should I display to this visitor?”What banner should I display to this visitor?”

• Cross sellsCross sells– ““What other products is this customer likely to buy?What other products is this customer likely to buy?

• Fraud detectionFraud detection– ““Is this insurance claim a fraud?”Is this insurance claim a fraud?”

• Churn analysisChurn analysis– ““Who are those customers likely to churn?”Who are those customers likely to churn?”

• Risk ManagementRisk Management– ““Should I approve the loan to this customer?”Should I approve the loan to this customer?”

• … …

Page 9: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation
Page 10: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

Mining Model

Mining Process - IllustratedMining Process - Illustrated

DMEngine

Data To Predict

DMEngine

Predicted Data

Training Data

Mining Model

Mining Model

Page 11: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

The Data Mining MarketThe Data Mining Market

Page 12: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

The $$$: Market Size The $$$: Market Size

• DM Tools Market: DM Tools Market: – 1999: $341.3M1999: $341.3M– 2000: $455.1M2000: $455.1M– 2001: $449.5M2001: $449.5M

* IDC

Page 13: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

The PlayersThe Players

• Leading vendorsLeading vendors– SASSAS– SPSSSPSS– IBMIBM– AngossAngoss– Hundreds of smaller vendors offering DM Hundreds of smaller vendors offering DM

algorithms…algorithms…

• Oracle –Thinking Machines acquisitionOracle –Thinking Machines acquisition

Page 14: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

The ProductsThe Products

• End-to-end horizontal DM toolsEnd-to-end horizontal DM tools– Extraction, Cleansing, Loading, Modeling, Algorithms (dozens), Analysts Extraction, Cleansing, Loading, Modeling, Algorithms (dozens), Analysts

workbench, Reporting, Charting….workbench, Reporting, Charting….

• The customer is the power-analystThe customer is the power-analyst– PhD in statistics is usually required…PhD in statistics is usually required…

• Closed tools – no standard APIClosed tools – no standard API– Total vendor lock-inTotal vendor lock-in– Limited integration with applicationsLimited integration with applications

• DM an “outsider” in the Data WarehouseDM an “outsider” in the Data Warehouse• Extensive consulting requiredExtensive consulting required• Sky rocketing pricesSky rocketing prices

– $60K+ for a single user license$60K+ for a single user license

Page 15: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

What the analysts say…What the analysts say…

• ““Stand-alone Data Mining Is Dead” - Stand-alone Data Mining Is Dead” - ForresterForrester

• ““The demise of [stand alone] data The demise of [stand alone] data mining” – Gartnermining” – Gartner

Page 16: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

The Microsoft ApproachThe Microsoft Approach

Page 17: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

DataPro Users Survey DataPro Users Survey 1999-20011999-2001

““Data mining will be the fastest-Data mining will be the fastest-growing BI technology…”growing BI technology…”

Page 18: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

Market Size of BIMarket Size of BI

* IDC

Page 19: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

SQL Server 2000 - The Analysis SQL Server 2000 - The Analysis PlatformPlatform

• SQL 2000 provides a complete Analysis SQL 2000 provides a complete Analysis PlatformPlatform– Not an isolated, stand alone DM productNot an isolated, stand alone DM product

• Platform means:Platform means:– Standard based DM API’s (OLE DB for DM) for Standard based DM API’s (OLE DB for DM) for

applications developmentapplications development– Integrated vision for all technologies, toolsIntegrated vision for all technologies, tools– ExtensibleExtensible– ScaleableScaleable

Page 20: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

Data FlowData Flow

DWOLTP OLAP

DMAppsReports

& Analysis

DM

Page 21: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

Analysis Services 2000 –Analysis Services 2000 –ComponentsComponents

Manager UI

DSO

Analysis Server Client

OLE DB OLAP

OLAPEngine(local)

OLAPEngine

DMEngine

DMEngine(local)

DM

DMM

DM Wizards

DM DTS Task

Tree View Control

Cluster View Control

Lift Chart Control

Sample Query Tool

Page 22: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

OLE DB for Data Mining…OLE DB for Data Mining…

Page 23: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

Why OLE DB for DM?Why OLE DB for DM?

Make DM a Make DM a mass market technologymass market technology by: by:• Leverage existing technologies and knowledge Leverage existing technologies and knowledge

– SQL and OLE DB SQL and OLE DB

• Common industry wide concepts and data Common industry wide concepts and data presentationpresentation

• Changing DM market perception from “proprietary” to Changing DM market perception from “proprietary” to “open”“open”

• Increasing the number of players:Increasing the number of players:– Reduce the cost and risk of becoming a consumer – one tool works with Reduce the cost and risk of becoming a consumer – one tool works with

multiple providersmultiple providers– Reduce the cost and risk of becoming a provider – focus on expertise Reduce the cost and risk of becoming a provider – focus on expertise

and find many partners to complement offeringand find many partners to complement offering

Page 24: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

Integration With RDBMSIntegration With RDBMS

• Customers would like to Customers would like to – Build DM models from within their RDBMSBuild DM models from within their RDBMS– Train the models directly off their relational tablesTrain the models directly off their relational tables– Perform predictions as relational queries (tables in, Perform predictions as relational queries (tables in,

tables out)tables out)– Feel that DM is a native part of their database.Feel that DM is a native part of their database.

• Therefore…Therefore…– Data mining models are relational objectsData mining models are relational objects– All operations on the models are relationalAll operations on the models are relational– The language used is SQL (w/Extensions)The language used is SQL (w/Extensions)

• The effect: every DBA and VB developer can The effect: every DBA and VB developer can become a DM developerbecome a DM developer

Page 25: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

Creating a Data Mining Model Creating a Data Mining Model (DMM)(DMM)

Page 26: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

Identifying the “Cases”Identifying the “Cases”

• DM algorithms analyze “cases”DM algorithms analyze “cases”• The “case” is the entity being categorized and The “case” is the entity being categorized and

classifiedclassified• ExamplesExamples

– Customer credit risk analysis: Customer credit risk analysis: Case = CustomerCase = Customer– Product profitability analysis: Product profitability analysis: Case = ProductCase = Product– Promotion success analysis: Promotion success analysis: Case = PromotionCase = Promotion

• Each case encapsulate all we know about the Each case encapsulate all we know about the entityentity

Page 27: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

A Simple Set of CasesA Simple Set of Cases

StudentIStudentIDD

GendeGenderr

Parent Parent

IncomeIncomeIQIQ EncouragementEncouragement

CollegeCollege

PlansPlans

11 MaleMale 2340023400 120120 Not EncouragedNot Encouraged NoNo

22 FemaleFemale 7920079200 9090 EncouragedEncouraged YesYes

33 MaleMale 4200042000 105105 Not EncouragedNot Encouraged YesYes

Page 28: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

More Complicated CasesMore Complicated Cases

Cust Cust IDID

AgeAge

MaritMaritalal

StatuStatuss

IQIQ

Favorite MoviesFavorite Movies

TitleTitle ScoreScore

11 3535 MM 22 Star WarsStar Wars 88

Toy StoryToy Story 99

TerminatorTerminator 77

22 2020 SS 33 Star WarsStar Wars 77

BraveheartBraveheart 77

The MatrixThe Matrix 1010

33 5757 MM 22 Sixth SenseSixth Sense 99

CasablancaCasablanca 1010

Page 29: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

A DMM is a Table!A DMM is a Table!

• A DMM structure is defined as a tableA DMM structure is defined as a table– Training a DMM means inserting data (pattern) Training a DMM means inserting data (pattern)

into the tableinto the table– Predicting from a DMM means querying the Predicting from a DMM means querying the

tabletable

• All information describing the case are All information describing the case are contained in columnscontained in columns

Page 30: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

Creating a Mining ModelCreating a Mining Model

CREATE MINING MODEL [Plans Prediction]CREATE MINING MODEL [Plans Prediction]

((

StudentID LONG KEY,StudentID LONG KEY,

Gender TEXT DISCRETE,Gender TEXT DISCRETE,

ParentIncome LONG CONTINUOUS,ParentIncome LONG CONTINUOUS,

IQ DOUBLE CONTINUOUS,IQ DOUBLE CONTINUOUS,

Encouragement TEXT DISCRETE, Encouragement TEXT DISCRETE,

CollegePlans TEXT DISCRETE PREDICTCollegePlans TEXT DISCRETE PREDICT

))

USING Microsoft_Decision_TreesUSING Microsoft_Decision_Trees

Page 31: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

Creating a mining model with Creating a mining model with nested tablenested table

Create Mining Model MoviePrediction Create Mining Model MoviePrediction

( (

CutomerId long key, CutomerId long key,

Age long continuous, Age long continuous,

Gender discrete,Gender discrete,

Education discrete,Education discrete,

MovieList table predict ( MovieList table predict (

MovieName text key MovieName text key

) )

) )

using microsoft_decision_treesusing microsoft_decision_trees

Page 32: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

Training a DMMTraining a DMM

Page 33: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

Training a DMMTraining a DMM

• Training a DMM means passing it data for which the Training a DMM means passing it data for which the attributes to be predicted are knownattributes to be predicted are known– Multiple passes are handled internally by the provider!Multiple passes are handled internally by the provider!

• Use an INSERT INTO statementUse an INSERT INTO statement• The DMM will not persist the inserted data The DMM will not persist the inserted data • Instead it will analyze the given cases and build the Instead it will analyze the given cases and build the

DMM content (decision tree, segmentation model, DMM content (decision tree, segmentation model, association rules)association rules)

INSERT [INTO] <mining model name>INSERT [INTO] <mining model name>

[(columns list)][(columns list)]<source data query><source data query>

Page 34: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

INSERT INTOINSERT INTO

INSERT INTO [Plans PredictionPlans Prediction](StudentID, Gender, ParentIncome, IQ,Encouragement, CollegePlans)SELECT

[StudentID], [Gender], [ParentIncome], [IQ],[Encouragement], [CollegePlans]

FROM [Students]

Page 35: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

When Insert Into Is Done…When Insert Into Is Done…

• The DMM is trainedThe DMM is trained– The model can be retrained The model can be retrained – Content (rules, trees, formulas) can be Content (rules, trees, formulas) can be

exploredexplored– OLE DB Schema rowsetOLE DB Schema rowset– SELECT * FROM <dmm>.CONTENTSELECT * FROM <dmm>.CONTENT– XML string (PMML)XML string (PMML)

• Prediction queries can be executedPrediction queries can be executed

Page 36: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

PredictionsPredictions

Page 37: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

What are Predictions?What are Predictions?• Predictions apply the rules of a trained Predictions apply the rules of a trained

model to a new set of data in order to model to a new set of data in order to estimate missing attributes or valuesestimate missing attributes or values

• Predictions = queriesPredictions = queries– The syntax is SQL - likeThe syntax is SQL - like– The output is a rowsetThe output is a rowset

• In order to predict you need:In order to predict you need:– Input data setInput data set– A trained DMMA trained DMM– Binding (mapping) information between the Binding (mapping) information between the

input data and the DMMinput data and the DMM

Page 38: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

The Truth Table ConceptThe Truth Table Concept

GendeGenderr

Parent Parent

IncomeIncomeIQIQ EncouragementEncouragement

CollegCollegee

PlansPlans

ProbabilitProbabilityy

MaleMale 2000020000 8585 Not EncouragedNot Encouraged NoNo 85%85%

MaleMale 2000020000 8585 Not EncouragedNot Encouraged YesYes 15%15%

MaleMale 2000020000 8585 EncouragedEncouraged NoNo 60%60%

MaleMale 2000020000 8585 EncouragedEncouraged YesYes 40%40%

MaleMale 2000020000 9090 Not EncouragedNot Encouraged NoNo 80%80%

MaleMale 2000020000 9090 Not EncouragedNot Encouraged YesYes 20%20%

MaleMale 2000020000 9090 EncouragedEncouraged NoNo 58%58%

……

Page 39: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

PredictionPrediction

GenderGender ParentParent

IncomeIncome

IQIQ EncouragementEncouragement College College PlansPlans

ProbabilityProbability

MaleMale 2000020000 8585 Not EncouragedNot Encouraged NoNo 85%85%

MaleMale 2000020000 8585 Not EncouragedNot Encouraged YesYes 15%15%

MaleMale 2000020000 8585 EncouragedEncouraged NoNo 60%60%

MaleMale 2000020000 8585 EncouragedEncouraged YesYes 40%40%

MaleMale 2000020000 9090 Not EncouragedNot Encouraged NoNo 80%80%

MaleMale 2000020000 9090 Not EncouragedNot Encouraged YesYes 20%20%

MaleMale 2000020000 9090 EncouragedEncouraged NoNo 58%58%

MaleMale 2000020000 9090 EncouragedEncouraged YesYes 42%42%

MaleMale 2000020000 9595 Not EncouragedNot Encouraged NoNo 78%78%

MaleMale 2000020000 9595 Not EncouragedNot Encouraged YesYes 22%22%

MaleMale 2000020000 9595 EncouragedEncouraged NoNo 45%45%

It’s a JOIN!It’s a JOIN!

StudentIStudentIDD

GenderGender ParentParent

IncomeIncome

IQIQ EncouragementEncouragement

11 MaleMale 4300043000 8585 Not EncouragedNot Encouraged

22 MaleMale 2000020000 135135 Not EncouragedNot Encouraged

33 FemaleFemale 2500025000 105105 EncouragedEncouraged

44 MaleMale 9600096000 100100 EncouragedEncouraged

55 FemaleFemale 5600056000 125125 Not EncouragedNot Encouraged

66 FemaleFemale 4600046000 9090 Not EncouragedNot Encouraged

Page 40: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

The Prediction Query SyntaxThe Prediction Query Syntax

SELECT SELECT <columns to return or predict><columns to return or predict>

FROM FROM

<dmm> <dmm> PREDICTION JOIN PREDICTION JOIN

<input data set><input data set>

ONON <dmm column> <dmm column> = = <dmm input column>…<dmm input column>…

Page 41: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

ExampleExample

SELECT SELECT [New Students].[StudentID], [New Students].[StudentID],

[Plans Prediction].[CollegePlans], [Plans Prediction].[CollegePlans],

PredictProbability([CollegePlans])PredictProbability([CollegePlans])

FROM FROM

[Plans Prediction] [Plans Prediction] PREDICTION JOINPREDICTION JOIN

[New Students][New Students]

ON ON [Plans Prediction].[Gender][Plans Prediction].[Gender] = =

[New Students].[Gender] [New Students].[Gender] ANDAND

[Plans Prediction].[IQ][Plans Prediction].[IQ] = =

[New Students].[IQ] [New Students].[IQ] AND ...AND ...

Page 42: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

DemoDemo

Page 43: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

OLE DB for Data Mining Defines OLE DB for Data Mining Defines APIAPI

OLE DB for DM (API)

RDBMS

Consumer

Provider

CubeMisc. Data

Source

Provider Provider

Consumer ……

……

OLE DB

Page 44: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

OLEDB for DM Configuration OLEDB for DM Configuration Options DemoOptions Demo

ConsumersConsumers

OLEDB for DMOLEDB for DM

ProvidersProviders

MS AnalysisManager

MS DMProvider

ANGOSS DMProvider

ANGOSSControls

1122 33

44

Page 45: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

Demo on OLE DB for DM API Demo on OLE DB for DM API using Angoss Controls using Angoss Controls and Providerand Provider

Page 46: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

For more info…For more info…

• DM URLDM URL– www.microsoft.com/data/oledbwww.microsoft.com/data/oledb– www.microsoft.com/data/www.microsoft.com/data/oledb/DMResKit.htmoledb/DMResKit.htm

• News Group:News Group:– Microsoft.public.SQLserver.dataminingMicrosoft.public.SQLserver.datamining– Communities.msn.com/AnalysisServicesDataMiningCommunities.msn.com/AnalysisServicesDataMining

• White papers:White papers:– Performance paper:Performance paper:

www.unisys.com/windows2000/default-07.asp www.unisys.com/windows2000/default-07.asp

www.microsoft.com/SQL/evaluation/compare/analysisdmwp.aspwww.microsoft.com/SQL/evaluation/compare/analysisdmwp.asp

Page 47: ZhaoHui Tang Program Manager  SQL Server Analysis Services Microsoft Corporation

Questions ?Questions ?