ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation DAT205 Advanced Data...

48
ZhaoHui Tang ZhaoHui Tang Program Manager Program Manager SQL Server Analysis SQL Server Analysis Services Services Microsoft Corporation Microsoft Corporation DAT205 DAT205 Advanced Data Mining Advanced Data Mining Using SQL Server 2000 Using SQL Server 2000

Transcript of ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation DAT205 Advanced Data...

ZhaoHui Tang ZhaoHui Tang

Program ManagerProgram Manager

SQL Server Analysis ServicesSQL Server Analysis Services

Microsoft CorporationMicrosoft Corporation

DAT205DAT205

Advanced Data Mining Using Advanced Data Mining Using SQL Server 2000SQL Server 2000

AgendaAgenda

• Microsoft Data Mining AlgorithmsMicrosoft Data Mining Algorithms• OLE DB for DM Data mining queryOLE DB for DM Data mining query• Data Mining Case Study: Click Stream Data Mining Case Study: Click Stream

Analysis Analysis – Customer SegmentationCustomer Segmentation– Site affiliationSite affiliation– Target ads in banner Target ads in banner

• Performance of Microsoft Data Mining Performance of Microsoft Data Mining Algorithm Algorithm

• Q&AQ&A

Data Mining Algorithms in SQL Data Mining Algorithms in SQL Server 2000Server 2000

Decision TreeDecision Tree• Popular technique for Popular technique for

classification, classification, Prediction taskPrediction task– Churn analysisChurn analysis– Credit risk analysisCredit risk analysis– ……

• Easy to understandEasy to understand– any path from node to any path from node to

leaf forms a ruleleaf forms a rule• Fast to buildFast to build• Prediction based on Prediction based on

leaf node statsleaf node stats• Variation: C4.5, C5, Variation: C4.5, C5,

CART, ChaidCART, Chaid

Attend College:55% Yes45% No

All Students

Attend College:79% Yes21% No

IQ=High

Attend College:35% Yes65% No

IQ < > High

Attend College:94% Yes6% No

Parent Income = High

Attend College:69% Yes31% No

Parent Income = Low

How tree worksHow tree worksIQIQ Parent Parent

EncouragementEncouragementParent Parent IncomeIncome

GenderGender

HighHigh MediumMedium LowLow TrueTrue FalseFalse HighHigh FalseFalse MaleMale FemaleFemale

CollegePCollegePlanlan

YesYes 300 500 200 700 300 400 600 500 500

NoNo 100 1000 900 400 1600 400 1600 1100 900

0

100

200

300

400

500

600

700

800

900

1000

IQ=High IQ=Medium IQ=Low

0

200

400

600

800

1000

1200

1400

1600

1800

PI=High PI=FALSE

0

200

400

600

800

1000

1200

1400

1600

1800

PE=TRUE PE=FALSE

0

200

400

600

800

1000

1200

Male Female

YesYes

NoNo

Split recursivelySplit recursively

College Plan33% Yes67% No

All Students

College Plan63% Yes37% No

Parent Encouragement = True

College Plan16% Yes84% No

Parent Encouragement = False

IQIQ Parent Parent EncouragementEncouragement

Parent Parent IncomeIncome

GenderGender

HighHigh MediumMedium LowLow TrueTrue FalseFalse HighHigh FalseFalse MaleMale FemaleFemale

CollegePCollegePlanlan

YesYes 200 400 100 700 0 300 400 400 250

NoNo 50 250 100 400 0 100 300 250 150

Microsoft Decision TreesMicrosoft Decision Trees

• Probabilistic Classification TreeProbabilistic Classification Tree• Splitting methods: Bayesian score and Splitting methods: Bayesian score and

EntropyEntropy• Forward pruningForward pruning• Tree shape: Binary and Nary treeTree shape: Binary and Nary tree• Scalable frameworkScalable framework

Clustering Algorithm (EM)Clustering Algorithm (EM)

• A popular method for customer A popular method for customer segmentation, mailing list, profiling…segmentation, mailing list, profiling…

• Algorithm processAlgorithm process– Assign a set of Initial PointsAssign a set of Initial Points– Assign initial cluster to each pointsAssign initial cluster to each points– Assign data points to Assign data points to each clustereach cluster with a with a

probabilityprobability– Computer new central point based on Computer new central point based on weighted weighted

computation computation – Cycle until convergenceCycle until convergence

EM IllustrationEM Illustration

X

X

X

Microsoft Clustering Algorithm Microsoft Clustering Algorithm (Scalable EM)(Scalable EM)

DataData

Fill BufferFill Buffer Build/Update Model

Build/Update Model

Compressed date Sufficient stats

Compressed date Sufficient stats

Identify Data to be Compressed

Identify Data to be Compressed

Stop?Stop?

Final ModelFinal Model

OLE DB for Data MiningOLE DB for Data Mining

OLE DB for DMOLE DB for DM• Industry standard for data miningIndustry standard for data mining• Based on existing technologiesBased on existing technologies

– SQLSQL– OLE DBOLE DB

• Define common concepts for DMDefine common concepts for DM– Case, Nested CaseCase, Nested Case– Mining ModelMining Model– Model CreationModel Creation– Model TrainingModel Training– Prediction Prediction

• Language based API Language based API

Customer TableCustomer Table

Customer ID Profession Income Gender Risk

1 Engineer 85 Male No

2 Worker 40 Male Yes

3 Doctor 90 Female No

4 Teacher 50 Female No

5 Worker 45 Male No

… … … … …

DM Query LanguageDM Query LanguageCreate Mining ModelCreate Mining Model CreditRisk CreditRisk

(CustomerID long key,(CustomerID long key,

Gender text discrete,Gender text discrete,

Income long continuous,Income long continuous,

Profession text discrete,Profession text discrete,

RiskRisk text discrete predict)text discrete predict)

UsingUsing Microsoft_Decision_Trees Microsoft_Decision_Trees

Create Mining ModelCreate Mining Model CreditRisk CreditRisk

(CustomerID long key,(CustomerID long key,

Gender text discrete,Gender text discrete,

Income long continuous,Income long continuous,

Profession text discrete,Profession text discrete,

RiskRisk text discrete predict)text discrete predict)

UsingUsing Microsoft_Decision_Trees Microsoft_Decision_Trees

Insert intoInsert into CreditRisk CreditRisk

(CustomerId, Gender, Income, (CustomerId, Gender, Income, Profession, Risk)Profession, Risk)

Select Select

CustomerID, Gender, Income, CustomerID, Gender, Income, Profession,RiskProfession,Risk

From CustomersFrom Customers

Insert intoInsert into CreditRisk CreditRisk

(CustomerId, Gender, Income, (CustomerId, Gender, Income, Profession, Risk)Profession, Risk)

Select Select

CustomerID, Gender, Income, CustomerID, Gender, Income, Profession,RiskProfession,Risk

From CustomersFrom Customers

SelectSelect NewCustomers.CustomerID, NewCustomers.CustomerID, CreditRisk.Risk, PredictProbability(CreditRisk)CreditRisk.Risk, PredictProbability(CreditRisk)

FromFrom CreditRisk CreditRisk Prediction JoinPrediction Join NewCustomers NewCustomers

OnOn CreditRisk.Gender=NewCustomer.Gender CreditRisk.Gender=NewCustomer.Gender

And CreditRisk.Income=NewCustomer.IncomeAnd CreditRisk.Income=NewCustomer.Income

AndAnd

CreditRisk.Profession=NewCustomer.ProfessionCreditRisk.Profession=NewCustomer.Profession

SelectSelect NewCustomers.CustomerID, NewCustomers.CustomerID, CreditRisk.Risk, PredictProbability(CreditRisk)CreditRisk.Risk, PredictProbability(CreditRisk)

FromFrom CreditRisk CreditRisk Prediction JoinPrediction Join NewCustomers NewCustomers

OnOn CreditRisk.Gender=NewCustomer.Gender CreditRisk.Gender=NewCustomer.Gender

And CreditRisk.Income=NewCustomer.IncomeAnd CreditRisk.Income=NewCustomer.Income

AndAnd

CreditRisk.Profession=NewCustomer.ProfessionCreditRisk.Profession=NewCustomer.Profession

Schema RowsetsSchema Rowsets

• Tabular data to provide meta data Tabular data to provide meta data informationinformation

• List of Schema Rowsets in OLE DB for DMList of Schema Rowsets in OLE DB for DM– Mining_ServicesMining_Services– Mining_Service_ParametersMining_Service_Parameters– Mining_ModelsMining_Models– Mining_ColumnsMining_Columns– Mining_Model_ContentsMining_Model_Contents– Model_Content_PMMLModel_Content_PMML

Mining Model Contents Schema Mining Model Contents Schema RowsetsRowsets

Schema Rowsets & Thin Client Schema Rowsets & Thin Client BrowserBrowser

Case Study: Click Stream Case Study: Click Stream AnalysisAnalysis

Schema Schema

CustomerCustomerCustomerGuidCustomerGuid

DayTimeOnLineDayTimeOnLine

NightTimeOnLinNightTimeOnLinee

BrowserTypeBrowserType

EmailTimeEmailTime

ChatTimeChatTime

GeoLocationGeoLocation

WebClickWebClickCustomerGuidCustomerGuid

URLCategoryURLCategory

TimeTime

DurationDuration

ReferPageReferPage

Web Customer SegmentationWeb Customer Segmentation

Web Visitors SegmentationWeb Visitors Segmentation

Segmentation based on Customer Segmentation based on Customer tabletable

Create Mining ModelCreate Mining Model CustomerClustering CustomerClustering

(CustomerID text key,(CustomerID text key,

DayTimeOnline long continuousDayTimeOnline long continuous

NightTimeOnline long continuous,NightTimeOnline long continuous,

BrowserType BrowserType text discrete, text discrete,

ChatTime ChatTime long continuous,long continuous,

EmailTimeEmailTime long continuous,long continuous,

GeoLocationGeoLocation text discretetext discrete

))

UsingUsing Microsoft_Clustering Microsoft_Clustering

Create Mining ModelCreate Mining Model CustomerClustering CustomerClustering

(CustomerID text key,(CustomerID text key,

DayTimeOnline long continuousDayTimeOnline long continuous

NightTimeOnline long continuous,NightTimeOnline long continuous,

BrowserType BrowserType text discrete, text discrete,

ChatTime ChatTime long continuous,long continuous,

EmailTimeEmailTime long continuous,long continuous,

GeoLocationGeoLocation text discretetext discrete

))

UsingUsing Microsoft_Clustering Microsoft_Clustering

Segmentation based on Customer Segmentation based on Customer and WebClickand WebClick

Create Mining ModelCreate Mining Model CustomerClustering CustomerClustering

(CustomerID text key,(CustomerID text key,

DayTimeOnline long continuous,DayTimeOnline long continuous,

NightTimeOnline long continuous,NightTimeOnline long continuous,

BrowserType BrowserType text discrete, text discrete,

ChatTime ChatTime long continuous,long continuous,

EmailTimeEmailTime long continuous,long continuous,

GeoLocationGeoLocation text discretetext discrete

WebClickWebClick table (table (

UrlCategory text key )UrlCategory text key )

))UsingUsing Microsoft_Clustering Microsoft_Clustering

Create Mining ModelCreate Mining Model CustomerClustering CustomerClustering

(CustomerID text key,(CustomerID text key,

DayTimeOnline long continuous,DayTimeOnline long continuous,

NightTimeOnline long continuous,NightTimeOnline long continuous,

BrowserType BrowserType text discrete, text discrete,

ChatTime ChatTime long continuous,long continuous,

EmailTimeEmailTime long continuous,long continuous,

GeoLocationGeoLocation text discretetext discrete

WebClickWebClick table (table (

UrlCategory text key )UrlCategory text key )

))UsingUsing Microsoft_Clustering Microsoft_Clustering

MSFTies SegmentationMSFTies Segmentation

Web Site AffiliationWeb Site Affiliation

Association analysis using Association analysis using Microsoft Decision Trees Microsoft Decision Trees

Insurance No Insurance

Loan No Loan

Business

Loan No Loan

Stock No Stock

Insurance

Business No Business

Shopping No Shopping

Stock

Stock

Insurance No Insurance

Loan

No Stock

Association analysis using Association analysis using Microsoft Decision Trees Microsoft Decision Trees

Insurance No Insurance

Loan No Loan

Business

Loan No Loan

Stock No Stock

Insurance

Business No Business

Shopping No Shopping

Stock

Stock

Insurance No Insurance

Loan

No Stock

Site AffiliationSite Affiliation

Site AffiliationSite AffiliationCreate Mining ModelCreate Mining Model SiteAffiliation SiteAffiliation

(CustomerID text key,(CustomerID text key,

WebClick table predict (WebClick table predict (

UrlCategory text key )UrlCategory text key )

))UsingUsing Microsoft_Decision_Trees Microsoft_Decision_Trees

Create Mining ModelCreate Mining Model SiteAffiliation SiteAffiliation

(CustomerID text key,(CustomerID text key,

WebClick table predict (WebClick table predict (

UrlCategory text key )UrlCategory text key )

))UsingUsing Microsoft_Decision_Trees Microsoft_Decision_Trees

Insert intoInsert into SiteAffiliation (CustomerID,WebClick (skip, SiteAffiliation (CustomerID,WebClick (skip, UrlCategory)UrlCategory)OpenRowset(‘MSDataShape’, 'data OpenRowset(‘MSDataShape’, 'data provider=SQLOLEDB;Server=myserver;UID=me; provider=SQLOLEDB;Server=myserver;UID=me; PWD=mypass' , PWD=mypass' , 'Shape{Select CustomerID from Customer}'Shape{Select CustomerID from Customer}

Append ( {Select customerid, URLCategoryAppend ( {Select customerid, URLCategoryfrom WebClick }from WebClick }

relate CustomerID to CustomerID) as WebClick’ relate CustomerID to CustomerID) as WebClick’

))

Insert intoInsert into SiteAffiliation (CustomerID,WebClick (skip, SiteAffiliation (CustomerID,WebClick (skip, UrlCategory)UrlCategory)OpenRowset(‘MSDataShape’, 'data OpenRowset(‘MSDataShape’, 'data provider=SQLOLEDB;Server=myserver;UID=me; provider=SQLOLEDB;Server=myserver;UID=me; PWD=mypass' , PWD=mypass' , 'Shape{Select CustomerID from Customer}'Shape{Select CustomerID from Customer}

Append ( {Select customerid, URLCategoryAppend ( {Select customerid, URLCategoryfrom WebClick }from WebClick }

relate CustomerID to CustomerID) as WebClick’ relate CustomerID to CustomerID) as WebClick’

))

Path PredictionPath Prediction

Path PredictionPath Prediction

Singleton PredictionSingleton Prediction

SelectSelect Flattened Flattened

Topcount((select URLCategory, $adjustedProbability as Topcount((select URLCategory, $adjustedProbability as prob prob

From Predict([Web Click], INCLUDE_STATISTICS, From Predict([Web Click], INCLUDE_STATISTICS, EXCLUSIVE)), prob, 5) EXCLUSIVE)), prob, 5)

FromFrom

WebLog PREDICTION JOIN (select (select 'Business' WebLog PREDICTION JOIN (select (select 'Business' as URLCategory) union (select ‘Telecom’ as as URLCategory) union (select ‘Telecom’ as URLCategory) as WebClick) as inputURLCategory) as WebClick) as input

OnOn

WebLog.[Web Click].URLCategory = WebLog.[Web Click].URLCategory = input.WebClick.URLCategoryinput.WebClick.URLCategory

SelectSelect Flattened Flattened

Topcount((select URLCategory, $adjustedProbability as Topcount((select URLCategory, $adjustedProbability as prob prob

From Predict([Web Click], INCLUDE_STATISTICS, From Predict([Web Click], INCLUDE_STATISTICS, EXCLUSIVE)), prob, 5) EXCLUSIVE)), prob, 5)

FromFrom

WebLog PREDICTION JOIN (select (select 'Business' WebLog PREDICTION JOIN (select (select 'Business' as URLCategory) union (select ‘Telecom’ as as URLCategory) union (select ‘Telecom’ as URLCategory) as WebClick) as inputURLCategory) as WebClick) as input

OnOn

WebLog.[Web Click].URLCategory = WebLog.[Web Click].URLCategory = input.WebClick.URLCategoryinput.WebClick.URLCategory

ArchitectureArchitecture

Web Web CustomerCustomer

Web Web CustomerCustomer IISIISIISIIS

ASPASPASPASP

DM ProviderDM ProviderDM ProviderDM Provider

DMMDMMDMMDMM

InternetInternet

Real Time Predictio

n

Real Time Predictio

n

ADO/DSOADO/DSO

Performance of DM AlgorithmsPerformance of DM Algorithms

DM Performance Study DM Performance Study

• Joint effort between Unisys & MicrosoftJoint effort between Unisys & Microsoft• Two parts of the white paper:Two parts of the white paper:

First part:First part: Use AS2k to build DM Models for Use AS2k to build DM Models for a a banking business scenario banking business scenario

Second Part:Second Part: Performance results of DM Performance results of DM algorithms studyalgorithms study

• Some results in this session…Some results in this session…• Details in the Details in the paperpaper and and SQL Server SQL Server

magazinemagazine articles… articles…

Data Source for DMMsData Source for DMMs

Training Performance Results…Training Performance Results…

Sample Business Question for Sample Business Question for Non Nested MDTNon Nested MDT

11 Identify those customers that are Identify those customers that are most likely to churn (leave) based most likely to churn (leave) based on customer demographical on customer demographical information.information.

Non Nested: Training Times for varying Number of Input attributesNon Nested: Training Times for varying Number of Input attributes

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

0 50 100 150 200 250

Number of Attributes

Trai

ning

Tim

e (m

inut

es)

Training Time

Assumptions:Assumptions:• 1 mm cases• 25 states• 1 predictable attribute

I/P AttributesI/P Attributes Training TimeTraining Time

1010 4.084.08

2020 7.277.27

5050 31.5431.54

100100 40.5540.55

200200 129.35129.35

Observations:Observations:

Non Nested: Training Times for varying Number of CasesNon Nested: Training Times for varying Number of Cases

Assumptions:Assumptions:• 20 attributes• 25 states• 1 predictable attribute

Training Time

10,0001,000,000

5,000,000

10000000

0.00

20.00

40.00

60.00

80.00

100.00

120.00

0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000

Number of Cases

Tran

ing

Tim

e (m

inut

es)

Training Time

Observations:Observations:

CasesCases Training Training TimeTime

10,00010,000 0.380.38

1,000,0001,000,000 11.3211.32

5,000,0005,000,000 34.1934.19

10,000,00010,000,000 100.53100.53

Sample Business Question for Sample Business Question for Nested MDTNested MDT

22 Find the list of other products that the Find the list of other products that the customer may be interested in based on the customer may be interested in based on the products the customer has purchased.products the customer has purchased.

Nested Cases: Training Times for varying Sample size of Case TableNested Cases: Training Times for varying Sample size of Case Table

Training Time

0

50

100

150

200

250

300

0 50000 100000 150000 200000 250000

Number of Master Cases

Trai

ning

Tim

e (m

inut

es)

Training Time

Assumptions:Assumptions:• Avg. customer

purchases=25• States in nested=200• Nested key predictable

Observations:Observations:

Master CasesMaster Cases Training Training TimeTime

10,00010,000 15.0915.09

50,00050,000 67.7967.79

100,000100,000 120.88120.88

200,000200,000 240.62240.62

Nested Cases: Training Times for varying Number of Products Nested Cases: Training Times for varying Number of Products purchased per customerpurchased per customer

Assumptions:Assumptions:• 200000 cases• 1000 products in nested

Observations:Observations:

Nested CasesNested Cases Training Training TimeTime

1010 85.2685.26

2525 120.82120.82

5050 172.96172.96

100100 281.65281.65

For more info…For more info…

• DM URLDM URL– www.microsoft.com/data/oledbwww.microsoft.com/data/oledb– www.microsoft.com/data/www.microsoft.com/data/oledb/DMResKit.htmoledb/DMResKit.htm

• News Group:News Group:– Microsoft.public.SQLserver.dataminingMicrosoft.public.SQLserver.datamining– Communities.msn.com/AnalysisServicesDataMiningCommunities.msn.com/AnalysisServicesDataMining

• White papers:White papers:– Performance paper:Performance paper:

www.unisys.com/windows2000/default-07.asp www.unisys.com/windows2000/default-07.asp

www.microsoft.com/SQL/evaluation/compare/analysisdmwp.aspwww.microsoft.com/SQL/evaluation/compare/analysisdmwp.asp

Don’t forget to complete the Don’t forget to complete the on-line Session Feedback form on-line Session Feedback form on the Attendee Web siteon the Attendee Web site

https://web.mseventseurope.com/teched/https://web.mseventseurope.com/teched/