成都睿思商智科技有限公司 版所有-4-成都睿思商智科技有限公司现有产品“睿思BI商业智能系统”以数据仓库技 术为依托,采用数据抽取、数据填报等方式采集业务数据并集中于数据中心,利
欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会
description
Transcript of 欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会
欢迎光临欢迎光临微软微软 SQLSQL 数据挖掘数据挖掘 // 数据仓库数据仓库
技术研讨会技术研讨会
今日安排
• 微软 SQL 数据挖掘技术概述− 左洪 微软公司
• 数据仓库在电信的应用− 贝志城 明天高科
• 数据挖掘在 CRM 中的应用− 王立军 中圣公司
• 灵通 IT Service 维护管理服务系统 – 邹雄文 广州灵通
Introduction to Data Mining Introduction to Data Mining with SQL Server 2000 with SQL Server 2000
左洪 左洪 高级产品市场经理高级产品市场经理微软(中国)有限公司微软(中国)有限公司
AgendaAgenda
What is Data MiningWhat is Data Mining The Data Mining MarketThe Data Mining Market OLE DB for Data MiningOLE DB for Data Mining Overview of the Data Mining Overview of the Data Mining
Features in SQL Server 2000Features in SQL Server 2000 Q&AQ&A
What Is Data Mining?What Is Data Mining?
What is DM?What is DM? A process of data exploration and A process of data exploration and
analysis using automatic or semi-analysis using automatic or semi-automatic meansautomatic means ““Exploring data” – scanning samples of known Exploring data” – scanning samples of known
facts about “cases”.facts about “cases”. ““knowledge”: knowledge”: Clusters, Rules, Decision treesClusters, Rules, Decision trees, , Equations, Equations,
Association rules…Association rules…
Once the “knowledge” is extracted it:Once the “knowledge” is extracted it: Can be browsed Can be browsed
Provides a very useful insight on the cases behaviorProvides a very useful insight on the cases behavior
Can be used to predict values of other casesCan be used to predict values of other cases Can serve as a key element in closed loop analysisCan serve as a key element in closed loop analysis
What drive high school What drive high school students to attend college?students to attend college?
The deciding factors for high The deciding factors for high school students to attend college school students to attend college are…are…
Attend College:55% Yes45% No
All Students
Attend College:79% Yes11% No
IQ=High
Attend College:45% Yes55% No
IQ=Low
IQ ?
Wealth
Attend College:94% Yes6% No
Wealth = True
Attend College:69% Yes21% No
Wealth = False
ParentsEncourage?
Attend College:70% Yes30% No
Attend College:31% Yes69% No
ParentsEncourage = No
ParentsEncourage = Yes
Business Oriented DM ProblemsBusiness Oriented DM Problems
Targeted adsTargeted ads ““What banner should I display to this visitor?”What banner should I display to this visitor?”
Cross sellsCross sells ““What other products is this customer likely to buy?What other products is this customer likely to buy?
Fraud detectionFraud detection ““Is this insurance claim a fraud?”Is this insurance claim a fraud?”
Churn analysisChurn analysis ““Who are those customers likely to churn?”Who are those customers likely to churn?”
Risk ManagementRisk Management ““Should I approve the loan to this customer?”Should I approve the loan to this customer?”
… …
Http://www.tunes.comHttp://www.tunes.com
Mining Model
Mining Process - IllustratedMining Process - Illustrated
DMEngine
Data To Predict
DMEngine
Predicted Data
Training Data
Mining Model
Mining Model
The Data Mining MarketThe Data Mining Market
The $$$: Y2000 Market Size The $$$: Y2000 Market Size
DM Tools Market: $250MDM Tools Market: $250M 40% - license fees40% - license fees 60% consulting60% consulting
* Gartner
The PlayersThe Players
Leading vendorsLeading vendors SASSAS SPSSSPSS IBMIBM Hundreds of smaller vendors offering DM Hundreds of smaller vendors offering DM
algorithms…algorithms…
Oracle –Thinking Machines acquisitionOracle –Thinking Machines acquisition
The ProductsThe Products End-to-end Data Mining toolsEnd-to-end Data Mining tools
Extraction, Cleansing, Loading, Modeling, Algorithms (dozens), Extraction, Cleansing, Loading, Modeling, Algorithms (dozens), Analysts workbench, Reporting, Charting….Analysts workbench, Reporting, Charting….
The customer is the power-analystThe customer is the power-analyst PhD in statistics is usually required…PhD in statistics is usually required…
Closed tools – no standard APIClosed tools – no standard API Total vendor lock-inTotal vendor lock-in Limited integration with applicationsLimited integration with applications
DM an “outsider” in the Data WarehouseDM an “outsider” in the Data Warehouse Extensive consulting requiredExtensive consulting required Sky rocketing pricesSky rocketing prices
$60K+ for a single user license$60K+ for a single user license
What the analysts say…What the analysts say…
““Stand-alone Data Mining Is Dead” Stand-alone Data Mining Is Dead” - Forrester- Forrester
““The demise of [stand alone] data The demise of [stand alone] data mining” – Gartnermining” – Gartner
The Microsoft ApproachThe Microsoft Approach
DataPro Users Survey DataPro Users Survey 1999-20011999-2001
““Data mining will be the fastest-Data mining will be the fastest-growing BI technology…”growing BI technology…”
The $$$: 2000 Market Size The $$$: 2000 Market Size
DM DM ApplicationsApplications Market Size: Market Size: $1.5B$1.5B
* IDC
SQL Server 2000 - The SQL Server 2000 - The Analysis PlatformAnalysis Platform SQL 2000 provides a complete Analysis SQL 2000 provides a complete Analysis
PlatformPlatform Not an isolated, stand alone DM productNot an isolated, stand alone DM product
Platform means:Platform means: The infrastructure for applicationsThe infrastructure for applications
Not an application by itselfNot an application by itself
Integrated vision for all technologies, toolsIntegrated vision for all technologies, tools Standard based API’s (OLE DB for DM)Standard based API’s (OLE DB for DM) ExtensibleExtensible ScaleableScaleable
Data FlowData Flow
DWOLTP OLAP
DMAppsReports
& Analysis
DM
Analysis Services 2000 - Analysis Services 2000 - ArchitectureArchitecture
Manager UI
DSO
Analysis Server Client
OLE DB OLAP
OLAPEngine(local)
OLAPEngine
DMEngine
DMEngine(local)
DM
DMM
DM Wizards
DM DTS Task
Ext. Ext.
OLE DB for Data Mining…OLE DB for Data Mining…
Why OLE DB for DM?Why OLE DB for DM?
Make DM a Make DM a mass market technologymass market technology by: by: Leverage existing technologies and knowledge Leverage existing technologies and knowledge
SQL and OLE DB SQL and OLE DB Common industry wide concepts and data Common industry wide concepts and data
presentationpresentation Changing DM market perception from Changing DM market perception from
“proprietary” to “open”“proprietary” to “open” Increasing the number of players:Increasing the number of players:
Reduce the cost and risk of becoming a consumer – one tool Reduce the cost and risk of becoming a consumer – one tool works with multiple providersworks with multiple providers
Reduce the cost and risk of becoming a provider – focus on Reduce the cost and risk of becoming a provider – focus on expertise and find many partners to complement offeringexpertise and find many partners to complement offering
Dramatically increase the number of DM developersDramatically increase the number of DM developers
Integration With RDBMSIntegration With RDBMS
Customers would like to Customers would like to Build DM models from within their RDBMSBuild DM models from within their RDBMS Train the models directly off their relational tablesTrain the models directly off their relational tables Perform predictions as relational queries (tables Perform predictions as relational queries (tables
in, tables out)in, tables out) Feel that DM is a native part of their database.Feel that DM is a native part of their database.
Therefore…Therefore… Data mining models are relational objectsData mining models are relational objects All operations on the models are relationalAll operations on the models are relational The language used is SQL (w/Extensions)The language used is SQL (w/Extensions)
The effect: every DBA and VB developer can The effect: every DBA and VB developer can become a DM developerbecome a DM developer
Creating a Data Mining Creating a Data Mining Model (DMM)Model (DMM)
Identifying the “Cases”Identifying the “Cases”
DM algorithms analyze “cases”DM algorithms analyze “cases” The “case” is the entity being categorized The “case” is the entity being categorized
and classifiedand classified ExamplesExamples
Customer credit risk analysis: Customer credit risk analysis: Case = CustomerCase = Customer Product profitability analysis: Product profitability analysis: Case = ProductCase = Product Promotion success analysis: Promotion success analysis: Case = PromotionCase = Promotion
Each case encapsulate all we know about Each case encapsulate all we know about the entitythe entity
A Simple Set of CasesA Simple Set of Cases
StudentIStudentIDD GenderGender
Parent Parent
IncomeIncomeIQIQ EncouragementEncouragement
CollegeCollege
PlansPlans
11 MaleMale 2340023400 120120 Not Not EncouragedEncouraged
NoNo
22 FemaleFemale 7920079200 9090 EncouragedEncouraged YesYes
33 MaleMale 4200042000 105105 Not Not EncouragedEncouraged
YesYes
More Complicated CasesMore Complicated Cases
Cust Cust IDID AgeAge
MaritMaritalal
StatuStatuss
IQIQ
Favorite MoviesFavorite Movies
TitleTitle ScoreScore
11 3535 MM 22 Star WarsStar Wars 88
Toy StoryToy Story 99
TerminatorTerminator 77
22 2020 SS 33 Star WarsStar Wars 77
BraveheartBraveheart 77
The MatrixThe Matrix 1010
33 5757 MM 22 Sixth SenseSixth Sense 99
CasablancaCasablanca 1010
A DMM is a Table!A DMM is a Table!
A DMM structure is defined as a tableA DMM structure is defined as a table Training a DMM means inserting data into Training a DMM means inserting data into
the tablethe table Predicting from a DMM means querying Predicting from a DMM means querying
the tablethe table
All information describing the case are All information describing the case are contained in columnscontained in columns
Creating a Mining ModelCreating a Mining Model
CREATE MINING MODEL [Plans Prediction]CREATE MINING MODEL [Plans Prediction]
((
StudentID LONG KEY,StudentID LONG KEY,
Gender TEXT DISCRETE,Gender TEXT DISCRETE,
ParentIncome LONG CONTINUOUS,ParentIncome LONG CONTINUOUS,
IQ DOUBLE CONTINUOUS,IQ DOUBLE CONTINUOUS,
Encouragement TEXT DISCRETE, Encouragement TEXT DISCRETE,
CollegePlans TEXT DISCRETE PREDICTCollegePlans TEXT DISCRETE PREDICT
))
USING Microsoft_Decision_TreesUSING Microsoft_Decision_Trees
Creating a mining model Creating a mining model with nested tablewith nested tableCreate Mining Model MoviePrediction Create Mining Model MoviePrediction
( (
CutomerId long key, CutomerId long key,
Age long continuous, Age long continuous,
Gender discrete,Gender discrete,
Education discrete,Education discrete,
MovieList table predict ( MovieList table predict (
MovieName text key MovieName text key
) )
) )
using microsoft_decision_treesusing microsoft_decision_trees
Training a DMMTraining a DMM
Training a DMMTraining a DMM Training a DMM means passing it data for Training a DMM means passing it data for
which the attributes to be predicted are knownwhich the attributes to be predicted are known Multiple passes are handled internally by the Multiple passes are handled internally by the
provider!provider!
Use an INSERT INTO statementUse an INSERT INTO statement The DMM will not persist the inserted data The DMM will not persist the inserted data Instead it will analyze the given cases and Instead it will analyze the given cases and
build the DMM content (decision tree, build the DMM content (decision tree, segmentation model, association rules)segmentation model, association rules)
INSERT [INTO] <mining model name>INSERT [INTO] <mining model name>
[(columns list)][(columns list)]<source data query><source data query>
INSERT INTOINSERT INTO
INSERT INTO [Plans PredictionPlans Prediction](StudentID, Gender, ParentIncome, IQ,Encouragement, CollegePlans)SELECT
[StudentID], [Gender], [ParentIncome], [IQ],[Encouragement], [CollegePlans]
FROM [CollegePlans]
When Insert Into Is Done…When Insert Into Is Done…
The DMM is trainedThe DMM is trained The model can be retrained The model can be retrained Content (rules, trees, formulas) can be Content (rules, trees, formulas) can be
exploredexplored OLE DB Schema rowsetOLE DB Schema rowset SELECT * FROM <dmm>.CONTENTSELECT * FROM <dmm>.CONTENT XML string (PMML)XML string (PMML)
Prediction queries can be executedPrediction queries can be executed
PredictionsPredictions
What are Predictions?What are Predictions? Predictions apply the rules of a trained Predictions apply the rules of a trained
model to a new set of data in order to model to a new set of data in order to estimate missing attributes or valuesestimate missing attributes or values
Predictions = queriesPredictions = queries The syntax is SQL - likeThe syntax is SQL - like The output is a rowsetThe output is a rowset
In order to predict you need:In order to predict you need: Input data setInput data set A trained DMMA trained DMM Binding (mapping) information between the Binding (mapping) information between the
input data and the DMMinput data and the DMM Specification of what to predictSpecification of what to predict
The Truth Table ConceptThe Truth Table Concept
GendeGenderr
Parent Parent
IncomeIncomeIQIQ EncouragementEncouragement
CollegCollegee
PlansPlansProbabilityProbability
MaleMale 2000020000 8585 Not EncouragedNot Encouraged NoNo 85%85%
MaleMale 2000020000 8585 Not EncouragedNot Encouraged YesYes 15%15%
MaleMale 2000020000 8585 EncouragedEncouraged NoNo 60%60%
MaleMale 2000020000 8585 EncouragedEncouraged YesYes 40%40%
MaleMale 2000020000 9090 Not EncouragedNot Encouraged NoNo 80%80%
MaleMale 2000020000 9090 Not EncouragedNot Encouraged YesYes 20%20%
MaleMale 2000020000 9090 EncouragedEncouraged NoNo 58%58%
……
PredictionPrediction
GenderGender ParentParent
IncomeIncome
IQIQ EncouragementEncouragement College College PlansPlans
ProbabilityProbability
MaleMale 2000020000 8585 Not EncouragedNot Encouraged NoNo 85%85%
MaleMale 2000020000 8585 Not EncouragedNot Encouraged YesYes 15%15%
MaleMale 2000020000 8585 EncouragedEncouraged NoNo 60%60%
MaleMale 2000020000 8585 EncouragedEncouraged YesYes 40%40%
MaleMale 2000020000 9090 Not EncouragedNot Encouraged NoNo 80%80%
MaleMale 2000020000 9090 Not EncouragedNot Encouraged YesYes 20%20%
MaleMale 2000020000 9090 EncouragedEncouraged NoNo 58%58%
MaleMale 2000020000 9090 EncouragedEncouraged YesYes 42%42%
MaleMale 2000020000 9595 Not EncouragedNot Encouraged NoNo 78%78%
MaleMale 2000020000 9595 Not EncouragedNot Encouraged YesYes 22%22%
MaleMale 2000020000 9595 EncouragedEncouraged NoNo 45%45%
It’s a JOIN!It’s a JOIN!
StudentStudentIDID
GenderGender ParentParent
IncomeIncome
IQIQ EncouragementEncouragement
11 MaleMale 4300043000 8585 Not EncouragedNot Encouraged
22 MaleMale 2000020000 135135 Not EncouragedNot Encouraged
33 FemaleFemale 2500025000 105105 EncouragedEncouraged
44 MaleMale 9600096000 100100 EncouragedEncouraged
55 FemaleFemale 5600056000 125125 Not EncouragedNot Encouraged
66 FemaleFemale 4600046000 9090 Not EncouragedNot Encouraged
The Prediction Query The Prediction Query SyntaxSyntax
SELECT SELECT <columns to return or predict><columns to return or predict>
FROM FROM
<dmm> <dmm> PREDICTION JOIN PREDICTION JOIN
<input data set><input data set>
ONON <dmm column> <dmm column> = = <dmm input <dmm input column>…column>…
ExampleExampleSELECT SELECT [New Students].[StudentID], [New Students].[StudentID],
[Plans Prediction].[CollegePlans], [Plans Prediction].[CollegePlans],
PredictProbability([CollegePlans])PredictProbability([CollegePlans])
FROM FROM
[Plans Prediction] [Plans Prediction] PREDICTION JOINPREDICTION JOIN
[New Students][New Students]
ON ON [Plans Prediction].[Gender][Plans Prediction].[Gender] = =
[New Students].[Gender] [New Students].[Gender] ANDAND
[Plans Prediction].[IQ][Plans Prediction].[IQ] = =
[New Students].[IQ] [New Students].[IQ] AND ...AND ...
OLE DB DM Sample OLE DB DM Sample Provider with SourceProvider with Source
All required OLE DB objects, such as session, All required OLE DB objects, such as session, command, and rowset command, and rowset
The OLE DB for Data Mining syntax parser The OLE DB for Data Mining syntax parser Tokenization of input data Tokenization of input data Query processing engine Query processing engine A sample Naïve Bayes algorithm A sample Naïve Bayes algorithm Model persistence in XML and binary formats Model persistence in XML and binary formats Available at Available at
www.microsoft.com/data/oledb/DMResKit.htmwww.microsoft.com/data/oledb/DMResKit.htm
Integrated OLAP and DM Integrated OLAP and DM AnalysisAnalysis
Why Use DM with OLAPWhy Use DM with OLAP
Relational DM is designed for:Relational DM is designed for: Reports of patternsReports of patterns Batch predictions fed into an OLTP systemBatch predictions fed into an OLTP system Real-time singleton prediction in an Real-time singleton prediction in an
operational environmentoperational environment
OLAP is designed for OLAP is designed for interactive analysis by a knowledge worker interactive analysis by a knowledge worker Consistent and convenient navigational Consistent and convenient navigational
modelmodel Pre-aggregations of OLAP allow faster Pre-aggregations of OLAP allow faster
performanceperformance
Understanding DM Understanding DM Content – Decision TreesContent – Decision Trees
Credit Risk:65% Good35% Bad
All Customers
Credit Risk:89% Good11% Bad
Debt=Low
Credit Risk:94% Good6% Bad
ET = Salaried
Credit Risk:70% Good30% Bad
Education?
Credit Risk:31% Good69% Bad
Education=High School
Credit Risk:79% Good21% Bad
Credit Risk:45% Good55% Bad
Debt=High
Debt ?
Employ--ment Type?
ET = SelfEmployed
Education=College
Customers having high debt and college education:
Filter([Individual Customers].Members,Customers.CurrentMember.Properties(“Debt”) = “High”And Customers.CurrentMember.Properties(“Education”) = “College”)
Customers having low debt and are self employed:
Filter([Individual Customers].Members,Customers.CurrentMember.Properties(“Debt”) = LowAnd Customers.CurrentMember.Properties(“Employment Type”) = “Self Employed”)
……Equivalent DM DimensionEquivalent DM Dimension
Customers with high debt and college education
All Customers
Customers with high debt
Customers with high debt and high school education
Customers with low debt and self employed
Customers with low debt
Customers with low debt and salaried
Custom Custom
Roll-upRoll-upCredit RiskCredit Risk
-- Good = 65%, Bad = Good = 65%, Bad = 35%35%
Aggregate(Filter(Aggregate(Filter(……
Good = 89%, Bad = Good = 89%, Bad = 11%11%
Aggregate(Filter(Aggregate(Filter(……
Good = 79%, Bad = Good = 79%, Bad = 21%21%
Aggregate(Filter(Aggregate(Filter(……
Good = 94%, Bad = Good = 94%, Bad = 6%6%
Aggregate(Filter(Aggregate(Filter(……
Good = 45%, Bad = Good = 45%, Bad = 55%55%
Aggregate(Filter(Aggregate(Filter(……
Good = 70%, Bad = Good = 70%, Bad = 30%30%
Aggregate(Filter(Aggregate(Filter(……
Good = 31%, Bad = Good = 31%, Bad = 69%69%
Tree = DimensionTree = Dimension Every node on the tree is a dimension memberEvery node on the tree is a dimension member The node statistics are the member propertiesThe node statistics are the member properties All members are calculatedAll members are calculated
Formula aggregates the case dimension members Formula aggregates the case dimension members that apply to this nodethat apply to this node
The MDX is generated by the DM algorithmThe MDX is generated by the DM algorithm
Analysis Service will automatically generate the Analysis Service will automatically generate the calculated dimension based on the DM content calculated dimension based on the DM content and also a virtual cubeand also a virtual cube
Applies to Applies to Classification (decision trees)Classification (decision trees) Segmentation (clusters)Segmentation (clusters)
Browsing the Virtual CubeBrowsing the Virtual Cube
Pivot the DM dimension:Pivot the DM dimension:
WAWA OROR CACA
All CustomersAll Customers 32003200 25002500 80008000
Customers with low debtCustomers with low debt 23202320 15031503 43004300
Customers with high debtCustomers with high debt 880880 997997 47004700
Customers … collegeCustomers … college 320320 450450 23102310
Customers … high schoolCustomers … high school 560560 547547 23902390
Credit Risk: 70% Good, 30% Bad
PredictionsPredictions
You might want to view predictions for each You might want to view predictions for each casecase
For example:For example: What is the expected profitability of a product?What is the expected profitability of a product? What is the credit risk of a specific customer?What is the credit risk of a specific customer? What are the products this customer is likely to buy?What are the products this customer is likely to buy?
All of those predictions are available through All of those predictions are available through MDX calculated membersMDX calculated members
Singleton query is created automaticallySingleton query is created automatically
Prediction Calculated Prediction Calculated MemberMemberMeasures.[Probability of High Credit Risk]:Measures.[Probability of High Credit Risk]:
PREDICT(Customers.CurrentMember, PREDICT(Customers.CurrentMember,
““Credit Risk Model”,Credit Risk Model”,
““PredictionProbability(PredictionProbability(
PredictionHistogram(“PredictionHistogram(“Credit Credit RiskRisk”),”),
‘‘HighHigh’)’)““
))
Predictions ExamplePredictions Example
Probability of Probability of
High Credit High Credit RiskRisk
Probability of Probability of
Low Credit Low Credit RiskRisk
Joe SmithJoe Smith 73%73% 27%27%
John DowJohn Dow 68%68% 32%32%
William ClingtonWilliam Clington 45%45% 55%55%
Robert MaxwellRobert Maxwell 98%98% 2%2%
Denis RodmanDenis Rodman 81%81% 19%19%
Questions ?Questions ?
E-Mail: [email protected]: [email protected]://www.microsoft.com/china/sqlhttp://www.microsoft.com/china/sql