slides

27
1 Data Stream Mining Applications: Toward Inductive DSMS CS240B Notes by Carlo Zaniolo UCLA Computer Science Department Spring 2008

description

 

Transcript of slides

Page 1: slides

1

Data Stream Mining Applications:Toward Inductive DSMS

CS240B Notes byCarlo Zaniolo UCLA Computer Science DepartmentSpring 2008

Page 2: slides

21-Mar-08 2 http://wis.cs.ucla.edu

Data Stream Mining and DSMS

Mining Data Stream: an emerging area of important applications

Many fast & light algorithms developed for mining data

streams: Ensembles, Moment, SWIM, etc. Deployemnt of these algorithms on data streams a

challenge To deal with bursty arrivals, synopses, QoS, scheduling

Analysts want to focus on high-level mining tasks, leaving such lower-level issues to the DSMS

Integration of mining methods and DSMS technology is needed—but it faces difficult research challenges: Data mining: a big problem for SQL-based DBMS

Page 3: slides

21-Mar-08 3 http://wis.cs.ucla.edu

Road Map for Next Three Weeks Data Mining query languages and systems

The Inductive DBMS dream and the reality: Oracle, IBM DB2, MS DMX, Weka

Fast& Light Algorithms for Mining Data Streams Classifiers and Classifier Ensembles, Clustering methods, Association Rules, Time series

Supporting these Algorithms in a DSMS Data Mining Query Languages and support for

the mining process

Page 4: slides

21-Mar-08 4 http://wis.cs.ucla.edu

The DM Experience for DBMS: from dreams to reality

Initial attempts to support mining queries in relational DBMS: Unsuccessful OR-DBMS do not fare much better [Sarawagi’ 98].

In 1996, a ‘high-road’ approach was proposed by Imielinski & Mannila who called for a quantum leap in functionality based on:

High-level declarative languages for Data Mining (DM) Technology breakthrough in DM query optimization.

The research area of Inductive DBMS was thus born Inspiring significant work: DMQL, Mine Rule, MSQL, …

Suffer from limited generality and performance issues.

Page 5: slides

21-Mar-08 5 http://wis.cs.ucla.edu

DB2 Intelligent Miner

Model creation Training:

CALL IDMMX.DM_buildClasModelCmd('IDMMX.CLASTASKS', 'TASK', 'ID', 'HeartClasTask', 'IDMMX.CLASSIFMODELS', 'MODEL', 'MODELNAME', 'HeartClasModel' );

Prediction Stored procedures and virtual mining views Outside the DBMS (like Cache Mining)

Data transfer delays http://www-306.ibm.com/software/data/iminer/

Page 6: slides

21-Mar-08 6 http://wis.cs.ucla.edu

DB2 Intelligent Miner

Model creation Training

CALL IDMMX.DM_buildClasModelCmd('IDMMX.CLASTASKS', 'TASK', 'ID', 'HeartClasTask', 'IDMMX.CLASSIFMODELS', 'MODEL', 'MODELNAME', 'HeartClasModel' );

Prediction Stored procedures and virtual mining views Outside the DBMS (like Cache Mining)

Data transfer delays http://www-306.ibm.com/software/data/iminer/

Page 7: slides

21-Mar-08 7 http://wis.cs.ucla.edu

Oracle Data Miner

Algorithms Adaptive Naïve Bayes SVM regression K-means clustering Association rules, text, mining, etc.

PL/SQL with extensions for mining Models as first class objects

Create_Model, Prediction, Prediction_Cost, Prediction_Details, etc.

http://www.oracle.com/technology/products/bi/odm/index.html

Page 8: slides

21-Mar-08 8 http://wis.cs.ucla.edu

OLE DB for DM (DMX) Model creation

Create mining model MemCard_Pred ( CustomerId long key, Age long continuous, Profession text discrete, Income long continuous, Risk text discrete predict)Using Microsoft_Decision_Tree;

Training Insert into MemCard_Pred OpenRowSet(

“‘sqloledb’, ‘sa’, ‘mypass’”, ‘SELECT CustomerId, Age,

Profession, Income, Risk from Customers’) Prediction Join

Select C.Id, C.Risk, PredictProbability(MemCard_Pred.Risk)From MemCard_Pred AS MP Prediction Join Customers AS CWhere MP.Profession = C.Profession and AP.Income =

C.Income AND MP.Age = C.Age;

Page 9: slides

21-Mar-08 9 http://wis.cs.ucla.edu

Defining a Mining Model Define

The format of “training cases” (top-level entity) Attributes, Input/output type, distribution Algoritms and parameters

Example

CREATE MINING MODEL CollegePlanModel

( StudentID LONG KEY,Gender TEXT DISCRETE,ParentIncome LONG NORMAL CONTINUOUS,Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT

) USING Microsoft_Decision_Trees

Page 10: slides

21-Mar-08 10 http://wis.cs.ucla.edu

INSERT INTO CollegePlanModel(StudentID, Gender, ParentIncome,

Encouragement, CollegePlans)

OPENROWSET(‘<provider>’, ‘<connection>’,‘SELECT StudentID,

Gender, ParentIncome,Encouragement,CollegePlans

FROM CollegePlansTrainData’)

Training

Page 11: slides

21-Mar-08 11 http://wis.cs.ucla.edu

SELECT t.ID, CPModel.PlanFROM CPModel PREDICTION JOIN OPENQUERY(…,‘SELECT * FROM

NewStudents’) AS tON CPModel.Gender = t.Gender AND CPModel.IQ = t.IQ

ID Gender IQID Gender IQ PlanCPModel NewStudents

Prediction Join

Page 12: slides

21-Mar-08 12 http://wis.cs.ucla.edu

OLE DB for DM (DMX) (cont.) Mining objects as first class objects

Schema rowsets Mining_Models Mining_Model_Content Mining_Functions

Other features Column value distribution Nested cases

http://research.microsoft.com/dmx/DataMining/

Page 13: slides

21-Mar-08 13 http://wis.cs.ucla.edu

Summary of Vendors’ Approaches Built-in library of mining methods

Script language or GUI tools Limitations

Closed systems (internals hidden from users) Adding new algorithms or customizing old ones --

Difficult Poor integration with SQL Limited interoperability across DBMSs

Predictive Markup Modeling Language (PMML) as a palliative

Page 14: slides

21-Mar-08 14 http://wis.cs.ucla.edu

PMML Predictive Markup Model Language

XML based language for vendor independent definition of statistical and data mining models

Share models among PMML compliant products A descriptive language

Supported by all major vendors

Page 15: slides

21-Mar-08 15 http://wis.cs.ucla.edu

PMML Example

Page 16: slides

The Data Mining Software Vendors Market Competition

The Data Mining World According to

Page 17: slides

DisclaimerDisclaimer

DisclaimerThis presentation contains preliminary information that may be changed substantially prior to final

commercial release of the software described herein.

The information contained in this presentation represents the current view of Microsoft Corporation on the issues discussed as of the date of the presentation. Because Microsoft must respond to changing

market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of the presentation.

This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this presentation. Except as expressly provided in any written license

agreement from Microsoft, the furnishing of this information does not give you any license to these patents, trademarks, copyrights, or other intellectual property.

© 2005 Microsoft Corporation. All rights reserved.

Page 18: slides

Major Data Mining VendorsMajor Data Mining Vendors

• Platforms IBM Oracle SAS

• Tools SPSS Angoss KXEN Megaputer FairIsaac Insightful

Page 19: slides

CompetitionCompetition

SQL Server 2005 Oracle 10g IBM SAS

Product SQL Server Analysis Services

Oracle Data Mining DB2 Intelligent Miner, WebSphere

Enterprise Miner

Link http://otn.oracle.com/products/bi/odm/odmining.html

http://www-306.ibm.com/software/data/iminer/

http://www.sas.com/technologies/analytics/datamining/miner/factsheet.pdf

API OLEDB/DM, DMX, XMLA, ADOMD.Net

Java DM, PL/SQL SQL MM/6 based on UDF, SQL SPROC

SAS Script

Algorithms 7 (+2) 8 6 8+

Text Mining Yes Yes Yes Yes

Marketing Pages N/A 18 10 Dozens

Client Tools Embeddable Viewers, Reporting Services

Analysis tools, Web-based targeted reports

Discoverer

WebSphere Portal (vertical solution)

IM Visualization

Excel AddIn

None

Distribution Included Additional Package Additional Packages Separate Product

Target Developers Developers DB2 IM Scoring module is for developers; Other modules are for analysts.

Analysts

Strengths Powerful yet simple API

Integration with other BI technologies

New GUI

Good credibility with enterprise customers

New GUI, Leader of JDM API

CRM Integration

Mature product (6 years). Good service model. Scoring inside relational engine. Strong partnership with SAS

Mature, Market Leader. Extensive customization and modelling abilities. Robust, industry tested and accepted algorithms and methodologies. Export to DB2 Scoring.

Weaknesses Not in-process with relational engine Lacking statistical functions

Poor Analyst experience

API overly complex

Inconsistent

High price. Standard Functionality. Poor API (SQL MM). Confusing product line.

Expensive. Proprietary. Customer relations range from congenial to hostile.

Page 20: slides

Major DMMajor DM

Platforms IBM Oracle SAS,

Tools SPSS Angoss KXEN Megaputer FairIsaac Insightful

SAS Institute (Enterprise Miner) IBM (DB2 Intelligent Miner for Data) Oracle (ODM option to Oracle 10g) SPSS (Clementine) Unica Technologies, Inc. (Pattern

Recognition Workbench) Insightsful (Insightful Miner) KXEN (Analytic Framework) Prudsys (Discoverer and its family) Microsoft (SQL Server 2005) Angoss (KnowledgeServer and its

family) DBMiner (DBMiner) etc…

Vendors

Page 21: slides

ORACLEORACLE

Strengths Oracle Data Mining (ODM) Integrated into relational engine

– Performance benefits

– Management integration

– SQL Language integration ODM Client

– “Walks through” Data Mining Process

– Data Mining tailored data preparation

– Generates code Integration into Oracle CRM

– “EZ” Data Mining for customer churn, other applications Full suite of algorithms

– Typical algorithms, plus text mining and bioinformatics Nice marketing/user education

Page 22: slides

ORACLEORACLE

Weaknesses Additional Licensing Fees (base $400/user, $20K proc) Confusing API Story

– Certain features only work with Java API– Certain features only work with PL/SQL API– Same features work differently with different API’s

Difficult to use– Different modeling concepts for each algorithm

Poor connectivity – ORACLE only

Page 23: slides

SASSAS

• Entrenched Data Mining Leader Market Share Mind Share

• “Best of Breed” Always will attract the top ?% of customers

• Overall poor product Only for the expert user (SAS Philosophy) Integration of results generally involves source code

• Integrated with ETL, other SAS tools• Partnership with IBM

Model in SAS, deploy in DB2

Page 24: slides

21-Mar-08 24 http://wis.cs.ucla.edu

Our View ...

Progress toward high level data models and integration with SQL, but

Closed systems, Lacking in coverage and user-extensibility. Not as popular as dedicated, stand-alone DM

systems, such as Weka.

Page 25: slides

21-Mar-08 25 http://wis.cs.ucla.edu

Weka A comprehensive set of DM algorithms, and tools.

Generic algorithms over arbitrary data sets. Independent on the number of columns in tables.

Open and extensible system based on Java.

These are the features that we want in our Inductive DSMS---starting from SQL rather than Java!

Page 26: slides

21-Mar-08 26 http://wis.cs.ucla.edu

References [Imielinski’ 96] Tomasz Imielinski and Heikki Mannila. A database

perspective on knowledge discovery. Commun. ACM, 39(11):58–64, 1996.

Carlo Zaniolo: Mining Databases and Data Streamswith Query Languages and Rules: Invited Talk, Fourth International Workshop on Knowledge Discovery in Inductive Databases, KDID 2005.

Page 27: slides

21-Mar-08 27 http://wis.cs.ucla.edu

Thank you!