Data Mining With Excel 2007 And SQL Server 2008

48
Data Mining with Excel 2007 and SQL Server 2008 Mark Tabladillo Ph.D. http://www.marktab.net November 10, 2008

description

Introduction to Excel 2007 Data Mining Plug-In using SQL Server 2008. The presentation starts with definitions and statistical theory (without equations). Then, the audience interactively participates in four demos showing the power and possibilities of the Microsoft Data Mining Algorithms.

Transcript of Data Mining With Excel 2007 And SQL Server 2008

Data Mining with Excel 2007 and SQL Server 2008

Mark Tabladillo Ph.D.

http://www.marktab.net

November 10, 2008

Approach of this Presentation

• Emphasize

– Conceptual value of data mining

– Relationship of data mining to the real world

• Reserve

– Specific procedures and mechanics

– Specific mathematics

– Production implementation

© 2008 Mark Tabladillo Ph.D. 2

Introduction

• Microsoft Data Mining (MDM) is a major branch of SQL Server Analysis Services (SSAS)

• The technology is supported by a new language within SSAS called DMX (Data Mining Extensions)

• Currently, the two promoted interfaces are BIDS (Business Intelligence Development Studio) and Excel 2007

© 2008 Mark Tabladillo Ph.D. 3

Introduction

• SQL Server 2008 has some improvements over 2005, but the main technology is similar

• A major improvement for 2008 is the documentation (Books Online)

• Microsoft’s team releases technology information at http://www.sqlserverdatamining.com

© 2008 Mark Tabladillo Ph.D. 4

Outline

• Main Conclusions on Data Mining

• Data Mining Definition

• Microsoft Data Mining Fundamentals

• Overview of Microsoft Data Mining Algorithms

• Conclusion

© 2008 Mark Tabladillo Ph.D. 5

Four Interactive Demos

• Card Sorting

• Demographic Profiles

• Sports (College Football)

• Money (American Economy)

© 2008 Mark Tabladillo Ph.D. 6

Data Mining Definitions

• Data mining is the automatic or semi-automatic process of exploring data for meaningful or useful patterns.

• Data mining algorithms typically use estimation or optimization to achieve results (as opposed to only calculations).

© 2008 Mark Tabladillo Ph.D. 7

Data Mining Provides Insight

• Business

– What reasons contribute to stock price changes?

– Why do longer term jobless benefits hit a 25 year high?

• Entertainment

– Who is more likely to lose a civil lawsuit?

– How well will new DVD sales do in the next few months?

© 2008 Mark Tabladillo Ph.D. 8

Data Mining Provides Insight

• Sports

– How much should a sports team offer for a proven free agent?

– What factors lead to winning a tennis championship?

• Technology

– How does Cisco know there are warning signals in the tech sector?

– What is the net loss in losing corporate secrets?

© 2008 Mark Tabladillo Ph.D. 9

Data Mining Provides Insight

• Politics

– What priorities do American voters have for the new President?

– Why did a certain candidate win or lose a race?

• Science

– What factors contribute to ozone holes over the Antarctic?

– Why do we believe that Tyrannosaurus Rex had a good sense of smell?

© 2008 Mark Tabladillo Ph.D. 10

Functions in Technology

• Job Titles = Rationalized System to Pay People Less or Give them More Responsibility

• “Engineer”?

• “Scientist”?

© 2008 Mark Tabladillo Ph.D. 11

The Scientific Method

• (Suppose you are a computer scientist)

• Define the question

• Gather information and resources (observe)

• Form hypothesis

• Perform experiment and collect data

© 2008 Mark Tabladillo Ph.D. 12

The Scientific Method

• Analyze data – data mining is an option

• Interpret data and draw conclusions that serve as a starting point for new hypothesis

• Publish results

• Retest (frequently done by other scientists)

© 2008 Mark Tabladillo Ph.D. 13

Microsoft Data Mining

• Microsoft Data Mining refers to Microsoft’s specific implementation of certain common data mining algorithms for the DMX (Data Mining Extensions) language.

• Also called SQL Server Data Mining, the technology is implemented through tools rather than through a single, finished application interface.

© 2008 Mark Tabladillo Ph.D. 14

Data Mining Input and Results

• Data mining input can include continuous numeric, categorized (ordinal or nominal), and text data.

• Data mining results consists of a lower dimensional model, either describing the empirical data (unsupervised), or the relationship between named input and output attributes (supervised)

© 2008 Mark Tabladillo Ph.D. 15

Data Explosion

© 2008 Mark Tabladillo Ph.D. 16

Donald Farmer – May 2008

"[We don't] have all the functionality of something like a SAS or an SPSS, because that's just not our market," he conceded.

It comes down to a difference of scale, according to Farmer. SAS and SPSS typically target larger, more expensive deployments, typically with users well-versed in the usage of their tools. Microsoft is targeting a different kind of data mining consumer: the Excel analyst, for example, who might not have much (if any) experience with data mining, predictive analytics or statistical analysis, for that matter.

© 2008 Mark Tabladillo Ph.D. 17

Donald Farmer – May 2008 "By the way, I don't mean to say we can't hit the high-end. Within

Microsoft, we have our own database marketing team. We're one of the largest companies in the world. We have a huge database marketing team who do classic customer analysis. These guys were all SAS users, but when they joined Microsoft, they started using our tools. The entire process runs on our database, they actually use the Excel [data mining] add-ins to do it. It's not that there's nothing they don't miss, [it's that] they are able to achieve the same business results using our tools.“

Redmond Magazine – May 7, 2008

http://redmondmag.com/news/article.asp?EditorialsID=9836

© 2008 Mark Tabladillo Ph.D. 18

Obtaining the Add-in

© 2008 Mark Tabladillo Ph.D. 19

Obtaining the Add-in (Nov 2008)

© 2008 Mark Tabladillo Ph.D. 20

http://www.microsoft.com/sqlserver/2008/en/us/data-mining-addins.aspx

System Requirements

• Supported Operating Systems: Windows Server 2003 Service Pack 2; Windows Server 2008; Windows Vista Service Pack 1; Windows XP Service Pack 3

• Microsoft .NET Framework 2.0.

• If installing the Table Analysis Tools or Data Mining Client for Excel, Microsoft Office 2007 with .NET Programmability Support. Supported editions of Office 2007 include:

– Professional

– Professional Plus

– Ultimate

– Enterprise

• If installing the Data Mining Templates for Visio, Microsoft Visio Professional 2007 with .NET Programmability Support.

• 40 MB of available hard disk space.

• Note: The Data Mining Add-ins require a connection to one of the following versions of SQL Server 2008 Analysis Services:

– Enterprise

– Standard

© 2008 Mark Tabladillo Ph.D. 21

Delivering Predictive Analysis to Every User

• Comprehensive – Extend the benefits of predictive analysis to all users, delivering a full

data mining development life cycle through the familiar environment of the 2007 Microsoft Office system.

• Intuitive – Empower users to harness advanced data mining technologies, hiding

complexity behind automated tasks that deliver actionable insight throughout the organization.

• Collaborative – Share data mining models through interactive graphical visualizations,

and deliver recommendation and insight with simple and prompt publishing capabilities.

© 2008 Mark Tabladillo Ph.D. 22

Top New Features

• Score new cases to seek most profitable customers with new Prediction Calculator.

• Discover cross-sell/up-sell opportunities to optimize offerings with new Shopping Basket Analysis.

• Validate accuracy and stability of models simultaneously with new, richly formatted Cross Validation.

• Generate summary reports to enhance referencing and collaboration with the new Document Model feature.

© 2008 Mark Tabladillo Ph.D. 23

SQL Server 2008 Menu Items

© 2008 Mark Tabladillo Ph.D. 24

Asking Permission

© 2008 Mark Tabladillo Ph.D. 25

Asking Permission Text DBA Person,

I have downloaded and installed Microsoft SQL Server 2008 Data Mining Add-ins for Office 2007 on my machine ARCHITECT. These add-ins let me analyze my spreadsheet data in powerful ways by utilizing Microsoft SQL Server 2008 Analysis Services.

In order to use these add-ins, I will need to be connected to an instance of Microsoft SQL Server 2008 Analysis Services that has been configured to support the add-ins. This configuration needs to be carried out by an administrator by following these steps:

1. Download the add-ins package from http://www.microsoft.com/sqlserver/2008/en/us/trial-software.aspx. 2. Launch the Setup, select the Server Configuration Tool and install it. 3. Run the Server Configuration Tool and follow the wizard steps.

I would appreciate it if you could let me know whether it is possible for you to configure an instance of SQL Server 2008 Analysis Services as described above and give me access to it.

Thank you, Data Miner

© 2008 Mark Tabladillo Ph.D. 26

What is a model?

© 2008 Mark Tabladillo Ph.D. 27

List the Data Mining Algorithms

• Ten Answers

• Each one is a field of academic focus

© 2008 Mark Tabladillo Ph.D. 28

The Data Mining Algorithms

• Microsoft Decision Trees

• Microsoft Clustering

• Microsoft Time Series

• Microsoft Association Rules

• Microsoft Sequence Clustering

• Microsoft Naive Bayes

• Microsoft Neural Network

• Microsoft Linear Regression

• Microsoft Logistic Regression

• Text Mining

© 2008 Mark Tabladillo Ph.D. 29

What is a calculation?

• Business intelligence relies on many common calculations.

© 2008 Mark Tabladillo Ph.D. 30

A Parable of Unity and Diversity

• One day a parabola met a line. They each wondered aloud how much they had in common. They moved around to find out.

© 2008 Mark Tabladillo Ph.D. 31

Parabola

Line

The Analyze Tab

© 2008 Mark Tabladillo Ph.D. 32

Menu Option Data Mining Algorithm

Analyze Key Influencers Naïve Bayes

Detect Categories Clustering

Fill from Example Logistic Regression

Forecast Time Series

Highlight Exceptions Clustering

Scenario Analysis (Goal Seek) Logistic Regression

Scenario Analysis (What If) Logistic Regression

Prediction Calculator Logistic Regression

Shopping Basket Analysis Association Rules

Why Different Button Names?

© 2008 Mark Tabladillo Ph.D. 33

Menu Option Data Mining Algorithm

Analyze Key Influencers Naïve Bayes

Detect Categories Clustering

Fill from Example Logistic Regression

Forecast Time Series

Highlight Exceptions Clustering

Scenario Analysis (Goal Seek) Logistic Regression

Scenario Analysis (What If) Logistic Regression

Prediction Calculator Logistic Regression

Shopping Basket Analysis Association Rules

The Data Mining Tab

© 2008 Mark Tabladillo Ph.D. 34

• The ribbon has different regions:

• Data Preparation

• Data Modeling

• Accuracy and Validation

• Model Usage

• Management

• Connection

Demo 1: Card Sorting

• Take the sample of cards you have and put them into one or more groups. Write in the area below what your groups are.

© 2008 Mark Tabladillo Ph.D. 35

Demo 2: Demographic Profiles

• Exercise 1. We will assume that each of the 10 listed people uses SQL Server technology as some part of their job. For the column marked “UserGroup”, write in YES (and NO otherwise) for people you believe would be interested in future SQL Server user group meetings.

© 2008 Mark Tabladillo Ph.D. 36

Demo 2: Demographic Profiles

• Exercise 2: Assume an average house in your neighborhood or area is for sale. For the column marked “NewNeighbors”, write in YES (and NO otherwise) for people you believe might be a potential buyer for that average home.

© 2008 Mark Tabladillo Ph.D. 37

What is unsupervised?

• Model of the empirical data.

© 2008 Mark Tabladillo Ph.D. 38

What is supervised?

• Model of the process between input and output attributes.

© 2008 Mark Tabladillo Ph.D. 39

Scientific Progress

• Why might two scientists come to slightly or widely different conclusions?

© 2008 Mark Tabladillo Ph.D. 40

Demo 3: Sports

• Look at page 8C with the USA Today Coaches Poll. Based on this list (and other information on college football on this page) do you completely agree with the rankings? Why or why not?

© 2008 Mark Tabladillo Ph.D. 41

Demo 4: Money

• Look at page 6B with the USA Today Market Trends. Choose three specific pieces of information on this chart which, to you, illustrate the current state of the American Economy.

© 2008 Mark Tabladillo Ph.D. 42

Wittgenstein’s Duck-Rabbit

© 2008 Mark Tabladillo Ph.D. 43

Data Mining Examples Tour

© 2008 Mark Tabladillo Ph.D. 44

Data Mining

• “Data” precedes “Mining”

• “Data” – when is it easier?

• “Data” – when is it harder?

• “Mining” – when is it easier?

• “Mining” – when is it harder?

© 2008 Mark Tabladillo Ph.D. 45

Regroup and Conclusion

• Main Points from this Presentation

© 2008 Mark Tabladillo Ph.D. 46

Resources • Microsoft SQL Server 2008

http://www.microsoft.com/sqlserver/2008/en/us/data-mining.aspx

• SQL Server Data Mining http://www.sqlserverdatamining.com/ssdm/default.aspx

• Adventure Works Tutorial – “SQL Server 2005 Data Mining Tutorial http://www.sqlserverdatamining.com/ssdm/Home/Tutorials/tabid/57/Default.aspx

• MSDN Forums (“Katmai” = 2008, “SQL Server” = 2005 and before) http://forums.microsoft.com/MSDN/default.aspx?SiteID=1

• Data Mining with Microsoft SQL Server 2008 (Coming November 17, 2008) by Jamie MacLennan (Author), ZhaoHui Tang (Author), Bogdan Crivat (Author)

• Smart Business Intelligence Solutions with Microsoft® SQL Server® 2008 (PRO-Developer) (Coming February 4, 2009) by Lynn Langit (Author), Matthew Roche (Author)

• KD Nuggets (Data Mining and Knowledge Discovery Portal) http://www.kdnuggets.com/

• Association of Computing Machinery http://www.acm.org/

© 2008 Mark Tabladillo Ph.D. 47

Contact Information

• Data Mining Portal and Blog http://marktab.net

• Twitter: @marktabnet

• Also on: Linked In Facebook

© 2008 Mark Tabladillo Ph.D. 48