13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark...

32
13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe, Ph.D., Consultant and Assistant Professor

Transcript of 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark...

Page 1: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

13 June 2013 | Virtual Business Analytics Chapter

A Best Practices Framework for Data Mining

Mark Tabladillo, Ph.D., Data Mining Scientist

Artus Krohn-Grimberghe, Ph.D., Consultant and Assistant Professor

Page 2: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

About MarkTab Training and Consulting with http://marktab.com

Data Mining Resources and Blog at http://marktab.net

Ph.D. – Industrial Engineering, Georgia Tech

Training and consulting internationally across many industries – SAS and Microsoft

Contributed to peer-reviewed research and legislation

◦ Mentoring doctoral dissertations at the accredited University of Phoenix

Presenter

Page 3: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

About Artus Assistant Professor for Analytic Information Systems and Business Intelligence

PhD in computer science

Research: data mining for e-commerce and mobile business

Consultant

Page 4: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

4

Section OneDATA MINING FOUNDATION

Page 5: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

Definition 1 (Informal) Data mining is the automated or semi-automated process of discovering patterns in data.

Page 6: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

Definition 2Data Mining is a process using

1. Exploratory Data AnalysisStatistical and visual data analysis techniques.

Forming a hypothesis

2. Data Modeling & Predictions Describe data using probability distributions and Machine Learning algorithms (“model”).

Fitting a hypothesis

3. Statistical Learning TheoryModel selection, model evaluation

6

Page 7: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

Data Mining Visualized

Target: attribute we are interested in.

Input: data available for our predictions.

Function f: describes the relationship between target and input.Regrettably, f is unknown and unknowable.

7

Input Target

f ( )

Page 8: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

Data Mining Visualized

8

Input Target

f ( )

Hypothesis h )(

UnknownReal world:

Data Mining model:

Need to find “good” h.h is your DM “algorithm”.

Input data has to be appropriate.Select and transform as needed

Correct modeling of target is crucial

Page 9: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

9

Top 10 ExpectationsBEST PRACTICE: LEARN FROM EXPERIENCE

Page 10: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

10

•People can start data mining in 10 minutes…

Marketing More Scientific

•Better models come from days, weeks or months of iterative improvement

Expectation Ten

Page 11: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

11

•Data miners can provide provably good models with little or zero knowledge of the specific industry…

Marketing More Scientific

•Knowing the industry and organizational goals helps orient the questions, modeling, and analysis.

Expectation Nine

Page 12: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

12

•Open source software can provide quality results worthy of peer-reviewed literature…

Marketing More Scientific

•Commercial software with years-long service options is required for enterprise scale.

Expectation Eight

Page 13: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

13

•We can learn a lot from the current data warehouses, cubes, and big data…

Marketing More Scientific

•We can improve our modeling by creating new data collection strategies.

Expectation Seven

Page 14: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

14

•People can build data mining models with little or zero data cleaning…

Marketing More Scientific

•Better results happen when we organize and rearrange data for best success.

Expectation Six

Page 15: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

15

•Data mining can provide answers to problems…

Marketing More Scientific

•Most times we only get detail insights toward larger problems, and sometimes uncover more problems than we started with.

Expectation Five

Page 16: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

16

•A little data mining knowledge can provide an organization with a competitive edge…

Marketing More Scientific

•The edge grows along with experience and better study of the methodology and mathematics.

Expectation Four

Page 17: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

17

•Individual professionals can deliver excellent predictive analysis…

Marketing More Scientific

•Small teams working together can help quickly and efficiently conquer some of the most difficult analytic challenges.

Expectation Three

Page 18: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

18

•Numbers speak for themselves and can influence better decision making…

Marketing More Scientific

•Leadership strategy helps teams deliver results in the best way given the current culture.

Expectation Two

Page 19: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

19

•A lot of data mining best practices and strategies can be communicated in an hour or a day…

Marketing More Scientific

•The best commitment is ongoing education on both data mining and machine learning technology.

Expectation One

Page 20: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

20

Section TwoANALYZING AND PREPARING DATA

Page 21: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

Best practice: study individual attributes

Histograms and frequencies (discrete)

Kernel density estimates

Cumulative distribution function

Rank-order plots and lift charts

Summary statistics (continuous)

Box-and-whisker plots

21

Page 22: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

Best practice: study combinations

Pivot tables

Scatter plots

Logarithmic plots

Naïve Bayes

Correlation matrices

False-Color plots

Scatter-Plot matrix

Co-plot

22

Page 23: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

23

Section ThreeMACHINE LEARNING ALGORITHMS

Page 24: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

How to Choose an Algorithm Choosing an algorithm or series of algorithms is an art

One algorithm could perform different tasks

Be willing to experiment with algorithms and algorithm parameters

24

Page 25: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

Algorithms for Data Mining Tasks (1 of 2)Algorithm Name

Description

Microsoft Time Series

Analyzes time-related data by using a linear decision tree.Patterns can be used to predict future values in the time series.

Microsoft Decision Trees

Makes predictions based on the relationships between columns in the dataset, and models the relationships as a tree-like series of splits on specific values.Supports the prediction of both discrete and continuous attributes.

Microsoft Linear Regression

If there is a linear dependency between the target variable and the variables being examined, finds the most efficient relationship between the target and its inputs.Supports prediction of continuous attributes.

Microsoft Clustering

Identifies relationships in a dataset that you might not logically derive through casual observation. Uses iterative techniques to group records into clusters that contain similar characteristics.

Page 26: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

Algorithms for Data Mining Tasks (2 of 2)Algorithm Name Description

Microsoft Naïve Bayes

Finds the probability of the relationship between all input and predictable columns. This algorithm is useful for quickly generating mining models to discover relationships.Supports only discrete or discretized attributes.Treats all input attributes as independent.

Microsoft Logistic Regression

Analyzes the factors that contribute to an outcome, where the outcome is restricted to two values, usually the occurrence or non-occurrence of an event.Supports the prediction of both discrete and continuous attributes.

Microsoft Neural Network

Analyzes complex input data or business problems for which a significant quantity of training data is available but for which rules cannot be easily derived by using other algorithms.Can predict multiple attributes.Can be used to classify discrete attributes and regression of continuous attributes.

Microsoft Association Rules

Builds rules that describe which items are likely to appear together in a transaction.

Microsoft Sequence Clustering

Identifies clusters of similarly ordered events in a sequence.Provides a combination of sequence analysis and clustering.

Page 27: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

Best practice: Document your science

Describe the business problem

Determine how to measure success (including baseline)

Document what was learned during data preparation and analysis

Justify the algorithms used during the investigation

List assumptions were made

27

Page 28: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

28

Section FourACHIEVING BUSINESS VALUE

Page 29: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

Leadership challenges Build on organizational communications

Consider redoing analysis

Find results champions

Celebrate the results

29

Page 30: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

Best practice: prepare the next cycle

Note strengths, weaknesses, opportunities, risks

Build consensus on model expiration dates

Encourage and improve the process

Create insight into new future data collection

30

Page 31: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

Conclusion Best Practices Framework

Provide a data mining foundation

Prepare the data

Evaluate machine learning output

Plan to move toward actionable decisions

31

Page 32: 13 June 2013 | Virtual Business Analytics Chapter A Best Practices Framework for Data Mining Mark Tabladillo, Ph.D., Data Mining Scientist Artus Krohn-Grimberghe,

Resources http://www.lfd.uci.edu/~gohlke/pythonlibs/ Free Win x64 Python libs

http://www.enthought.com/products/epd.php Commercial Python

http://www.burns-stat.com/pages/Tutor/R_inferno.pdf R Tutorial

http://technet.microsoft.com/en-us/sqlserver/cc510301.aspx SQL Server Analysis Services Data Mining

http://marktab.net Data Mining Portal

http://sqlserverdatamining.com Data Mining Team Portal

Books: “Data Mining with SQL Server 2008”, “Data Mining for Business Intelligence”, “Practical Time Series Forecasting”

32