Strata: 9 laws of Data Mining

44
Advanced Analytics Duncan Ross @duncan3ross [email protected] Based on the 9 Laws of Data Mining by Tom Khabaza THE NINE LAWS OF DATA MINING

description

My 9 Laws of Data Mining presentation from Strata Santa Clara 2013-02-26

Transcript of Strata: 9 laws of Data Mining

Page 1: Strata: 9 laws of Data Mining

Advanced Analytics

Duncan [email protected]@teradata.com

Based on the 9 Laws of Data Mining by Tom Khabaza

THE NINE LAWS OF DATA MINING

Page 2: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

• The last two algorithms you need to know!• An explanation of Bayes’ theorem• The name of the software that will make you $ millions

> Not even a comparison of different software!

What you won’t get from this presentation

The grave of Thomas Bayes (probably) – near “silicon roundabout” Image via Wikimedia

Page 3: Strata: 9 laws of Data Mining

Advanced Analytics

Data Mining laws also work as Data Science laws

THE 0TH LAW

Page 4: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

• This question generates more arguments than answers

• Common features> Predicting or classifying things> Based on historical cases (with or without outcomes)> Machine learning techniques> No predefined underlying model assumed

What is data mining?

Image via Wikimedia

Page 5: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

What, where, why and how of data mining

9 Laws

CRISP-DM

What?

Where? Unified data architecture

Who?

Why?

How?

Page 6: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

CRISP-DM created to help

Page 7: Strata: 9 laws of Data Mining

Advanced Analytics

Prediction increases information locally by generalisation

THE 7TH LAW

Page 8: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

• Data mining learns from generalisations> Historical cases build a model of reality

• These general models then predict an outcome that is local to a case and a time> How likely is it that someone will purchase product ‘x’> Will person a influence person b> What number will the ball land on in roulette

• The knowledge gained may have been implied in the data, but it is new and valuable

This may seem obvious

Page 9: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

• Results need to be thought of at a group level for assessment> Individual results may be poor even when generated from a

great model

• Two levels of value> Prediction (what, when etc…)> Model (how…)

• The gap between the general and the local is the difference between model building and scoring> Hadoop?> R?

Why the 7th Law is important

Page 10: Strata: 9 laws of Data Mining

Advanced Analytics

There are always patterns

THE 5TH LAW

Page 11: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

… is taking the 5th Law to heart

• A major difference between the approach of data mining and data science is in the “Field of Dreams”> Data mining (usually) requires measurable ROI prior to projects> Data science is trading on probable ROI prior to projects

• Fortunately there is still a lot of gold in those hills> And as technologies and data increase the number of hills is also

increasing

The heart of data science…

Page 12: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

Graph of hills vs gold extracted

Page 13: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

• Just because there are always patterns doesn’t mean that they are useful> Algorithms can (and will) cluster a cloud> Without Laws 1 and 2 patterns may not be a good thing

But…

Page 14: Strata: 9 laws of Data Mining

Advanced Analytics

Business objectives are the origin of every data mining solution

Business knowledge is central to every step of the data mining

process

THE 1ST LAW

Advanced Analytics

THE 2ND LAW

Page 15: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

• This story begins with a gains curve…

The sad tale of churn

Page 16: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

• To predict churn

• What was the definition of churn?

• What did the business actually want to do?> Predict “churn”?> Predict people who became inactive?> Predict people who became inactive who might not if contacted?

What was the business objective?

Page 17: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

• Because we aren’t doing this for the fun of it> Or at least not just for the fun of it

• At every stage ask:> Does this relate to the business question?> Is the original business question still valid?> Is there a better question that could be asked of this data?> Can this be acted on?> What does this actually mean?

• Document the answers, and refer back to them

Why the 1st and 2nd Laws are important

Page 18: Strata: 9 laws of Data Mining

Advanced Analytics

There is no free lunch for the data miner

THE 4TH LAW

Page 19: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

• Is….

• I spent a lot of time on this in the 1990s> Neural nets> Regression> Decision trees

• If you know in advance what technique you need to use the problem has already been solved

The last algorithm you will need to learn

Page 20: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

The case that worked... then didn‘t

Campaign Topic

Identify fingerprint of churners

Description

SNA offers an opportunity to detect potential churners earlier (possibly before they have completely ceased all on-net activity) and also identifies the individuals who are likely to have the best chance of persuading them to return. The aim of this campaign format is to use SNA to detect potential churners during the process of leaving and motivate them to stay.

Current Approach: New Approach

Active Inactive

Churn detected Churn detected

Page 21: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

• Solutions are not generally reproducible> It may work here, but not there

• Methodologies are reproducible

• Learnings may have value

• Time will invalidate even the best models

Why the 4th Law is important

Page 22: Strata: 9 laws of Data Mining

Advanced Analytics

Data preparation is more than half of every data mining process

THE 3RD LAW

Page 23: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

Data preparation through a case…

Page 24: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

The problems of text data

Page 25: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

Data quality raises it’s head…

Page 26: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

CREATE dimension table wrk.npath_reboot_5eventsAS SELECT path, COUNT(*) AS path_countFROM nPath

(ON wrk.w_event_f PARTITION BY srv_id ORDER BY evt_ts desc MODE (NONOVERLAPPING ) PATTERN ('X{0,5}.reboot') SYMBOLS

(true as X, evt_name = 'REBOOT' AS reboot) RESULT (FIRST( srv_id OF X) AS srv_id, ACCUMULATE (evt_name OF ANY (X,reboot))

AS path) ) GROUP BY 1 ;

SELECT * FROM GraphGen (ON

(SELECT * from wrk.npath_reboot_5events ORDER BY path_count LIMIT 30 )PARTITION BY 1ORDER BY path_count descitem_format('npath')item1_col('path') score_col('path_count') output_format('sankey')justify('right'));

Note number of paths with a reboot,

following another reboot!

What events lead up to a reboot?

Page 27: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

Looks like an issue with the data on the 30th September and beyond, the Reboot data for October seems to have been aggregated and added to September the 30th

More data issues

Page 28: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

• Duncan’s theorem> The usefulness of a variable in a model is inversely related to the

amount of time you spend creating it

• Edouard’s corollary> If it turns out to be useful you could have created it in the time

indicated by Duncan’s theorem

Data preparation is tough

Page 29: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

• Data just got noisier and less consistent

• Maintaining an analytical data dictionary just moved from vital to really really vital

Welcome to the world of big data

Page 30: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

• Because data prep is such a huge task you need to plan for it well> Assume that you will need to do it at least twice

– Experimentation– Model building– Deployment

• Look for software that makes it easy> And repeatable> And documentable

– Scripts ≠ documentation

• Documentation of your data is even more important than documentation of your models> Models can be very sensitive to data inputs

Why the 3rd Law is important

Page 31: Strata: 9 laws of Data Mining

Advanced Analytics

Data mining amplifies perception in the business domain

THE 6TH LAW

Page 32: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

srv_id dslam err_cnt srvid_cnt nra_id dslam_cnt errorspersrvid20785675 lgp44-2 2 248 MZL 2 1522254516 ltc56-1 4 314 BOT 10 1521059184 bch66-1 2 184 RIV 15 1521149846 tsm83-1 2 308 LCR 3 1320833837 did75-4 10 216 DID 23 1322295785 gbw68-1 36 170 HRS 1 1221807750 gmo34-1 2 117 BER 17 1221374927 bgl93-1 2 246 G5Y 8 1220291116 ien11-1 2 211 ALZ 2 1221459244 pai34-1 4 210 M7C 3 1121027647 bel60-1 4 223 TRO 10 1120551629 pla13-1 10 332 BED 4 1120633112 crj95-2 2 332 G5Y 8 1120585199 bau06-1 46 349 BLA 21 1021477790 cvl92-1 4 180 IMS 35 1021292874 che78-1 2 163 PIT 2 10

Look for patterns in Network Infrastructure

• Too many end customers to visualise as a graph but network has a hierarchy> Internet Gateway Area Hub Customer Router

• Create a table using standard SQL to join the reference data plus the Customer Hub error data into a single view

Page 33: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

Size of Node = number of customersWidth of Edge = number of errors

SELECT * FROM graphgen (ON

(SELECT DISTINCT dmt_act_dslam, nra_id,

nbr_of_srvid, errorspersrv, nbr_of_dslam

FROM wrk.srvid_dslam_err) PARTITION BY 1 ORDER BY errorspersrv item_format('cfilter') item1_col('dmt_act_dslam') item2_col('nra_id') score_col('errorspersrv') cnt1_col('nbr_of_srvid') cnt2_col('nbr_of_dslam') output_format('sigma') directed('false') width_max(10) width_min(1) nodesize_max (3) nodesize_min (1));

Visualise as a Graph using Aster GraphGen

Page 34: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

Zoom in on area where the edge width/colour indicates a problem

Page 35: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

Add churn information

• Add churn information to find customers connected to this Hub that have cancelled their accounts

Page 36: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

Synch Issues by Hub Type

Page 37: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

Error and Complaint rates by equipment type

Page 38: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

• We don’t exist in a vacuum> We need to sell the results of analysis

• This is a virtuous feedback loop

Why the 6th Law is important

Page 39: Strata: 9 laws of Data Mining

Advanced Analytics

The value of data mining results is not determined by the accuracy or

stability of predictive models

THE 8TH LAW

Page 40: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

• Or if it’s right 1 time in 35?

If your model is 98% accurate – so what?

Page 41: Strata: 9 laws of Data Mining

04/08/2023 @duncan3ross

• Type I and Type II errors> What is the cost (opportunity and actual) of a false positive?> What is the cost of a false negative?

• Gains curves> But beware the over accurate curve

• Don’t the forget the user> Decision trees fight back

How can you evaluate models?

Page 42: Strata: 9 laws of Data Mining

Advanced Analytics

All patterns are subject to change

THE 9TH LAW

Page 43: Strata: 9 laws of Data Mining

Advanced Analytics

0 Listen to data miners…7 Data mining brings new knowledge5 And there will always be new knowledge1 Start with the business2 Keep going back to the business4 It won’t get easier with time3 Especially given the state your data is in6 But you will improve business results8 As long as you look for the right outputs9 Goto 0

SUMMARY

Page 44: Strata: 9 laws of Data Mining

Advanced Analytics

• http://khabaza.codimension.net/index_files/9laws.htm

• The Society of Data Miners (coming soon)> Available on LinkedIn

• CRISP-DM

RESOURCES