Driscoll bi sig_15_jun2010

33
WINNING WITH BIG DATA Michael Driscoll @dataspora SDForum BI SIG June 15, 2010 Secrets of the Successful Data Scientist

description

 

Transcript of Driscoll bi sig_15_jun2010

Page 1: Driscoll bi sig_15_jun2010

WINNINGWITH

BIG DATA

Michael Driscoll@dataspora

SDForum BI SIGJune 15, 2010

Secrets of the Successful

Data Scientist

Page 2: Driscoll bi sig_15_jun2010

WHY DATAMATTERSNOW

Page 3: Driscoll bi sig_15_jun2010

THE INDUSTRIALAGE OF DATA

Page 4: Driscoll bi sig_15_jun2010

WHAT IS BIG DATA?

Data that is distributed.

class size manage with how it fits examples

small < 10 GB Excel, Rfits in one machine’s memory

thousands of sales figures

medium 10GB-1TB indexed files, monolothic DB

fits on one machine’s disk millions of web pages

Big > 1TBHadoop,

distributed DBs

stored across many

machinesbillions of web clicks

Page 5: Driscoll bi sig_15_jun2010

WHAT ISDATA SCIENCE?

Page 6: Driscoll bi sig_15_jun2010

WHY DATA SCIENCEIS SEXY

Page 7: Driscoll bi sig_15_jun2010

+ =

“The sexy job in the next ten years will be statisticians…”- Hal Varian

Page 8: Driscoll bi sig_15_jun2010
Page 9: Driscoll bi sig_15_jun2010

data model

1000 bytes 2 bytes

Page 10: Driscoll bi sig_15_jun2010

9 WAYS TO WINWITH DATA

Page 11: Driscoll bi sig_15_jun2010

1. CHOOSE THERIGHT TOOL

You don’t need a chainsaw to cut butter.

Page 12: Driscoll bi sig_15_jun2010

2. COMPRESS EVERYTHING

The world is IO-bound.

mysqldump -u myuser -p mypass sourceDB | \ gzip | ssh [email protected] "cat - | \ gunzip | mysql -u myuser -p mypass targetDB"

Page 13: Driscoll bi sig_15_jun2010

3. SPLIT UPYOUR DATA

Split, apply, combine.

Page 14: Driscoll bi sig_15_jun2010

4. WORK WITH SAMPLES

Big Data is heavy, samples are light.

perl -ne "print if (rand() < 0.01)" \ data.csv > sample.csv

Page 15: Driscoll bi sig_15_jun2010

5. USESTATISTICS

Page 16: Driscoll bi sig_15_jun2010

6. COPYFROM OTHERS

Use open source.

git clone git://github.com/kevinweil/hadoop-lzo

Page 17: Driscoll bi sig_15_jun2010

Charts are compositions,not containers.

7. ESCHEW CHART TYPOLOGIES

Page 18: Driscoll bi sig_15_jun2010

8. COLOR WITH CARE

Color can enhance or insult.

Page 19: Driscoll bi sig_15_jun2010

9. TELL A STORY

People are listening.

Page 20: Driscoll bi sig_15_jun2010

ONE SUCCESSSTORY

Page 21: Driscoll bi sig_15_jun2010

WHY DO TELCO CUSTOMERS LEAVE?

Sign up Leave

Goal: “less churn.”

Page 22: Driscoll bi sig_15_jun2010

DATA:BILLIONSOF CALLS

… and millions of callers.

Page 23: Driscoll bi sig_15_jun2010

… a difference,but not significant.

DOES CALL QUALITYMATTER?

Page 24: Driscoll bi sig_15_jun2010

Hmmm...

WHAT ABOUTSOCIALNETWORKS?

Page 25: Driscoll bi sig_15_jun2010

… but is it predictive?

BUILD THE CALL GRAPH

Page 26: Driscoll bi sig_15_jun2010

April

EVOLUTION OF A CALL GRAPH

Page 27: Driscoll bi sig_15_jun2010

May

EVOLUTION OF A CALL GRAPH

Page 28: Driscoll bi sig_15_jun2010

June

EVOLUTION OF A CALL GRAPH

Page 29: Driscoll bi sig_15_jun2010

July

EVOLUTION OF A CALL GRAPH

Page 30: Driscoll bi sig_15_jun2010

when a cancellationoccurs in a call network.

700% INCREASEIN CHURN

Page 31: Driscoll bi sig_15_jun2010

FINAL THOUGHTS

Page 32: Driscoll bi sig_15_jun2010

Big Data Dedicated RDBMS

Analytics(R, SPSS, SAS, SAP)

Data Products (Content Filters, Rec Engines)

Data

Actions

Insights

THE BIG DATA STACK

Page 33: Driscoll bi sig_15_jun2010

THANKS!QUESTIONS?

Michael [email protected]

@dataspora on Twitterhttp://www.dataspora.com/blog

SDForum BI SIGJune 15, 2010