Driscoll bi sig_15_jun2010

Post on 05-Dec-2014

1.616 views 0 download

description

 

Transcript of Driscoll bi sig_15_jun2010

WINNINGWITH

BIG DATA

Michael Driscoll@dataspora

SDForum BI SIGJune 15, 2010

Secrets of the Successful

Data Scientist

WHY DATAMATTERSNOW

THE INDUSTRIALAGE OF DATA

WHAT IS BIG DATA?

Data that is distributed.

class size manage with how it fits examples

small < 10 GB Excel, Rfits in one machine’s memory

thousands of sales figures

medium 10GB-1TB indexed files, monolothic DB

fits on one machine’s disk millions of web pages

Big > 1TBHadoop,

distributed DBs

stored across many

machinesbillions of web clicks

WHAT ISDATA SCIENCE?

WHY DATA SCIENCEIS SEXY

+ =

“The sexy job in the next ten years will be statisticians…”- Hal Varian

data model

1000 bytes 2 bytes

9 WAYS TO WINWITH DATA

1. CHOOSE THERIGHT TOOL

You don’t need a chainsaw to cut butter.

2. COMPRESS EVERYTHING

The world is IO-bound.

mysqldump -u myuser -p mypass sourceDB | \ gzip | ssh mike@dataspora.com "cat - | \ gunzip | mysql -u myuser -p mypass targetDB"

3. SPLIT UPYOUR DATA

Split, apply, combine.

4. WORK WITH SAMPLES

Big Data is heavy, samples are light.

perl -ne "print if (rand() < 0.01)" \ data.csv > sample.csv

5. USESTATISTICS

6. COPYFROM OTHERS

Use open source.

git clone git://github.com/kevinweil/hadoop-lzo

Charts are compositions,not containers.

7. ESCHEW CHART TYPOLOGIES

8. COLOR WITH CARE

Color can enhance or insult.

9. TELL A STORY

People are listening.

ONE SUCCESSSTORY

WHY DO TELCO CUSTOMERS LEAVE?

Sign up Leave

Goal: “less churn.”

DATA:BILLIONSOF CALLS

… and millions of callers.

… a difference,but not significant.

DOES CALL QUALITYMATTER?

Hmmm...

WHAT ABOUTSOCIALNETWORKS?

… but is it predictive?

BUILD THE CALL GRAPH

April

EVOLUTION OF A CALL GRAPH

May

EVOLUTION OF A CALL GRAPH

June

EVOLUTION OF A CALL GRAPH

July

EVOLUTION OF A CALL GRAPH

when a cancellationoccurs in a call network.

700% INCREASEIN CHURN

FINAL THOUGHTS

Big Data Dedicated RDBMS

Analytics(R, SPSS, SAS, SAP)

Data Products (Content Filters, Rec Engines)

Data

Actions

Insights

THE BIG DATA STACK

THANKS!QUESTIONS?

Michael Driscollmed@dataspora.com

@dataspora on Twitterhttp://www.dataspora.com/blog

SDForum BI SIGJune 15, 2010