KDD-07 Invited Innovation Talk

64
August 12, 2007 KDD-07 Invited Innovation Talk Research Usama Fayyad, Ph.D. Chief Data Officer & Executive VP Yahoo! Inc.

description

August 12, 2007. KDD-07 Invited Innovation Talk. Research. Usama Fayyad, Ph.D. Chief Data Officer & Executive VP Yahoo! Inc. Thanks and Gratitude. My family: my wife Kristina and my 4 kids; my parents and my sisters - PowerPoint PPT Presentation

Transcript of KDD-07 Invited Innovation Talk

Page 1: KDD-07 Invited Innovation Talk

1

August 12, 2007KDD-07 Invited Innovation Talk

Research

Usama Fayyad, Ph.D.Chief Data Officer & Executive VP

Yahoo! Inc.

Page 2: KDD-07 Invited Innovation Talk

2

Research

Thanks and Gratitude

• My family: my wife Kristina and my 4 kids; my parents and my sisters

• My academic roots: The University of Michigan, Ann Arbor – my Ph.D. committee, including Ramasamy Uthurusamy (then at GM Research Labs), grad student colleagues (Jie Cheng), Internships at GM Research and at NASA’s JPL

• My Mentors and Collaborators– Caltech Astronomy (G. Djorgovski, Nick Weir), Pietro Perona and M.C. Burl

– JPLNASA Colleagues: Padhraic Smyth, Rich Doyle, Steve Chien, Paul Stolorz, Peter Cheeseman, David Atkinson, many others…

– Microsoft Colleagues: Decision Theory Group, Surajit Chadhuri, Jim Gray, Paul Bradley, Bassel Ojjeh, Nick Besbeas, Heikki Mannila, Rick Rashid, many others

– Fellows in KDD: Gregpry Piatetsky-Shapiro, Daryl Pregibon, Christos Faloutsos, Geoff Webb, Bob Grossman, Jiawei Han, Eric Tsui, Tharam Dillon, Chengqi Zhang, many, many colleagues

• My Business Partners– Bassel Ojjeh, Nick Besbeas, many VC’s, many advisers and strategic clients including

Microsoft SQL Server and sales teams

• My Yahoo! Colleagues:– Zod Nazem, Jerry Yang, David Filo, Yahoo! exec team, Prabhakar Raghavan, Pavel Berkhin,

Nick Weir, Hunter Madsen, Nitin Sharma, Raghu Ramakrishnan, Y! Research folks, many at Yahoo SDS and current and previous Yahoo! employees

Page 3: KDD-07 Invited Innovation Talk

3

Personal Observations of a Data Mining Disciple

A Data Miner’s Story – Getting to Know the Grand Challenges

Research

Usama Fayyad, Ph.D.Chief Data Officer & Executive VP

Yahoo! Inc.

Page 4: KDD-07 Invited Innovation Talk

4

Research

Overview

• The setting

• Why data mining is a must?

• Why data mining is not happening?

• A Data Miner’s Story– Grand Challenges: Pragmatic

– Grand Challenges: Technical

– Some case studies

• Concluding Remarks

Page 5: KDD-07 Invited Innovation Talk

5

Research

The data gap…

• The Machinery Moves on:– Moore’s law: processing “capacity” doubles every 18 months : CPU,

cache, memory– It’s more aggressive cousin: Disk storage “capacity” doubles every 9

months

• The Demand is exploding:– Every business is an eBusiness

– Scientific Instruments and Moore’s law

– Government

• The Internet – the ubiquity of the Web

• The Talent Shortage

Page 6: KDD-07 Invited Innovation Talk

6

Research

What is Data Mining?

Finding interesting structure in data• Structure: refers to statistical patterns, predictive

models, hidden relationships

• Interesting: ?

• Examples of tasks addressed by Data Mining

– Predictive Modeling (classification, regression)

– Segmentation (Data Clustering )

– Affinity (Summarization)• relations between fields, associations, visualization

Page 7: KDD-07 Invited Innovation Talk

7

Research

Beyond Data Analysis

• Scaling analysis to large databases– How to deal with data without having to move it out?

– Are there abstract primitive accesses to the data, in database systems, that can provide mining algorithms with the information to drive the search for patterns?

– How do we minimize--or sometimes even avoid--having to scan the large database in its entirety?

• Automated search – Enumerate and create numerous hypotheses

– Fast search

– Useful data reductions

• More emphasis on understandable models– Finding patterns and models that are “interesting” or “novel” to

users.

• Scaling to high-dimensional data and models.

Page 8: KDD-07 Invited Innovation Talk

Research

Data Mining and Databases

Many interesting analysis queries are difficult to state precisely

• Examples:– which records represent fraudulent transactions?

– which households are likely to prefer a Ford over a Toyota?

– Who’s a good credit risk in my customer DB?

• Yet database contains the information – good/bad customer, profitability

– did/did not respond to mailout/survey/...

Page 9: KDD-07 Invited Innovation Talk

9

Research

ACME CORP ULTIMATE DATA MINING BROWSER

Data Mining Grand Vision

What’s New? What’s Interesting?

Predict for me

Page 10: KDD-07 Invited Innovation Talk

10

Research

The myths…

• Companies have built up some large and impressive data warehouses

• Data mining is pervasive nowadays

• Large corporations know how to do it

• There are tools and applications that discover valuable information in enterprise databases

Page 11: KDD-07 Invited Innovation Talk

11

Research

The truths…

• Data is a shambles, – most data mining efforts end up not benefiting

from existing data infra-structure

• Corporations care a lot about data, and are obsessed with customer behavior and understanding it

• They talk a lot about it…

• An extremely small number of businesses are successfully mining data

• The successful efforts are “one-of”, “lucky strikes”

Page 12: KDD-07 Invited Innovation Talk

12

Research

• Data navigation, exploration, & exploitation technology is fairly primitive:– we know how to build massive data stores

– we do not know how to exploit them

– we do the book-keeping really well (OLTP)

– Inadequate basic understanding of navigation /systems

• many large data stores are write-only (= data tomb)

Ancient Egypt

Current state of Databases

Page 13: KDD-07 Invited Innovation Talk

13

Research

A Data Miner’s Story

• Started out in pure research– Professional student

– Math and algorithms

Page 14: KDD-07 Invited Innovation Talk

14

Research

Researcher view

Algorithms andTheory

Database

Systems

Page 15: KDD-07 Invited Innovation Talk

15

Research

Practitioner view

Systems and integration

Database

Algorithms

Customer

Page 16: KDD-07 Invited Innovation Talk

16

Research

Business view

Systems

Database

Algorithms

Customer

$$$’s

Page 17: KDD-07 Invited Innovation Talk

17

Research

A Data Miner’s Story

• Started out in pure research

• At NASA-JPL did basic research and applied techniques to Science Data Analysis problems– Worked with top scientists is several fields: astronomy,

planetary geology, atmospherics, space science, remote sensing imagery

– Great results, strong group, lots of funding, high demand…

• So why move to Microsoft Research?

Page 18: KDD-07 Invited Innovation Talk

18

Research

Example: Cataloging Sky Objects

Page 19: KDD-07 Invited Innovation Talk

Research

Data Mining Based Solution

• 94% accuracy in recognizing sky objects

• Speed up catalog generation by one to two orders of magnitude (unrealistic to perform manually).

• Classify objects that are at least one magnitude fainter than catalogs to-date.

• Tripled the “data yield”

• Generate sky catalogs with much richer content:

– on order of billions of objects:

> 2x107 galaxies > 2x108 stars, 105 quasars

• Discovered new quasars 40 times more efficiently

Page 20: KDD-07 Invited Innovation Talk

Research

Page 21: KDD-07 Invited Innovation Talk

21

Research

A Data Miner’s Story

• Started out in pure research

• At NASA-JPL

• At Microsoft Research– Basic research in algorithms and scalability

– Began to worry about building products and integrating with database server

– Two groups established: research and product

• So why move out to a start-up?

Page 22: KDD-07 Invited Innovation Talk

22

Research

Working with Large Databases

• One scan (or less) of the database– terminate early if appropriate

• Work within confines of a given limited RAM buffer– Cluster a Gigabyte or Terabyte in, say 10 or 100

Megabytes RAM

• “Anytime” algorithm– best answer always handy

• Pause/resume enabled, incremental• Operate on forward-only cursor over a view

(essentially a data stream)

Page 23: KDD-07 Invited Innovation Talk

23

Research

Business Challenges

Conversion

Retention

Acquisition

Loyalty

Average Order

Technologies

Segmentation

Logistic Regressions

Genetic Algorithms

Decision Trees

Chaid

CART

OLAP

Bayesian Networks

Neural Networks

Business Results Gap

Business Challenges

Conversion

Retention

Acquisition

Loyalty

Average Order

Technical Tools

Business users are unable to apply the power of existing data mining tools to achieve results

Page 24: KDD-07 Invited Innovation Talk

24

Research

Business Challenges

Conversion

Retention

Acquisition

Loyalty

Average Order

TechnologiesSpecialists

Statisticians

DBAs

Consultants

Data Mining PhDsSegmentation

Logistic Regressions

Genetic Algorithms

Decision Trees

Chaid

CART

OLAP

Bayesian Networks

Neural Networks

Business Results Gap

Business Challenges

Conversion

Retention

Acquisition

Loyalty

Average Order

Technical Tools

Business users are unable to apply the power of existing data mining tools to achieve results

Page 25: KDD-07 Invited Innovation Talk

25

Research

Evolving Data Mining

• Evolution on the technical front: – New algorithms

– Embedded applications

– Make the analyst life easier

• Evolution on the usability front– New metaphors

– Vertical applications embedding

– Used by the business user

• In both cases, success means invisibility…

Page 26: KDD-07 Invited Innovation Talk

26

Research

Grand Challenges

• Pragmatic:– Achieving integration and invisibility

• Research/Technical:– Solving some serious unaddressed problems

Page 27: KDD-07 Invited Innovation Talk

27

Research

Pragmatic Grand Challenge 1

Where is the data?• There is a glut of stored data

• Very little of that data is ready for mining

• Data warehousing has proven that it will not solve the problem for us

• Solution: – integration with operational systems

– Take a serious database approach to solving the storage management problem

Page 28: KDD-07 Invited Innovation Talk

28

Research

digiMine Background

Started as Venture Capital-funded company: digiMine, Inc. in March 2000.

Built, operated and hosted data warehouses with built-in data mining apps

• Headquartered in Bellevue, Washington

• $45 million in funding – Mayfield, Mohr Davidow, American Express, Deutsche Bank

• Grew to over 120 employees

• 50 patents+ in technology and processes

• Both technology and services

Page 29: KDD-07 Invited Innovation Talk

29

Research

Sample Customers

Page 30: KDD-07 Invited Innovation Talk

30

Research

A Data Miner’s Story

• Started out in pure research

• At NASA-JPL

• At Microsoft Research

• At digiMine– Lots of VC funding, great team, great press coverage,

and fast moving

– great customers

• So why move to a DMX Group?

Page 31: KDD-07 Invited Innovation Talk

31

Research

Why DMX Group?

• At digiMine, we grew a large “Professional Services” organization

• We learned a lot from these engagements

• VC-funded companies cannot do much consulting

• A fork in the road appeared…– digiMine re-focused on a market vertical: behavioral targeting

for media and publishers

– Renamed to Revenue Science, Inc.

• Formed DMX Group… which was eventually acquired by Yahoo!

Page 32: KDD-07 Invited Innovation Talk

32

Research

DMX Group Mission

• Make enterprise data a working asset in the enterprise:– Data strategy for the business

– Implementation of Business Intelligence and data mining capabilities

– Business issues around data• What is possible?

• How to expose it to business users

• How to train people and change processes

– Integration with operational systems

Page 33: KDD-07 Invited Innovation Talk

33

Research

Data Strategy

• How can your data influence your revenues?

• How do you optimize operations based on data?

• How do you increase customer retention based on data?

• How do you utilize enterprise data assets to spot new opportunities:– Cross-sell to existing customers

– Grow new markets

– Avoid problems such as fraud, abuse, churn, etc?

Page 34: KDD-07 Invited Innovation Talk

34

Research

A Data Miner’s Story

• Started out in pure research

• At NASA-JPL

• At Microsoft Research

• At digiMine/Revenue Science Inc.

• At DMX Group…

Page 35: KDD-07 Invited Innovation Talk

35

Research

Pragmatic Grand Challenge 2

Embedding within Operational Systems• We all worry about algorithms, they are fascinating

• Most of us know that data mining in practice is mostly data prep work

• Go where the data is when the data does not come to you

• But how much of the problem is “data mining”?

• facts: – The effort in embedding an application is huge, and often not

discussed

– Without it, all the algorithms are useless

Page 36: KDD-07 Invited Innovation Talk

36

Churn Modelling and PredictionCase Study – Wireless Telco

Research

Page 37: KDD-07 Invited Innovation Talk

37

Research

Modeling Process

Customer Interactio

n Base11

SMS WAP CDR Billing

Sample Databas

e

Build Churn Model

22 33 High RiskScore

Database

44

Assign Customer Value

55

Med Risk

Low Risk

High Val

Med Val

Low Val

66

66

High Val High Risk

High Val Med Risk

High Val Low Risk

Med Val High Risk

Med Val Med Risk

Med Val Low Risk

Low Val High Risk

Low Val Med Risk

Low Val Low Risk

Risk

Valu e

Page 38: KDD-07 Invited Innovation Talk

38

Research

LTV and Its Application

• A customer’s life-time value (LTV) is the net value that a customer brings in to a business by the end of their service. I.e. their profit contribution.

• LTV allows decisions for individual customers that optimize the return-on-investment (ROI). Examples:– Aggressive retention programs, such as equipment

upgrade and contract renewal for high LTV.

– Differentiated customer care treatment for reactivations by customer with low LTV

Page 39: KDD-07 Invited Innovation Talk

39

Research

What is the Required?

• Detailed data– Integration of CDR, WIG, SMS, Billing – Maintained at detailed level

• Integrated data mining – Algorithms tuned to model thousands of variables and millions of

rows– Accurate Forecasts

• System Robustness– Massively scalable back end system – Flexible architecture to create new variables quickly and easily

• Collaborative Service Model– Service model which guarantees success– Combined IQ Model to optimize science and business knowledge– Low cost to create and maintain models

Page 40: KDD-07 Invited Innovation Talk

40

Research

Map Segments to Actions

High

Low

Low High

Nurture /Maintain

Aggressively

Defend

CautiouslyDefend

Grow Margin

ChangeBad

Behavior

Let them go

Equipment

UpgradeFeature Add

ContractRenewal

Save Program

Elite Program

Loyalty ProgramsFeature Use

PlanMigration

Cost ReducingPrograms

ChurnProbability

Forecasted LTVNegativ

e

Page 41: KDD-07 Invited Innovation Talk

41

Research

Cost Rules Applied…

Cost Rules are introduced to define scoring

For Example:

– Network System Usage Cost

– Mobile to Land Connections Costs

– Technical Operations/Support Costs

– Long Distance Costs

– Inter-Carrier /International subsidy costs

– Roaming Costs

– Bad Debt Allocation

– Many others…

Page 42: KDD-07 Invited Innovation Talk

42

Research

Cost Rules for a Bank?

Cost Rules are introduced to define value

For Example:

– Deposit Value

– Product mix

– Average. daily balance

– Monthly service fees

– Technical operations/Support costs

– Branch/teller usage

– Late payment/Overdraft history

– Interest rate

– Contract term

– Credit Score

– Employment history/Income

Page 43: KDD-07 Invited Innovation Talk

43

Research

Pragmatic Grand Challenge 3

Integrating domain knowledge• Data mining algorithms are knowledge free

• There is no notion of “common sense reasoning”

• Do we have to solve an AI-hard problem?

• Robust and deep domain knowledge utilization

• solution: – Very deep and very narrow integration

– Ability to “model” business strategy

– Reasoning capability just evolves (c.f. chess players)

Page 44: KDD-07 Invited Innovation Talk

44

Research

Cross-Sell / Up-Sell Example

Any Related

Products

Complete the Assortment

Help Me Decide

Customer looking for pants

Recommendations

CollaborativeFiltering

Impulse BuyComplement Add-on

Alternates Up Sells

Context SensitiveApproach

Page 45: KDD-07 Invited Innovation Talk

45

Research

Pragmatic Grand Challenge 4

Managing and maintaining models• When was the last time you thought about the lifetime of a

mining model

• What happens when a model is changed

• Have you tried to merge the results of two different clustering models over time?

• How many “data droppings” (aka temp files, quick transformations, quick fixes) do you generate in an analysis session?

• A framework for managing, updating, and retiring mining models

• solution: use techniques that have been invented for this, databases, systems mngmt, s/w engr, etc…

Page 46: KDD-07 Invited Innovation Talk

46

Research

Pragmatic Grand Challenge 5

Effectiveness Measurement• How do we measure [honestly] the effectiveness of a model in a

context?

• Return on Investment (ROI) measurement

• Evaluation in the context of the application

• A framework and methodology for measurement and evaluation– Build the measurement method as part of the design of the

model

– An engineering recipe for measurements, and a set of metrics

Page 47: KDD-07 Invited Innovation Talk

47

Technical Challenges

Research

Page 48: KDD-07 Invited Innovation Talk

48

Research

Technical Challenges

0. Public benchmark data sets• As a field we have failed to define a common data collection• Very difficult to judge research and systems advances• Not an easy task, but not impossible• A mix of

– synthetic (but realistic) data sets – and real datasets

Page 49: KDD-07 Invited Innovation Talk

49

Research

Technical Challenges

1. How does the data grow?• A theory for how large data sets get to be large

• Definitely not IID sampling from a static distribution

• Inappropriateness of a “single-population” model

2. Complexity/understandability tradeoff• Explaining how, when and why a model works• Explaining when a model fails• A “Tuning Dial” for reducing the complex into the

understandable

Page 50: KDD-07 Invited Innovation Talk

50

Research

Technical Challenges

3. Interestingness• What is an “interesting” pattern or summary?

• How do you measure “novelty”?

• What is “unusual”? When is it worthy of attention?

• Is it low probability events? High summarization ability? Outliers? Good fits? Bad fits?

Page 51: KDD-07 Invited Innovation Talk

51

Research

Technical Challenges

4. ScalabilityBeyond just dealing with a large data set: • Principled feature reduction: what is SVD equivalent? Graceful degradation with dimensionality• Uncovering graphical structure in data

– Communities, relations, link analysis, …• Dealing with multiple data types:

– Structured, sparse, dense, text, images, video, audio, sequence data, etc.– I have yet to see an algorithm that deals with more than one type.

• Integration with DBMS– Appropriate sampling– Appropriate operator abstractions

• Taking care of “minor details”– Initialization?– Determining k

Page 52: KDD-07 Invited Innovation Talk

52

Research

Technical Challenges

5. A theory for what we do• What are the fundamental abstractions?

• What are the basics operations? What are the basic components of an algorithm?

• What is it that we are optimizing?

• What is hard? What is doable? Why?

• What is a “data summary”?

• When are two attributes “similar”? Can you measure efficiently?

• How do we extract the right representation?

Page 53: KDD-07 Invited Innovation Talk

53

Research

A new theory is needed

• What are the fundamental problems?

• What do partial models or summaries of data really mean?

• What are the implications of post hoc data analysis? When is it/is it not reasonable to conclude a task is appropriate?

• A new algebra for dealing with highly-summarized views of the world

• Effect of sparse spaces on dimensionality. What is the true dimensionality of data? What are the limits?

• A theory for adaptive sampling

Page 54: KDD-07 Invited Innovation Talk

54

Pragmatic and Technical Grand ChallengesSummary

Research

Page 55: KDD-07 Invited Innovation Talk

55

Research

Challenges

1. Where’s the Data?

2. In Situ mining

3. Domain knowledge

4. Life-cycle maintenance

5. Metrics

1. Understanding “large”

2. Simplicity knob

3. Interestingness

4. Scalability

5. Theory of what we do

0. Public and challenging benchmark data sets

A Scorecard for the field: At least 2 advances in the next 10 years!!!

Pragmatic Technical

Page 56: KDD-07 Invited Innovation Talk

56

Research

ACME CORP ULTIMATE DATA MINING BROWSER

Data Mining Grand Vision

What’s New? What’s Interesting?

Predict for me

Page 57: KDD-07 Invited Innovation Talk

57

Research

In the meantime, there is an understanding gap

• The technical community speaks of tech problems

• The business strategic thinking hit an “understandability wall”

• Traditionally, the thinking of business strategy never included data

• A new generation of business challenges are born

Page 58: KDD-07 Invited Innovation Talk

58

Research

Data Strategy

• Is the mapping of the capabilities enabled by data in driving the business

• The Integration of data-driven capabilities in revenue-driving activities

• The Integration of data-derived metrics to feedback into the measurement of the success of the business

• Evolving to an operational state where planning includes data, measurability, and data-driven feedback loops

Page 59: KDD-07 Invited Innovation Talk

59

Research

A Data Miner’s Story

• Started out in pure research

• At NASA-JPL

• At Microsoft Research

• At digiMine/Revenue Science Inc.

• At DMX Group

• So why join Yahoo! ?

Page 60: KDD-07 Invited Innovation Talk

60

Evolving the Data Strategy as Chief Data OfficerYahoo! Case Study

Research

Page 61: KDD-07 Invited Innovation Talk

61

Research

Yahoo! is the #1 Destination on the Web

73% of the U.S. Internet population uses Yahoo!

– About 500 million users per month globally!

• Global network of content, commerce, media, search and access products

• 100+ properties including mail, TV, news, shopping, finance, autos, travel, games, movies, health, etc.

• 25 terabytes of data collected each day… and growing

• Representing thousands of cataloged consumer behaviors

More people visited Yahoo! in the past month than:

• Use coupons• Vote• Recycle• Exercise regularly• Have children living at

home• Wear sunscreen regularly

Sources: Mediamark Research, Spring 2004 and comScore Media Metrix, February 2005.

Data is used to develop content, consumer, category and campaign insights for our key content partners and large advertisers

Page 62: KDD-07 Invited Innovation Talk

62

Research

Yahoo! Data – A league of its own…

Terrabytes of Warehoused Data

25 49 94 100500

1,000

5,000

Am

azo

n

Ko

rea

Te

leco

m

AT

&T

Y!

Liv

eS

tor

Y!

Pa

na

ma

Wa

reh

ou

se

Wa

lma

rt

Y!

Ma

in

wa

reh

ou

se

GRAND CHALLENGE PROBLEMS OF DATA PROCESSING

TRAVEL, CREDIT CARD PROCESSING, STOCK EXCHANGE, RETAIL, INTERNET

Y! PROBLEM EXCEEDS OTHERS BY 2 ORDERS OF MAGNITUDE

Millions of Events Processed Per Day

50 120 2252,000

14,000

SABRE VISA NYSE Y! Panama Y! DataHighway

Page 63: KDD-07 Invited Innovation Talk

63

Research

To be continued…

• Will cover the Yahoo! case study on Tuesday’s Invited talk

• Will include– Strategic Importance of Data

– Evolving the data strategy

– Evolving towards the need to invent the new sciences of the Internet

Hope the Data Miner’s Story continues… Perhaps to a happy ending?

Page 64: KDD-07 Invited Innovation Talk

64

[email protected] You! & Questions?

Research