Data Mining Lecture 26. Today’s Lecture 2 What is data mining? Why data mining? What applications?...

Data Mining

Lecture 26

Today’s Lecture

2

What is data mining? Why data mining? What applications? What techniques? What process? What software?

Definition

Data mining may be defined as follows:

data mining is a collection of techniques for efficient automated discovery of previously unknown, valid, novel, useful and understandable patterns in large databases. The patterns must be actionable so they may be used in an enterprise’s decision making.

3

What is Data Mining?

Efficient automated discovery of previously unknown patterns in large volumes of data.

Patterns must be valid, novel, useful and understandable. Businesses are mostly interested in discovering past

patterns to predict future behaviour. A data warehouse, as discussed earlier, is an enterprise’s

memory. Data mining can provide intelligence using that memory.

4

Examples

amazon.com uses associations. Recommendations to customers are based on past purchases and what other customers are purchasing.

A store in USA “Just for Feet” has about 200 stores, each carrying up to 6000 shoe styles, each style in several sizes. Data mining is used to find the right shoes to stock in the right store.

More examples in case studies to be discussed later.

5

Data Mining We assume we are dealing with large data, perhaps

Gigabytes, perhaps in Terabytes. Although data mining is possible with smaller amount of

data, bigger the data, higher the confidence in any unknown pattern that is discovered.

There is considerable hype about data mining at the present time and Gartner Group has listed data mining as one of the top ten technologies to watch.

Question: How many books could one store in one Terabyte of memory?

6

Why Data Mining Now?

7

Growth in generation and storage of corporate data – information explosion

Need for sophisticated decision making – current database systems are Online Transaction Processing (OLTP) systems. The OLTP data is difficult to use for such applications. Why?

Evolution of technology – much cheaper storage, easier data collection, better database management, to data analysis and understanding.

Information explosion

8

Database systems are being used since the 1960s in the Western countries (perhaps since 1980s in India). These systems have generated mountains of data.

Point of sale terminals and bar codes on many products, railway bookings, educational institutions, huge number of mobile phones, electronic commerce, all generate data.

Government is now collecting a lot of information.


9

Internet banking via networked computers and ATMs. Credit and debit cards. Medical data, doctors, hospitals. Transportation, Indian railways, automatic toll collection

on toll roads, growing air travel. Passports, NRI visas, Other visas, NRI money

transfers.

Question: Can you think of other examples of data collection?


10

Many adults in India generate: Mobile phone transactions. More than 300 million phones

in India, reportedly growing at the rate of 10,000 new ones every hour! Mobile companies must save information about calls.

Growing middle class with growing number of credit and debit card transactions. About 25m credit cards and 70m debit cards in 2007. Annual growth rate about 30% and 40% respectively. Could be 55m credit cards and 200m debit cards in 2010 resulting in perhaps 500m transactions annually.


11

India has some huge enterprises, for example Indian railways, perhaps the busiest network in the world with 2.5m employees, 10,000 locomotives, 10,000 passenger trains daily, 10,000 freight trains daily and 20m passengers daily.

Growing airline traffic with more than ten airlines. Perhaps 30m passengers annually.

Growing number of motor vehicles – registration, insurance, driver license

Internet surfing records

OLTP

12

As noted earlier, most enterprise database systems were designed in the 1970’s or 1980’s and were mainly designed to automate some of the office procedures e.g. order entry, student enrolment, patient registration, airline reservations. These are well structured repetitive operations easily automated.

Decision Making

13

Need for business memory and intelligence. Need to serve customers better by learning from past

interactions. OLTP data is not a good basis for maintaining an

enterprise memory. The intelligence hidden in data could be the secret

weapon in a competitive business world but given the information explosion not even a small fraction could be looked at by human eye.

Question: Why OLTP is not good for maintaining an enterprise memory?

OLTP vs Decision Making

14

Clerical view of data focuses on details required for day-to-day running of an enterprise.

Management view of data focuses on summary data to identify trends, challenges and opportunities.

The detailed data view is the operational view while the management view is decision-support view. Comparison of the two views:

Operational vs Management ViewOperational Decision-Support

Users – Admin staff Users – Management

Day–to–day work Decision support

Application oriented Subject oriented

Current data Historical data

Detailed Overall view – summaries

Simple queries Complex queries

Predetermined queries Ad hoc queries

Update/Select Only Select

Real–time Not real–time

15

Evolution of Technology Corporate data growth accompanied by decline in the

cost of storage and processing. PC motherboard performance, measured in MHz/$, is

currently doubling every 27 ± 2 months. Next slide using logarithmic scale shows that disk is now

about 10GB per US dollar and the following slide shows that sales of disk storage is growing exponentially.

Question: How much is the cost of 100GB disk? What is the cost of a PC and what is its CPU performance?

16

Decline in Hard Drive cost

17

Growth in Worldwide Disk Capacity

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

1996 1997 1998 1999 2000 2001 2002 2003

Year

Sto

rag

e in

Pet

abyt

es

18

Evolution of Technology

Question: What do the graphs in the last two slides tell us? What scales are used in them? What was the pink line is the first graph?

19

Evolution of Technology Database technology has improved over the years. Data collection is often much better and cheaper now The need for analyzing and synthesizing information is

growing in a fiercely competitive business environment of today.

20

New applications

21

Sophisticated applications of modern enterprises include:

- sales forecasting and analysis- marketing and promotion planning- business modeling

OLTP is not designed for such applications. Also, large enterprises operate a number of database systems and then it is necessary to integrate information for decision making applications.

Question: Why OLTP cannot be used for sales forecasting and analysis?

Why Data Mining Now?

22

As noted earlier, the reasons may be summarized as:

•Accumulation of large amounts of data• Increased affordable computing power enabling data mining processing

• Statistical and learning algorithms• Availability of software• Strong business competition

Large amount of data

23

Already discussed that many enterprises have large amounts of data accumulated over 30+ years.

Noted earlier that some enterprises collect information for analysis, for example, supermarkets in USA offer loyalty cards in exchange for shopper information. Loyalty cards in Australia also collect information using a reward system.

Growth of cardsA recent survey in USA found that the percentages of US adults using the following types of cards were:

Credit cards - 88%; ATM cards - 60% Membership cards - 58% Debit cards - 35% Prepaid cards - 35% Loyalty cards - 29%

Question: What kind of data do these cards generate?

24

Affordable computing power

25

Data mining is usually computationally intensive. Dramatic reduction in the price of computer systems, as noted earlier, is making it possible to carry out data mining without investing huge amounts of resources in hardware and software.

In spite of affordable computing power, using data mining can be resources intensive.

Algorithms

26

A variety of statistical and learning algorithms have been available in fields like statistics and artificial intelligence that have been adapted for data mining.

With new focus on data mining, new algorithms are being developed.

Availability of Software

Large variety of DM software is now available. Some more widely used software is:

IBM - Intelligent Miner and more SAS - Enterprise Miner Silicon Graphics - MineSet Oracle - Thinking Machines - Darwin Angoss - knowledgeSEEKER

27

Strong Business Competition

28

Growth in service economies. Almost every business is a service business. Service economies are information rich and very competitive.

Consider the telecommunications environment in Australia. About 20 years ago, Telstra was a monopoly. The field is now very competitive. Mobile phone market in India is also very competitive.

ApplicationsIn finance, telecom, insurance and retail:

Loan/credit card approval market segmentation fraud detection better marketing trend analysis market basket analysis customer churn Web site design and promotion

29

Loan/Credit card approvals

30

In a modern society, a bank does not know its customers. Only knowledge a bank has is their information stored in the computer.

Credit agencies and banks collect a lot of customers’ behavioural data from many sources. This information is used to predict the chances of a customer paying back a loan.

Market Segmentation

Large amounts of data about customers contains valuable information

The market may be segmented into many subgroups according to variables that are good discriminators

Not always easy to find variables that will help in market segmentation

31

Fraud Detection

Very challenging since it is difficult to define characteristics of fraud. Often based on detecting changes from the norm.

In statistics, it is common to throw out the outliers but in data mining it may be useful to identify them since they could either be due to errors or perhaps fraud.

32

Better Marketing

When customers buy new products, other products may be suggested to them when they are ready.

As noted earlier, in mail order marketing for example, one wants to know:

- will the customer respond?

- will the customer buy and how much?

- will the customer return purchase?

- will the customer pay for the purchase?

33

Better Marketing

34

It has been reported that more than 1000 variable values on each customer are held by some mail order marketing companies.

The aim is to “lift” the response rate.

Trend analysis

35

In a large company, not all trends are always visible to the management. It is then useful to use data mining software that will identify trends.

Trends may be long term trends, cyclic trends or seasonal trends.

Market Basket Analysis

Aims to find what the customers buy and what they buy together

This may be useful in designing store layouts or in deciding which items to put on sale

Basket analysis can also be used for applications other than just analysing what items customers buy together

36

Customer Churn In businesses like telecommunications, companies are

trying very hard to keep their good customers and to perhaps persuade good customers of their competitors to switch to them.

In such an environment, businesses want to find which customers are good, why customers switch and what makes customers loyal.

Cheaper to develop a retention plan and retain an old customer than to bring in a new customer.

37

Customer Churn The aim is to get to know the customers better so you

will be able to keep them longer. Given the competitive nature of businesses, customers

will move if not looked after. Also, some businesses may wish to get rid of customers

that cost more than they are worth e.g. credit card holders that don’t use the card, bank customers with very small amount of money in their accounts.

38

Web site design A Web site is effective only if the visitors easily find

what they are looking for. Data mining can help discover affinity of visitors to

pages and the site layout may be modified based on this information.

39

Data Mining ProcessSuccessful data mining involves careful determining the aims and selecting appropriate data. The following steps should normally be followed:

1. Requirements analysis 2. Data selection and collection3. Cleaning and preparing data4. Data mining exploration and validation5. Implementing, evaluating and monitoring6. Results visualisation

40

Requirements Analysis

The enterprise decision makers need to formulate goals that the data mining process is expected to achieve. The business problem must be clearly defined. One cannot use data mining without a good idea of what kind of outcomes the enterprise is looking for.

If objectives have been clearly defined, it is easier to

evaluate the results of the project.

41

Data Selection and Collection

Find the best source databases for the data that is required. If the enterprise has implemented a data warehouse, then most of the data could be available there. Otherwise source OLTP systems need to be identified and required information extracted and stored in some temporary system.

In some cases, only a sample of the data available may be required.

42

Cleaning and Preparing DataThis may not be an onerous task if a data warehouse containing the required data exists, since most of this must have already been done when data was loaded in the warehouse. Otherwise this task can be very resource intensive, perhaps more than 50% of effort in a data mining project is spent on this step. Essentially a data store that integrates data from a number of databases may need to be created. When integrating data, one often encounters problems like identifying data, dealing with missing data, data conflicts and ambiguity. An ETL (extraction, transformation and loading) tool may be used to overcome these problems.

43

Exploration and Validation

Assuming that the user has access to one or more data mining tools, a data mining model may be constructed based on the enterprise’s needs. It may be possible to take a sample of data and apply a number of relevant techniques. For each technique the results should be evaluated and their significance interpreted.

This is likely to be an iterative process which should lead to selection of one or more techniques that are suitable for further exploration, testing and validation.

44

Implementing, Evaluating and Monitoring

Once a model has been selected and validated, the model can be implemented for use by the decision makers. This may involve software development for generating reports or for results visualisation and explanation for managers. If more than one technique is available for the given data mining task, it is necessary to evaluate the results and choose the best. This may involve checking the accuracy and effectiveness of each technique.

45

Implementing, Evaluating and Monitoring

Regular monitoring of the performance of the techniques that have been implemented is required. Every enterprise evolves with time and so must the data mining system. Monitoring may from time to time lead to the refinement of tools and techniques that have been implemented.

46

Results Visualisation

Explaining the results of data mining to the decision makers is an important step. Most DM software includes data visualisation modules which should be used in communicating data mining results to the managers.

Clever data visualisation tools are being developed to display results that deal with more than two dimensions. The visualisation tools available should be tried and used if found effective for the given problem.

47

Summary We have seen today

What is data mining? Why data mining? What applications? What techniques? What process? What software?

48

Data Mining Lecture 26. Today’s Lecture 2 What is data mining? Why data mining? What applications?...

Documents

Transcript of Data Mining Lecture 26. Today’s Lecture 2 What is data mining? Why data mining? What applications?...