D ATA M INING P ROCESS – A NOTHER A PPROACH Another approach that also includes six steps has been...

DATA MINING PROCESS – ANOTHER APPROACH

Another approach that also includes six steps has been proposed by CRISP–DM (Cross–Industry Standard Process for Data Mining) developed by an industry consortium.The six steps are:

CRISP–DM STEPS

The six CRISP–DM steps are:1. Business understanding2. Data understanding3. Data preparation4. Modelling5. Evaluation6. Deployment

CRISP–DM STEPS

The six steps proposed in CRISP–DM are similar to the six steps proposed earlier.. The CRIS–DM steps are shown in the following figure.

CRISP DATA MINING MODEL

DATA MINING TECHNIQUES

Although data mining is a new field, it uses many techniques developed years ago in other fields

Machine learning, statistics, artificial intelligence, etc

These techniques are in some cases modified to deal with large amounts of data

DATA MINING TECHNIQUES

Data mining includes a large number of techniques including concept/class description, association analysis, classification and prediction, cluster analysis, outlier analysis etc.

Expression and visualization of data mining results is a challenging task.

Privacy issues also need to be considered.

DATA MINING TASKS

Association analysis Classification and prediction Cluster analysis Web data mining Search Engines Data warehouse and OLAP Others, for example, Sequential patterns and

Time-series analysis

ASSOCIATION ANALYSIS

Association analysis involves discovery of relationships or correlations among a set of items.

To determine which items are purchased together frequently.

This information may be used for cross selling. Lift – is a term that is used to measure the power

of association. Application : market basket analysis, customer

segmentation, medicine, electronic commerce, classification, clustering, web mining, bioinformatics, finance.

ASSOCIATIONS

The association rules are often written as X → Y meaning that whenever X appears Y also tends to appear. X and Y may be collection of attributes.

CLASSIFICATION AND PREDICTION

• A set of training objects each with a number of attribute values are given to the classifier. The classifier formulates rules for each class in the training set so that the rules may be used to classify new objects. Some techniques do not require training data.

• Classification may be used for predicting the class label of data objects. Number of techniques including decision tree and neural network.

• Supervised classification can be used in predicting the class to which an object or individual is likely to belong.

CLUSTER ANALYSIS

• Similar to classification in that the aim is to build clusters such that each of them is similar within itself but is dissimilar to others. Clustering does not rely on class-labeled data objects.

• Cluster analysis is useful when the classes in the data are not already known and the training data is not available.

• Is to find groups that are very different from each other n the collection of data.

• Based on the principle of maximizing the intracluster similarity and minimizing the intercluster similarity.

WEB DATA MINING

The Web revolution has had a profound impact on the way we search and find information at home and at work. From its beginning in the early 1990s, the web has grown to more than ten billion pages in 2008 (estimates vary), perhaps even more by the time you are looking at this slide.

Web usage Web content Web structure

SEARCH ENGINES Search engines are huge databases of web pages as

well as software packages for indexing and retrieving the pages that enable users to find information.

Normally the search engine databases of Web pages are built and updated automatically by Web crawlers.

When one searches the Web using one of the search engines, one is not searching the entire Web. Instead one is only searching the database that has been compiled by the search engine.

how to assign a ranking to each Web page that is retrieved in response to a user query.

DATA WAREHOUSING AND OLAP Data warehousing is a process by which an enterprise

collects data from the whole enterprise to build a single version of the truth

This information is useful for decision makers and may also be used for data mining.

A data warehouse can be of real help in data mining since data cleaning and other problems of collecting data would have already been overcome.

OLAP (Online Analytical Processing) tools are decision support tools that are often built on top of a data warehouse or another database. OLAP goes further than traditional query and report tools in that a decision maker already has a hypothesis which he/she is trying to test.

DATA WAREHOUSING AND OLAP

Data mining is somewhat different than OLAP since in data mining a hypothesis is not being tested. Instead data mining is used to uncover novel patterns in the data.

BEFORE DATA MINING

To define a data mining task, one needs to answer the following:

• What data set do I want to mine? • What kind of knowledge do I want to mine?• What background knowledge could be useful?• How do I measure if the results are interesting?• How do I display what I have discovered?

TASK-RELEVANT DATA

The whole database may not be required since it may be that we only want to study something specific e.g. trends in postgraduate students

- countries they come from- degree program they are doing- their age?- time they take to finish the degree- scholarship they have they been awarded

May need to build a database subset before data mining can be done.

TASK-RELEVANT DATA

Data collection is non-trivial.

OLTP data is not useful since it is changing all the time. In some cases, data from more than one database may be needed.

PREPROCESSING

A data mining process would normally involve preprocessing

Often data mining applications use data warehousing

One approach is to pre-mine the data, warehouse it, then carry out data mining

The process is usually iterative and can take years of effort for a large project

DATA PREPROCESSING

Preprocessing is very important although often considered too mundane to be taken seriously

Preprocessing may also be needed after the data warehouse phase

Data reduction may be needed to transform very high dimensional data to a lower dimensional data

DATA PREPROCESSING

Feature Selection Use sampling? Normalization Smoothing Dealing with duplicates, missing data Dealing with time-dependent data

BACKGROUND KNOWLEDGE

Background information may be useful in the discovery process.

For example, concept hierarchies or relationships between data may be useful in data mining. For postgraduate degrees, we may wish to look at all Masters degrees and all doctorate degrees separately.

MEASURING INTEREST

Data mining process may generate many patterns. We cannot look at all of them and so need some way to separate uninteresting results from the interesting ones.

This may be based on simplicity of pattern, rule length, or level of confidence.

VISUALIZATION

We must be able to display results so that they are easy to understand.

Display may be a graph, pie chart, tables etc. Some displays are better than others for a given kind of knowledge.

GUIDELINES FOR SUCCESSFUL DATA MINING

• The data must be available• The data must be relevant, adequate and

clean• There must be a well-defined problem• The problem should not be solvable by

means of ordinary query or OLAP tools• The results must be actionable


1. Use a small team with a strong internal integration and a loose management style.

2. Carry out a small pilot project before a major data mining project.

3. Identify a clear problem owner responsible for the project. Could be someone in a sales or marketing. This will benefit the external integration.


4. Try to realise a positive return on investment within 6 to 12 months.

5. The whole data mining project should have the support of the top management of the company.

DATA MINING SOFTWARE

As noted earlier, a large variety of DM software is now available. Some more widely used software is:

IBM - Intelligent Miner and more SAS - Enterprise Miner Silicon Graphics - MineSet Oracle - Thinking Machines - Darwin Angoss - knowledgeSEEKER

D ATA M INING P ROCESS – A NOTHER A PPROACH Another approach that also includes six steps has been...

Documents

Transcript of D ATA M INING P ROCESS – A NOTHER A PPROACH Another approach that also includes six steps has been...