D ATA M INING P ROCESS – A NOTHER A PPROACH Another approach that also includes six steps has been...
Transcript of D ATA M INING P ROCESS – A NOTHER A PPROACH Another approach that also includes six steps has been...
DATA MINING PROCESS – ANOTHER APPROACH
Another approach that also includes six steps has been proposed by CRISP–DM (Cross–Industry Standard Process for Data Mining) developed by an industry consortium.The six steps are:
CRISP–DM STEPS
The six CRISP–DM steps are:1. Business understanding2. Data understanding3. Data preparation4. Modelling5. Evaluation6. Deployment
CRISP–DM STEPS
The six steps proposed in CRISP–DM are similar to the six steps proposed earlier.. The CRIS–DM steps are shown in the following figure.
CRISP DATA MINING MODEL
DATA MINING TECHNIQUES
Although data mining is a new field, it uses many techniques developed years ago in other fields
Machine learning, statistics, artificial intelligence, etc
These techniques are in some cases modified to deal with large amounts of data
DATA MINING TECHNIQUES
Data mining includes a large number of techniques including concept/class description, association analysis, classification and prediction, cluster analysis, outlier analysis etc.
Expression and visualization of data mining results is a challenging task.
Privacy issues also need to be considered.
DATA MINING TASKS
Association analysis Classification and prediction Cluster analysis Web data mining Search Engines Data warehouse and OLAP Others, for example, Sequential patterns and
Time-series analysis
ASSOCIATION ANALYSIS
Association analysis involves discovery of relationships or correlations among a set of items.
To determine which items are purchased together frequently.
This information may be used for cross selling. Lift – is a term that is used to measure the power
of association. Application : market basket analysis, customer
segmentation, medicine, electronic commerce, classification, clustering, web mining, bioinformatics, finance.
ASSOCIATIONS
The association rules are often written as X → Y meaning that whenever X appears Y also tends to appear. X and Y may be collection of attributes.
CLASSIFICATION AND PREDICTION
• A set of training objects each with a number of attribute values are given to the classifier. The classifier formulates rules for each class in the training set so that the rules may be used to classify new objects. Some techniques do not require training data.
• Classification may be used for predicting the class label of data objects. Number of techniques including decision tree and neural network.
• Supervised classification can be used in predicting the class to which an object or individual is likely to belong.
CLUSTER ANALYSIS
• Similar to classification in that the aim is to build clusters such that each of them is similar within itself but is dissimilar to others. Clustering does not rely on class-labeled data objects.
• Cluster analysis is useful when the classes in the data are not already known and the training data is not available.
• Is to find groups that are very different from each other n the collection of data.
• Based on the principle of maximizing the intracluster similarity and minimizing the intercluster similarity.
WEB DATA MINING
The Web revolution has had a profound impact on the way we search and find information at home and at work. From its beginning in the early 1990s, the web has grown to more than ten billion pages in 2008 (estimates vary), perhaps even more by the time you are looking at this slide.
Web usage Web content Web structure
SEARCH ENGINES Search engines are huge databases of web pages as
well as software packages for indexing and retrieving the pages that enable users to find information.
Normally the search engine databases of Web pages are built and updated automatically by Web crawlers.
When one searches the Web using one of the search engines, one is not searching the entire Web. Instead one is only searching the database that has been compiled by the search engine.
how to assign a ranking to each Web page that is retrieved in response to a user query.
DATA WAREHOUSING AND OLAP Data warehousing is a process by which an enterprise
collects data from the whole enterprise to build a single version of the truth
This information is useful for decision makers and may also be used for data mining.
A data warehouse can be of real help in data mining since data cleaning and other problems of collecting data would have already been overcome.
OLAP (Online Analytical Processing) tools are decision support tools that are often built on top of a data warehouse or another database. OLAP goes further than traditional query and report tools in that a decision maker already has a hypothesis which he/she is trying to test.
DATA WAREHOUSING AND OLAP
Data mining is somewhat different than OLAP since in data mining a hypothesis is not being tested. Instead data mining is used to uncover novel patterns in the data.
BEFORE DATA MINING
To define a data mining task, one needs to answer the following:
• What data set do I want to mine? • What kind of knowledge do I want to mine?• What background knowledge could be useful?• How do I measure if the results are interesting?• How do I display what I have discovered?
TASK-RELEVANT DATA
The whole database may not be required since it may be that we only want to study something specific e.g. trends in postgraduate students
- countries they come from- degree program they are doing- their age?- time they take to finish the degree- scholarship they have they been awarded
May need to build a database subset before data mining can be done.
TASK-RELEVANT DATA
Data collection is non-trivial.
OLTP data is not useful since it is changing all the time. In some cases, data from more than one database may be needed.
PREPROCESSING
A data mining process would normally involve preprocessing
Often data mining applications use data warehousing
One approach is to pre-mine the data, warehouse it, then carry out data mining
The process is usually iterative and can take years of effort for a large project
DATA PREPROCESSING
Preprocessing is very important although often considered too mundane to be taken seriously
Preprocessing may also be needed after the data warehouse phase
Data reduction may be needed to transform very high dimensional data to a lower dimensional data
DATA PREPROCESSING
Feature Selection Use sampling? Normalization Smoothing Dealing with duplicates, missing data Dealing with time-dependent data
BACKGROUND KNOWLEDGE
Background information may be useful in the discovery process.
For example, concept hierarchies or relationships between data may be useful in data mining. For postgraduate degrees, we may wish to look at all Masters degrees and all doctorate degrees separately.
MEASURING INTEREST
Data mining process may generate many patterns. We cannot look at all of them and so need some way to separate uninteresting results from the interesting ones.
This may be based on simplicity of pattern, rule length, or level of confidence.
VISUALIZATION
We must be able to display results so that they are easy to understand.
Display may be a graph, pie chart, tables etc. Some displays are better than others for a given kind of knowledge.
GUIDELINES FOR SUCCESSFUL DATA MINING
• The data must be available• The data must be relevant, adequate and
clean• There must be a well-defined problem• The problem should not be solvable by
means of ordinary query or OLAP tools• The results must be actionable
GUIDELINES FOR SUCCESSFUL DATA MINING
1. Use a small team with a strong internal integration and a loose management style.
2. Carry out a small pilot project before a major data mining project.
3. Identify a clear problem owner responsible for the project. Could be someone in a sales or marketing. This will benefit the external integration.
GUIDELINES FOR SUCCESSFUL DATA MINING
4. Try to realise a positive return on investment within 6 to 12 months.
5. The whole data mining project should have the support of the top management of the company.
DATA MINING SOFTWARE
As noted earlier, a large variety of DM software is now available. Some more widely used software is:
IBM - Intelligent Miner and more SAS - Enterprise Miner Silicon Graphics - MineSet Oracle - Thinking Machines - Darwin Angoss - knowledgeSEEKER