Download - 01 Introduction to Data Mining

Chapter 1Introduction

IntroductionData mining is often defined as finding hidden information in a database or exploratory data analysis, data driven discovery, deductive learning. Data mining access of a database differs from a traditional access in:• Query: The query might not be well formed or precisely stated. The data

miner might not even be exactly sure of what he wants to see.• Data: The data accessed is usually a different version from that all of the

original operational database. The data have been cleansed and modified to better support the mining process.• Output: The output of the data mining query probably is not at subset of

the database. Instead it is the output of some analysis of the contents of the database.

Data Mining AlgorithmsDM algorithms attempt to fit a model to the data. They examine the data and determine a model that is closest to the characteristics of the data being examined. Such algorithms can be characterized as consisting of three parts:• Model: The purpose of the algorithm is to fit a model to the data.

What attributes should be used to define what class structure?• Preference: Some criteria must be used to fit one model over another.

The preference will be given to the criteria that fits data the best.• Search: All algorithms require some technique to search the data. The

criteria needed to fit the data to the classes must be properly defined.

• A predictive model makes a prediction about values of data using known results found from other (historical) data.

• A descriptive model identifies patterns or relationships in data. It serves as a way to explore the properties of the data examined, not to predict new properties.

1.1 Basic Data Mining Models and Tasks

• Classification maps data into predefined groups or classes. It is often referred to as supervised learning because classes are determined before examining the data.

• Regression is it used to math data item to a real valued prediction variable. Regression assumes that the target data fit into song known type of function (e.g., , linear, logistic etc.) and determines the best function of this type that models the given data. In actuality regression involves learning of the function that does this mapping.

• Time series analysis examines the value of an attribute as it varies over time (obtained at evenly spaced points). There're three basic functions performed in time series analysis: 1) similarity between different time series is determines using distance measures; 2) the structure of the line is examined to determine (perhaps classify) its behavior; 3) future values are predicted using historical time series plot.

• Prediction predicts future data states based on past and current data. Prediction can be also viewed as a type of classification.

Predictive Models

Descriptive Models• Clustering is similar to classification except for that the groups are not predefined but

rather defined by the data alone. The clustering is usually accomplished by determining the similarity among the data on predefined attributes. The most similar data are grouped into clusters.

• Summarization extracts or derives representative information about the database. It maps data into subsets with associated simple descriptions. It is also called characterization or generalization.

• Association rules (link analysis, affinity analysis or association) refers to uncovering relationships among data. An association rule is a model that identifies specific types of data associations. These are not casual relationships, and there is no guarantee that an association will apply in the future.

• Sequence discovery is used to determine sequential patterns in data. These patterns are based on time (a sequence of actions). Temporal association rules fall into this category.

Knowledge Discovery Steps

Data Mining Issues• Human interaction. Experts are used to formulate the queries, identify data and desired results.• Overfitting: It occurs when the model does not fit future states. This may be caused by

assumptions that are made about the data or may simply be caused by the small size of the training database.

• Outliers.• Interpretation of results. Output may require expert to correctly interpret the results.• Large databases: Sampling and parallelization are effective tools to attack the scalability problem.• High dimensionality. One solution to this problem is to reduce the number of attributes, which is

known as dimensionality reduction.• Multimedia data, missing data, irrelevant data, noisy data, changing data.• Integration and application: Business practices may have to be modified to determine how to

effectively use the information uncovered.

Data Mining Metrics• From an overall business perspective, a measure such as the return

on investment (ROI) could be used. ROI examines the difference between what the data mining technique costs and what the savings or benefits from its use are. It could be measured as increased to sales, increased advertising expenditure, or both.• The metrics used include the traditional metrics of space and time

based on complexity and analysis. In some cases, such as accuracy in classification, more specific metrics targeted to data mining task may be used.

Cross-Industry Standard Process Model for Data Mining (CRISP-DM)

The process lifecycle consists of:• business understanding,• data understanding, • data preparation, • modeling • evaluation and deployment.

ETL, Online Analytic Processing (OLAP), BI

Examples of Data Mining Applications• Healthcare data can identify best practices that improve care and reduce costs. Mining can be used to predict the volume

of patients in every category, to find best practices for diagnosis and the most effective treatments

• Market Basket Analysis may allow the retailer to understand the purchase behavior of a buyer.

• Education. Learning pattern of the students can be captured and used to develop techniques to teach them.

• Manufacturing Engineering. Discovering patterns in product architecture, product portfolio, and customer needs data. Predicting product development span time, cost, or dependencies among tasks.

• Customer Relationship Management (CRM) and customer segmentation are used for implementing customer focused strategies in acquiring and retaining customers, improving customers’ loyalty.

• Fraud Detection, image analysis, facial and speech recognition.

• Financial Banking. Finding patterns, causalities, and correlations in business information and market prices.

• Research in bio informatics, biology, medicine, neuroscience: gene finding, protein function inference, protein and gene interaction network reconstruction, data cleansing, and protein sub-cellular location prediction.

• The Human Genome Project. Scientists use Microarray data to look at the gene expressions and sophisticated data analysis techniques are employed to account for the background noise and normalization of data.

Information Flow Diagram

References:Dunham, Margaret H. “Data Mining: Introductory and Advanced Topics”. Pearson Education, Inc., 2003.