Chapter 1Introduction
IntroductionData mining is often defined as finding hidden information in a database or exploratory data analysis, data driven discovery, deductive learning. Data mining access of a database differs from a traditional access in:• Query: The query might not be well formed or precisely stated. The data
miner might not even be exactly sure of what he wants to see.• Data: The data accessed is usually a different version from that all of the
original operational database. The data have been cleansed and modified to better support the mining process.• Output: The output of the data mining query probably is not at subset of
the database. Instead it is the output of some analysis of the contents of the database.
Data Mining AlgorithmsDM algorithms attempt to fit a model to the data. They examine the data and determine a model that is closest to the characteristics of the data being examined. Such algorithms can be characterized as consisting of three parts:• Model: The purpose of the algorithm is to fit a model to the data.
What attributes should be used to define what class structure?• Preference: Some criteria must be used to fit one model over another.
The preference will be given to the criteria that fits data the best.• Search: All algorithms require some technique to search the data. The
criteria needed to fit the data to the classes must be properly defined.
• A predictive model makes a prediction about values of data using known results found from other (historical) data.
• A descriptive model identifies patterns or relationships in data. It serves as a way to explore the properties of the data examined, not to predict new properties.
1.1 Basic Data Mining Models and Tasks
• Classification maps data into predefined groups or classes. It is often referred to as supervised learning because classes are determined before examining the data.
• Regression is it used to math data item to a real valued prediction variable. Regression assumes that the target data fit into song known type of function (e.g., , linear, logistic etc.) and determines the best function of this type that models the given data. In actuality regression involves learning of the function that does this mapping.
• Time series analysis examines the value of an attribute as it varies over time (obtained at evenly spaced points). There're three basic functions performed in time series analysis: 1) similarity between different time series is determines using distance measures; 2) the structure of the line is examined to determine (perhaps classify) its behavior; 3) future values are predicted using historical time series plot.
• Prediction predicts future data states based on past and current data. Prediction can be also viewed as a type of classification.
Predictive Models
Descriptive Models• Clustering is similar to classification except for that the groups are not predefined but
rather defined by the data alone. The clustering is usually accomplished by determining the similarity among the data on predefined attributes. The most similar data are grouped into clusters.
• Summarization extracts or derives representative information about the database. It maps data into subsets with associated simple descriptions. It is also called characterization or generalization.
• Association rules (link analysis, affinity analysis or association) refers to uncovering relationships among data. An association rule is a model that identifies specific types of data associations. These are not casual relationships, and there is no guarantee that an association will apply in the future.
• Sequence discovery is used to determine sequential patterns in data. These patterns are based on time (a sequence of actions). Temporal association rules fall into this category.
Knowledge Discovery Steps
Data Mining Issues• Human interaction. Experts are used to formulate the queries, identify data and desired results.• Overfitting: It occurs when the model does not fit future states. This may be caused by
assumptions that are made about the data or may simply be caused by the small size of the training database.
• Outliers.• Interpretation of results. Output may require expert to correctly interpret the results.• Large databases: Sampling and parallelization are effective tools to attack the scalability problem.• High dimensionality. One solution to this problem is to reduce the number of attributes, which is
known as dimensionality reduction.• Multimedia data, missing data, irrelevant data, noisy data, changing data.• Integration and application: Business practices may have to be modified to determine how to
effectively use the information uncovered.
Data Mining Metrics• From an overall business perspective, a measure such as the return
on investment (ROI) could be used. ROI examines the difference between what the data mining technique costs and what the savings or benefits from its use are. It could be measured as increased to sales, increased advertising expenditure, or both.• The metrics used include the traditional metrics of space and time
based on complexity and analysis. In some cases, such as accuracy in classification, more specific metrics targeted to data mining task may be used.
Cross-Industry Standard Process Model for Data Mining (CRISP-DM)
The process lifecycle consists of:• business understanding,• data understanding, • data preparation, • modeling • evaluation and deployment.
ETL, Online Analytic Processing (OLAP), BI
Examples of Data Mining Applications• Healthcare data can identify best practices that improve care and reduce costs. Mining can be used to predict the volume
of patients in every category, to find best practices for diagnosis and the most effective treatments
• Market Basket Analysis may allow the retailer to understand the purchase behavior of a buyer.
• Education. Learning pattern of the students can be captured and used to develop techniques to teach them.
• Manufacturing Engineering. Discovering patterns in product architecture, product portfolio, and customer needs data. Predicting product development span time, cost, or dependencies among tasks.
• Customer Relationship Management (CRM) and customer segmentation are used for implementing customer focused strategies in acquiring and retaining customers, improving customers’ loyalty.
• Fraud Detection, image analysis, facial and speech recognition.
• Financial Banking. Finding patterns, causalities, and correlations in business information and market prices.
• Research in bio informatics, biology, medicine, neuroscience: gene finding, protein function inference, protein and gene interaction network reconstruction, data cleansing, and protein sub-cellular location prediction.
• The Human Genome Project. Scientists use Microarray data to look at the gene expressions and sophisticated data analysis techniques are employed to account for the background noise and normalization of data.
Information Flow Diagram
References:Dunham, Margaret H. “Data Mining: Introductory and Advanced Topics”. Pearson Education, Inc., 2003.
Top Related