DATA
MINING
U N IT -1 IN T R O D U C T I O N , K N O W L E D G E O F D ATA ,
Why Data mining :
Needs of Data Mining
Kinds of Pattern and Technologies in DM
KDD vs Data Mining
Machine learning concepts
OLAP
Knowledge Representation
Data Pre-Processing {cleaning,integration,
D ATA P R O C E S S I N G
reduction, transformationanddiscretization
Application with mining aspect example like weather prediction.
WH Y D ATA M I N I N G ? The Explosive Growth of Data: from
terabytes to petabytes Data collection and data availability
Automated data collection tools, database systems,
Web, computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
E V O L U T I O N O F S C I E N C E S Before 1600, empirical science
1600-1950s, theoretical science Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
1950s-1990s, computational science Over the last 50 years, most disciplines have grown a third, computational branch
(e.g.empirical, theoretical, and computational ecology, or physics, or linguistics.)
Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.
1990-now, data science The flood of data from new scientific instruments and simulations The ability to economically store and manage petabytes of data online The Internet and computing Grid that makes all these archives universally
accessible Scientific info. management, acquisition, organization, query, and visualization
tasksscale almost linearly with data volumes. Data mining is a major new challenge!
E V O L U T I O N O F D ATAB A S E
TE C H N O L O G Y 1960s:
Data collection, database creation, IMS and network DBMS
1970s: Relational data model, relational DBMS implementation
1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s: Data mining, data warehousing, multim edia databases, and Web
databases
2000s Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems
WHAT IS DATA MINING? Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount of
data Data mining: a misnomer?
Alternative names Knowledge discovery (mining) in databases(KDD), data
knowledge dredging,extraction, data/pattern analysis, data
archeology,information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”? Simple search and query processing (Deductive) expert systems
DATA
MINING
Data Mining is thetechnologyPattern
whichand
warehousingautomates the process of Trend discovery indata environment.
MOTIVATION FOR DATA
MINING
There isgold Thisgold is
in the data waitingto be mined. inthe form of useful patternswaiting to be discovered.
Prime motivation for data mining remains optimal usage of data stored & collected over a period of time.
Useful patterns are actionable, i.e. they suggest decision that could improve some utility.
MOTIVATION FOR DATA
MINING Example : Examples of useful patterns mined from a database or data warehouse that the supermarket chain owner can discover & put to use are shown in the figure below.
DATA WAREHOUSING & DATA
MINING
TECHNOLOGIES Data is captured using technologies such as bar codes,
radiofrequency
identification (RIFD) tags, scanned texts, QR codes, digital cameras, satellites &the internet.
It is humanly impossible to digest and interpret all the collected data without the help of automated tools.
Data warehousing is a technology that allows one to gather, store, & present data in a form suitable for human exploration. It involves the following :
Data Cleaning – removing noise & inconsistent data. Data Integration -
bringing data from multiple source to a single location & common format.
OLAP(On-line analytical processing) tools enable us to explore the stored data along multiple dimensions, at any level of granularity, and manually discover patterns.
KDD(Knowledge discovery in databases) is the automatic extraction of novel, understandable, and useful patterns from large stores of data. It also includes post mining steps such as pattern evaluation & knowledge presentation.
HOW DISCOVERED PATTERNS
HELP??The above question is best answered by the following figure.
DATA
MODELS
A data model is a description of the organization or the structure of data in an information system. The structure of data can be observed at different levels of abstraction which are Conceptual, Logical, & Physical.
Conceptual Data Model – The manner in which users view the overall structure of the data denoted as a conceptual data model. Entity relationship models & ontologies are examples of conceptual data models.
Logical Data Model – The way in which a database system views the overall structure of the data is denoted as a logical data model. E.g. relational, object-oriented, & object-relational models.
Physical Data Model – The way the data is actually stored in a disk(in terms of cylinders & tracks) or in other storage media is denoted as a physical data model.
DATA WAREHOUSING & OLAP : USER’SPERSPECTIVE
Data warehousing : A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data to support the decision-making process of an enterprise. The data sources maybe volatile but the data warehouse is non-volatile and is time-variant.
Data warehouse users are analysts who explore data to find out patterns. They study how measures are related to dimensions.
The data warehouse is structured in terms ofsubjects,measurements & dimensions to facilitate exploration.
The data is organized in a multi-dimensional model & the data repository is said to be as subject-oriented.
N E E D S O F DATA MINI N G Nowadays, large quantities of data is being accumulated. The amount of data
collected is said to be almost doubled every 9 months.
Seeking knowledge from massive data is one of the most desired attributes of
Data Mining.
Data could be large in two senses. In terms of size, e.g. for Image Data or in terms
of dimensionality, e.g. for Gene expression data.
Usually there is a huge gap from the stored data to the knowledge that could
be construed from the data.
This transition won't occur automatically, that's where Data Mining comes into picture.
In Exploratory Data Analysis, some initial knowledge is known about the data, but
Data Mining could help in a more in-depth knowledge about the data.
KINDS OF PATTERN AND TECHNOLOGIES
IN DM Data mining deals with the kind of patterns that can be mined. On the basis of the
kind of data to be mined, there are two categories of functions involved in Data
Mining
Descriptive
Classification and Prediction
Descriptive Function
The descriptive function deals with the general properties of data in the database.
Here is the list of descriptive functions −
Class/Concept Description
Mining of Frequent Patterns
Mining of Associations
Mining of Correlations
Mining of Clusters
CLASS/CONCEPT DESCRIPTION: CHARACTERIZATION AND DISCRIMINATION
Classes of items for sale include computers and printers
Concepts of customers include bigSpenders and budgetSpenders.
To describe individual classes and concepts in summarized, concise, and
yet precise terms. Such descriptions of a class or a concept are called
class/concept descriptions.
Data characterization: by summarizing the data of the class under study
(often called the target class)
eg. characteristics of software products with sales that increased by 10% in
the previous year
eg. Summarize the characteristics of customers who spend more than $5000
a year at Electronics shops. Result found as , they are 40 to 50 years
old, employed, and have excellent credit ratings.
Data Discrimination: a comparison of the general features of the target class data objects against the general features of objects from one or multiple contrasting classes.eg. Shopping of computer products regularly (e.g., more than twice a month) and those who rarely shop for such products (e.g., less than three times a year).Result provides like 80% of the customers who frequently purchase computer products are between 20 and 40 years old and have a university education, 60% of the customers who infrequently buy such products are either seniors or youthsMining Frequent PatternsA frequent item set typically refers to a set of items that often appeartogether in a transactional data seteg.milk and bread, are frequently bought together in grocery stores by many customers.A frequently occurring subsequence, such as the pattern that customers, tend to purchase first a laptop, followed by a digital camera, and then a memory card, is a (frequent) sequential pattern.
Mining Associations
eg An example of such a rule, mined from the Electronics shops transactional database,
is
• where X is a variable representing a customer. A confidence, or certainty, of
50% means that if a customer buys a computer, there is a 50% chance that she
will buy software as well.
• A 1% support means that 1% of all the transactions under analysis show that
computer and software are purchased together.
• Association rule involves a single attribute or predicate (i.e., buys) that repeats. It is
referred as single-dimensional association rules.
• Dropping the predicate notation, the rule can be written simply as “computer ⇒software [1%, 50%].”
IN RELATIONAL DATABASE RELATED TO PURCHASES A
DATA MINING SYSTEM MAY FIND ASSOCIATION RULES LIKE
• Customers under study, 2% are 20 to 29 years old with an income of
$40,000 to $49,000 and have purchased a laptop (computer)
• 60% probability that a customer in this age and income group will
purchase a laptop. This is an association involving more than one
attribute or predicate (i.e., age, income, and buys). The above rule can
be referred to as a multidimensional association
Mining of Correlations
It is a kind of additional analysis performed to uncover interesting
statistical correlations between associated-attribute−value pairs or
between two item sets to analyze that if they have positive,
negative or no effect on each other.
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster
analysis refers to forming group of objects that are very
similar to each other but are highly different from the objects in
other clusters.
CLASSIFICATION AND PREDICTION
Classification is the process of finding a model that describes the data classes
or concepts.
Purpose is to be able to use this model to predict the class of objects
whose class label is unknown.
Derived model is based on the analysis of sets of training data. The
derived model can be presented in the following forms −
Classification (IF-THEN) Rules
Decision Trees
Mathematical Formulae
Neural Networks
DATA MINING: CONFLUENCE OF
MULTIPLE
DISCIPLINES
Data Mining
DatabaseTechnolog
y
Statist ics
Machine Learning
Pat tern Recognition
AlgorithmOther
Disciplines
Visualization
MACHINE LEARNING
Machine learning is a fast-growing discipline.
Supervised learning is basically a synonym for classification.
• Eg: The postal code recognition problem
• A set of handwritten postal code images and their corresponding
machine-readable translations are used as the training examples
• It supervise the learning of the classification model.
Unsupervised learning is essentially a synonym for clustering.
• Eg: an unsupervised learning method can take, as input, a set
of images of handwritten digits.
• It finds 10 clusters of data. These clusters may correspond to the
10 distinct digits of 0 to 9, respectively.
Semi-supervised learning isa
class ofmachine
techniques that make use of both labeled and
LEAR
NING
UNLAB
ELED
examples when learning a model.
• Labeled examples are used to learn class models
• Unlabeled examples are used to refine the boundaries between
classes.
• Eg: For a two-class problem, we can think of the set of
examples belonging to one class as the positive examples
and those belonging to the other class as the negative examples.
In Figure if we do not consider the unlabeled examples, the dashed line is the decision boundary that best partitions the positive examples from the negative examples.Using the unlabeled examples, we can refine the decision boundary to thesolid line. Moreover, we can detect that the two positive examples at the top right corner, though labeled, are likely noise or outliers.
Active learning is a machine learning approach that lets
users play an active role in the learning process.
• An active learning approach can ask a user (e.g., adomain
expert) to label an example, which may be from a setof
unlabeled examples or synthesized by the learning program.
• The goal is to optimize the model quality by actively
acquiring knowledge from human users, given a constraint on
how many examples they can be asked to label.
KNOWLEDGE
REPRESENTATION
DATA PRE-PROCESSING Data Cleaning
Missing Values Noisy Data Data Cleaning as a Process
DATA INTEGRATION
Entity Identification P roblem Redundancy and Correlation AnalysisTuple DuplicationData Value Conflict Detection and Resolution
DATA REDUCTION
Overview of Data ReductionWavelet TransformPrinciple Component AnalysisAttribute Subset SelectionRegression and Long Linear ModelsHistogramsClustering
DATA TRANSFORMATION AND DATA DISCRETIZATION
Data Transformation Strategic Overview Data Transformation and Normalization Discretization by binning Discretization by Histogram Analysis Discretization by Cluster, Decision Tree ,
Correlation Analyses Concept of Hierarchy Generation for
Nominal Data
D ATA C L E A N IN G Importance
“Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball
“Data cleaning is the number one problem in datawarehousing”—DCI survey
Data cleaning ta sks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
MISSING DATA Data is not always available
E.g., many tuples have no recorded value for several
attrib utes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important a t the time of
entry
not register history or changes of the data
Missing data may need to be inferred.
HOW TO HANDLE MISSING DATA? Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification—not effective when the
percentage of missing values per attrib ute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., “unknown”, a new class?!
the attribute mean
the attrib ute mean for all sam ples belonging to the sam e
class: smarte r
the most probable value: inference-based such as Bayesian
form ula or decision tree
NOISY DATA Noise: random error or variance in a measuredvariable
Incorrect attribute values may due to faulty data collection instrumentsdata entry problemsdata t r ansmission problems technology limitation inconsistency in naming convention
Other data problems which requires data cleaningduplicate records incomplete data inconsistent data
HOW TO HANDLE NOISY DATA? Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
Regressionsmooth by fitting the data into regression
functions Clustering
detect and remove outliers Combined computer and human inspection
detect suspicious values and check by human (e.g.,deal with possible outliers)
SIMPLE DISCRETIZATION METHODS: BINNING
Equal-wid th (distance) part itioning
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the
attrib ute, the width of intervals will be: W = (B –A)/N.
The most straightforwa rd, bu t outliers may dominate
presentation
Skewed data is not handled well
Equal-dep th (frequency) part itioning
Divides the range into N intervals, each containing
approximately sam e num ber of sam ples
Good data scaling
Managing categorical attrib utes can be tricky
BINNING METHODS FOR
DATA SMOOTHING Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34*Partit ion into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34*Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29*Smoothing by bin
boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
REGRESSION
x
y = x + 1
X1
y
Y1
Y1’
CLUSTER ANALYSIS
DATA INTEGRATION Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust -id B.cust-# In tegrate metadata from different sources
Entity identification problem: Identify real world entities from multiple data
sources, e.g., Bill Clinton = William Clinton Detecting and resolving data value conflicts
For the same real world entity, attribute values from different sources are different
Possible reasons: different representations, differentscales, e.g., metric vs. British units
H ANDLING REDUNDANCY IN DATA
INTEGRATION Redundant data occur often when integrat ion of
mul tiple da tabasesObject identification: The same attribute or
object may have different names in different databases
Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue
Redunda n t a t t r ibutes may be able to be detected bycorrelation analysis
Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
DATA TRANSFORMATION
Smoothing: remove noise from data Aggregation: summarization, data cube
construction Generalization: concept hierarchy
climbing Normalization: scaled to fall within a small,
specified rangemin-max normalizationz-score normalizationnormalization by decimal scaling
Attribute/feature constructionNew attributes constructed from the
given ones
DATA TRANSFORMATION: NORMALIZATION
Min-max normalization: to [new_min A, new_maxA]
Ex. Let income range $12,000 to $98,000 normalized to [0.0,1.0]. Then $73,000 is mapped to
Z-score normalization (μ: mean, σ: s tandard deviation):
98,000 12,00073,600 12,000 (1.0 0) 0 0.716
maxA minA
v minA
v' (new _ maxA new _ minA) new _ minA
A
Ex. Let μ = 54,000, σ = 16,000.Then
Normalization by decimal scaling
v ' v A
v
10 j
v' Where j is the smallest integer such that Max(|ν’|) < 1
1.22516,000
73,600 54,000
DATA REDUCTION STRATEGIES Why data reduction?
A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time to
run on the complete data set Data reduction
Obtain a reduced representation of the data set tha t is much smaller in volume but yet produce the sam e (or almost the sam e) analytical results
Data reduction s t rategies Data cube aggregation: Dimensionality reduction — e.g., remove
unimportan t attrib utes Data Compression Numerosity reduction — e.g., fit data into models Discretization and concept hierarchy generat ion
DATA CUBE AGGREGATION The lowest level of a data cube (base cuboid)
The aggregated data for an individual entity of
interest
E.g., a customer in a phone calling data warehouse
Multiple levels of aggregation in da ta cubes
Further reduce the size of data to deal with
Reference appropriate levels
Use the smallest representation which is enough
to solve the task
Queries regarding aggregated inform a tion should be
answered using data cube, when possible
ATTRIBUTE SUBSET SELECTION Feature selection (i.e., a t t r ibute subset selection):
Select a minimum set of features such tha t the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features
reduce # of patterns in th e patterns, easier to understand
Heuris tic methods (due to exponential # of choices):Step-wise forward selectionStep-wise backward eliminationCombining forward selection and backward
eliminationDecision-tree induction
EXAMPLE OF DECISION TREE INDUCTIONInitial attribute set:{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}
HEURISTIC FEATURE SELECTION
METHODS There are 2d possible sub-features of d features Several heuristic feature selection methods:
Best single features under the feature independence assumption: choose by significance tests
Best step-wise feature selection: The best single-featu re is picked first Then next best featu re condition to the first, ...
Step-wise feature elimination: Repeatedly eliminate the worst feature
Best combined feature selection and eliminationOptimal branch and bound:
Use feature eliminat ion and backtracking
Given N data vectors from n-dimensions, find k ≤ n
orthogonalvectors (principal components) tha t can be best used to representdata
Steps Normalize input da ta : E ach a tt r ibute falls with in th e sam e
range Compute k orthonorm al (unit) vectors, i.e.,
principal components E ach input data (vector) is a linear combination of
th e kprincipal component vectors
The principal components are sorted in order of decreasing “significance” or strength
Since the components a re sorted, the size of the data can be reduced by elimina t ing the weak components, i.e., those with low variance. (i.e., using the s t rongest principa l components, it is possible to reconstruct a good approxima tion of th e original data
DIMENSIONALITY REDUCTION: PRINCIPAL
COMPONENT ANALYSIS (PCA)
X1
X2
Y1
Y2
PRINCIPAL COMPONENT ANALYSIS
KDD V S . D ATA M I N I N G KDD is a process of extracting previously unknown, valid,
and actionable (understandable) information fromlarge
databases
Data mining is a step in the KDD process of applying data
analysis and discovery algorithms.
OLAP Basic idea : converting data into information tha tdecision makers need
Concept to analyze data by multiple dimension ina structure called data cube
Goal of OLAP is to suppor t ad-hoc querying forthebusiness analyst
Business analysts are familiar with spreadsheets Extend spreadsheet analysis model to work
with view of da ta is the foundationof
warehouse data Multidimensional
OLAP
Data Mining Issues
Human interaction- Interfaces required with both domain and technical experts- variety of databases, variety of users leading to numerous data mining techniques – What is required is not known hence extraction process need to be interactive.
Interpretation of results- Requirements of Experts- interpretability problems- Background knowledge or domain expertise is essential to guide the discovery process
Visualization of results- Visualization helps- multi- dimensional data is problematic – The discovered knowledge should expressed in the form of trees , tables, graphs, charts curves etc.
Data Mining Issues Continued
Large datasets- Scalability is a problem- algorithms do not scale well with massive real-world datasets- sampling and parallelization are effective tools
High dimensionality - Conventional Database may contain many different attributes, all are not relevant- increases complexity and reduces efficiency – dimensionality curse-data reduction-dimensionality reduction
Multimedia data - Found in GIS databases provesconventional data mining algorithms ineffective
Missing data -It is not always possible to ignore missing data but in preprocessing data mining algorithms can be used to replace missing data with estimates
Data Mining Issues Continued
Irrelevant data – Data reduced by removing irrelevant data
Noisy data and outliers –Invalid , incorrect data will lead to poor quality data mining- Outliers are very much different and do not fit nicely into the derived model
Changing data- Data warehouses contain non-volatile data-Dynamic data is uploaded and then algorithms are reapplied
Integration- KDD requests are one time needs-data miningfunctions are now integrated into traditional database systems
Applications – Effective use of output of mining algorithm is a challenge rather than the complexity of the mining algorithm
Data Mining Metrics
How to measure the effectiveness of data mining process?
-KDD process is expensive- Return on investment will be thesaving due to decision process using the results
-Difficult to measure and quantify
-Measured as increase in sales, reduction in advertising cost
Social Implications of Data mining
Two sides of the coin
Data mining can be used to improvecustomer service and satisfaction
Data mining can be used to confront one’s right to privacy
Omnipresent Invisible Data mining affecting everyone-profiling
isused to label typical characteristics
DATABASE
PROCESSING VS. DATA MINING
PROCESSING
Query Well
defined SQL
Query Poorly defined No precise query language
Output– Subset of da tabase
Output–Not a subset of da tabase
QUERY
EXAMPLES
D atabase– Find all credit applicants with last name of Smith.– Identify customers who have purchased more
than$10,000 in the last month.– Find all customers who have purchased milk
Data Mining–Find all credit applicants who are poorcreditrisks. (classification)–Identify customers with similar buying
habits. (Clustering)–Find all items which are frequently purchased with milk. (association rules)
OPERATIONAL VS. INFORMATIONAL
Operational Data Data Warehouse
Application OLTP OLAP
Use Precise Queries Ad Hoc
Temporal Snapshot Historical
Modification Dynamic Static
Orientation Application Business
Data Operational Values Integrated
Size Gigabits TerabitsLevel Detailed Summarized
Access Often Less Often
Response Few Seconds Minutes
Data Schema Relational Star/Snowflake
OLAP
O n line Analytic Processing (OLAP): provides morecomplex queries than OLTP.
O n L in e Transaction Processing (OLTP): t radi tional da tabase/t r ansaction processing.
Dimensional data; cube view Visualization of operat ions:
Slice: examine sub-cube. Dice: rotate cube to look a t another dimension. Roll Up/Dril l Down
OLAP OPERATIONS
Single Cell Multiple Cells Slice Dice
Roll Up
Drill Down
6
OLTP VS. OLAPOLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-datedetailed, flat relational isolated
historical,summarized, multidimensional integrated, consolidated
usage repetitive ad-hoc
access read/writeindex/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
6
D ATA M IN IN G A P P L I C A T IO N S
(BI) technologies providehistorical,
current, and predictiveviews of business operations.
It includes reporting, onlineanalytical
processing,businessperformance management, competitive
intelligence, benchmarking, and predictive analytics “How important is business intelligence?” Effective market analysis, Compare customer feedback on similar products, Discover the strengths and weaknesses of their competitors, Retain highly valuable customers, and make smart
business decisions.
Classification and Prediction techniques are the core
of predictive analytics in business intelligence, for which there
are many applications in analyzing markets, supplies, and
sales.
Clustering plays a central role in customer
relationship management, which groups customers
based on their similarities.
Characterization mining techniques, we can better understand
features of each customer group and developcustomized
customer reward programs.
WEB SEARCH
ENGINE
Crawling
Indexing
Target Hits
Top Related