Talent Management Data Mining - Discovering Gold in LAP 360 ...
o Lap Mining
-
Upload
atanu-misra -
Category
Documents
-
view
218 -
download
0
Transcript of o Lap Mining
-
8/13/2019 o Lap Mining
1/92
-
8/13/2019 o Lap Mining
2/92
2
Chapter 32 - Objectives
Purpose of online analytical processing (OLAP)and how OLAP differs from data warehousing.
Key features of OLAP applications.
Potential benefits associated with successfulOLAP applications.
Rules for OLAP tools and main types of toolsincluding: multi-dimensional OLAP (MOLAP),relational OLAP (ROLAP), and managed queryenvironment (MQE).
-
8/13/2019 o Lap Mining
3/92
-
8/13/2019 o Lap Mining
4/92
4
Data Warehousing and End-User Access Tools
Accompanying growth in data warehouses isincreasing demands for more powerful accesstools providing advanced analytical
capabilities.
Key developments include:
Online analytical processing (OLAP).
SQL extensions for complex data analysis. Data mining tools.
-
8/13/2019 o Lap Mining
5/92
5
Introducing OLAP
The dynamic synthesis, analysis, and
consolidation of large volumes of multi-
dimensional data, Codd (1993).
Describes a technology that uses a multi-
dimensional view of aggregate data to provide
quick access to strategic information for
purposes of advanced analysis.
-
8/13/2019 o Lap Mining
6/92
6
Introducing OLAP
Enables users to gain a deeper understanding
and knowledge about various aspects of their
corporate data through fast, consistent,
interactive access to a wide variety of possibleviews of the data.
Allows users to view corporate data in such a
way that it is a better model of the true
dimensionality of the enterprise.
-
8/13/2019 o Lap Mining
7/92
7
Introducing OLAP
Can easily answer who? and what?
questions, however, ability to answer what if?
and why? type questions distinguishes OLAP
from general-purpose query tools.
Types of analysis ranges from basic navigation
and browsing (slicing and dicing) tocalculations, to more complex analyses such as
time series and complex modeling.
-
8/13/2019 o Lap Mining
8/92
8
OLAP Benchmarks
OLAP Council published an analytical
processing benchmark referred to as the APB-
1 (OLAP Council, 1998).
Aim is to measure a servers overall OLAP
performance rather than the performance of
individual tasks.
-
8/13/2019 o Lap Mining
9/92
9
OLAP Benchmarks
APB-1 assesses most common business operationsincluding:
bulk loading of data from internal/external datasources;
incremental loading of data from operational systems;
aggregation of input/level data along hierarchies;
calculation of new data based on business models;
time series analysis;
queries with a high degree of complexity;
drill-down through hierarchies;
ad hocqueries;
multiple online sessions.
-
8/13/2019 o Lap Mining
10/92
10
OLAP Benchmarks
OLAP applications are judged on their abilityto provide just-in-time (JIT) information, acore requirement of supporting effectivedecision-making.
Assessing a servers ability to satisfy thisrequirement is more than measuring
processing performance but includes itsabilities to model complex businessrelationships and to respond to changingbusiness requirements.
-
8/13/2019 o Lap Mining
11/92
11
OLAP Benchmarks
APB-1 uses a standard benchmark metric
called AQM (Analytical Queries per Minute).
AQM represents number of analytical queries
processed per minute including data loading
and computation time. Thus, AQM
incorporates data loading performance,calculation performance, and query
performance into a singe metric.
-
8/13/2019 o Lap Mining
12/92
12
OLAP Benchmarks
Publication of APB-1 benchmark results must
include both the database schema and all code
required for executing the benchmark.
An essential requirement of all OLAP
applications is ability to provide users with JIT
information, to make effective decisions aboutan organizations strategic directions.
-
8/13/2019 o Lap Mining
13/92
13
OLAP Applications
JIT information is computed data that usually
reflects complex relationships and is often
calculated on the fly.
Also, as data relationships may not be knownin advance, the data model must be flexible.
-
8/13/2019 o Lap Mining
14/92
14
Examples of OLAP Applications in Various
Functional Areas
-
8/13/2019 o Lap Mining
15/92
15
OLAP Applications
Although OLAP applications are found in
widely divergent functional areas, all have
following key features:
multi-dimensional views of data;
support for complex calculations;
time intelligence.
-
8/13/2019 o Lap Mining
16/92
-
8/13/2019 o Lap Mining
17/92
17
OLAP Applications - Support for Complex
Calculations
Must provide a range of powerful
computational methods such as that required
by sales forecasting, which uses trend
algorithms such as moving averages andpercentage growth.
Mechanisms for implementing computationalmethods should be clear and non-procedural.
-
8/13/2019 o Lap Mining
18/92
18
OLAP ApplicationsTime Intelligence
Key feature of almost any analyticalapplication as performance is almost always
judged over time.
Time hierarchy is not always used in samemanner as other hierarchies.
Concepts such as year-to-date and period-over-period comparisons should be easily defined.
-
8/13/2019 o Lap Mining
19/92
-
8/13/2019 o Lap Mining
20/92
20
Representing Multi-Dimensional Data
Example of two-dimensional query.
What is the total revenue generated by property
sales in each city, in each quarter of 1997?
Choice of representation is based on types of
queries end-user may ask.
Compare representation - three-field relationaltable versus two-dimensional matrix.
-
8/13/2019 o Lap Mining
21/92
21
Multi-Dimensional Data as Three-Field Table
versus Two-Dimensional Matrix
-
8/13/2019 o Lap Mining
22/92
22
Representing Multi-Dimensional Data
Example of three-dimensional query.
What is the total revenue generated by property
sales for each type of property (Flat or House) in
each city, in each quarter of 1997?
Compare representation - four-field relational
table versus three-dimensional cube.
-
8/13/2019 o Lap Mining
23/92
23
Multi-Dimensional Data as Four-Field
Table versus Three-Dimensional Cube
-
8/13/2019 o Lap Mining
24/92
24
Representing Multi-Dimensional Data
Cube represents data as cells in an array.
Relational table only represents multi-
dimensional data in two dimensions.
-
8/13/2019 o Lap Mining
25/92
25
Multi-Dimensional OLAP Servers
Use multi-dimensional structures to store data
and relationships between data.
Multi-dimensional structures are bestvisualized as cubes of data, and cubes within
cubes of data. Each side of cube is a dimension.
A cube can be expanded to include otherdimensions.
-
8/13/2019 o Lap Mining
26/92
26
Multi-Dimensional OLAP Servers
A cube supports matrix arithmetic.
Multi-dimensional query response time
depends on how many cells have to be addedon the fly.
As number of dimensions increases, number of
the cubes cells increases exponentially.
-
8/13/2019 o Lap Mining
27/92
27
Multi-Dimensional OLAP Servers
However, majority of multi-dimensionalqueries use summarized, high-level data.
Solution is to pre-aggregate (consolidate) alllogical subtotals and totals along alldimensions.
Pre-aggregation is valuable, as typical
dimensions are hierarchical in nature. (e.g. Time dimension hierarchy - years, quarters,
months, weeks, and days)
-
8/13/2019 o Lap Mining
28/92
28
Multi-Dimensional OLAP Servers
Predefined hierarchy allows logical pre-
aggregation and, conversely, allows for a
logical drill-down.
Supports common analytical operations
Consolidation.
Drill-down.
Slicing and dicing.
-
8/13/2019 o Lap Mining
29/92
29
Multi-Dimensional OLAP Servers
Consolidation - aggregation of data such assimple roll-ups or complex expressionsinvolving inter-related data.
Drill-Down - is reverse of consolidation andinvolves displaying the detailed data thatcomprises the consolidated data.
Slicing and Dicing - (also called pivoting) refersto the ability to look at the data from differentviewpoints.
-
8/13/2019 o Lap Mining
30/92
30
Multi-Dimensional OLAP servers
Can store data in a compressed form by
dynamically selecting physical storage
organizations and compression techniques that
maximize space utilization.
Dense data (i.e., data that exists for high
percentage of cells) can be stored separately
from sparse data (i.e., significant percentage of
cells are empty).
-
8/13/2019 o Lap Mining
31/92
31
Multi-Dimensional OLAP Servers
Ability to omit empty or repetitive cells can
greatly reduce the size of the cube and the
amount of processing.
Allows analysis of exceptionally large amounts
of data.
-
8/13/2019 o Lap Mining
32/92
32
Multi-Dimensional OLAP Servers
In summary, pre-aggregation, dimensional
hierarchy, and sparse data management can
significantly reduce the size of the cube and the
need to calculate values on-the-fly.
Removes need for multi-table joins and
provides quick and direct access to arrays of
data, thus significantly speeding up execution
of multi-dimensional queries.
-
8/13/2019 o Lap Mining
33/92
33
Codds Rules for OLAP Systems
In 1993, E.F. Codd formulated twelve rules as
the basis for selecting OLAP tools.
Multi-dimensional conceptual view
Transparency
Accessibility
Consistent reporting performance
Client-server architectureGeneric dimensionality
-
8/13/2019 o Lap Mining
34/92
34
Codds Rules for OLAP
Dynamic sparse matrix handling
Multi-user support
Unrestricted cross-dimensional operations
Intuitive data manipulation
Flexible reporting
Unlimited dimensions and aggregation levels.
-
8/13/2019 o Lap Mining
35/92
35
Codds Rules for OLAP Systems
There are proposals to re-define or extend the
rules. For example, to also include:
Comprehensive database management tools.
Ability to drill down to detail (source record) level. Incremental database refresh.
SQL interface to the existing enterprise
environment.
-
8/13/2019 o Lap Mining
36/92
36
Categories of OLAP Tools
OLAP tools are categorized according to the
architecture of the underlying database.
Three main categories of OLAP tools include Multi-dimensional OLAP (MOLAP or MD-OLAP)
Relational OLAP (ROLAP), also called multi-
relational OLAP
Managed query environment (MQE)
-
8/13/2019 o Lap Mining
37/92
37
Multi-Dimensional OLAP (MOLAP)
Uses specialized data structures and multi-
dimensional Database Management Systems
(MDDBMSs) to organize, navigate, and
analyze data.
Data is typically aggregated and stored
according to predicted usage to enhance query
performance.
-
8/13/2019 o Lap Mining
38/92
38
Multi-Dimensional OLAP (MOLAP)
Use array technology and efficient storage
techniques that minimize the disk space
requirements through sparse data
management.
Provides excellent performance when data is
used as designed, and the focus is on data for a
specific decision-support application.
-
8/13/2019 o Lap Mining
39/92
39
Multi-Dimensional OLAP (MOLAP)
Traditionally, require a tight coupling with the
application layer and presentation layer.
Recent trends segregate the OLAP from the
data structures through the use of published
application programming interfaces (APIs).
-
8/13/2019 o Lap Mining
40/92
40
Typical Architecture for MOLAP Tools
-
8/13/2019 o Lap Mining
41/92
41
MOLAP Tools - Development Issues
Underlying data structures are limited in their
ability to support multiple subject areas and to
provide access to detailed data.
Navigation and analysis of data is limited
because the data is designed according to
previously determined requirements.
-
8/13/2019 o Lap Mining
42/92
42
MOLAP Tools - Development Issues
MOLAP products require a different set of
skills and tools to build and maintain the
database, thus increasing the cost and
complexity of support.
-
8/13/2019 o Lap Mining
43/92
43
Relational OLAP (ROLAP)
Fastest growing style of OLAP technology.
Supports RDBMS products using a metadata
layer - avoids need to create a static multi-
dimensional data structure - facilitates the
creation of multiple multi-dimensional views of
the two-dimensional relation.
-
8/13/2019 o Lap Mining
44/92
44
Relational OLAP (ROLAP)
To improve performance, some products use
SQL engines to support complexity of multi-
dimensional analysis, while others recommend,
or require, the use of highly denormalizeddatabase designs such as the star schema.
-
8/13/2019 o Lap Mining
45/92
45
Typical Architecture for ROLAP Tools
-
8/13/2019 o Lap Mining
46/92
46
ROLAP Tools - Development Issues
Middleware to facilitate the development of
multi-dimensional applications. (Software that
converts the two-dimensional relation into a
multi-dimensional structure).
Development of an option to create persistent,
multi-dimensional structures with facilities to
assist in the administration of these structures.
-
8/13/2019 o Lap Mining
47/92
47
Hybrid OLAP (HOLAP)
Can use data from either a RDBMS directly or
a multi-dimension server.
-
8/13/2019 o Lap Mining
48/92
48
Managed Query Environment (MQE)
Relatively new development.
Provide limited analysis capability, either
directly against RDBMS products, or by usingan intermediate MOLAP server.
-
8/13/2019 o Lap Mining
49/92
49
Managed Query Environment (MQE)
Deliver selected data directly from DBMS or
via a MOLAP server to desktop (or local
server) in form of a datacube, where it is
stored, analyzed, and maintained locally.
Promoted as being relatively simple to install
and administer with reduced cost and
maintenance.
-
8/13/2019 o Lap Mining
50/92
50
Typical Architecture for MQE Tools
-
8/13/2019 o Lap Mining
51/92
51
MQE Tools - Development Issues
Architecture results in significant dataredundancy and may cause problems fornetworks that support many users.
Ability of each user to build a custom datacubemay cause a lack of data consistency amongusers.
Only a limited amount of data can beefficiently maintained.
-
8/13/2019 o Lap Mining
52/92
52
OLAP Extensions to SQL
SQL promoted as easy to learn, non-procedural, free-format, DBMS-independent,and international standard.
However, major disadvantage has been inabilityto represent many of the questions mostcommonly asked by business analysts.
IBM and Oracle jointly proposed OLAPextensions to SQL early in 1999, adopted as anamendment to SQL.
-
8/13/2019 o Lap Mining
53/92
-
8/13/2019 o Lap Mining
54/92
54
OLAP Extensions to SQL - RISQL
Designed for business analysts.
Set of extensions that augments SQL with a
variety of powerful operations appropriate todata analysis and decision-support applications
such as ranking, moving averages,
comparisons, market share, this year versus
last year.
-
8/13/2019 o Lap Mining
55/92
55
Use of the RISQL CUME Function
Show the quarterly sales for branch office B003,along with the monthly year-to-date figures.
SELECT quarter, quarterlySales, CUME(quarterlySales)
AS Year-to-DateFROM BranchSales
WHERE branchNo = B003;
-
8/13/2019 o Lap Mining
56/92
56
Use of the RISQL MOVINGAVG / MOVINGSUM
Function
Show the first six monthly sales for branch office
B003 without the effect of seasonality.
SELECT month, monthlySales,
MOVINGAVG(monthlySales) AS 3-MonthMovingAvg,MOVINGSUM(monthlySales) AS 3-MonthMovingSum
FROM BranchSales
WHERE branchNo = B003;
-
8/13/2019 o Lap Mining
57/92
57
Data Mining
The process of extracting valid, previously
unknown, comprehensible, and actionable
information from large databases and using it
to make crucial business decisions (Simoudis,1996).
Involves analysis of data and use of software
techniques for finding hidden and unexpected
patterns and relationships in sets of data.
-
8/13/2019 o Lap Mining
58/92
58
Data Mining
Reveals information that is hidden andunexpected, as little value in finding patternsand relationships that are already intuitive.
Patterns and relationships are identified byexamining the underlying rules and features inthe data.
Tends to work from the data up and most
accurate results normally require largevolumes of data to deliver reliable conclusions.
-
8/13/2019 o Lap Mining
59/92
59
Data Mining
Starts by developing an optimal representationof structure of sample data, during which timeknowledge is acquired and extended to largersets of data.
Data mining can provide huge paybacks forcompanies who have made a significantinvestment in data warehousing.
Relatively new technology, however alreadyused in a number of industries.
-
8/13/2019 o Lap Mining
60/92
60
Examples of Applications of Data Mining
Retail / Marketing
Identifying buying patterns of customers.
Finding associations among customer
demographic characteristics.Predicting response to mailing campaigns.
Market basket analysis.
-
8/13/2019 o Lap Mining
61/92
61
Examples of Applications of Data Mining
Banking
Detecting patterns of fraudulent credit card
use.
Identifying loyal customers.
Predicting customers likely to change their
credit card affiliation.
Determining credit card spending bycustomer groups.
-
8/13/2019 o Lap Mining
62/92
62
Examples of Applications of Data Mining
Insurance
Claims analysis.
Predicting which customers will buy new policies.
Medicine
Characterizing patient behavior to predict surgery
visits.
Identifying successful medical therapies fordifferent illnesses.
-
8/13/2019 o Lap Mining
63/92
63
Data Mining Operations
Four main operations include: Predictive modeling.
Database segmentation.
Link analysis.
Deviation detection.
There are recognized associations between theapplications and the corresponding operations.
e.g. Direct marketing strategies use databasesegmentation.
-
8/13/2019 o Lap Mining
64/92
64
Data Mining Techniques
Techniques are specific implementations of the
data mining operations.
Each operation has its own strengths and
weaknesses.
Data mining tools sometimes offer a choice of
operations to implement a technique.
-
8/13/2019 o Lap Mining
65/92
65
Data Mining Techniques
Criteria for selection of tool includes
Suitability for certain input data types.
Transparency of the mining output.
Tolerance of missing variable values. Level of accuracy possible.
Ability to handle large volumes of data.
Data Mining Operations and Associated
-
8/13/2019 o Lap Mining
66/92
66
Data Mining Operations and Associated
Techniques
-
8/13/2019 o Lap Mining
67/92
67
Predictive Modeling
Similar to the human learning experience
uses observations to form a model of the important
characteristics of some phenomenon.
Uses generalizations of real world and abilityto fit new data into a general framework.
Can analyze a database to determine essential
characteristics (model) about the data set.
-
8/13/2019 o Lap Mining
68/92
68
Predictive Modeling
Model is developed using a supervised learning
approach, which has two phases: training and
testing.
Training builds a model using a large sample ofhistorical data called a training set.
Testing involves trying out the model on new,
previously unseen data to determine its accuracy
and physical performance characteristics.
-
8/13/2019 o Lap Mining
69/92
69
Predictive Modeling
Applications of predictive modeling include
customer retention management, credit
approval, cross selling, and direct marketing.
Two techniques associated with predictive
modeling: classification and value prediction,
distinguished by nature of the variable being
predicted.
-
8/13/2019 o Lap Mining
70/92
70
Predictive Modeling - Classification
Used to establish a specific predetermined class
for each record in a database from a finite set
of possible class values.
Two specializations of classification: tree
induction and neural induction.
-
8/13/2019 o Lap Mining
71/92
71
Example of Classification using Tree Induction
Example of Classification using Neural
-
8/13/2019 o Lap Mining
72/92
72
Example of Classification using Neural
Induction
-
8/13/2019 o Lap Mining
73/92
73
Predictive Modeling - Value Prediction
Used to estimate a continuous numeric value
that is associated with a database record.
Uses the traditional statistical techniques oflinear regression and nonlinear regression.
Relatively easy to use and understand.
-
8/13/2019 o Lap Mining
74/92
74
Predictive Modeling - Value Prediction
Linear regression attempts to fit a straight line
through a plot of the data, such that the line is
the best representation of the average of all
observations at that point in the plot.
Problem is that the technique only works well
with linear data and is sensitive to the presence
of outliers (i.e., data values, which do notconform to the expected norm).
-
8/13/2019 o Lap Mining
75/92
75
Predictive Modeling - Value Prediction
Although nonlinear regression avoids the main
problems of linear regression, still not flexible
enough to handle all possible shapes of the data
plot.
Statistical measurements are fine for building
linear models that describe predictable data
points, however, most data is not linear in
nature.
-
8/13/2019 o Lap Mining
76/92
76
Predictive Modeling - Value Prediction
Data mining requires statistical methods that
can accommodate non-linearity, outliers, and
non-numeric data.
Applications of value prediction include credit
card fraud detection or target mailing list
identification.
-
8/13/2019 o Lap Mining
77/92
-
8/13/2019 o Lap Mining
78/92
78
Database Segmentation
Less precise than other operations thus lesssensitive to redundant and irrelevant features.
Sensitivity can be reduced by ignoring a subset
of the attributes that describe each instance orby assigning a weighting factor to eachvariable.
Applications of database segmentation includecustomer profiling, direct marketing, and crossselling.
Example of Database Segmentation using a
-
8/13/2019 o Lap Mining
79/92
79
Example of Database Segmentation using a
Scatterplot
-
8/13/2019 o Lap Mining
80/92
80
Database Segmentation
Associated with demographic or neural
clustering techniques, distinguished by:
Allowable data inputs.
Methods used to calculate the distancebetween records.
Presentation of the resulting segments for
analysis.
-
8/13/2019 o Lap Mining
81/92
81
Link Analysis
Aims to establish links (associations) betweenrecords, or sets of records, in a database.
There are three specializations
Associations discovery.Sequential pattern discovery.
Similar time sequence discovery.
Applications include product affinity analysis,direct marketing, and stock price movement.
-
8/13/2019 o Lap Mining
82/92
82
Link Analysis - Associations Discovery
Finds items that imply the presence of other
items in the same event.
Affinities between items are represented by
association rules.
e.g. When customer rents property for more than 2
years and is more than 25 years old, in 40% of cases,
customer will buy a property. Association happens in
35% of all customers who rent properties.
-
8/13/2019 o Lap Mining
83/92
83
Link Analysis - Sequential Pattern Discovery
Finds patterns between events such that the
presence of one set of items is followed by
another set of items in a database of events
over a period of time. e.g. Used to understand long-term customer buying
behavior.
Link Analysis Similar Time Sequence
-
8/13/2019 o Lap Mining
84/92
84
Link Analysis - Similar Time Sequence
Discovery
Finds links between two sets of data that are
time-dependent, and is based on the degree of
similarity between the patterns that both time
series demonstrate.
e.g. Within three months of buying property, new
home owners will purchase goods such as cookers,
freezers, and washing machines.
-
8/13/2019 o Lap Mining
85/92
85
Deviation Detection
Relatively new operation in terms of
commercially available data mining tools.
Often a source of true discovery because itidentifies outliers, which express deviation
from some previously known expectation and
norm.
-
8/13/2019 o Lap Mining
86/92
86
Deviation Detection
Can be performed using statistics and
visualization techniques or as a by-product of
data mining.
Applications include fraud detection in the use
of credit cards and insurance claims, quality
control, and defects tracing.
Example of Database Segmentation using a
-
8/13/2019 o Lap Mining
87/92
87
p S g g
Visualization
-
8/13/2019 o Lap Mining
88/92
88
Data Mining Tools
There are a growing number of commercialdata mining tools on the marketplace.
Important characteristics of data mining toolsinclude:
Data preparation facilities.
Selection of data mining operations.
Product scalability and performance. Facilities for visualization of results.
-
8/13/2019 o Lap Mining
89/92
89
Data Mining and Data Warehousing
Major challenge to exploit data mining is
identifying suitable data to mine.
Data mining requires single, separate, clean,integrated, and self-consistent source of data.
-
8/13/2019 o Lap Mining
90/92
90
Data Mining and Data Warehousing
A data warehouse is well equipped for
providing data for mining.
Data quality and consistency is a prerequisitefor mining to ensure the accuracy of the
predictive models. Data warehouses are
populated with clean, consistent data.
-
8/13/2019 o Lap Mining
91/92
91
Data Mining and Data Warehousing
Advantageous to mine data from multiplesources to discover as many interrelationships
as possible. Data warehouses contain data from
a number of sources.
Selecting relevant subsets of records and fields
for data mining requires query capabilities of
the data warehouse.
-
8/13/2019 o Lap Mining
92/92
Data Mining and Data Warehousing
Results of a data mining study are useful if
there is some way to further investigate the
uncovered patterns. Data warehouses provide
capability to go back to the data source.