Data Warehousing and elements of Data Mining - Uniudmaurizio.pighin/DWDMLecturePighin.pdf · and...

55
Pagina 1 Copyright © 2008 by Maurizio Pighin prof. Maurizio Pighin e-mail: [email protected] Dipartimento di Matematica e Informatica Università di Udine - Italy Data Warehousing and elements of Data Mining Slide 2 DW and elements of DM Maurizio Pighin Motivation: “Necessity is the Mother of Invention” Data explosion problem Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases and other information repositories Difficult to analyze data Complex query, long time of analysis We are drowning in data, but starving for knowledge! Solution: Data warehousing and Data mining Data warehousing and on-line analytical processing Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases

Transcript of Data Warehousing and elements of Data Mining - Uniudmaurizio.pighin/DWDMLecturePighin.pdf · and...

Pagina 1Copyright © 2008 by Maurizio Pighin

prof. Maurizio Pighin

e-mail: [email protected]

Dipartimento di Matematica e Informatica

Università di Udine - Italy

Data Warehousing and elements of Data Mining

Slide 2

DW and elements of DMMaurizio Pighin

Motivation: “Necessity is the Mother of Invention”

• Data explosion problem – Automated data collection tools and mature database

technology lead to tremendous amounts of data stored in databases and other information repositories

• Difficult to analyze data– Complex query, long time of analysis

• We are drowning in data, but starving for knowledge! • Solution: Data warehousing and Data mining

– Data warehousing and on-line analytical processing– Extraction of interesting knowledge (rules, regularities,

patterns, constraints) from data in large databases

Pagina 2Copyright © 2008 by Maurizio Pighin

Slide 3

DW and elements of DMMaurizio PighinEvolution of Database Technology

• 1960s: Data collection, database creation, IMS and network DBMS

• 1970s: Relational data model, relational DBMS implementation

• 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)

• 1990s—2000s: Data mining and data warehousing, multimedia databases, and Web databases

Slide 4

DW and elements of DMMaurizio PighinEvolution of data analysis

• 1960s: batch reports– Difficult to find and analyze data– Expensive, every request needs a new report (today a

lot of systems offers only this kind of analysis)• 1970s: First procedures to help decision process

– Usually very poor and do not integrated with office automation tools

• 1980s: Office automation tools– Query tools, spreadsheets, GUIs– Access to operational data (usually very complex)

• 1990s: Data warehousing and data mining

Pagina 3Copyright © 2008 by Maurizio Pighin

Slide 5

DW and elements of DMMaurizio Pighin

Data Warehousing and Data Mining

• What is a data warehouse? • A multi-dimensional data model• Data warehouse architecture• Data warehouse implementation• OLAP analysis• From data warehousing to data mining• Principles of data mining

Slide 6

DW and elements of DMMaurizio PighinWhat is Data Warehouse?

• Defined in many different ways, but not rigorously.– A decision support database that is maintained

separately from the organization’s operational database

– Support information processing by providing a solid platform of consolidated, historical data for analysis.

Pagina 4Copyright © 2008 by Maurizio Pighin

Slide 7

DW and elements of DMMaurizio PighinWhat is Data Warehouse?

• “A data warehouse is a subject-oriented, integrated, time-variant, and non volatile collection of data in support of management’s decision-making process.”- W. H. Inmon (1985)

• “A single, complete and consistent data warehouse, obtained by different sources, available to final users to be immediately utilized” – IBM System Journal (1990)

• Data warehousing:– The process of constructing and using data

warehouses

Slide 8

DW and elements of DMMaurizio PighinData Warehouse - Subject-Oriented

• Organized around major subjects, such as customer, product, sales.

• Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.

• Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.

Pagina 5Copyright © 2008 by Maurizio Pighin

Slide 9

DW and elements of DMMaurizio PighinData Warehouse - Integrated

• Constructed by integrating multiple, heterogeneous data sources– relational databases, flat files, on-line transaction

records• Data cleaning and data integration techniques are

applied.– Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different data sources

• E.g., Hotel price: currency, tax, breakfast covered, etc.

– When data is moved to the warehouse, it is converted.

Slide 10

DW and elements of DMMaurizio PighinData Warehouse - Time Variant

• The time horizon for the data warehouse is significantly longer than that of operational systems.– Operational database: current value data.– Data warehouse data: provide information from a

historical perspective (e.g., past 5-10 years)• Every key structure in the data warehouse

– Contains an element of time, explicitly or implicitly– But the key of operational data may or may not contain

“time element”.

Pagina 6Copyright © 2008 by Maurizio Pighin

Slide 11

DW and elements of DMMaurizio PighinData Warehouse - Non-Volatile

• A physically separate store of data transformed from the operational environment.

• Operational update of data does not occur in the data warehouse environment.– Does not require transaction processing, recovery, and

concurrency control mechanisms– Requires only two operations in data accessing:

• initial loading of data• access of data.

Slide 12

DW and elements of DMMaurizio PighinData Warehouse

• Data analysis system characteristics: FASMI – OLAP Report 1995– Fast– Analytical– Shared– Multidimensional– Informational

Pagina 7Copyright © 2008 by Maurizio Pighin

Slide 13

DW and elements of DMMaurizio PighinWhy do we need all that?

• Operational databases are for On Line Transaction Processing (OLTP)– automate day-to-day operations (purchasing, banking

etc)– transactions access (and modify!) a few records at a

time– database design is application (process) oriented– metric: transactions/sec

Slide 14

DW and elements of DMMaurizio PighinWhy do we need all that?

• Data Warehouse is for On Line Analytical Processing (OLAP)– complex queries that access millions of records– need historical data for trend analysis – long scans would interfere with normal operations– synchronizing data-intensive queries among physically

separated databases would be a nightmare!– metric: query response time

Pagina 8Copyright © 2008 by Maurizio Pighin

Slide 15

DW and elements of DMMaurizio PighinExamples of OLAP

• Comparisons (this period v.s. last period)– Show me the sales per region for this year and

compare it to that of the previous year to identify discrepancies

• Multidimensional ratios (percent to total)– Show me the contribution to weekly profit made by all

items sold in the northeast stores between may 1 and may 7

Slide 16

DW and elements of DMMaurizio PighinExamples of OLAP

• Ranking and statistical profiles (top N/bottom N)– Show me sales, profit and average call volume per day

for my 10 most profitable salespeople• Custom consolidation

(market segments, ad hoc groups)– Show me an abbreviated income statement by quarter

for the last four quarters for my northeast region operations

Pagina 9Copyright © 2008 by Maurizio Pighin

Slide 17

DW and elements of DMMaurizio Pighin

Data Warehouse vs. Heterogeneous DBMS

• Traditional heterogeneous DB integration: – Build wrappers/mediators on top of heterogeneous

databases – Query driven approach

• When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set

• Complex information filtering, compete for resources

• Data warehouse: update-driven, high performance– Information from heterogeneous sources is integrated

in advance and stored in warehouses for direct query and analysis

Slide 18

DW and elements of DMMaurizio Pighin

Data Warehouse vs. Operational DBMS

• OLTP (on-line transaction processing)– Day-to-day operations: purchasing, inventory, banking,

manufacturing, payroll, registration, accounting, etc.• OLAP (on-line analytical processing)

– Data analysis and decision making• Distinct features (OLTP vs. OLAP):

– System orientation: process vs. business subject– Data contents: current, detailed vs. historical, consolidated– Database design: ER + application vs. Multidimensional + subject– View: current, local vs. evolutionary, integrated– Access patterns: update vs. read-only but complex queries

Pagina 10Copyright © 2008 by Maurizio Pighin

Slide 19

DW and elements of DMMaurizio PighinOLTP vs. OLAP

OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date

detailed, flat relational isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc access read/write

index/hash on prim. key lots of scans

unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response

Slide 20

DW and elements of DMMaurizio PighinWhy Separate Data Warehouse?

• High performance for both systems– DBMS - tuned for OLTP: access methods, indexing, concurrency

control, recovery– Warehouse - tuned for OLAP: complex OLAP queries,

multidimensional view, consolidation.• Different functions and different data:

– missing data: Decision Support requires historical data which operational DBs do not typically maintain

– data consolidation: Decision Support requires consolidation (aggregation, summarization) of data from heterogeneous sources

– data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled

Pagina 11Copyright © 2008 by Maurizio Pighin

Slide 21

DW and elements of DMMaurizio Pighin

Data Warehousing and Data Mining

• What is a data warehouse? • A multi-dimensional data model• Data warehouse architecture• Data warehouse implementation• OLAP analysis• From data warehousing to data mining• Principles of data mining

Slide 22

DW and elements of DMMaurizio PighinMultidimensional model

• A data warehouse is based on a multidimensional data model which views data in the form of a data cube (hypercube)

• An hypercube is a multidimensional array which represents particular event

• We define “fact” a point of this multidimensional array obtained crossing exiting co-ordinates– Dimension: fact co-ordinate– Measure: numerical value characterizing the event

Pagina 12Copyright © 2008 by Maurizio Pighin

Slide 23

DW and elements of DMMaurizio PighinMultidimensional model - example

• A data cube, such as sales, allows numerical data (measures) to be modeled and viewed in multiple dimensions– Measures such as transaction value (dollars_sold),

quantity (item_quantity)– Dimension, such as item (item_name, brand, type), or

time (day, week, month, quarter, year), or customer (customer_name, city, region, state)

Slide 24

DW and elements of DMMaurizio PighinMeasures

• Every fact can contain more than one measure• A measure may be

– Saved on the Data Warehouse (effective)– Run-time evaluated from effective measures– Implicit (presence or absence of a fact)

Pagina 13Copyright © 2008 by Maurizio Pighin

Slide 25

DW and elements of DMMaurizio PighinFact aggregation

• It is possible to aggregate elementary facts to obtain synthetic facts

• The measures of the synthetic facts can be obtained with aggregation operators– Sum, mean, max, min,…

• For each couple measure-dimension it is possible to define different aggregation-operators

Slide 26

DW and elements of DMMaurizio PighinFact aggregation

• The measures can be– Addictive: can be aggregate by sum on every

dimension (for instance total income)– Semi-addictive: can be aggregate by sum on some

dimension but not on other (for instance quantity can be summed on “item” but not on “store” (where are present different items))

– Not-addictive: they never can be summed, you must use other operators (mean, median, max, min) (for instance unitary price)

Pagina 14Copyright © 2008 by Maurizio Pighin

Slide 27

DW and elements of DMMaurizio PighinDimension hierarchy

• Hierarchy– Set of dimensional attributes hierarchically linked to

one dimension – Dimensional attributes

• Are used to aggregate elementary facts• Are univocally determined by a dimension• Represent a “classification” of the dimension

Slide 28

DW and elements of DMMaurizio PighinExample of dimension hierarchy

all

Europe North_America

MexicoCanadaSpainGermany

Vancouver

M. WindL. Chan

...

......

... ...

...

all

region

office

country

TorontoFrankfurtcity

Pagina 15Copyright © 2008 by Maurizio Pighin

Slide 29

DW and elements of DMMaurizio Pighin

View of Warehouses and Hierarchies

Slide 30

DW and elements of DMMaurizio PighinMultidimensional Data

• Sales volume as a function of Product, Location, and Time

Prod

uct

Locati

on

Time

Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Item City Month Week

Office Day

Pagina 16Copyright © 2008 by Maurizio Pighin

Slide 31

DW and elements of DMMaurizio Pighin

Data Warehousing and Data Mining

• What is a data warehouse? • A multi-dimensional data model• Data warehouse architecture• Data warehouse implementation• OLAP analysis• From data warehousing to data mining• Principles of data mining

Slide 32

DW and elements of DMMaurizio PighinOLAP Server Architectures

• Relational OLAP (ROLAP) – Use relational or extended-relational DBMS to store

and manage warehouse data and OLAP middle ware to support missing pieces

– Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services

– Greater scalability

Pagina 17Copyright © 2008 by Maurizio Pighin

Slide 33

DW and elements of DMMaurizio PighinOLAP Server Architectures

• Multidimensional OLAP (MOLAP) – Array-based multidimensional storage engine (sparse

matrix techniques)– fast indexing to pre-computed summarized data

• Hybrid OLAP (HOLAP)– User flexibility, e.g., low level: relational, high-level:

array

Slide 34

DW and elements of DMMaurizio Pighin

Conceptual Modeling of Data Warehouses

• Modeling data warehouses: dimensions & measures on ROLAP Systems– Star schema: A fact table in the middle connected to a

set of dimension tables – Snowflake schema: A refinement of star schema

where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake

– Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

Pagina 18Copyright © 2008 by Maurizio Pighin

Slide 35

DW and elements of DMMaurizio Pighin

Fact tables contain factual or quantitative data

Dimension tables contain descriptions about the subjects of the business

1:N relationship between dimension tables and fact tables

Excellent for ad-hoc queries, but bad for online transaction processing

Dimension tables are denormalized to maximize performance

Components of Star Schema

Slide 36

DW and elements of DMMaurizio Pighin

Fact table provides statistics for sales broken down by product, period and store dimensions

Star Schema example

Pagina 19Copyright © 2008 by Maurizio Pighin

Slide 37

DW and elements of DMMaurizio PighinStar Schema with sample data

Slide 38

DW and elements of DMMaurizio PighinAnother example of Star Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_salesMeasures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Pagina 20Copyright © 2008 by Maurizio Pighin

Slide 39

DW and elements of DMMaurizio PighinExample of Snowflake Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcity_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_key

item

branch_keybranch_namebranch_type

branch

supplier_keysupplier_type

supplier

city_keycityprovince_or_streetcountry

city

Slide 40

DW and elements of DMMaurizio PighinExample of Fact Constellation

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_salesMeasures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_keyshipper_namelocation_keyshipper_type

shipper

Pagina 21Copyright © 2008 by Maurizio Pighin

Slide 41

DW and elements of DMMaurizio Pighin

Main Data Warehouse Architectures

• Architectures – Generic Two-Level Architecture– Independent Data Mart– Dependent Data Mart and Operational Data Store -

Three-Level Architecture• All involve some form of extraction, transformation

and loading (ETL)

Slide 42

DW and elements of DMMaurizio Pighin

E

T

LOne, company-wide warehouse

Periodic extraction data is not completely current in warehouse

Generic Two LevelData Warehousing Architecture

Pagina 22Copyright © 2008 by Maurizio Pighin

Slide 43

DW and elements of DMMaurizio PighinData marts:Data marts:

Mini-warehouses, limited in scope

E

T

L

Separate ETL for each independent data mart

Data access complexity due to multiple data marts

Indipendent data martData Warehousing Architecture

Slide 44

DW and elements of DMMaurizio Pighin

ET

L

Single ETL for enterprise data warehouse(EDW)(EDW)

Simpler data access

Dependent data marts loaded from EDW

Dependent data mart with operationaldatastore at three level architecture

Pagina 23Copyright © 2008 by Maurizio Pighin

Slide 45

DW and elements of DMMaurizio Pighin

DataWarehouse

ExtractTransformLoadRefresh

OLAP Engine

AnalysisQueryReportsData mining

Monitor&

IntegratorMetadata

Data Sources Front-End

Server

Data Marts

OperationalDBs

othersources

Data Storage

OLAP Server

General Architecture

Slide 46

DW and elements of DMMaurizio PighinGeneral Architecture

• Enterprise warehouse– collects all of the information about subjects spanning

the entire organization• Data Mart

– a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart

• Independent vs. dependent (directly from warehouse) data mart

Pagina 24Copyright © 2008 by Maurizio Pighin

Slide 47

DW and elements of DMMaurizio PighinETL function

• Data extraction:– get data from multiple, heterogeneous, and external sources

• Data cleaning:– detect errors in the data and rectify them when possible

• Data transformation:– convert data from legacy or host format to warehouse format

• Load:– sort, summarize, consolidate, compute views, check integrity, and

build indices and partitions• Refresh:

– propagate the updates from the data sources to the warehouse

Slide 48

DW and elements of DMMaurizio Pighin

Data Warehousing and Data Mining

• What is a data warehouse? • A multi-dimensional data model• Data warehouse architecture• Data warehouse implementation• OLAP analysis• From data warehousing to data mining• Principles of data mining

Pagina 25Copyright © 2008 by Maurizio Pighin

Slide 49

DW and elements of DMMaurizio Pighin

Design of a Data Warehouse: A Business Analysis Framework

• Four views regarding the design of a data warehouse – Top-down view

• allows selection of the relevant information necessary for the data warehouse

– Data source view• exposes the information being captured, stored, and managed

by operational systems

– Data warehouse view• consists of fact tables and dimension tables

– Business query view • sees the perspectives of data in the warehouse from the view

of end-user

Slide 50

DW and elements of DMMaurizio PighinData Warehouse Design Process

• Top-down, bottom-up approaches or a combination of both– Top-down: Starts with overall design and planning

(mature)– Bottom-up: Starts with experiments and prototypes

(rapid)• From software engineering point of view

– Waterfall: structured and systematic analysis at each step before proceeding to the next (top-down)

– Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around (bottom-up)

Pagina 26Copyright © 2008 by Maurizio Pighin

Slide 51

DW and elements of DMMaurizio PighinData Warehouse Design Process

• Typical data warehouse design process with bottom up process– Choose a business process to model, e.g., orders, invoices, etc.– Choose the grain (atomic level of data) of the business process– Choose the dimensions that will apply to each fact table record– Choose the measure that will populate each fact table record– Design the architecture of the DW– Design the ETL– Install and test

• Advantages– Results in short time– Not too expensive– Give to the management a clear perspective of the OLAP world

Slide 52

DW and elements of DMMaurizio Pighin

Data Warehousing and Data Mining

• What is a data warehouse? • A multi-dimensional data model• Data warehouse architecture• Data warehouse implementation• OLAP analysis• From data warehousing to data mining• Principles of data mining

Pagina 27Copyright © 2008 by Maurizio Pighin

Slide 53

DW and elements of DMMaurizio PighinExploration of Data Cubes

• OLAP– Interactive navigation through data

• Two models– Hypothesis-driven: exploration by user driven by

hypothesis formulated by the user– Discovery-driven: pre-compute measures indicating

exceptions, guide user in the data analysis, at all levels of aggregation. Then users utilize Hypothesis driven exploration

Slide 54

DW and elements of DMMaurizio PighinA Sample Data Cube

Total annual salesof TV in U.S.A.Date

Produ

ct

Cou

ntrysum

sumTV

VCRPC

1Qtr 2Qtr 3Qtr 4QtrU.S.A

Canada

Mexico

sum

Pagina 28Copyright © 2008 by Maurizio Pighin

Slide 55

DW and elements of DMMaurizio PighinTypical OLAP Operations

• Roll up (drill-up): summarize data– by climbing up hierarchy or by dimension reduction

• Drill down (roll down): reverse of roll-up– from higher level summary to lower level summary or

detailed data, or introducing new dimensions

Slide 56

DW and elements of DMMaurizio PighinRoll-up/Drill-down

Date

ProductC

ount

ry

All

Cou

ntry

All

Cou

ntry

Date

All

AllAll

All

Drill-Down

Roll-up

Roll-up

Drill-Down

Drill-Down

Roll-up

Pagina 29Copyright © 2008 by Maurizio Pighin

Slide 57

DW and elements of DMMaurizio PighinOLAP Operations

drill-down

Slide 58

DW and elements of DMMaurizio PighinOLAP Operations

drill-down

Pagina 30Copyright © 2008 by Maurizio Pighin

Slide 59

DW and elements of DMMaurizio PighinOLAP Operations

drill-down

Slide 60

DW and elements of DMMaurizio PighinOLAP Operations

roll-up

Pagina 31Copyright © 2008 by Maurizio Pighin

Slide 61

DW and elements of DMMaurizio PighinOLAP Operations

roll-up

Slide 62

DW and elements of DMMaurizio PighinOLAP Operations

roll-up

Pagina 32Copyright © 2008 by Maurizio Pighin

Slide 63

DW and elements of DMMaurizio PighinOLAP Operations

• Slice and Dice: select and project on one or more dimensions

produ

ct

country

date

customer = “Smith”

Slide 64

DW and elements of DMMaurizio PighinSlice

Date (4 quarters)

Cou

ntry

Produ

ct

Slice

Date ( 2 quarters)

Cou

ntry

Produ

ct

Pagina 33Copyright © 2008 by Maurizio Pighin

Slide 65

DW and elements of DMMaurizio PighinOLAP Operations

slice-and-dice

Slide 66

DW and elements of DMMaurizio PighinOLAP Operations

slice-and-dice

Pagina 34Copyright © 2008 by Maurizio Pighin

Slide 67

DW and elements of DMMaurizio PighinOLAP Operations

slice-and-dice

Slide 68

DW and elements of DMMaurizio PighinOLAP Operations

• Pivot (rotate): – reorient the cube visualization, 3D to

series of 2D planes.

Pagina 35Copyright © 2008 by Maurizio Pighin

Slide 69

DW and elements of DMMaurizio PighinOLAP Operations

ProductStore

Time

ProductTime

Store

Pivot

Pivot

StoreTime

Product

Pivot

Pivot

Slide 70

DW and elements of DMMaurizio PighinOLAP Operations

pivoting

Pagina 36Copyright © 2008 by Maurizio Pighin

Slide 71

DW and elements of DMMaurizio PighinOLAP Operations

pivoting

Slide 72

DW and elements of DMMaurizio PighinOLAP Operations

pivoting

Pagina 37Copyright © 2008 by Maurizio Pighin

Slide 73

DW and elements of DMMaurizio PighinOLAP Operations

• Drill across: involving (across) more than one fact table

Slide 74

DW and elements of DMMaurizio PighinOLAP Operations

drill-across

Pagina 38Copyright © 2008 by Maurizio Pighin

Slide 75

DW and elements of DMMaurizio PighinOLAP Operations

drill-across

Slide 76

DW and elements of DMMaurizio PighinExploration of Data Cubes

• Hypothesis-driven– exploration by user, huge search space

• Discovery-driven– Pre-compute measures indicating exceptions, guide

user in the data analysis, at all levels of aggregation– Exception: significantly different from the value

anticipated, based on a statistical model– Visual cues such as background color are used to

reflect the degree of exception of each cell– Computation of exception indicator can be overlapped

with cube construction

Pagina 39Copyright © 2008 by Maurizio Pighin

Slide 77

DW and elements of DMMaurizio Pighin

Examples: Discovery-Driven Data Cubes

Slide 78

DW and elements of DMMaurizio Pighin

Data Warehousing and Data Mining

• What is a data warehouse? • A multi-dimensional data model• Data warehouse architecture• Data warehouse implementation• OLAP analysis• From data warehousing to data mining• Principles of data mining

Pagina 40Copyright © 2008 by Maurizio Pighin

Slide 79

DW and elements of DMMaurizio PighinData Warehouse Usage

• Three kinds of data warehouse applications– Information processing

• supports querying, basic statistical analysis, and reportingusing crosstabs, tables, charts and graphs

– Analytical processing• multidimensional analysis of data warehouse data• supports basic OLAP operations, slice-dice, drilling, pivoting

– Data mining• knowledge discovery from hidden patterns• supports associations, constructing analytical models,

performing classification and prediction, and presenting the mining results using visualization tools.

• Differences among the three tasks

Slide 80

DW and elements of DMMaurizio Pighin

From On-Line Analytical Processing to On Line Analytical Mining (OLAM)

• Why online analytical mining?– High quality of data in data warehouses

• DW contains integrated, consistent, cleaned data

– Available information processing structure surrounding data warehouses

• ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools

– OLAP-based exploratory data analysis• mining with drilling, dicing, pivoting, etc.

– On-line selection of data mining functions

Pagina 41Copyright © 2008 by Maurizio Pighin

Slide 81

DW and elements of DMMaurizio Pighin

Data Warehousing and Data Mining

• What is a data warehouse? • A multi-dimensional data model• Data warehouse architecture• Data warehouse implementation• OLAP analysis• From data warehousing to data mining• Principles of data mining

Slide 82

DW and elements of DMMaurizio PighinWhat Is Data Mining?

• Data mining (knowledge discovery in databases): – Extraction of interesting (non-trivial, implicit, previously unknown

and potentially useful) information or patterns from data in large databases

• Alternative names:– Knowledge discovery(mining) in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Pagina 42Copyright © 2008 by Maurizio Pighin

Slide 83

DW and elements of DMMaurizio PighinWhat Is Data Mining?

• Other Definitions– Non-trivial extraction of implicit, previously unknown

and potentially useful information from data– Exploration & analysis, by automatic or

semi-automatic means, of large quantities of data in order to discover meaningful patterns

Slide 84

DW and elements of DMMaurizio Pighin

Why Mine Data? Commercial Viewpoint

• Lots of data is being collected and warehoused – Web data, e-commerce– Purchases at department stores– Bank/Credit Card transactions

• Computers have become cheaper and more powerful• Competitive Pressure is Strong

– Provide better, customized services (e.g. in Customer Relationship Management)

Pagina 43Copyright © 2008 by Maurizio Pighin

Slide 85

DW and elements of DMMaurizio Pighin

Mining Large Data Sets Motivation

• There is often information “hidden” in the data that is not readily evident

• Human analysts may take weeks to discover useful information

• Much of the data is never analyzed at all

Slide 86

DW and elements of DMMaurizio Pighin

Why Data Mining? Potential Applications

• Database analysis and decision support– Market analysis and management

• target marketing, customer relation management, market basket analysis, cross selling, market segmentation

– Risk analysis and management• Forecasting, customer retention, quality control, competitive

analysis

– Fraud detection and management• Other Applications

– Text mining (news group, email, documents) and Web analysis.

Pagina 44Copyright © 2008 by Maurizio Pighin

Slide 87

DW and elements of DMMaurizio PighinMarket Analysis and Management

• Where are the data sources for analysis?– Credit card transactions, loyalty cards, discount

coupons, customer complaint calls, plus (public) lifestyle studies

• Target marketing– Find clusters of “model” customers who share the

same characteristics: interest, income level, spending habits, etc.

Slide 88

DW and elements of DMMaurizio PighinMarket Analysis and Management

• Determine customer purchasing patterns over time– Changing of customer habits with age

• Cross-market analysis– Associations/co-relations between product sales– Prediction based on the association information

• Customer profiling– Indentifying what types of customers buy what

products (clustering or classification)• Identifying customer requirements

– identifying the best products for different customers– using prediction to find what factors will attract new

customers

Pagina 45Copyright © 2008 by Maurizio Pighin

Slide 89

DW and elements of DMMaurizio Pighin

Corporate Analysis and Risk Management

• Finance planning and asset evaluation– cash flow analysis and prediction– cross-sectional and time series analysis (financial-ratio,

trend analysis, etc.)• Resource planning

– summarize and compare the resources and spending• Competition

– monitor competitors and market directions – group customers into classes and a class-based

pricing procedure– set pricing strategy in a highly competitive market

Slide 90

DW and elements of DMMaurizio PighinFraud Detection and Management

• Applications– widely used in health care, retail, credit card services,

telecommunications (phone card fraud), etc.– approach: use historical data to build models of

fraudulent behavior and use data mining to help identify similar instances

• Examples– auto insurance: detect a group of people who stage

accidents to collect on insurance– money laundering: detect suspicious money

transactions

Pagina 46Copyright © 2008 by Maurizio Pighin

Slide 91

DW and elements of DMMaurizio PighinData Mining Tasks

• Prediction Methods– Use some variables to predict unknown or future

values of other variables.

• Description Methods– Find human-interpretable patterns that describe the

data.

Slide 92

DW and elements of DMMaurizio PighinPrincipal Data Mining Tasks.

• Classification [Predictive]• Clustering [Descriptive]• Association Rule Discovery [Descriptive]• Regression [Predictive]• Deviation Detection [Predictive]

Pagina 47Copyright © 2008 by Maurizio Pighin

Slide 93

DW and elements of DMMaurizio PighinClassification: Definition

• Given a collection of records (training set)• Each record contains a set of attributes, one of the

attributes is the class.• Find a model for class attribute as a function of the

values of other attributes.• Goal: previously unseen records should be assigned

a class as accurately as possible.• Metodology: a test set is used to determine the

accuracy of the model. Usually, the given a collection of known data set is randomly divided into trainingand test sets, with training set used to build the model and test set used to validate it.

Slide 94

DW and elements of DMMaurizio PighinClassification Example

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categorical

categorical

continuous

class

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

TestSet

Training Set

ModelLearn

Classifier

Pagina 48Copyright © 2008 by Maurizio Pighin

Slide 95

DW and elements of DMMaurizio PighinClassification: Application

• Direct Marketing– Goal: Reduce cost of mailing by targeting a set of

consumers likely to buy a new cell-phone product.– Approach:

• Use the data for a similar product introduced before. • We know which customers decided to buy and which decided

otherwise. This {buy, don’t buy} decision forms the class attribute.

• Collect various demographic, lifestyle, and company-interaction related information about all such customers.

– Type of business, where they stay, how much they earn, etc.• Use this information as input attributes to learn a classifier

model.

Slide 96

DW and elements of DMMaurizio PighinClustering Definition

• Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that– Data points in one cluster are more similar to one

another.– Data points in separate clusters are less similar to one

another. • Similarity Measures

– Euclidean Distance if attributes are continuous.– Other Problem-specific Measures

Pagina 49Copyright © 2008 by Maurizio Pighin

Slide 97

DW and elements of DMMaurizio PighinIllustrating Clustering

Euclidean Distance Based Clustering in 3-D space.

Intracluster distancesare minimized

Intracluster distancesare minimized

Intercluster distancesare maximized

Intercluster distancesare maximized

Slide 98

DW and elements of DMMaurizio PighinClustering: Application

• Market Segmentation:– Goal: subdivide a market into distinct subsets of

customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.

– Approach: • Collect different attributes of customers based on their

geographical and lifestyle related information.• Find clusters of similar customers.• Measure the clustering quality by observing buying patterns of

customers in same cluster vs. those from different clusters.

Pagina 50Copyright © 2008 by Maurizio Pighin

Slide 99

DW and elements of DMMaurizio Pighin

Association Rule Discovery: Definition

• Given a set of records each of which contain some number of items from a given collection;– Produce dependency rules which will predict

occurrence of an item based on occurrences of other items.

TID Items

1 Bread, Coke, Milk2 Beer, Bread3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk5 Coke, Diaper, Milk

Rules Discovered:{Milk} --> {Coke}{Diaper, Milk} --> {Beer}

Rules Discovered:{Milk} --> {Coke}{Diaper, Milk} --> {Beer}

Slide 100

DW and elements of DMMaurizio Pighin

Association Rule Discovery: Application 1

• Marketing and Sales Promotion:– Let the rule discovered be

{Bagels, … } --> {Potato Chips}– Potato Chips as consequent => Can be used to

determine what should be done to boost its sales.– Bagels in the antecedent => Can be used to see which

products would be affected if the store discontinues selling bagels.

– Bagels in antecedent and Potato chips in consequent=> Can be used to see what products should be sold with Bagels to promote sale of Potato chips!

Pagina 51Copyright © 2008 by Maurizio Pighin

Slide 101

DW and elements of DMMaurizio Pighin

Association Rule Discovery: Application 2

• Supermarket shelf management.– Goal: To identify items that are bought together by

sufficiently many customers.– Approach: Process the point-of-sale data collected with

barcode scanners to find dependencies among items.– A classic rule --

• If a customer buys diaper and milk, then he is very likely to buy beer.

• So, don’t be surprised if you find six-packs stacked next to diapers!

Slide 102

DW and elements of DMMaurizio PighinRegression

• To identify unknown values in a continuous domain• Build tendency functions with interpolation of known

points (regression) • Different models

– Linear regression (two variables)• Y = q + m X

– Multi-linear regression (more variables) • Y = q + m1 X1 + m2 X2+ m3 X3

– Non-linear regression (polynomial, exponential, logarithmic ...)

• Y = q + m1X+ m2X2+ m3X3

Pagina 52Copyright © 2008 by Maurizio Pighin

Slide 103

DW and elements of DMMaurizio PighinRegression

• Example

Slide 104

DW and elements of DMMaurizio PighinDeviation Detection

• The search of “Outlier”• Outlier: exception, element out of range• The search is based on the same principles of clustering• Concentrates the efforts in finding elements “far” from the other • Search method

– Statistical• Can be used if a statistical distribution is evaluable

– Distance based• Search for elements with maximize the distance from the other

elements of the set – Deviation based

• Search for elements with maximize the deviance from the other elements of the set.

• Example: fraud detection

Pagina 53Copyright © 2008 by Maurizio Pighin

Slide 105

DW and elements of DMMaurizio Pighin

Challenges of Data Warehousing and Mining

• Scalability• Dimensionality• Complex and Heterogeneous Data• Data Ownership and Distribution• Privacy Preservation• Streaming Data• Data Quality

Slide 106

DW and elements of DMMaurizio PighinData Quality

• A process quality measures its adherence to users targets

• In the following tables you can find some aspects of “quality”(Wang-Wand (1999): quality dimensions)

Pagina 54Copyright © 2008 by Maurizio Pighin

Slide 107

DW and elements of DMMaurizio PighinData Quality

Slide 108

DW and elements of DMMaurizio PighinMain Competitors in DW Systems

5,700Total

152Others

159Oracle Corporation

199Infor

205Applix

210Cartesis SA

330SAP AG

416MicroStrategy

416Business Objects

735Cognos

1,077Hyperion Solutions Corporation

1,801Microsoft Corporation

Global Revenue 2006 (Millions USD)Vendor

Pagina 55Copyright © 2008 by Maurizio Pighin

Slide 109

DW and elements of DMMaurizio PighinBibliography – Data warehousing

• Berson A. and Smith S.J., “Data warehousing, data mining and OLAP”, McGraw-Hill, 1997

• Berthold M., Hand D.J., “Intelligent data analysis: an introduction”, Springer-Verlag, 1999

• Inmon W.H., “Building the data warehouse”, John Wiley & Sons, 1996

• Inmon W.H., Zachman J.A., Geiger G., “Data stores, data warehousing and Zachman framework; managing enterprise knowledge”, McGraw-Hill, 1997

• Kimball R., Ross M., “The Data Warehouse Toolkit. Practical techniques for building dimensional Data Warehouses”, 2nd ed. John Wiley, 2002

• Thomsen E., “OLAP solutions: building multidimensional information systems”, John Wiley & Sons, 1997

Slide 110

DW and elements of DMMaurizio PighinBibliography – Data mining

• Bramer M., “Principles of Data Mining”, Springer, 2007• Han J., Kamber M., “Data Mining – Concepts and techniques”,

Academic Press, 2001• Parr Rud O., “Data mining cookbook – Modeling data for

marketing, risk and CRM”, John Wiley & Sons, 2000• Pyle D., “Data preparation for data mining”, Morgan Kaufmann,

1999• Weiss S.M., Indurkhya N., “Predictive Data Mining”, Morgan

Kaufmann, 1998• Witten I.H., Frank E., “Data mining, Practical Machine Learning

Tools and Techniques”, 2nd Edition, Elsivier, 2005