Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Data Cleaning Data...

Data Mining: A KDD Process

Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Steps of a KDD Process

Learning the application domain: relevant prior knowledge and goals of application

Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:

Find useful features, dimensionality/variable reduction, invariant representation.

Choosing functions of data mining summarization, classification, regression, association, clustering.

Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge

Architecture of a Typical Data Mining System

Data Warehouse

Data cleaning & data integration Filtering

Databases

Database or data warehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledge-base

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology

Statistics

OtherDisciplines

InformationScience

MachineLearning Visualization

Major Issues in Data Mining (1)

Mining methodology and user interaction Mining different kinds of knowledge in databases Interactive mining of knowledge at multiple levels of abstraction Incorporation of background knowledge Data mining query languages and ad-hoc data mining Expression and visualization of data mining results Handling noise and incomplete data Pattern evaluation: the interestingness problem

Performance and scalability Efficiency and scalability of data mining algorithms Parallel, distributed and incremental mining methods

Major Issues in Data Mining (2)

Issues relating to the diversity of data types Handling relational and complex types of data Mining information from heterogeneous databases and global

information systems (WWW)

Issues related to applications and social impacts Application of discovered knowledge

Domain-specific data mining tools Intelligent query answering Process control and decision making

Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem

Protection of data security, integrity, and privacy

An Overview of Data Warehousing and OLAP

Technology

OLAP-Online Analytical Processing

Data ware house enables OLAP to help decision support.

Organize and format data in various format.

OLTP-Online Transaction Processing It uses Operational datatbases. Data w/h is kept separate from Operational DB. It covers day to day operations of an org. Such as purchase, inventory, manufacturing, banking,

payroll, accounting.

OLTP vs OLAP OLTP applications typically automate clerical data processing tasks Data warehouses, in contrast, are targeted for decision support

The transactions require detailed, up-to-date data, and read or update a few (tens of) records accessed typically on their primary keys

Historical, summarized and consolidated data is more important than detailed, individual records

Operational databases tend to be hundreds of megabytes to gigabytes in size Enterprise data warehouses are projected to be hundreds of

gigabytes to terabytes in size

Consistency and recoverability of the database are critical, and maximizing transaction throughput is the key performance metric

Query throughput and response times are more important than transaction throughput

OLAP-characteristics

Use multi dimensional data analysis techniques.

Provide advance data base support. Provides easy to use end user interfaces. Support client/server architecture.

Architecture and End-to-End Process

Back End Tools and Utilities Data Cleaning Tools: Tools that help to detect data

anomalies and correct them E.g. To correct Inconsistent field lengths, inconsistent

descriptions, inconsistent value assignments, missing entries and violation of integrity constraints

Types: Data migration tools e.g. Warehouse Manager from

Prism Data scrubbing tools e.g. Integrity Data auditing tools: such tools may be considered as

variants of data mining tools

Back End Tools and Utilities (Contd.)

Load: After extracting, cleaning and transforming, data

must be loaded into the warehouse. e.g., RedBrick Table Management Utility

Additional preprocessing may still be required for:

checking integrity constraints; sorting; summarization, aggregation and other computation to build the derived tables stored in the warehouse; building indices and other access paths; and partitioning to multiple target storage areas.

Methods: Batch load utilities Pipelined and partitioned parallelism To insert only updated table

Back End Tools and Utilities (Contd.)

Refresh: Done only if some OLAP queries need current data

Most contemporary database systems provide replication servers that support incremental techniques for propagating updates from a primary database to one or more replicas.

Techniques: Data shipping and Transaction shipping

Transaction shipping has the advantage that it does not require triggers, which can increase the workload on the operational source databases

Conceptual Model and Front End Tools

A popular conceptual model that influences the front-end tools, database design, and the query engines for OLAP is the multidimensional view of data in the warehouse.

Front End Tools

The spreadsheet is still the most compelling front-end application for OLAP

Popular operations that are supported by the multidimensional spreadsheet:

rollup (increasing the level of aggregation) drill-down (decreasing the level of aggregation or

increasing detail) along one or more dimension hierarchies

slice_and_dice (selection and projection) pivot (re-orienting the multidimensional view of data).

Front End Tools

Other Applications : Traditional analysis by means of a managed query environment

These applications often use raw data access tools and optimize the access patterns depending on the back end database server.

E.g. there are query environments (e.g., Microsoft Access) that help build ad hoc SQL queries by “pointing-and-clicking”

Database Design Methodology

The database designs recommended by ER diagrams are inappropriate for decision support systems where efficiency in querying and in loading data (including incremental loads) are important

Schema used to represent the multidimensional data model are:

Star schema Snowflake schemas Fact constellations

Star Schema

Snowflake Schema

Warehouse Servers

Data warehouses may contain large volumes of data Thus, improving the efficiency of scans is important Index Structures and their Usage: Warehouse servers

can use bit map indices, which support efficient index operations (e.g., union, intersection).

Materialized Views and their Usage: strategy for using a materialized view is to use selection on the materialized

view, or rollup on the materialized view by grouping and aggregating on additional columns Transformation of Complex SQL Queries: “unnesting”

complex SQL queries containing nested subqueries Parallel Processing

Warehouse Servers (Contd.)

Server Architectures for Query Processing:

Specialized SQL Servers: The objective here is to provide advanced query language and query processing support for SQL queries over star and snowflake schemas in read-only environments. e.g. Redbrick

ROLAP Servers: These are intermediate servers that sit between a relational back end server (where the data in the warehouse is stored) and client front end tools e.g.

Microstrategy.

MOLAP Servers: These servers directly support the multidimensional view of data through multidimensional storage

engine e.g. Essbase (Arbor)

Warehouse Servers (Contd.)

SQL Extensions: Extended family of aggregate functions: rank,

percentile, mean, mode, median Reporting Features: moving average Multiple Group-By: Cube and Rollup Comparisons

Metadata and Warehouse Management Administrative metadata includes: Descriptions of the source databases, back-end and front-end

tools; definitions of the warehouse schema, derived data, dimensions and hierarchies, predefined queries and reports; data mart locations and contents; physical organization such as data partitions; data extraction, cleaning, and transformation rules; data refresh and purging policies; and user profiles, user authorization and access control policies

Business metadata includes: Business terms and definitions, ownership of the data, and

charging policies Operational metadata includes: Information that is collected during the operation of the

warehouse: the lineage of migrated and transformed data; the currency of data in the warehouse (active, archived or purged); and monitoring information such as usage statistics, error reports, and audit trails.

Metadata and Warehouse Management (Contd.)

A metadata repository is used to store and manage all the metadata associated with the warehouse. E.g. Platinum Repository and Prism Directory Manager

Warehouse management tools (e.g., HP Intelligent Warehouse Advisor, IBM Data Hub, Prism Warehouse Manager) are used for monitoring a warehouse

System and network management tools (e.g., HP OpenView, IBM NetView,Tivoli) are used to measure traffic between clients and servers, between warehouse servers and operational databases

Workflow management tools been considered for managing the extract-scrub-transform-load-refresh process

Conclusion

There are substantial technical challenges in developing and deploying decision support systems

While many commercial products and services exist, there are still several interesting avenues for research related to the different aspects in designing and maintaining a data warehouse.

References:

Surajit Chaudhuri Umeshwar Dayal Microsoft Research, Redmond, Umeshwar Dayal, Hewlett-Packard Labs, Palo Alto, An Overview of Data Warehousing and OLAP Technology

Inmon, W.H., Building the Data Warehouse. John Wiley, 1992. Athanasios Vavouras, Stella Gatziu, Klaus R. Dittrich, Modeling and

Executing the Data Warehouse Refreshment Process, Technical Report 2000.01, January 2000

Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Data Cleaning Data...

Documents

Transcript of Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Data Cleaning Data...