Elsayed Hemayed Data Mining Course

download Elsayed Hemayed Data Mining Course

If you can't read please download the document

description

Outline Introduction Operational System (OLTP) Vs. Data Warehouse (OLAP) Data Warehouse vs. Data Marts Data Warehouse Architecture Data Warehouse Structure Data Warehouse

Transcript of Elsayed Hemayed Data Mining Course

Elsayed Hemayed Data Mining Course
Data Warehouse Elsayed HemayedData Mining Course Outline Introduction Operational System (OLTP) Vs. Data Warehouse(OLAP) Data Warehouse vs. Data Marts Data Warehouse Architecture Data Warehouse Structure Data Warehouse Data, Data everywhere I cant find the data I need
data is scattered over the network many versions, subtle differences I cant get the data I need need an expert to get the data I cant understand the data I found available data poorly documented I cant use the data I found results are unexpected data needs to be transformed from one form to other Data Warehouse What is a Data Warehouse?
A single, complete and consistent store of data obtained from a variety of different sources made available to end users in what they can understand and use in a business context. [Barry Devlin] Data Warehouse What are the users saying...
Data should be integratedacross the enterprise Summary data has a realvalue to the organization Historical data holds the keyto understanding data overtime What-if capabilities arerequired Data Warehouse What is Data Warehousing?
A process of transformingdata into information andmaking it available to usersin a timely enough manner tomake a difference [Forrester Research, April 1996] Data Information Data Warehouse Warehouses are Very Large Databases
Terabytes -- 10^12 bytes: Petabytes -- 10^15 bytes: Exabytes -- 10^18 bytes: Zettabytes -- 10^21 bytes: Zottabytes -- 10^24 bytes: Walmart Terabytes Geographic InformationSystems National Medical Records Weather images Intelligence Agency Videos Data Warehouse Data Warehousing -- It is a process
Technique for assembling andmanaging data from varioussources for the purpose ofanswering business questions. Thusmaking decisions that were notprevious possible A decision support databasemaintained separately from theorganizations operationaldatabase Data Warehouse Why Separate Data Warehouse?
Performance Operational dbs designed & tuned for known transactions & workloads. Complex OLAP queries would degrade performance for operationtransactions. Special data organization, access & implementation methods needed formultidimensional views & queries. Function Missing data:Decision support requires historical data, which operationdbs do not typically maintain. Data consolidation: Decision support requires consolidation(aggregation, summarization) of data from many heterogeneous sources: operation dbs, external sources. Data quality:Different sources typically use inconsistent datarepresentations, codes, and formats which have to be reconciled. Data Warehouse Key Definition OLTP: On Line Transaction Processing
Describes processing at operational sites OLAP: On Line Analytical Processing Describes processing at warehouse Business Intelligence refers to reporting andanalysis of data stored in the warehouse Data warehouse is the foundation for businessintelligence. Data warehouse/business intelligence (DW/BI)refers to the complete end-to-end system. Data Warehouse Explorers, Farmers and Tourists
Tourists:Browse information harvested by farmers Farmers:Harvest information from known access paths Explorers:Seek out the unknown and previously unsuspected rewards hiding in the detailed data Data Warehouse Data Mining works with Warehouse Data
Data Warehousing provides theEnterprise with a memory Data Mining provides the Enterprise with intelligence Data Warehouse To summarize ... Operational (OLTP)Systems are used to run abusiness The Data Warehouse(OLAP) helps tooptimize the business Data Warehouse Data Warehouse vs. Data Marts
What comes first Data Warehouse Data Mart Vs Data Warehouse
Data mart is a specific, subject-oriented repository ofdata that was designed to answer specific questions Usually, multiple data marts exist to serve the needs ofmultiple business units (sales, marketing, operations,collections, accounting, etc.) Data warehouse is a single organizational repositoryof enterprise wide data across many or all subjectareas. Data warehouse is an enterprise wide collection of datamarts Data Warehouse From the Data Warehouse to Data Marts
Information Individually Structured Less More History Normalized Detailed Departmentally Structured Data Warehouse Organizationally Structured Data Warehouse Data Warehouse and Data Marts
OLAP Data Mart Lightly summarized Departmentally structured Sales Mktg. Finance Organizationally structured Atomic Detailed Data Warehouse Data Data Warehouse Characteristics of the Departmental Data Mart
OLAP Small Flexible Customized by Department Source is departmentallystructured data warehouse Sales Mktg. Finance Data Warehouse Data Mart Centric Data Sources Data Marts Data Warehouse Problems with Data Mart Centric Solution
If you end up creating multiple warehouses, integrating them is a problem Data Warehouse True Warehouse Data Sources Data Warehouse Data Marts Data Warehouse Data Warehouse Architecture Data Warehouse Architecture
Relational Databases Legacy Data PurchasedData ERP Systems Analyze Query Data WarehouseEngine Optimized Loader Extraction Cleansing Metadata Repository Data Warehouse Implementing a Warehouse
Monitoring: Getting the data from the sources Data Integration Cleansing Loading Processing: Query processing, indexing, ... Managing: Metadata, Design, ... Data Warehouse Monitoring Source Types: relational, flat file, IMS, WWW, news-wire,
Incremental vs. Refresh new Data Warehouse Monitoring Techniques
Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Application level monitoring Data Warehouse Monitoring Issues Frequency Data transformation Standards (e.g., ODBC)
periodic: daily, weekly, triggered: on big change, lots of changes, ... Data transformation convert data to uniform format remove & add fields (e.g., add date to get history) Standards (e.g., ODBC) Gateways Data Warehouse Refresh Propagate updates on source data to the warehouse Issues:
when to refresh how to refresh -- refresh techniques Data Warehouse When to Refresh? periodically (e.g., every night, every week) or aftersignificant events on every update: not warranted unless warehousedata requirecurrent data (up to the minute stockquotes) refresh policy set by administrator based on userneeds and traffic possibly different policies for different sources Data Warehouse How To Detect Changes Create a snapshot log table to record ids of updated rowsof source data and timestamp Detect changes by: Defining after row triggers to update snapshot logwhen source table changes Using regular transaction log to detect changes tosource data Data Warehouse Data Integration Across Sources
Trust Savings Loans Credit card Same data different name Different data Same name Data found here nowhere else Different keys same data Data Warehouse Data Transformation Example
Data Warehouse appl A - m,f appl B - 1,0 appl C - x,y appl D - male, female encoding appl A - pipeline - cm appl B - pipeline - in appl C - pipeline - feet appl D - pipeline - yds unit appl A - balance appl B - bal appl C - currbal appl D - balcurr field Data Warehouse Data Integrity Problems
Same person, different spellings Ahmed, Ahmad, Ahmaad etc... Multiple ways to denote company name Persistent Systems, PSPL, Persistent Pvt. LTD. Use of different names Oct 6, 6 Oct Different account numbers generated by differentapplications for the same customer Required fields left blank Invalid product codes collected at point of sale manual entry leads to mistakes in case of a problem use Data Warehouse Data Extraction and Cleansing
Extract data from existing operational and legacydata Issues: Sources of data for the warehouse Data quality at the sources Merging different data sources Data Transformation How to propagate updates (on the sources) to thewarehouse Terabytes of data to be loaded Data Warehouse Scrubbing Data Scrubbing Tools Sophisticated transformation tools.
Used for cleaning the quality of data Clean data is vital for the success of the warehouse Example Ahmed Aly, Ahmad Ali, Ahmaad Aly, Ahmad Aly, etc. are thesame person Scrubbing Tools Apertus -- Enterprise/Integrator Vality -- IPE Postal Soft Data Warehouse Data Loading After extracting, cleaning, validating etc. need toload the data into the warehouse Issues huge volumes of data to be loaded small time window available when warehouse can be taken off line(usually nights) when to build index and summary tables allow system administrators to monitor, cancel, resume, change loadrates Recover gracefully -- restart after failure from where you were andwithout loss of data integrity Data Warehouse Load Techniques Use SQL to append or insert new data
record at a time interface will lead to random disk I/Os Use batch load utility Incremental versus Full loads Online versus Offline loads Data Warehouse Data Warehouse Structure Data Warehouse Structure
Subject Orientation -- customer, product, policy,account etc... A subject may be implemented as aset of related tables. E.g., customer may be fivetables Data Warehouse Data Warehouse Structure
base customer ( ) custid, from date, to date, name, phone, dob base customer ( ) custid, from date, to date, name, credit rating, employer customer activity ( ) -- monthly summary customer activity detail ( ) custid, activity date, amount, clerk id, order no customer activity detail ( ) custid, activity date, amount, line item no, order no Time is part of key of each table Data Warehouse Data Granularity in Warehouse
Summarized data stored reduce storage costs reduce cpu usage increases performance since smaller number of recordsto be processed design around traditional high level reporting needs tradeoff with volume of data to be stored anddetailed usage of data Data Warehouse Granularity in Warehouse
Can not answer some questions with summarizeddata Did Ahmed call Aly last month? Not possible to answerif total duration of calls by Ahmed over a monthis onlymaintained andindividual call details are not. Detailed data too voluminous Data Warehouse Granularity in Warehouse
Tradeoff is to have dual level of granularity Store summary data on disks 95% of DSS processing done against this data Store detail on tapes 5% of DSS processing against this data Data Warehouse Vertical Partitioning
Acct. No Name Balance Date Opened Interest Rate Address Frequently accessed Rarely accessed Acct. No Balance Acct. No Name Date Opened Interest Rate Address Smaller table and so less I/O Data Warehouse Schema Design Database organization Schema Types
must look like business must be recognizable by business user approachable by business user Must be simple Schema Types Star Schema Fact Constellation Schema Snowflake schema Data Warehouse Dimensional Modeling Fact Table Dimension Table Dimension Table
Data Warehouse Fact Tables Contain the metrics resulting from a business process ormeasurement event, such as the sales ordering process orservice call event Dimensional models should be structured around businessprocesses and their associated data sources, This results in ability to design identical, consistent views of datafor all observers, regardless of which business unit they belong to,which goes a long way toward eliminating misunderstandings atbusiness meetings Fact tables granularity should be set at the lowest, mostatomic level captured by the business process This allows for maximum flexibility and extensibility. Business users will be able to ask constantly changing, free-ranging,and very precise questions. Data Warehouse Fact Table Central table mostly raw numeric items
narrow rows, a few columns at most large number of rows (millions to a billion) Access via dimensions Data Warehouse Dimension Tables Contain the descriptive attributes and characteristicsassociated with specific, tangible measurementevents, such as the customer, product, or salesrepresentative associated with an order beingplaced. Dimension attributes are used for constraining,grouping, or labeling in a query. Hierarchical many-to-one relationships aredenormalized into single dimension tables. Data Warehouse Dimension Table Define business in terms already familiar to users
Wide rows with lots of descriptive text Small tables (about a million rows) Joined to fact table by a foreign key heavily indexed typical dimensions time periods, geographic region (markets, cities),products, customers, salesperson, etc. Data Warehouse Star Schema A single fact table and multiple dimension tables m p T r
date, custno, prodno, cityname,... f a c t c u s t c i t y Data Warehouse Star Schema Example Data Warehouse Star Schema Example Data Warehouse Snowflake schema The tables which describe the dimensions arenormalized. Easy to maintain and saves storage p r o d T i m e date, custno, prodno, cityname,... f a c t c u s t r e g c i t y Data Warehouse Snowflake Schema Example
sType store city region Data Warehouse Fact Constellation Booking Checkout
Multiple fact tables that share many dimensiontables Booking and Checkout may share many dimensiontables in the hotel industry Hotels Travel Agents Promotion Room Type Customer Booking Checkout Data Warehouse Hybrid Approach If a dimension is very sparse (i.e. most of thepossible values for the dimension have no data)and/or a dimension has a very long list of attributeswhich may be used in a query, the dimension tablemay occupy a significant proportion of thedatabase and snowflaking may be appropriate In practice, many data warehouses will normalizesome dimensions and not others, and hence use acombination of snowflake and classic star schema. Data Warehouse Partitioning Breaking data into severalphysical units that can be handledseparately Not a question of whether to do itin data warehouses but how to doit Granularity and partitioning arekey to effective implementation ofa warehouse Data Warehouse Why Partition? Flexibility in managing data
Smaller physical units allow easy restructuring free indexing sequential scans if needed easy reorganization easy recovery easy monitoring Data Warehouse Criterion for Partitioning
Typically partitioned by date line of business geography organizational unit any combination of above Data Warehouse Query Processing Indexing Parallel Query Processing
Pre computed views/aggregates SQL extensions Extended family of aggregate functions rank (top 10 customers) percentile (top 30% of customers) median, mode Reporting features running total, cumulative totals Data Warehouse Metadata Repository Administrative metadata
source databases and their contents gateway descriptions warehouse schema, view & derived data definitions dimensions, hierarchies pre-defined queries and reports data mart locations and contents data partitions data extraction, cleansing, transformation rules, defaults data refresh and purging rules user profiles, user groups security: user authorization, access control Data Warehouse Metdata Repository .. 2 Business data operational metadata
business terms and definitions ownership of data charging policies operational metadata data lineage:history of migrated data and sequenceof transformations applied currency of data:active, archived, purged monitoring information:warehouse usage statistics,error reports, audit trails. Data Warehouse Data Warehouse References
W.H. Inmon, Building the Data Warehouse, SecondEdition, John Wiley and Sons, 1996 W.H. Inmon, J. D. Welch, Katherine L. Glassey,Managing the Data Warehouse, John Wiley andSons, 1997 Barry Devlin, Data Warehouse from Architecture toImplementation, Addison Wesley Longman, Inc 1997 Data Warehouse Summary Introduction Operational System (OLTP) Vs. DataWarehouse (OLAP) Data Warehouse vs. Data Marts Data Warehouse Architecture Data Warehouse Structure Data Warehouse