Datawarehouse Intro Ch1 Ch2

download Datawarehouse Intro Ch1 Ch2

of 193

Transcript of Datawarehouse Intro Ch1 Ch2

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    1/193

    3

    Which are ourlowest/highest margin

    customers ? Who are my customers

    and what productsare they buying?

    Which customersare most likely to goto the competition ?

    What impact willnew products/services

    have on revenueand margins?

    What product prom--otions have the biggest

    impact on revenue?

    What is the mosteffective distribution

    channel?

    A producer wants to know.

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    2/193

    4

    Data, Data everywhere yet ... I cant find the data I need

    data is scattered over thenetworkmany versions, subtledifferences

    I cant get the data I need need an expert to get the data

    I cant understand the data Ifound

    available data poorly documented

    I cant use the data I found results are unexpected

    data needs to be transformedfrom one form to other

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    3/193

    5

    What is a Data Warehouse?

    A single, complete andconsistent store of dataobtained from a variety

    of different sourcesmade available to endusers in a what theycan understand and usein a business context.

    [Barry Devlin]

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    4/193

    6

    What are the users saying...

    Data should be integratedacross the enterpriseSummary data has a realvalue to the organizationHistorical data holds thekey to understanding data

    over timeWhat-if capabilities arerequired

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    5/193

    7

    What is Data Warehousing?

    A process of transforming data intoinformation andmaking it available tousers in a timelyenough manner to

    make a difference

    [Forrester Research, April1996]Data

    Information

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    6/193

    8

    Evolution

    60s: Batch reports hard to find and analyze informationinflexible and expensive, reprogram every newrequest

    70s: Terminal -based DSS and EIS (executiveinformation systems)

    still inflexible, not integrated with desktop tools

    80s: Desktop data access and analysis tools query tools, spreadsheets, GUIseasier to use, but only access operational databases

    90s: Data warehousing with integrated OLAP

    engines and tools

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    7/193

    9

    Warehouses are Very LargeDatabases

    35%

    30%

    25%

    20%

    15%

    10%

    5%

    0%5GB

    5-9GB

    10-19GB 50-99GB 250-499GB

    20-49GB 100-249GB 500GB-1TB

    InitialProjected 2Q96

    Source: META Group, Inc.

    R e s p o n

    d e n t s

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    8/193

    10

    Very Large Data Bases

    Terabytes -- 10^12 bytes:

    Petabytes -- 10^15 bytes:

    Exabytes -- 10^18 bytes:

    Zettabytes -- 10^21

    bytes:

    Zottabytes -- 10^24bytes:

    Walmart -- 24 Terabytes

    Geographic Information

    SystemsNational Medical Records

    Weather images

    Intelligence AgencyVideos

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    9/193

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    10/193

    12

    Data Warehousing--* It is a process* It is a product* It is an

    environment

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    11/193

    13

    Data Warehousing --It is a process

    Technique for assembling andmanaging data from varioussources for the purpose of

    answering businessquestions. Thus makingdecisions that were notprevious possibleA decision support databasemaintained separately fromthe organizations operational

    database

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    12/193

    14

    Data Warehouse (2 nd Chapter)

    A data warehouse is asubject-oriented

    integrated

    time-varying

    non-volatile

    collection of data that is used primarily inorganizational decision making.

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    13/193

    15

    Data Warehouse

    Subject Oriented

    The data in the data warehouse is organizedso that all the data elements relating to thesame real-world event or object are linkedtogether.

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    14/193

    16

    Data Warehouse

    Integrated

    The data warehouse contains data from mostor all of an organization's operational systemsand this data is made consistent

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    15/193

    17

    Data Warehouse (2 nd Chapter)

    Non volatile Data

    Data in the data warehouse is never over-written or deleted - once committed, the datais static, read-only, and retained for futurereporting

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    16/193

    18

    Data Warehouse

    Time variant

    In a data warehouse environment,

    t he decision makers can view the data acrossthe field of time at whichever level of detailthey may wish

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    17/193

    Data Granuality

    Granularity is the extent to which asystem is broken down into smallparts, either the system itself or itsdescription or observation. It is the"extent to which a larger entity issubdivided. For example, a yardbroken into inches has finergranularity than a yard broken intofeet."

    19

    http://en.wikipedia.org/wiki/Systemhttp://en.wikipedia.org/wiki/System
  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    18/193

    Cont.

    Granularity is usually mentioned in thecontext of dimensional data structures(i.e., facts and dimensions) and refers to

    the level of detail in a given fact table.The more detail there is in the fact table,the higher its granularity and vice versa.Another way to look at it is that thehigher the granularity of a fact table, themore rows it will have .

    20

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    19/193

    Example:

    Say we have a data mart with a single fact(Sales) and three dimensions (Time,Organization and Product). The fact tablecontains three metrics (Unit Price, Units Sold andTotal Sale Amount). The Time dimension consistsof four hierarchical elements (Year, Quarter,Month and Day). The Organization dimensionconsists of three hierarchical elements (Region,

    District and Store). The Product dimensionconsists of two hierarchical elements (ProductFamily and SKU).

    21

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    20/193

    Cont.As always, the metrics in the Sales fact table must bestored at some intersection of the dimensions (i.e., Time,Organization and Product). Hence, in this data mart, thehighest granularity that we can store Sales metrics is byDay/Store/SKU (i.e., the lowest level in each dimensionalhierarchy). Conversely, the lowest granularity that we canaggregate Sales metrics to in this data mart is byYear/Region/Product Family (i.e., the highest level in eachdimensional hierarchy). We may also (for a variety of performance reasons) choose to store Sales metrics atsome intermediate level of granularity (e.g., byMonth/District/SKU) .

    22

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    21/193

    The information flow mechanism

    23Extract Transform Load Operational data store

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    22/193

    Data extraction from source

    Identify the sourceFinalize the filters for each sourceProduce automatic extract file from operational

    dataGenerate intermediate fileRender automated job control services forcreating extract files

    Reformat and standardized inputProduce common application code for dataextractionResolve inconsistencies for common data that

    will be extracted from multiple source systems 24

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    23/193

    Meta data in warehouse

    Metadata is one of the importantkeys to the success of the datawarehousing and businessintelligence effort.Metadata is your control panel to thedata warehouse. It is data thatdescribes the data warehousing andbusiness intelligence system:

    25

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    24/193

    What is Metadata? ReportsCubesTables (Records, Segments, Entities, etc.)Columns (Fields, Attributes, Data Elements, etc.)

    KeysIndexes

    Metadata is often used to control the handling of data and describes:

    RulesTransformationsAggregationsMappings

    26

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    25/193

    Data Warehouse Metadata

    Data warehousing has specific metadatarequirements. Metadata that describes tablestypically includes:

    Physical NameLogical NameType: Fact, Dimension, BridgeRole: Legacy, OLTP, Stage,

    DBMS: DB2, Informix, MS SQL Server, Oracle,SybaseLocationDefinition

    Notes 27

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    26/193

    29

    Data Warehouse for DecisionSupport & OLAP

    Putting Information technology to help theknowledge worker make faster and betterdecisions

    Which of my customers are most likely to goto the competition?What product promotions have the biggestimpact on revenue?How did the share price of softwarecompanies correlate with profits over last 10years?

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    27/193

    30

    Decision Support

    Used to manage and control business

    Data is historical or point-in-time

    Optimized for inquiry rather than updateUse of the system is loosely defined andcan be ad-hoc

    Used by managers and end-users tounderstand the business and make

    judgements

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    28/193

    31

    Data Mining works with WarehouseData

    Data Warehousingprovides the Enterprisewith a memory

    Data Mining providesthe Enterprise withintelligence

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    29/193

    32

    We want to know ...Given a database of 100,000 names, which persons are theleast likely to default on their credit cards?Which types of transactions are likely to be fraudulentgiven the demographics and transactional history of aparticular customer?

    If I raise the price of my product by Rs. 2, what is theeffect on my ROI?

    If I offer only 2,500 airline miles as an incentive topurchase rather than 5,000, how many lost responses willresult?

    If I emphasize ease-of-use of the product as opposed to itstechnical capabilities, what will be the net effect on myrevenues?

    Which of my customers are likely to be the most loyal?

    Data Mining helps extract such information

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    30/193

    33

    Application Areas

    Industry Application Finance Credit Card Analysis

    Insurance Claims, Fraud AnalysisTelecommunication Call record analysisTransport Logistics managementConsumer goods promotion analysisData Service providers Value added dataUtilities Power usage analysis

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    31/193

    34

    Data Mining in Use

    The US Government uses Data Mining totrack fraudA Supermarket becomes an informationbrokerBasketball teams use it to track gamestrategy

    Cross SellingWarranty Claims RoutingHolding on to Good Customers

    Weeding out Bad Customers

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    32/193

    35

    What makes data mining possible?

    Advances in the following areas aremaking data mining deployable:

    data warehousingbetter and more data (i.e., operational,

    behavioral, and demographic)the emergence of easily deployed data

    mining tools andthe advent of new data mining

    techniques.

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    33/193

    36

    Why Separate Data Warehouse?

    PerformanceOp dbs designed & tuned for known txs & workloads.Complex OLAP queries would degrade perf. for op txs.Special data organization, access & implementationmethods needed for multidimensional views & queries.

    FunctionMissing data: Decision support requires historical data, which

    op dbs do not typically maintain.Data consolidation: Decision support requires consolidation(aggregation, summarization) of data from manyheterogeneous sources: op dbs, external sources.Data quality: Different sources typically use inconsistent data

    representations, codes, and formats which have to bereconciled.

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    34/193

    37

    What are Operational Systems?

    They are OLTP systemsRun mission criticalapplicationsNeed to work withstringent performancerequirements forroutine tasksUsed to run abusiness!

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    35/193

    38

    RDBMS used for OLTP

    Database Systems have been usedtraditionally for OLTP

    clerical data processing tasksdetailed, up to date datastructured repetitive tasksread/update a few recordsisolation, recovery and integrity are

    critical

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    36/193

    39

    Operational Systems

    Run the business in real timeBased on up-to-the-second dataOptimized to handle largenumbers of simple read/writetransactionsOptimized for fast response topredefined transactionsUsed by people who deal with

    customers, products -- clerks,salespeople etc.They are increasingly used bycustomers

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    37/193

    40

    Examples of Operational DataData Industry Usage Technology Volumes

    Customer File

    All Track Customer Details

    Legacy application, flat files, main frames

    Small-medium

    Account Balance Finance Control account activities

    Legacy applications, hierarchical databases, mainframe

    Large

    Point-of- Sale data

    Retail Generate bills, manage stock

    ERP, Client/Server, relational databases

    Very Large

    Call Record Telecomm- unications Billing Legacy application, hierarchical database, mainframe

    Very Large

    Production Record

    Manufact- uring

    Control Production

    ERP, relational databases,

    AS/400

    Medium

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    38/193

    So, whats different?

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    39/193

    42

    Application-Orientation vs.Subject-Orientation

    Application-Orientation

    Operational

    Database

    LoansCreditCard

    Trust

    Savings

    Subject-Orientation

    Data

    Warehouse

    Customer

    VendorProduct

    Activity

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    40/193

    43

    OLTP vs. Data Warehouse

    OLTP systems are tuned for knowntransactions and workloads whileworkload is not known a priori in a data

    warehouseSpecial data organization, access methodsand implementation methods are neededto support data warehouse queries(typically multidimensional queries)

    e.g ., average amount spent on phone callsbetween 9AM-5PM in Pune during the monthof December

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    41/193

    44

    OLTP vs Data Warehouse

    OLTPApplicationOriented

    Used to runbusinessDetailed dataCurrent up to date

    Isolated DataRepetitive accessClerical User

    Warehouse (DSS)Subject OrientedUsed to analyze

    businessSummarized andrefinedSnapshot data

    Integrated DataAd-hoc accessKnowledge User(Manager)

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    42/193

    45

    OLTP vs Data Warehouse

    OLTPPerformance SensitiveFew Records accessed at

    a time (tens)

    Read/Update Access

    No data redundancy

    Database Size 100MB-100 GB

    Data WarehousePerformance relaxedLarge volumes accessed

    at a time(millions)Mostly Read (BatchUpdate)Redundancy presentDatabase Size

    100 GB - few terabytes

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    43/193

    46

    OLTP vs Data Warehouse

    OLTPTransactionthroughput is the

    performance metricThousands of usersManaged inentirety

    Data WarehouseQuery throughputis the performance

    metricHundreds of usersManaged bysubsets

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    44/193

    47

    To summarize ...

    OLTP Systems areused to run abusiness

    The DataWarehouse helpsto optimize thebusiness

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    45/193

    48

    Why Now?

    Data is being producedERP provides clean data

    The computing power is availableThe computing power is affordableThe competitive pressures are

    strongCommercial products are available

    M th di OLAP S

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    46/193

    49

    Myths surrounding OLAP Serversand Data Marts

    Data marts and OLAP servers are departmentalsolutions supporting a handful of usersMillion dollar massively parallel hardware is

    needed to deliver fast time for complex queriesOLAP servers require massive and unwieldyindicesComplex OLAP queries clog the network with

    dataData warehouses must be at least 100 GB to beeffective

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    47/193

    50

    Wal*Mart Case Study

    Founded by Sam WaltonOne of the largest Super MarketChains in the US

    Wal*Mart: 2000+ Retail Stores

    SAM's Clubs 100+WholesalersStores

    This case study is from Felipe Carinos (NCR

    Teradata) presentation made at Stanford DatabaseSeminar

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    48/193

    51

    Old Retail Paradigm

    Wal*MartInventoryManagement

    Merchandise AccountsPayablePurchasingSupplier Promotions:

    National, Region,Store Level

    SuppliersAccept OrdersPromote Products

    Provide specialIncentivesMonitor and TrackThe Incentives

    Bill and CollectReceivablesEstimate RetailerDemands

    Ne (J st In Time) Ret il

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    49/193

    52

    New (Just-In-Time) RetailParadigm

    No more dealsShelf-Pass Through (POS Application)

    One Unit PriceSuppliers paid once a week on ACTUAL items sold

    Wal*Mart ManagerDaily Inventory RestockSuppliers (sometimes SameDay) ship to Wal*Mart

    Warehouse-Pass ThroughStock some Large Items

    Delivery may come from supplierDistribution Center

    Suppliers merchandise unloaded directly onto Wal*MartTrucks

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    50/193

    53

    Wal*Mart System

    NCR 5100M 96Nodes;Number of Rows:Historical Data:New Daily Volume:

    Number of Users:Number of Queries:

    24 TB Raw Disk; 700 -1000 Pentium CPUs

    > 5 Billions65 weeks (5 Quarters)Current Apps: 75 MillionNew Apps: 100 Million +

    Thousands60,000 per week

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    51/193

    54

    Course Overview

    0. IntroductionI. Data Warehousing

    II. Decision Supportand OLAPIII. Data MiningIV. Looking Ahead

    Demos and Labs

    I Data Warehouses:

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    52/193

    55

    I. Data Warehouses:Architecture, Design & Construction

    DW ArchitectureLoading, refreshingStructuring/ModelingDWs and Data MartsQuery Processing

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    53/193

    56

    Data Warehouse Architecture

    Data WarehouseEngine

    Optimized Loader

    ExtractionCleansing

    AnalyzeQuery

    Metadata Repository

    RelationalDatabases

    LegacyData

    Purchased

    Data

    ERPSystems

    Characteristics of data warehouse

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    54/193

    Characteristics of data warehousearchitecture

    Different objectives and scope(analytical)

    Data content (read only)Complex analysis and quickresponseFlexible and dynamicMeta data driven

    57

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    55/193

    Goal

    Architecture of data warehousebecomes the framework for productselectionIt is collection of documents, plans,models, drawing, and specificationsArchitecture has to be driven by thebusiness

    58

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    56/193

    DW arctitecture

    It is a way of representing overallstructure of the data, processing andpresentation that exists for end-usercomputing within the organization

    It has number of interconnectedcomponents

    59

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    57/193

    Components

    Operational database layerInformation access layerData access layerData directory layerProcess management layerApplication messaging layer

    Data warehouse (physical) layerData staging layer

    60

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    58/193

    61

    Components of the Warehouse

    Data Extraction and Loading(The Warehouse

    Analyze and Query -- OLAP ToolsMetadata

    Data Mining tools ETL(extract, transfer, load)

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    59/193

    Loading the Warehouse

    Cleaning the databefore it is loaded

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    60/193

    63

    Source Data

    Typically host based, legacy applicationsCustomized applications, COBOL, 3GL,4GL

    Point of Contact DevicesPOS(point of sale), ATM, Callswitches( Call Switch makes managinginbound telephone calls )

    Sequential Legacy Relational ExternalOperational/ Source Data

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    61/193

    External SourcesNielsens( Nielsen monitors and measures morethan 90% of global Internet activity andprovides insights about the online universe -

    including audiences, advertising),Acxiom(Provides range of information servicesand products geared towards enterprise datamanagement and retrieval),CMIE( Centre for Monitoring Indian Economy

    ), Vendors, Partners

    64

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    62/193

    65

    Data Quality - The Reality

    Tempting to think creating a datawarehouse is simply extractingoperational data and entering into adata warehouse

    Nothing could be farther from thetruthWarehouse data comes fromdisparate questionable sources

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    63/193

    66

    Data Quality - The Reality

    Legacy systems no longer documented

    Outside sources with questionable qualityproceduresProduction systems with no built inintegrity checks and no integration

    Operational systems are usually designed to

    solve a specific business problem and arerarely developed to a a corporate plan

    And get it done quickly, we do not have time toworry about corporate standards...

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    64/193

    67

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    65/193

    68

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    66/193

    69

    Data Integration Across Sources

    Trust Credit cardSavings Loans

    Same datadifferent name

    Different dataSame name

    Data found herenowhere else

    Different keyssame data

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    67/193

    70

    Data Transformation Example

    appl A - balanceappl B - balappl C - currbalappl D - balcurr

    appl A - pipeline - cmappl B - pipeline - inappl C - pipeline - feetappl D - pipeline - yds

    appl A - m,f appl B - 1,0appl C - x,yappl D - male, female

    Data Warehouse

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    68/193

    71

    Data Integrity Problems

    Same person, different spellingsAgarwal, Agrawal, Aggarwal etc...

    Multiple ways to denote company namePersistent Systems, PSPL, Persistent Pvt.LTD.

    Use of different namesmumbai, bombay

    Different account numbers generated bydifferent applications for the same customerRequired fields left blankInvalid product codes collected at point of sale

    manual entry leads to mistakes

    in case of a problem use 9999999

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    69/193

    72

    Data Transformation Terms

    ExtractingConditioning

    ScrubbingMergingHouseholding

    EnrichmentScoring

    LoadingValidatingDelta Updating

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    70/193

    73

    Data Transformation Terms

    ExtractingCapture of data from operational source in

    as is status

    Sources for data generally in legacymainframes in VSAM (virtual storage access method) ,IMS (information management system) , IDMS (integrated dbms) , DB2;more data today in relational databases on

    UnixConditioning

    The conversion of data types from the sourceto the target data store (warehouse) --

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    71/193

    74

    Data Transformation Terms

    HouseholdingIdentifying all members of a household

    (living at the same address)Ensures only one mail is sent to a

    householdCan result in substantial savings: 1

    lakh catalogues at Rs. 50 each costs Rs.50 lakhs. A 2% savings would save Rs.1 lakh.

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    72/193

    75

    Data Transformation Terms

    EnrichmentBring data from external sources to

    augment/enrich operational data. Data

    sources include Dunn and Bradstreet, A.C. Nielsen, CMIE, IMRA (provides an extensive digest of media,polls, and significant interviews and events.

    )etc...Scoring

    computation of a probability of anevent. e.g..., chance that a customerwill defect to AT&T from MCI (American telecomcompany) , chance that a customer is likely tobuy a new product

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    73/193

    76

    Loads

    After extracting, scrubbing, cleaning,validating etc. need to load the datainto the warehouse

    Issueshuge volumes of data to be loadedsmall time window available when warehouse can betaken off line (usually nights)when to build index and summary tablesallow system administrators to monitor, cancel, resume,change load ratesRecover gracefully -- restart after failure from whereyou were and without loss of data integrity

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    74/193

    77

    Load Techniques

    Use SQL to append or insert newdata

    record at a time interfacewill lead to random disk I/Os

    Use batch load utility

    d

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    75/193

    78

    Load Taxonomy

    Incremental versus Full loadsOnline versus Offline loads

    f h

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    76/193

    79

    Refresh

    Propagate updates on source data tothe warehouseIssues:

    when to refreshhow to refresh -- refresh techniques

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    77/193

    80

    When to Refresh?

    periodically (e.g., every night, everyweek) or after significant eventson every update: not warranted unlesswarehouse data require current data (upto the minute stock quotes)refresh policy set by administrator based

    on user needs and trafficpossibly different policies for differentsources

    R f h T h i

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    78/193

    81

    Refresh Techniques

    Full Extract from base tablesread entire source table: too expensivemaybe the only choice for legacy

    systems

    H T D Ch

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    79/193

    82

    How To Detect Changes

    Create a snapshot log table to recordids of updated rows of source dataand timestampDetect changes by:

    Defining after row triggers to updatesnapshot log when source table

    changesUsing regular transaction log to detect

    changes to source data

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    80/193

    83

    Data Extraction and Cleansing

    Extract data from existingoperational and legacy dataIssues:

    Sources of data for the warehouseData quality at the sourcesMerging different data sourcesData Transformation

    How to propagate updates (on the sources) tothe warehouseTerabytes of data to be loaded

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    81/193

    84

    Scrubbing Data

    Sophisticatedtransformation tools.Used for cleaning thequality of dataClean data is vital for thesuccess of thewarehouse

    ExampleSeshadri, Sheshadri,Sesadri, Seshadri S.,Srinivasan Seshadri, etc.are the same person

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    82/193

    85

    Scrubbing Tools

    Apertus -- Enterprise/IntegratorVality -- IPE

    Postal Soft

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    83/193

    Structuring/Modeling Issues

    Data -- Heart of the Data

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    84/193

    87

    Warehouse

    Heart of the data warehouse is thedata itself!Single version of the truthCorporate memoryData is organized in a way thatrepresents business -- subjectorientation

    D t W h St t

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    85/193

    88

    Data Warehouse Structure

    Subject Orientation -- customer,product, policy, account etc... Asubject may be implemented as a

    set of related tables. E.g.,customer may be five tables

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    86/193

    89

    Data Warehouse Structure

    base customer (1985-87)custid, from date, to date, name, phone, dob

    base customer (1988-90)custid, from date, to date, name, credit rating,employer

    customer activity (1986-89) -- monthlysummarycustomer activity detail (1987-89)

    custid, activity date, amount, clerk id, order nocustomer activity detail (1990-91)

    custid, activity date, amount, line item no, order no

    Time is part of

    key of each table

    D t G l it i W h

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    87/193

    90

    Data Granularity in Warehouse

    Summarized data storedreduce storage costsreduce cpu usageincreases performance since smaller

    number of records to be processeddesign around traditional high level

    reporting needstradeoff with volume of data to be

    stored and detailed usage of data

    Gran larit in Wareho se

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    88/193

    91

    Granularity in Warehouse

    Can not answer some questions withsummarized data

    Did Anand call Seshadri last month?Not possible to answer if total durationof calls by Anand over a month is onlymaintained and individual call detailsare not.

    Detailed data too voluminous

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    89/193

    92

    Granularity in Warehouse

    Tradeoff is to have dual level of granularity

    Store summary data on disks95% of DSS processing done against this

    data

    Store detail on tapes5% of DSS processing against this data

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    90/193

    93

    Vertical Partitioning

    Frequentlyaccessed Rarelyaccessed

    Smaller tableand so less I/O

    Acct.No Name Balance Date Opened

    InterestRate Address

    Acct.No Balance

    Acct.No Name Date Opened

    InterestRate Address

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    91/193

    94

    Derived Data

    Introduction of derived (calculateddata) may often helpHave seen this in the context of duallevels of granularityCan keep auxiliary views andindexes to speed up queryprocessing

    Schema Design

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    92/193

    95

    Schema Design

    Database organizationmust look like businessmust be recognizable by business user

    approachable by business userMust be simple

    Schema Types

    Star SchemaFact Constellation SchemaSnowflake schema

    Dimension Tables

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    93/193

    96

    Dimension Tables

    Dimension tablesDefine business in terms already

    familiar to users

    Wide rows with lots of descriptive textSmall tables (about a million rows)Joined to fact table by a foreign keyheavily indexedtypical dimensions

    time periods, geographic region (markets,cities), products, customers, salesperson,etc.

    In data warehousing, a dimension

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    94/193

    table is one of the set of companiontables to a fact table.The fact table contains businessfacts or measures and foreign keyswhich refer to candidate keys(normally primary keys) in thedimension tables.The dimension tables containattributes (or fields) used toconstrain and group data whenperforming data warehousing

    queries 97

    Fact Table

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    95/193

    98

    Fact Table

    Central tablemostly raw numeric itemsnarrow rows, a few columns at mostlarge number of rows (millions to a

    billion)Access via dimensions

    In data warehousing, a fact table

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    96/193

    g,consists of the measurements,

    metrics or facts of a businessprocess.Fact tables provide the (usually)additive values that act asindependent variables by whichdimensional attributes are analyzed.

    99

    Star Schema

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    97/193

    100

    Star Schema

    A single fact table and for eachdimension one dimension tableDoes not capture hierarchies directly

    T i

    m e

    p r o d

    c u s t

    c i t y

    f a c t

    date, custno, prodno, cityname, ...

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    98/193

    101

    Snowflake schema

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    99/193

    102

    Snowflake schema

    Represent dimensional hierarchy directlyby normalizing tables.Easy to maintain and saves storage

    T i

    m e

    p r o d

    c u s t

    c i t y

    f a c t

    date, custno, prodno, cityname, ...

    r e g i o n

    A is a logical arrangement of tables

    http://en.wikipedia.org/wiki/Logical_schemahttp://en.wikipedia.org/wiki/Logical_schema
  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    100/193

    A is a logical arrangement of tablesin a multidimensional database such that the entityrelationship diagram resembles a snowflake in shape.

    Closely related to the star schema ,The snowflake schema is represented by centralizedfact tables which are connected to multipledimensions . In the snowflake schema, however,dimensions are normalized into multiple related tables

    whereas the star schema's dimensions aredenormalized with each dimension being representedby a single table.When the dimensions of a snowflake schema areelaborate, having multiple levels of relationships, and

    where child tables have multiple parent tables ("forksin the road"), a complex snowflake shape starts toemerge. The "snowflaking" effect only affects thedimension tables and not the fact tables.

    103

    Fact Constellation

    http://en.wikipedia.org/wiki/Logical_schemahttp://en.wikipedia.org/wiki/Multidimensional_databasehttp://en.wikipedia.org/wiki/Entity-relationship_modelhttp://en.wikipedia.org/wiki/Entity-relationship_modelhttp://en.wikipedia.org/wiki/Snowhttp://en.wikipedia.org/wiki/Star_schemahttp://en.wikipedia.org/wiki/Fact_tablehttp://en.wikipedia.org/wiki/Dimension_(data_warehouse)http://en.wikipedia.org/wiki/Dimension_(data_warehouse)http://en.wikipedia.org/wiki/Fact_tablehttp://en.wikipedia.org/wiki/Star_schemahttp://en.wikipedia.org/wiki/Snowhttp://en.wikipedia.org/wiki/Entity-relationship_modelhttp://en.wikipedia.org/wiki/Entity-relationship_modelhttp://en.wikipedia.org/wiki/Multidimensional_databasehttp://en.wikipedia.org/wiki/Logical_schema
  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    101/193

    104

    Fact Constellation

    Fact ConstellationMultiple fact tables that share many

    dimension tables

    Booking and Checkout may share manydimension tables in the hotel industry

    Hotels

    Travel Agents

    Promotion

    Room Type

    Customer

    Booking

    Checkout

    D li i

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    102/193

    105

    De-normalization

    Normalization in a data warehousemay lead to lots of small tablesCan lead to excessive I/Os sincemany tables have to be accessedDe-normalization is the answerespecially since updates are rare

    C i A

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    103/193

    106

    Creating Arrays

    Many times each occurrence of a sequence of data is in a different physical locationBeneficial to collect all occurrences together

    and store as an array in a single rowMakes sense only if there are a stablenumber of occurrences which are accessedtogetherIn a data warehouse, such situations arisenaturally due to time based orientation

    can create an array by month

    S l i R d d

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    104/193

    107

    Selective Redundancy

    Description of an item can be storedredundantly with order table --most often item description is alsoaccessed with order tableUpdates have to be careful

    P i i i

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    105/193

    108

    Partitioning

    Breaking data into severalphysical units that can behandled separatelyNot a question of whether to do it in datawarehouses but how to doitGranularity andpartitioning are key toeffective implementationof a warehouse

    Wh P i i ?

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    106/193

    109

    Why Partition?

    Flexibility in managing dataSmaller physical units allow

    easy restructuringfree indexingsequential scans if neededeasy reorganizationeasy recoveryeasy monitoring

    C it i f P titi i

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    107/193

    110

    Criterion for Partitioning

    Typically partitioned bydateline of businessgeographyorganizational unitany combination of above

    Wh t P titi ?

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    108/193

    111

    Where to Partition?

    Application level or DBMS levelMakes sense to partition atapplication level

    Allows different definition for each yearImportant since warehouse spans many

    years and as business evolves definitionchanges

    Allows data to be moved betweenprocessing complexes easily

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    109/193

    Data Warehouse vs. Data Marts

    What comes first

    From the Data Warehouse to DataM t

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    110/193

    113

    Marts

    DepartmentallyStructured

    IndividuallyStructured

    Data WarehouseOrganizationallyStructured

    Less

    More

    HistoryNormalizedDetailed

    Data

    Information

    D t W h d D t M t

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    111/193

    114

    Data Warehouse and Data Marts

    OLAPData MartLightly summarizedDepartmentally structured

    Organizationally structured AtomicDetailed Data Warehouse Data

    Characteristics of theD t t l D t M t

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    112/193

    115

    Departmental Data Mart

    OLAPSmallFlexible

    Customized byDepartmentSource is

    departmentallystructured datawarehouse

    Techniques for CreatingDepartmental Data Mart

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    113/193

    116

    Departmental Data Mart

    OLAP

    Subset

    SummarizedSuperset

    Indexed

    Arrayed

    Sales Mktg.Finance

    Data Mart Centric

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    114/193

    117

    Data Mart Centric

    Data Marts

    Data Sources

    Data Warehouse

    Problems with Data Mart CentricSolution

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    115/193

    118

    Solution

    If you end up creating multiple warehouses,integrating them is a problem

    True Warehouse

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    116/193

    119

    True Warehouse

    Data Marts

    Data Sources

    Data Warehouse

    Query Processing (end)

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    117/193

    120

    Query Processing (end)

    Indexing

    Pre computedviews/aggregatesSQL extensions

    Indexing Techniques

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    118/193

    121

    Indexing Techniques

    Exploiting indexes to reducescanning of data is of crucialimportance

    Bitmap IndexesJoin IndexesOther Issues

    Text indexingParallelizing and sequencing of index

    builds and incremental updates

    Indexing Techniques

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    119/193

    122

    g q

    Bitmap index:A collection of bitmaps -- one for each

    distinct value of the column

    Each bitmap has N bits where N is thenumber of rows in the tableA bit corresponding to a value v for a

    row r is set if and only if r has the valuefor the indexed attribute

    BitMap Indexes

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    120/193

    123

    BitMap Indexes

    An alternative representation of RID-listSpecially advantageous for low-cardinalitydomains

    Represent each row of a table by a bitand the table as a bit vectorThere is a distinct bit vector Bv for eachvalue v for the domainExample: the attribute sex has values Mand F. A table of 100 million peopleneeds 2 lists of 100 million bits

    Bitmap Index

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    121/193

    124Customer Query : select * from customer where

    gender = F and vote = Y

    0

    0

    0

    0

    0

    0

    0

    0

    0

    1

    1

    1

    1

    1

    1

    1

    1

    1

    Bitmap Index

    M

    F

    F

    F

    F

    M

    Y

    Y

    Y

    N

    N

    N

    Bit Map Index

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    122/193

    125

    Bit Map Index

    Cust Region RatingC1 N HC2 S MC3 W LC4 W HC5 S LC6 W L

    C7 N H

    Base Table

    Row ID N S E W

    1 1 0 0 0

    2 0 1 0 0

    3 0 0 0 1

    4 0 0 0 1

    5 0 1 0 0

    6 0 0 0 1

    7 1 0 0 0

    Row ID H M L

    1 1 0 0

    2 0 1 0

    3 0 0 0

    4 0 0 0

    5 0 1 0

    6 0 0 0

    7 1 0 0

    Rating Index Region Index

    Customers where Region = W Rating = M And

    BitMap Indexes

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    123/193

    126

    BitMap Indexes

    Comparison, join and aggregation operationsare reduced to bit arithmetic with dramaticimprovement in processing time

    Significant reduction in space and I/O (30:1)Adapted for higher cardinality domains as well.Compression (e.g., run-length encoding)exploitedProducts that support bitmaps: Model 204,TargetIndex (Redbrick), IQ (Sybase), Oracle7.3

    Join Indexes

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    124/193

    127

    Pre-computed joinsA join index between a fact table and adimension table correlates a dimension

    tuple with the fact tuples that have thesame value on the common dimensionalattribute

    e.g., a join index on city dimension of calls

    fact tablecorrelates for each city the calls (in the calls table) from that city

    Join Indexes

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    125/193

    128

    Join Indexes

    Join indexes can also span multipledimension tables

    e.g., a join index on city and time

    dimension of calls fact table

    Star Join Processing

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    126/193

    129

    g

    Use join indexes to join dimensionand fact table

    Calls C+T

    C+T+L

    C+T+L +P

    Time

    Loca- tion

    Plan

    Optimized Star Join Processing

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    127/193

    130

    p g

    Time

    Loca- tion

    Plan

    Calls

    Virtual Cross Product of T, L and P

    Apply Selections

    Bitmapped Join Processing

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    128/193

    131

    AND

    Time

    Loca- tion

    Plan

    Calls

    Calls

    Calls

    Bitmaps 1 0

    1

    0 0 1

    1 1 0

    Intelligent Scan

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    129/193

    132

    Piggyback multiple scans of arelation (Redbrick)

    piggybacking also done if second scan

    starts a little while after the first scan

    Parallel Query Processing

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    130/193

    133

    Three forms of parallelismIndependentPipelined

    Partitioned and partition and replicate Deterrents to parallelism

    startup

    communication

    Parallel Query Processing

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    131/193

    134

    Partitioned DataParallel scansYields I/O parallelism

    Parallel algorithms for relational operatorsJoins, Aggregates, Sort

    Parallel UtilitiesLoad, Archive, Update, Parse, Checkpoint,Recovery

    Parallel Query Optimization

    Pre-computed Aggregates

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    132/193

    135

    Keep aggregated data forefficiency (pre-computed queries)

    QuestionsWhich aggregates to compute?How to update aggregates?How to use pre-computed

    aggregates in queries?

    Pre-computed Aggregates

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    133/193

    136

    Pre computed Aggregates

    Aggregated table can be maintainedby the

    warehouse server

    middle tierclient applications

    Pre-computed aggregates -- special

    case of materialized views -- samequestions and issues remain

    SQL Extensions

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    134/193

    137

    Extended family of aggregatefunctions

    rank (top 10 customers)percentile (top 30% of customers)median, modeObject Relational Systems allow

    addition of new aggregate functions

    SQL Extensions

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    135/193

    138

    SQL Extensions

    Reporting featuresrunning total, cumulative totals

    Cube operatorgroup by on all subsets of a set of

    attributes (month,city)redundant scan and sorting of data can

    be avoided

    Red Brick has Extended set ofAggregates

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    136/193

    139

    Aggregates

    Select month, dollars, cume(dollars) asrun_dollars, weight, cume(weight) asrun_weightsfrom sales, market, product, period t

    where year = 1993and product like Columbian% and city like San Fr% order by t.perkey

    RISQL (Red Brick Systems)Extensions

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    137/193

    140

    Extensions

    AggregatesCUMEMOVINGAVGMOVINGSUMRANKTERTILERATIOTOREPORT

    Calculating RowSubtotals

    BREAK BY

    Sophisticated DateTime SupportDATEDIFF

    Using SubQueriesin calculations

    Using SubQueries in Calculations

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    138/193

    141

    Using SubQueries in Calculations

    select product, dollars as jun97_sales,(select sum(s1.dollars)from market mi, product pi, period, ti, sales si

    where pi.product = product.productand ti.year = period.yearand mi.city = market.city) as total97_sales,100 * dollars/

    (select sum(s1.dollars)from market mi, product pi, period, ti, sales si where pi.product = product.product

    and ti.year = period.yearand mi.city = market.city) as percent_of_yr

    from market, product, period, sales where year = 1997

    and month = June and city like Ahmed% order by product;

    Course Overview

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    139/193

    142

    Course Overview

    The course:what and how

    0. IntroductionI. Data WarehousingII. Decision Supportand OLAP

    III. Data MiningIV. Looking Ahead

    Demos and Labs

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    140/193

    II. On-Line Analytical Processing (OLAP)

    Making DecisionSupport Possible

    Limitations of SQL

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    141/193

    144

    Q

    A Freshman inBusiness needs

    a Ph.D. in SQL

    -- Ralph Kimball

    Typical OLAP Queries

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    142/193

    145

    yp Q

    Write a multi-table join to compare sales for eachproduct line YTD this year vs. last year.

    Repeat the above process to find the top 5

    product contributors to margin.Repeat the above process to find the sales of aproduct line to new vs. existing customers.

    Repeat the above process to find the customersthat have had negative sales growth.

    What Is OLAP?

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    143/193

    146

    * Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html

    Online Analytical Processing - coined byEF Codd in 1994 paper contracted byArbor Software * Generally synonymous with earlier terms such asDecisions Support, Business Intelligence, ExecutiveInformation SystemOLAP = Multidimensional DatabaseMOLAP: Multidimensional OLAP (Arbor Essbase,Oracle Express)ROLAP: Relational OLAP (Informix MetaCube,Microstrategy DSS Agent)

    The OLAP Market

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    144/193

    147

    Rapid growth in the enterprise market1995: $700 Million1997: $2.1 Billion

    Significant consolidation activity amongmajor DBMS vendors

    10/94: Sybase acquires ExpressWay7/95: Oracle acquires Express11/95: Informix acquires Metacube1/97: Arbor partners up with IBM10/96: Microsoft acquires Panorama

    Result: OLAP shifted from small verticalniche to mainstream DBMS category

    Strengths of OLAP

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    145/193

    148

    g

    It is a powerful visualization paradigm

    It provides fast, interactive responsetimes

    It is good for analyzing time series

    It can be useful to find some clusters and

    outliersMany vendors offer OLAP tools

    OLAP Is FASMI

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    146/193

    149

    Nigel Pendse, Richard Creath - The OLAP Report

    FastAnalysisSharedMultidimensionalInformation

    Multi-dimensional Data

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    147/193

    150Month

    1 2 3 4 765

    P r o

    d u c

    t

    Toothpaste

    JuiceColaMilk

    Cream

    Soap

    WSN

    Dimensions: Product, Region, TimeHierarchical summarization paths

    Product Region Time Industry Country Year

    Category Region Quarter

    Product City Month Week

    Office Day

    Multi-dimensional Data

    HeyI sold $100M worth of goods

    Data Cube Lattice

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    148/193

    151

    Cube latticeABC

    AB AC BCA B C

    noneCan materialize some groupbys, compute otherson demandQuestion: which groupbys to materialze?

    Question: what indices to createQuestion: how to organize data (chunks, etc)

    Visualizing Neighbors is simpler

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    149/193

    152

    g g p

    1 2 3 4 5 6 7 8 AprMayJunJul AugSepOctNovDecJanFebMar

    Month Store Sales Apr 1 Apr 2 Apr 3 Apr 4 Apr 5 Apr 6 Apr 7 Apr 8May 1May 2May 3May 4May 5May 6May 7May 8Jun 1Jun 2

    A Visual Operation: Pivot (Rotate)

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    150/193

    153

    p ( )

    10

    47

    30

    12

    JuiceCola

    Milk

    Cream

    3/1 3/2 3/3 3/4

    Date

    Product

    Slicing and Dicing

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    151/193

    154

    g g

    Product

    Sales Channel Retail Direct Special

    Household

    Telecomm

    Video

    Audio IndiaFar East

    Europe

    The Telecomm Slice

    Roll-up and Drill Down

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    152/193

    155

    Sales ChannelRegionCountryStateLocation Address

    SalesRepresentative

    Higher Level of Aggregation

    Low-levelDetails

    Nature of OLAP Analysis

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    153/193

    156

    Aggregation -- (total sales,percent-to-total)Comparison -- Budget vs.Expenses

    Ranking -- Top 10, quartileanalysisAccess to detailed and

    aggregate dataComplex criteriaspecificationVisualization

    Organizationally Structured Data

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    154/193

    157

    Different Departments look at the samedetailed data in different ways. Withoutthe detailed, organizationally structureddata as a foundation, there is noreconcilability of data

    marketing

    manufacturing

    sales

    finance

    Multidimensional Spreadsheets

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    155/193

    158

    Analysts needspreadsheets that support

    pivot tables (cross-tabs)drill-down and roll-up

    slice and dicesortselectionsderived attributes

    Popular in retail domain

    OLAP - Data Cube

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    156/193

    159

    Idea: analysts need to group data in manydifferent ways

    eg. Sales(region, product, prodtype,prodstyle, date, saleamount)

    saleamount is a measure attribute, rest aredimension attributesgroupby every subset of the other attributes

    materialize (precompute and store)

    groupbys to give online responseAlso: hierarchies on attributes: date ->weekday,date -> month -> quarter -> year

    SQL Extensions

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    157/193

    160

    Front-end tools requireExtended Family of Aggregate Functionsrank, median, mode

    Reporting Featuresrunning totals, cumulative totals

    Results of multiple group bytotal sales by month and total sales by

    productData Cube

    Relational OLAP: 3 Tier DSS

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    158/193

    161

    Data Warehouse ROLAP Engine Decision Support Client

    Database Layer Application Logic Layer Presentation Layer

    Store atomicdata in industrystandardRDBMS.

    Generate SQLexecution plans inthe ROLAP engineto obtain OLAPfunctionality.

    Obtain multi-dimensionalreports from theDSS Client.

    MD-OLAP: 2 Tier DSS

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    159/193

    162

    MDDB Engine MDDB Engine Decision Support Client

    Database Layer Application Logic Layer Presentation Layer

    Store atomic data in a proprietarydata structure (MDDB), pre-calculateas many outcomes as possible, obtainOLAP functionality via proprietaryalgorithms running against this data.

    Obtain multi-dimensionalreports from theDSS Client.

    Typical OLAP ProblemsData Explosion

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    160/193

    163

    Data Explosion Syndrome

    Number of Dimensions

    N u m

    b e r o

    f A g g r e g a

    t i o n s

    (4 levels in each dimension)

    Data Explosion

    Microsoft TechEd98

    Metadata Repository

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    161/193

    164

    Administrative metadatasource databases and their contentsgateway descriptionswarehouse schema, view & derived data definitions

    dimensions, hierarchiespre-defined queries and reportsdata mart locations and contentsdata partitionsdata extraction, cleansing, transformation rules,defaultsdata refresh and purging rulesuser profiles, user groupssecurity: user authorization, access control

    Metdata Repository .. 2

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    162/193

    165

    Business databusiness terms and definitionsownership of data

    charging policiesoperational metadata

    data lineage: history of migrated data andsequence of transformations appliedcurrency of data: active, archived, purgedmonitoring information: warehouse usagestatistics, error reports, audit trails.

    Recipe for a SuccessfulW h

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    163/193

    Warehouse

    For a Successful Warehouse

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    164/193

    167

    From day one establish that warehousingis a joint user/builder project

    Establish that maintaining data quality willbe an ONGOING joint user/builderresponsibilityTrain the users one step at a timeConsider doing a high level corporate datamodel in no more than three weeks

    From Larry Greenfield, http://pwp.starnetinc.com/larryg/index.html

    For a Successful Warehouse

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    165/193

    168

    Look closely at the data extracting,cleaning, and loading toolsImplement a user accessible automated

    directory to information stored in thewarehouseDetermine a plan to test the integrity of the data in the warehouseFrom the start get warehouse users in thehabit of 'testing' complex queries

    For a Successful Warehouse

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    166/193

    169

    Coordinate system roll-out with networkadministration personnelWhen in a bind, ask others who have

    done the same thing for adviceBe on the lookout for small, but strategic,projectsMarket and sell your data warehousingsystems

    Data Warehouse Pitfalls

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    167/193

    170

    You are going to spend much time extracting,cleaning, and loading data

    Despite best efforts at project management, datawarehousing project scope will increase

    You are going to find problems with systemsfeeding the data warehouse

    You will find the need to store data not beingcaptured by any existing system

    You will need to validate data not being validatedby transaction processing systems

    Data Warehouse Pitfalls

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    168/193

    171

    Some transaction processing systems feeding thewarehousing system will not contain detail

    Many warehouse end users will be trained andnever or seldom apply their training

    After end users receive query and report tools,requests for IS written reports may increase

    Your warehouse users will develop conflictingbusiness rules

    Large scale data warehousing can become anexercise in data homogenizing

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    169/193

    DW and OLAP Research Issues

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    170/193

    173

    Data cleaningfocus on data inconsistencies, not schema differencesdata mining techniques

    Physical Designdesign of summary tables, partitions, indexes

    tradeoffs in use of different indexesQuery processing

    selecting appropriate summary tablesdynamic optimization with feedbackacid test for query optimization: cost estimation, use of transformations, search strategiespartitioning query processing between OLAP server andbackend server.

    DW and OLAP Research Issues .. 2

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    171/193

    174

    Warehouse Managementdetecting runaway queriesresource managementincremental refresh techniquescomputing summary tables during loadfailure recovery during load and refreshprocess management: scheduling queries,load and refreshQuery processing, cachinguse of workflow technology for processmanagement

    P d t R f U f l Li k

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    172/193

    Products, References, Useful Links

    Reporting Tools

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    173/193

    176

    Andyne Computing -- GQLBrio -- BrioQueryBusiness Objects -- Business ObjectsCognos -- ImpromptuInformation Builders Inc. -- Focus for WindowsOracle -- Discoverer2000Platinum Technology -- SQL*Assist, ProReportsPowerSoft -- InfoMakerSAS Institute -- SAS/AssistSoftware AG -- EsperantSterling Software -- VISION:Data

    OLAP and Executive InformationSystems

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    174/193

    177

    Andyne Computing -- PabloArbor Software -- Essbase

    Cognos -- PowerPlay

    Comshare -- Commander

    OLAPHolistic Systems -- Holos

    Information Advantage --AXSYS, WebOLAP

    Informix -- MetacubeMicrostrategies --DSS/Agent

    Microsoft -- PlatoOracle -- Express

    Pilot -- LightShip

    Planning Sciences --

    GentiumPlatinum Technology --ProdeaBeacon, Forest & Trees

    SAS Institute -- SAS/EIS,OLAP++

    Speedware -- Media

    Other Warehouse RelatedProducts

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    175/193

    178

    Data extract, clean, transform,refresh

    CA-Ingres replicator

    Carleton PassportPrism Warehouse ManagerSAS Access

    Sybase Replication ServerPlatinum Inforefiner, Infopump

    Extraction and TransformationTools

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    176/193

    179

    Carleton Corporation -- PassportEvolutionary Technologies Inc. -- Extract

    Informatica -- OpenBridge

    Information Builders Inc. -- EDA Copy Manager

    Platinum Technology -- InfoRefiner

    Prism Solutions -- Prism Warehouse Manager

    Red Brick Systems -- DecisionScape Formation

    Scrubbing Tools

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    177/193

    180

    Apertus -- Enterprise/IntegratorVality -- IPEPostal Soft

    Warehouse Products

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    178/193

    181

    Computer Associates -- CA-IngresHewlett-Packard -- Allbase/SQLInformix -- Informix, Informix XPS

    Microsoft -- SQL ServerOracle -- Oracle7, Oracle Parallel ServerRed Brick -- Red Brick WarehouseSAS Institute -- SASSoftware AG -- ADABASSybase -- SQL Server, IQ, MPP

    Warehouse Server Products

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    179/193

    182

    Oracle 8InformixOnline Dynamic ServerXPS --Extended Parallel ServerUniversal Server for object relational

    applicationsSybase

    Adaptive Server 11.5Sybase MPPSybase IQ

    Warehouse Server Products

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    180/193

    183

    Red Brick WarehouseTandem NonstopIBM

    DB2 MVSUniversal ServerDB2 400

    Teradata

    Other Warehouse RelatedProducts

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    181/193

    184

    Connectivity to SourcesApertusInformation Builders EDA/SQL

    Platimum InfohubSAS ConnectIBM Data Joiner

    Oracle Open ConnectInformix Express Gateway

    Other Warehouse RelatedProducts

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    182/193

    185

    Query/Reporting EnvironmentsBrio/QueryCognos Impromptu

    Informix ViewpointCA Visual ExpressBusiness Objects

    Platinum Forest and Trees

    4GL's, GUI Builders, and PCDatabases

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    183/193

    186

    Information Builders -- FocusLotus -- ApproachMicrosoft -- Access, Visual BasicMITI -- SQR/WorkbenchPowerSoft -- PowerBuilder

    SAS Institute -- SAS/AF

    Data Mining Products

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    184/193

    187

    DataMind -- neurOagentInformation Discovery -- IDISSAS Institute -- SAS/Neuronets

    Data Warehouse

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    185/193

    188

    W.H. Inmon, Building the DataWarehouse, Second Edition, John Wileyand Sons, 1996W.H. Inmon, J. D. Welch, Katherine L.Glassey, Managing the Data Warehouse,John Wiley and Sons, 1997Barry Devlin, Data Warehouse from

    Architecture to Implementation, AddisonWesley Longman, Inc 1997

    Data Warehouse

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    186/193

    189

    W.H. Inmon, John A. Zachman, JonathanG. Geiger, Data Stores Data Warehousingand the Zachman Framework, McGraw HillSeries on Data Warehousing and DataManagement, 1997Ralph Kimball, The Data WarehouseToolkit, John Wiley and Sons, 1996

    OLAP and DSS

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    187/193

    190

    Erik Thomsen, OLAP Solutions, John Wileyand Sons 1997Microsoft TechEd Transparencies fromMicrosoft TechEd 98Essbase Product LiteratureOracle Express Product LiteratureMicrosoft Plato Web SiteMicrostrategy Web Site

    Data Mining

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    188/193

    191

    Michael J.A. Berry and Gordon Linoff, DataMining Techniques, John Wiley and Sons1997Peter Adriaans and Dolf Zantinge, DataMining, Addison Wesley Longman Ltd.1996KDD Conferences

    Other Tutorials

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    189/193

    192

    Donovan Schneider, Data Warehousing Tutorial,Tutorial at International Conference forManagement of Data (SIGMOD 1996) andInternational Conference on Very Large Data

    Bases 97Umeshwar Dayal and Surajit Chaudhuri, DataWarehousing Tutorial at International Conferenceon Very Large Data Bases 1996

    Anand Deshpande and S. Seshadri, Tutorial onDatawarehousing and Data Mining, CSI-97

    Useful URLs

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    190/193

    193

    Ralph Kimballs home page http://www.rkimball.com

    Larry Greenfields Data WarehouseInformation Center

    http://pwp.starnetinc.com/larryg/

    Data Warehousing Institutehttp://www.dw-institute.com/

    OLAP Councilhttp://www.olapcouncil.com/

    Data Mining Motivation

    http://www.rkimball.com/http://pwp.starnetinc.com/larryg/http://www.dw-institute.com/http://www.dw-institute.com/http://www.dw-institute.com/http://www.dw-institute.com/http://www.dw-institute.com/http://www.dw-institute.com/http://www.dw-institute.com/http://pwp.starnetinc.com/larryg/http://www.rkimball.com/
  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    191/193

    194

    Changes in the Business EnvironmentCustomers becoming more demandingMarkets are saturated

    Databases today are huge:More than 1,000,000 entities/records/rowsFrom 10 to 10,000 fields/attributes/variablesGigabytes and terabytes

    Databases a growing at an unprecedentedrateDecisions must be made rapidlyDecisions must be made with maximum

    k l d

    Data Mining Applications:Retail

    P f i g b k t l i

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    192/193

    195

    Performing basket analysis

    Which items customers tend to purchase together. Thisknowledge can improve stocking, store layout strategies, andpromotions.

    Sales forecastingExamining time-based patterns helps retailers make stockingdecisions. If a customer purchases an item today, when arethey likely to purchase a complementary item?

    Database marketingRetailers can develop profiles of customers with certainbehaviors, for example, those who purchase designer labelsclothing or those who attend sales. This information can beused to focus cost effective promotions.

    Merchandise planning and allocation

    When retailers add new stores, they can improve merchandiseplanning and allocation by examining patterns in stores withsimilar demographic characteristics. Retailers can also usedata mining to determine the ideal layout for a specific store.

    Data Mining Applications:Banking

  • 7/30/2019 Datawarehouse Intro Ch1 Ch2

    193/193

    Card marketingBy identifying customer segments, card issuers andacquirers can improve profitability with more effectiveacquisition and retention programs, targeted productdevelopment, and customized pricing.

    Cardholder pricing and profitabilityCard issuers can take advantage of data miningtechnology to price their products so as to maximizeprofit and minimize loss of customers. Includes risk-based pricing.

    Fraud detectionFraud is enormously costly. By analyzing pasttransactions that were later determined to be