Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non-...

23
Data Warehousing

Transcript of Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non-...

Page 1: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Data Warehousing

Page 2: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Definition• Data Warehouse:

– A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes

– Subject-oriented: e.g. customers, patients, students, products

– Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources

– Time-variant: Contain a time dimenstion so that it may be used to study trends and changes

– Nonupdatable: Read-only, periodically refreshed

• Data Mart:– A data warehouse that is limited in scope

Page 3: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Need for Data Warehousing• Integrated, company-wide view of high-quality

information (from disparate databases)• Separation of operational and informational (decision

support) systems and data (for improved performance)

Page 4: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Data Warehouse Architectures

• Generic Two-Level Architecture

• Independent Data Mart

All involve some form of extraction, transformation and loading (ETLETL)

Page 5: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Figure 11-2: Generic two-level data warehousing architecture

E

T

LOne, company-wide warehouse

Periodic extraction data is not completely current in warehouse

Page 6: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Figure 11-3 Independent data mart data warehousing architecture

Data marts:Data marts:Mini-warehouses, limited in scope

E

T

L

Separate ETL for each independent data mart

Data access complexity due to multiple data marts

Page 7: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

The ETL Process

• Capture/Extract• Scrub or data cleansing• Transform:

– Convert data from the format of the source to the format of the data warehouse.

• Load and Index

ETL = Extract, transform, and load

Page 8: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Load/Index= place transformed data into the warehouse and create indexes

Refresh mode:Refresh mode: bulk rewriting of target data at periodic intervals

Update mode:Update mode: only changes in source data are written to data warehouse

Figure 11-10: Steps in data reconciliation

(cont.)

Page 9: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Index

• Bitmap index

• Join index

Page 10: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Figure 6-8Bitmap index index organization

Bitmap saves on space requirementsRows - possible values of the attribute

Columns - table rows

Bit indicates whether the attribute of a row has the values

Page 11: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Figure 6-9 Join Indexes–speeds up join operations

Page 12: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Star Schema for Data Warehouse

• Objectives– Ease of use for decision support applications– Fast response to predefined user queries– Customized data for particular target audiences

Also called “dimensional model”• Dimension:

– A dimension is a term used to describe any category used in analyzing data, such as time, geography, and product line.

Page 13: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Figure 11-13 Components of a star schemastar schemaFact tables contain factual or quantitative data

Dimension tables contain descriptions about the subjects of the business

1:N relationship between dimension tables and fact tables

Excellent for ad-hoc queries, but bad for online transaction processing

Dimension tables are denormalized to maximize performance

Page 14: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Figure 11-14 Star schema example

Fact table provides statistics for sales broken down by product, period and store dimensions

Page 15: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Figure 11-15 Star schema with sample data

Page 16: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

On-Line Analytical Processing (OLAP) Tools

• The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques

• Relational OLAP (ROLAP)– Traditional relational representation

• Multidimensional OLAP (MOLAP)– Cube structure

• OLAP Operations– Cube slicing–come up with 2-D view of data– Drill-down–going from summary to more detailed views

Page 17: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Figure 11-23 Slicing a data cube

Page 18: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Figure 11-24 Example of drill-down

Summary report

Drill-down with color added

Starting with summary data, users can obtain details for particular cells

Page 19: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Data Mining and Visualization• Knowledge discovery using a blend of statistical, AI, and computer graphics

techniques• Goals:

– Explain observed events or conditions– Confirm hypotheses– Explore data for new or unexpected relationships

• Techniques– Statistical regression– Decision tree induction– Clustering and signal processing– Affinity– Sequence association– Case-based reasoning– Rule discovery– Neural nets– Fractals

• Data visualization–representing data in graphical/multimedia formats for analysis

Page 20: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Pivot Table

• Excel:– Drill Down, Roll Up

• Access CrossTab query

Page 21: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

SQL GROUPING SETS

• GROUPING SETS– SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS

– GROUP BY GROUPING SETS(CITY,RATING,(CITY,RATING),())

– ORDER BY CITY;

• Note: () indicates that an overall total is desired.

Page 22: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

SQL CUBE

• Perform aggregations for all possible combinations of columns indicated.– SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS

– GROUP BY CUBE(CITY,RATING)

– ORDER BY CITY, RATING;

Page 23: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

SQL ROLLUP

• The ROLLUP extension causes cumulative subtotals to be calculated for the columns indicated. If multiple columns are indicated, subtotals are performed for each of the columns except the far-right column.– SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS– GROUP BY ROLLUP(CITY,RATING)– ORDER BY CITY, RATING