Aggregation

Aggregation, Historical information, Query Facility, OLAP function and Tools. OLAP

Servers, ROLAP, MOLAP, HOLAP, Data Mining interface, Security, Backup and

Recovery, Tuning Data Warehouse, Testing Data Warehouse.

Aggregation

Aggregations are the way by which information can be divided so queries can be run on the

aggregated part and not the whole set of data. These are precalculated summaries derived from

the most granular fact table. These summaries form a set of separate aggregate fact tables. You

may create each aggregate fact table as a specific summarization across any number of

dimensions.

Aggregations are a way of dividing the information so queries can

be run on the aggregated part and not the whole set of data. The

warehouse manager is responsible for creating aggregations.

Most aggregations can be created in a single complex query, but

the time it saves not having to do that every day is significant. An

example would be an aggregation that keeps track of all of the

customers who have bought athletic shoes in the past year. This

would allow you to do queries about which people have bought

Nike shoes or which people have bought shoes over $100.

Having the aggregation would save you the time of waiting for the

query to search through every customer every time to see if they

bought athletic shoes and then do another check to see what the

price or brand was. Having preaggregated data improves

performance and allows users to spot trends that might have

otherwise gone unnoticed. [1]

Aggregations also aid in the process of creating summary

tables, which are used to speed up query time by storing

aggregated values in columns. If a department store wants to keep

track of weekly sales, there would be an aggregation of total sales

for each product at each store location. The summary table would

possibly consist of a product id number, a store id number, total

revenue for the product for that week and the quantity sold.

Using the aggregation to quickly obtain the sales figures saves

time and makes updating the summary table easier. The summary

tables need to have a date on them because as access to the tables

diminishes, the tables will be deleted to save space. Summary

tables are always changing along with the needs of the users so it

is important to define the aggregations according to what

summary tables might be of use.

Aggregates are precalculated summaries derived from the most granular fact table. These

summaries form a set of separate aggregate fact tables. You may create each aggregate

fact table as a specific summarization across any number of dimensions. Let us begin by

examining a sample STAR schema. Choose a simple STAR schema with the fact table at

the lowest possible level of granularity. Assume there are four dimension tables surrounding

this most granular fact table. Figure 11-11 shows the example we want to examine.

When you run a query in an operational system, it produces a result set about a single

customer, a single order, a single invoice, a single product, and so on. But, as you know,

the queries in a data warehouse environment produce large result sets. These queries retrieve

hundreds and thousands of table rows, manipulate the metrics in the fact tables, and

then produce the result sets. The manipulation of the fact table metrics may be a simple

addition, an addition with some adjustments, a calculation of averages, or even an application

of complex arithmetic algorithms.

Aggregates

Aggregates are precalculated summaries derived from the most granular fact table. These

summaries form a set of separate aggregate fact tables. You may create each aggregate

fact table as a specific summarization across any number of dimensions. Let us begin by

examining a sample STAR schema. Choose a simple STAR schema with the fact table at

the lowest possible level of granularity. Assume there are four dimension tables surrounding

this most granular fact table. Figure 11-11 shows the example we want to examine.

When you run a query in an operational system, it produces a result set about a single

customer, a single order, a single invoice, a single product, and so on. But, as you know,

the queries in a data warehouse environment produce large result sets. These queries retrieve

hundreds and thousands of table rows, manipulate the metrics in the fact tables, and

then produce the result sets. The manipulation of the fact table metrics may be a simple

addition, an addition with some adjustments, a calculation of averages, or even an application

of complex arithmetic algorithms.

Let us review a few typical queries against the sample STAR schema shown in Figure

11-11.

Query 1: Total sales for customer number 12345678 during the first week of December

2000 for product Widget-1.

Query 2: Total sales for customer number 12345678 during the first three months of

2000 for product Widget-1.

Query 3: Total sales for all customers in the South-Central territory for the first two

quarters of 2000 for product category Bigtools.

Scrutinize these queries and determine how the totals will be calculated in each case.

The totals will be calculated by adding the sales quantities and sales dollars from the qualifying

rows of the fact table. In each case, let us review the qualifying rows that contribute

to the total in the result set.

Query 1: All fact table rows where the customer key relates to customer number

12345678, the product key relates to product Widget-1, and the time key relates

to the seven days in the first week of December 2000. Assuming that a

customer may make at most one purchase of a single product in a single day,

only a maximum of 7 fact table rows participate in the summation.

Query 2: All fact table rows where the customer key relates to customer number

12345678, the product key relates to product Widget-1, and the time key relates

to about 90 days of the first quarter of 2000. Assuming that a customer

may make at most one purchase of a single product in a single day, only

about 90 fact table rows or less participate in the summation.

Query 3: All fact table rows where the customer key relates to all customers in the

South-Central territory, the product key relates to all products in the product

category Bigtools, and the time key relates to about 180 days in the first two

quarters of 2000. In this case, clearly a large number of fact table rows participate

in the summation.

Obviously, Query 3 will run long because of the large number of fact table rows to be

retrieved. What can be done to reduce the query time? This is where aggregate tables can

be helpful. Before we discuss aggregate fact tables in detail, let us review the sizes of

some typical fact tables in real-world data warehouses.

Summaries and AggregatesData warehouse customers always have a common complaint—performance. Datawarehouses always have a common problem—performance. Database tuning, SQLtuning, indexing, and optimizer improvements increase the performance of a datawarehouse. Two methods, though, are applied in almost every data warehouse –Summaries and Aggregates.32

A Summary is a table that stores the results of a SQL arithmetic SUM statementthat has been applied to a Fact table. The arithmetic portion of a Fact table issummed, while simultaneously one or more hierarchical levels of detail are removedfrom the data in a Fact table. For example:Intraday Fact data is summed at the Day level. The resulting data is stored ina Daily Summary table. For that data, the lowest grain is the Day.Store Fact data is summed at the Region level. The resulting data is stored in aRegion Summary table. For that data, the lowest grain is the Region.The intention of a Summary table is to perform the summation of arithmeticFact data only once, rather than many times. By incurring the resource consumptionnecessary to summarize a Fact table, data warehouse customers will receive thepreviously summarized data they want quickly.An Aggregate is a table that stores the results of SQL JOIN statements, whichhave been applied to a set of Dimension tables. The hierarchies and attributes abovean entity are prejoined and stored in a table. For example:The Product entity, its levels of hierarchy and management area prejoinedinto a single table that stores the result set. The grain of this result set is theProduct.The Facility entity, its levels of geographic and management hierarchy areprejoined into a single table that store the result set. The grain of this resultset is the Facility.The intention of an Aggregate table is to perform the joins of large sets ofDimension data only once. By incurring the resource consumption necessary tojoin a series of Dimension tables, data warehouse customers will receive data thatuses those levels of hierarchy quickly.An Aggregate is not a pure Dimension table as it would appear in a DimensionalData Model. An Aggregate is a physical table that holds the result set of joinstatements, which are commonly used by data warehouse customers and are highsystem resource consumers. The point of an Aggregate is to incur the high systemresource consumption once during off-peak hours to avoid multiple consumptionsof system resources during peak hours. That being the case, an Aggregate tablecan denormalize along multiple hierarchies. The intersection of those multiplehierarchies is the grain of an Aggregate table. The hierarchical intersection andlowest level of granular detail must be the same because they are the grain of an

Aggregate table.

On-Line Analytical Processing (OLAP) is a category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user.

OLAP functionality is characterized by dynamic multi-dimensional analysis of consolidated enterprise data supporting end user analytical and navigational activities including:

calculations and modeling applied across dimensions, through hierarchies and/or across members

trend analysis over sequential time periods slicing subsets for on-screen viewing drill-down to deeper levels of consolidation reach-through to underlying detail data rotation to new dimensional comparisons in the viewing area

OLAP is implemented in a multi-user client/server mode and offers consistently rapid response to queries, regardless of database size and complexity. OLAP helps the user synthesize enterprise information through comparative, personalized viewing, as well as through analysis of historical and projected data in various "what-if" data model scenarios. This is achieved through use of an OLAP Server.

OLAP allows business users to slice and dice data at will. Normally data in an organization is distributed in multiple data sources and are incompatible with each other. A retail example: Point-of-sales data and sales made via call-center or the Web are stored in different location and formats. It would a time consuming process for an executive to obtain OLAP reports such as - What are the most popular products purchased by customers between the ages 15 to 30?

Part of the OLAP implementation process involves extracting data from the various data repositories and making them compatible. Making data compatible involves ensuring that the meaning of the data in one repository matches all other repositories. An example of incompatible data: Customer ages can be stored as birth date for purchases made over the web and stored as age categories (i.e. between 15 and 30) for in store sales.

It is not always necessary to create a data warehouse for OLAP analysis. Data stored by operational systems, such as point-of-sales, are in types of databases called OLTPs. OLTP, Online Transaction Process, databases do not have any difference from a structural perspective from any other databases. The main difference, and only, difference is the way in which data is stored.

Examples of OLTPs can include ERP, CRM, SCM, Point-of-Sale applications, Call Center.

OLTPs are designed for optimal transaction speed. When a consumer makes a purchase online, they expect the transactions to occur instantaneously. With a database design, call data modeling,

http://www.dwreview.com/DW_Overview.html

optimized for transactions the record 'Consumer name, Address, Telephone, Order Number, Order Name, Price, Payment Method' is created quickly on the database and the results can be recalled by managers equally quickly if needed.

OLAP SERVER

An OLAP server is a high-capacity, multi-user data manipulation engine specifically designed to support and operate on multi-dimensional data structures. A multi- dimensional structure is arranged so that every data item is located and accessed based on the intersection of the dimension members which define that item. The design of the server and the structure of the data are optimized for rapid ad-hoc information retrieval in any orientation, as well as for fast, flexible calculation and transformation of raw data based on formulaic relationships. The OLAP Server may either physically stage the processed multi-dimensional information to deliver consistent and rapid response times to end users, or it may populate its data structures in real-time from relational or other databases, or offer a choice of both. Given the current state of technology and the end user requirement for consistent and rapid response times, staging the multi-dimensional data in the OLAP Server is often the preferred method.

Aggregation

Documents

Transcript of Aggregation