Introduction to Dimensional Analysis Session 2 5/20/2005 M D Metadata Solutions Dan McCreary...

40
Introduction to Dimensional Analysis Session 2 5/20/2005 M D Metadata Solutions Dan McCreary President Dan McCreary & Associates [email protected] m (952) 931-9198

Transcript of Introduction to Dimensional Analysis Session 2 5/20/2005 M D Metadata Solutions Dan McCreary...

Introduction to Dimensional AnalysisSession 2

5/20/2005

M

D

Metadata Solutions

Dan McCrearyPresidentDan McCreary & [email protected](952) 931-9198

2

Agenda• General introduction to Data Dictionaries

that drive Business Intelligence (BI) concepts and terminology

• Understand why Data Dictionaries are so critical in accurate BI

• Understand how BI looks at the world in different ways

• Understand how data warehouse concepts and data dictionary impact analysis and research

3

What is a Data Warehouse?

• Fast Retrieval

• Internally Consistent

• Slice and Dice Capability

• Easy to “Browse”

• Complete and Reliable

• Data Quality Controls– GI-GO (Garbage-In, Garbage-Out)

Source: Ralph Kimball

4

Factors Driving Business Intelligence• Computer process and store twice as much data per dollar

every 18 months (Moore’s Law)• People can make better decisions if they have tools to

quickly see only the data they are interested in seeing• People frequently want to analyze data in new ways that

was unanticipated by people creating "canned reports"• Tools can be designed to allow non-technical (non-SQL

programmers) to generate their own reports• People have an incredible ability to categorize things base

on their properties and attributes but if they don't have consistent definitions of these properties they will not be generating consistent results

5

The BI Iterative Process

• The BI process in an on-going iterative process where the structure of the data warehouse changes based on what data is critical to an organizations business objectives.

AccessData Warehouse Analysis

Insights, Conclusions andFindings

Publishing, Change, DataGap Analysis, New Data

Gathered

BI ProjectManagement

BI ProjectManagement

6

BI Evolution

• Shorten the time-to-report interval• Allow users to "browse" data sets interactively• Remove programmers with "backlogs" of reports• Users frequently waited days, weeks for months to get a custom report

created

Monthly Green Bar Reports BrowseableGraphical Interface

Increasing Responsiveness

7

Dimensions of BI

Degree ofEnd User Control

Technical SophisticationRequired

Highly Responsive to "What If"

ScenariosLow

(analysts)

High(programmers)

Few Dimensionsfew parameters, few filters

Many Dimensionsmany variables

8

Overlapping Terminology

BusinessIntelligence

Data Mining

TransactionProcessing

(OLTP)

DimensionalAnalysis

Indexing

Aggregates

Statistical AnalysisPatternDiscovery

Data Storage(RDBMS)

DataWarehousing

Data DictionariesData Modeling

Semantics

9

Key Terms Covered in This Class• Properties• Dimension• Aggregation and Levels• Enumerations of Categorical Data• Labeling Categories• Giving precise definitions to Labels• Dimension Hierarchies and Levels• Cubes• Measures• Filters• Data Warehouse Presentation

10

Things Have Many "Properties"

People are very good at recognizing and sorting things by their properties.

11

Sorting by an Object's Property

• Sort objects by their color

12

Sorting by A Property

• Sort objects by their shape

13

Sorting by Color AND Shape

Shape “Dimension”

Color “D

imension”

14

Dimensional Analysis

• The science of figuring out intuitive ways that people want to categorize information using independent variables to graphically filter and browse their data

15

Dimension

• List of categories used to partition the information based on a property of the objects

• Dimension Names: Color, Shape

16

Labels

• A name given to a non-overlapping category within a dimensions

"red"

"blue"

"green"

Labels

17

Enumeration

• Whenever we decide to break the continuous observable world into a predefined list of categories when each category has a label we call this an "enumerated value domain". These will then become the "dimensions" of our cube.

"red" "green" "blue"

Statisticians call this type of "categorical data" and it requires the categories to be non-overlapping.

Note: NO OVERLAP!

18

The Challenge of Semantic Classification• People are good at sorting based on a property they see

• People are good at assigning names to a property type

• People usually come up with different names for properties

• Some dimensions people easily agree on

• Some are very difficult to classifyand even more difficult go get peopleto agree on a non-overlapping classification system

"Polygon" "Square"

"Red Circle"

"Green""Blue"

"Blue-Green"

What happens with a small percentage of data does not quite fit into a discrete category?

19

Level

• A layer of "aggregation" within a single dimension – categorization of properties

All Shapes

ShapesWith Curves

ShapesWithout Curves

Circle Heart Square Trapezoid

Levels

Moon Star Diamond

20

Measures (example weight)

9.1

3.5

1.1

9.3

5.5

6.6

8.4

7.45.7

6.18.2

2.6

3.8

10

A measure is any property that you can perform math on (sums, averages).

21

Measures

• Something that you can do math on.

+- X

/ %

averagesum

22

Sample Object "Fact Table"id Color Shape DashStyle Weight

1 blue heart solid 5.7

2 blue star small-dash 6.6

3 green moon large-dash 1.1

4 red trapezoid small-dash 3.8

5 green square solid 9.3

6 red diamond small-dash 8.2

7 green circle small-dash 2.6

8 red circle large-dash 3.5

9 blue trapezoid solid 5.5

10 blue square large-dash 10

11 blue diamond large-dash 8.4

12 red heart large-dash 9.1

13 red moon solid 7.4

14 green star solid 6.1

Measures tend to havedata types of integers andfloating point numbers.

Note that categorical data can not beadded together. But we can count thefrequencies of items with a category!

23

Shape DimensionShape Code Has Curves Definition

circle Yes A round shape with no corners.

diamond NoA shape with our corners and parallel edges but not horizontal and vertical

edges.

square No A shape with four corners with horizontal and vertical edges that are parallel.

heart Yes A shape that is round on the top and pointy on the bottom.

moon Yes A crescent shape.

trapezoid No Four corners, four sides but the sides are not all parallel.

Note that there is no reference to "Has Curves" in the prior table. "HasCurves" is a property of the shape value domain because it can be "inferred"from the shape of the object.

Some categorical definitions use "exclusionary" language.

Note that "Has Curves" also must have a precise definition in the data dictionary.

24

Facts and Dimension

Shape FactsColor_FKShape_FK

WeightShape DimHas Curves

Color NameColor Dim

Shape Name

Note that "Has curves" does not need to be in the central fact table.It is a property of the shape!

25

Adding Dimensions

9.1

3.5

1.1

9.3

5.5

6.6

8.4

7.45.7

6.18.2

2.6

3.8

10

We have now added a 3rd dimension – "Dash Style"

26

Each New Property is Another Dimension

Shape FactsColor_FKShape_FK

DashStyle_FK

Shape Dim

Shape Code

Color CodeColor Dim

Weight

DashStype DimShape

Has Curves

27

Filters

9.1

3.5

1.1

9.3

5.5

6.6

8.4

7.45.7

6.18.2

2.6

3.8

10

A filter will exclude all objects with a specified property.For example we can exclude all shapes with a property of "Circle"

28

Example: Discarding Invalid Scores

This example filter removes all scores EXCEPT the valid scores.

29

Selecting Only Scale Scores

This filter removes all scores EXCEPT the assessments Scale Scoreusing the Test Score Type dimension.

30

The Star Schema

Dim1

Cat1

Cat2

Cat3

PK

Facts

Foreign KeyForeign KeyForeign KeyForeign Key

Measure1Measure1

Foreign KeyPrimary Key

Dim2

Cat1

Cat2

Cat3

PK

Dim3

Cat1

Cat2

Cat3

PK

Dim4

Cat1

Cat2

Cat3

PK

Dim5

Cat1

Cat2

Cat3

PK

31

Adding Measures

Shape FactsColor_FKShape_FK

DashStyle_FKShape DimShapeCode

ColorCodeColor Dim

WeightValueDashStype Dim

ShapeCodeHeightValuePriceAmountDensityValue

Measures can be easily be added to the fact table without changing any of the dimensions.

Measures areIntegers or floatsthat you can performmath on.

32

Cube• A Cube is a pre-built structure that has facts

and many dimensions (not necessarily just three)

• Designed to have averages and sums for most levels "pre-calculated" to make analysis fast

Shape Dimension

Color Dimension9.1

3.5

1.1

9.3

5.5

6.6

8.4

7.45.7

6.18.2

2.6

3.8

10

Dash-S

tyle Dim

ensio

n

33

Build a Mental Model

9.1

3.5

1.1

9.3

5.5

6.6

8.4

7.45.7

6.18.2

2.6

3.8

10

FilterFunnel

Horizontal Dimension (columns)

Verti

cal D

imen

sion

(rows

)

Measure = count

Presentation

(aka "Page Fields")

34

Using Cubes in ExcelFilter Dropped Here

Measures Dropped HereRow and Column Dropped Here

35

Count of Year vs. Assessment Name

The Row Dimension is the "Test Name".

The Column Dimension is the "Fiscal Year"The measure is the count of records in the cube.

There are around25 milliontestresults

36

Conformed Dimensions

• When building many cubes, there is a large benefit to "reusing" dimensions

• Commonly reused dimensions– Time (Fiscal Year, Quarter)– Organization (School, District)– Expense Category– Student

37

Each bar represents thesum of all the expendituresin the category (ExpendituresOn girls athletics for the fiscalyear 1991)

38

Sample of National Conformed Dimensions

Process

Student Assessment

Student Attendance

School and District Status

Teacher Licensing

Date

School Food and Nutrition

Student Disciplinary Reporting

Student Safety Reporting

District Technology Planning

District Financial Reporting

Organ

izatio

n

Asses

smen

t

Financia

l

Teach

er

Claims

School In

ciden

t Data

School T

echnology

Student

39

Role of Data Architecture

• Facilitate how business users want to identify and categorize data

• Assist in the creation and documentation of categorical value domains and measures

• Creation of machine-readable data dictionaries for use in building data warehouse structures

40

Summary

• We found a way for non-SQL programmers to analyze complex data by looking at one dimension at a time

• Users don't have to memorize "codes"

• Users do need to understand how continuous data is mapped into categories and what the labels on these categories mean