SFDV3007 Chapter 4: Decision Support Systems. Overview of Chapter 4 What is decision support?...

Post on 13-Jan-2016

236 views 0 download

Tags:

Transcript of SFDV3007 Chapter 4: Decision Support Systems. Overview of Chapter 4 What is decision support?...

SFDV3007

Chapter 4:Decision Support Systems

Overview of Chapter 4

• What is decision support?• Decision support systems• Data warehouse concepts• Data warehouse analysis & design• Online analytical processing

(OLAP)

2

What is decision support?

(Kifer §1.4, 17.1; Silberschatz §22.1)

• Data → timely, relevant, well-visualised information.

• Tune information and presentation to specific purposes (often decision-making).

3

Decision-making occurs atthe operational level

(see also Figure 4–1)

• Very short-term.• Well-defined inputs.• Produced by existing applications or simple

front-end tools.• Line managers.• Operational decision making has to do with

the day-to-day operations of an organisation, e.g., should we produce hammers tomorrow?

• Data are clearly defined (how many hammers do we need to produce 4

Decision-making occurs atthe tactical level

(see also Figure 4–1)

• Short-term.• Less well-defined inputs.• Middle managers.• Tactical decision making has to do with short-term

planning (weeks → months), e.g., when should we start gearing up for Christmas production? When is the best time to introduce or discontinue a particular product?

• Things start to get less clear at this level, because external factors are starting to influence the data. Data may be coming not only from internal systems, but also from outside, and will probably need some processing to be useful. 5

Decision-making occurs atthe strategic level

(see also Figure 4–1)

• Long-term.• Ill-defined inputs.• Often cannot use pre-existing applications ⇒

Decision Support Systems (DSS); or Executive Information Systems (EIS).

• Senior managers.• Strategic decision making has to do with long-term

planning (months → years), e.g., should we expand our product in China?

• The inputs at this level come from both internal and external sources and often include ill-defined or “fuzzy” data, and even rumour and speculation. 6

Operational vs. decision support queries

Operational– How many brass reciprocating hammers do we have in

stock?– How much electrical twine did we sell yesterday?

Decision support– How many brass reciprocating hammers were sold to

customers aged 18–25 in large North Island towns over each of the last six months?

– If we double the advertising budget for electrical twine, how might that affect revenues for the next six months?

The key point here is that operational queries are typically very easy to ask in SQL, whereas decision support queries can be difficult or even impossible to formulate using SQL. 7

There is a strong need for DSS

• Modern business very complex.• Shrinking time frame for decision-making.• Data from multiple sources: (see Figure 4–1)

– internal vs. external– “formal” vs. “informal”⇒ must be sensibly integrated

• External data includes things like economic indicators, share prices, marketing information, competitors’ data (!); anything that’s relevant to the running of the business.

• Informal data includes hearsay, rumour, speculation, gossip, etc., etc.

8

Components of a DSS(adapted from Rob & Coronel, Figure 12.1)

9

Components of a DSS(adapted from Rob & Coronel, Figure 12.1)

10

• The data store is a specially-structured database integrating data from multiple sources.

• Data extraction and filtering extracts, validates and cleans data from many sources. Data cleaning is a non-trivial problem; sometimes it can be faster to re-enter the data manually than to clean it automatically. Cleaning includes things like ensuring that all data have the same format (not just the data type, but also things like unit sizes, e.g., thousands vs. millions of dollars), that values are stored in the correct columns, and even things like adjusting monetary values for differences in exchange rates or inflation.

• Business model data are generated by various modelling algorithms, like a regression model or linear programming.

• The end-user presentation tool is important for visualising information in a useful form (e.g., graphs vs. raw numbers). Poor visualisation can make the difference between a good decision and a bad decision.

There are many types of decision support tool

Basic– Ad hoc query tools (SQL).– Graph and report generators.– Spreadsheets (small data sets only!)

More advanced– Data warehouses.– Online analytical processing (OLAP).– “OLAP” was first used by Codd in 1993

11

Operational data vs.decision support data

12

• Operational data are usually stored in online transaction processing (OLTP) databases.

• Decision support data are needed for tactical and strategic decision making.

• The key differences are:• Timespan• Granularity• Dimensionality

Operational data vs.decision support data

13

Characteristic Operational data Decision support data

Data currency Current operations Historic data

Real-time data Snapshot of company data

Time component (week/month/year)

Granularity Atomic, detailed data Summarised data

Summarisation level

Low; some aggregation

High; heavily aggregated

Data structure Highly normalised Non-normalised

Mostly RDBMS Complex structures

Some relational; mostly multidimensional

Transaction type Mostly updates Mostly queries

Transaction volumes

High update volume Periodic loads and summary calculations

Transaction speed Updates are critical Retrievals are critical

Query activity Low to medium High

Query scope Narrow range Broad range

Query complexity Simple to medium Very complex

Data volumes Hundreds of MiB → GiB

Hundreds of GiB → TiB

Timespan is a key difference

Operational– Very short (current transactions).

Decision support– Long (past and future).– Data may not be current.

14

Granularity is a key difference

Operational– Represent specific transactions (atomic).

Decision support– Varying levels of aggregation (atomic →

highly summarised).• For example, sales aggregated by day, week, month, year,

city, region, country, …. Typically, all of these levels of aggregation will be pre-computed so as to speed up querying.

– Drilling down vs. rolling up.

15

Dimensionality is a key difference

(Kifer §17.2; Silberschatz §22.2.1)

Operational– “Flat” (tables of atomic transactions). **

Decision support– Many dimensions.

• However “dimension” means something different in a DSS context. “Variable of interest” is a more accurate term. In that sense, a single relation stores data about a single item of interest, and can therefore be considered to be a single dimension.

– Orders by region per quarter (2D)– Compare sales of products during the last six months by region,

city, store & customer (4D)

– Multidimensional data may be stored in multidimensional databases (MDD), which store data as multidimensional arrays rather than as tables.

16

Dimensionality is a key difference

(adapted from Rob & Coronel, Figure 12.13; see also Kifer §17.2 & Silberschatz §22.2.1)

17

Data warehouses storedecision support data(Mannino §14.1.2, 14.1.3; Rob & Coronel Table 12.5)

• Designed and optimised for decision support data.

• Internal structure quite different from operational databases:– aggregated– denormalised– data from multiple internal/external

sources

18

DW data are integrated(Rob & Coronel Table 12.5)

Operational database data– Mostly internal sources.– Multiple representations.

• Similar data from different sources can have different representations or meanings. For example, tax numbers may be stored as ##-###-### or as ########, and a Boolean condition may be labelled as T/F, 0/1 or Y/N.

• Internal source: sales database.

Data warehouse data– Both internal and external sources.

• External sources: share prices, government economic indicators, etc.

– Transformed, cleaned and summarised during integration.

19

DW data are subject-oriented

(Rob & Coronel Table 12.5)Operational database data

– Functional or process-oriented (invoices, payments, products).

Data warehouse data– Facts or measures organised by major subject areas (sales,

marketing, etc.).– Held according to dimensions or variables of interest:

product, customer, region, …– Aggregated data from many operational tables.– Queries tuned to specific decision-making needs.

20

DW data are time-variant

(Rob & Coronel Table 12.5)Operational database data

– Current transactions with precise time stamps.

Data warehouse data– Time an important dimension for almost all

subject areas.– Data aggregated by time, e.g., sales by week,

month, quarter, year…– Historical focus (past and future).

21

DW data are non-volatile

(Rob & Coronel Table 12.5)Operational database data– Frequent changes ⇒ dynamic.– Often archived periodically.– The bigger the database, the slower it will be, so archiving periodically will

help with database performance.

– The main side effect of this is that operational databases typically don’t grow that large

Data warehouse data– Read only (occasional batch updates) ⇒ static.– Historical data retained ⇒ always growing (GiB → …)– Data are generally at least several TiB in size, and are always growing because

nothing is deleted.

22

Drill-down and Roll-up• The two basic hierarchical operations

when displaying data at multiple levels of aggregations are the drill-down and roll-up operations.

• Drill-down refers to the process of viewing data at a level of increased detail

• Roll-up refers to the process of viewing data with decreasing detail.

23

Defining a data warehouse

in more detail(Kifer §17.6; Silberschatz §22.4.1; Table 4–2)• Read-only database optimised for data analysis and

query processing.• Data from:

– “legacy”/archived databases – operational databases– other sources

• Optimisation includes:– decisions on aggregations– important dimensions– appropriate indexing and physical designNote that because they are read-only, you can throw as many indexes at

them

24

Data marts are small, specialised data

warehouses• Focused subset of data.• “Clusters” of data marts surrounding

central enterprise data warehouse?– This is useful if the central data warehouse is

particularly large. Extracting a smaller subset relating only to the problem at hand enables faster processing and a tighter focus.

25

Data warehouse analysisis more demanding

(Mannino §14.3.1)

• Some queries may be impossible if not designed for.

• Not as flexible for ad hoc queries.• Users must identify intended use.• Data derived from both internal and

external sources (e.g., Internet: Yahoo!, Dow Jones, NASDAQ).

26

The difficulty of DW design

(The Standish Group (1997), “The Meta Myth”; http://standishgroup.com/)

Interviewer: How many data warehouses have you had?

Data warehouser: We have had eight.

Interviewer: To what do you attribute so many warehouses?

Data warehouser: Seven mistakes…

27

Facts are a key design aspect

(Mannino §14.3.2)

• Facts are the base values that we are interested inmonitoring .

• Examples: revenue, profits, cost, number of sales.

• Also known as measures.

28

Dimensions are a key design aspect

(Mannino §14.3.2)

• A factor/variable that influences the facts.

• The values of the dimensions affect our view of the facts

• Examples: time, product, customer, salesrep, location.

• Each has attributes.

29

Time as a dimension(see also Mannino §14.2.3)

• Not as simple as it seems!• Granularity (unit size): year,

month, week, day, hour.• Alternate units (periodicity):

season, financial year, quarter.

30

Star schemas for relationaldata warehouses

(Kifer pp. 715–717; Silberschatz §22.4.2; Figure 4–3; see also Data Warehousing Guide ch. 2)

• Central fact table.• Cluster of related dimension tables.• Needed because of inadequate

physical data independence? (denormalised)

• Partial normalisation leads to a “snowflake” or “starflake” structure (also “constellation”).

31

Star schemas for relationaldata warehouses

(Kifer pp. 715–717; Silberschatz §22.4.2; Figure 4–3; see also Data Warehousing Guide ch. 2)

32

Star schemas for relationaldata warehouses

(Kifer pp. 715–717; Silberschatz §22.4.2; Figure 4–3; see also Data Warehousing Guide ch. 2)

• Star schemas are the most common structure used for relational data warehouses.

• The fact table is typically a huge composite of numbers relating to various things, and is heavily denormalised with lots of duplicated values.

• Relational data warehouses are typically denormalised, although this has more to do with a lack of physical data independence in RDBMS.

33

Three steps to populatea data warehouse

This process is often referred to by the acronym ETL

Extraction: obtaining data from sources.Transformation: altering form of data (includes

cleaning). Transformation involves ensuring that consistent data formats are used and also fitting imported data into the DW structure . Cleaning involves ensuring that data are correct, accurate and self-consistent.

Loading: adding data to warehouse.• Possibly intermediate data staging steps.

– Data staging can happen between each of the three stages. Thus, we might extract the data from the original sources and store it in a temporary staging database before it enters the transformation process.

• Critical for successful data warehouses.

Performance tuningfor data warehouses

• Complex queries ⇒ denormalisation (fewer joins).• Mostly read-only + complex queries ⇒ index

heavily.• Other techniques:

– normalise dimension tables• Normalising the dimension tables simplifies filtering operations related to the

dimensions.

– multiple fact tables for different aggregation levels• that each fact table will be smaller, and thus faster to access.

– physical tuning: partitioning, replication, etc.

35

Performance tuningfor data warehouses

• B-tree indexes and hashing generally useful.• Bitmap indexes particularly for “counting by category”

queries.– Bitmap indexes also useful because there are lots of low selectivity

columns in the fact table.

• Integrated indexes for dimension tables..• Function-based indexes could be useful? • Dimension tables are effectively commonly used “lookup tables”, so

storing them (in Oracle) as index-organised tables could be beneficial. • The main advantage is that the index (and hence the table) will

typically be kept in RAM.

36

Oracle10g supports data warehouses

(Data Warehousing Guide )

The simple approach– Use distribution and replication services.– Does not scale well.

Oracle data mart suite– Add-on for constructing Oracle data marts.– GUI interface.– Modules for design, extraction,

transformation and loading of data.– Third-party tools also available.

37

Oracle10g supports data warehouses

• Bitmap & function-based indexes, index-organised tables.– Bitmap join indexes speed up queries that join dimension tables to a

fact table, because they effectively index the join

• Bitmap join indexes.• Other relevant tools:

– SQL*Loader (possibly in conjunction with Transparent Gateways)

– export and import (basic)• If your data sources are Oracle databases, then you could in

theory export from the sources and import into the DW

38

OLAP tools enable complex data processing

(Silberschatz §22.2; Figure 4–4)

• Complex analysis of multidimensional data.

• Spreadsheet-like “simplicity”.• Data stored in warehouse or tool’s

internal proprietary database.

39

OLAP tools have many capabilities

• Data transformation.• Business modelling.• Statistical analysis.• Powerful GUI query facility.• Visualisation (graphics).

40

A simple OLAPexample using Excel(Kifer §17.3; see also Silberschatz §22.2.3–22.2.5)

• Sales subject area dimensions: customer, salesreps, product, region, time, …

• View sales aggregated by dimensions.• Dynamically alter presentation:

– drill down/roll up– “slice and dice”

– “pivot” the table - is a data summarization tool found in data visualization programs such as spreadsheets

– highlight exceptions (e.g., high loss products)– invent new columns (e.g., % sales revenue)

41

“Slice and dice” enables dynamic visualisation

42

A slice is effectively a restriction on a subset of values in a particular dimension. We can slice on just one dimension, or many dimensions

at once.

OLAP data may bestored in different ways

(Kifer §17.4; Silberschatz §22.2.2)

• Internal proprietary database (often MDD - Multidimensional database.).

• Access external databases (data warehouses):

– relational (ROLAP)– multidimensional (MOLAP)– both (HOLAP) - hybrid OLAP

43

Some OLAP products

• Oracle Business Intelligence and Hyperion BI+.

• IBM Cognos Business Intelligence.• Business Objects & Crystal

Reports.• JasperSoft Business Intelligence

Suite (http://www.jasperforge.org/).

44

Oracle10g SQL hasadditional OLAP support

(see also Kifer §17.3.2 and Silberschatz §22.2.3)

• GROUP BY CUBE (<columns>).• GROUP BY ROLLUP (<columns>).

– These enable you to build data cubes and aggregated rollups straight out of Oracle

• GROUPING SETS (different from SQL:1999’s GROUPING function).

• No RANK or PARTITION BY (SQL:1999).

45

Data mining mayfind hidden trends

(Kifer §17.7; Silberschatz §22.3)

• OLAP & data warehousing let us identify trends and relationships.

BUT: some relationships too complex or subtle to easily notice.

• Data mining tools claim to sift through databases and find unrecognised relationships and trends.

46

There are many datamining techniques

(Kifer §17.8–17.11; Silberschatz §22.3)

• Neural networks• Complex visualisation.• Genetic algorithms (evolve a

solution).• Advanced statistical analysis

(traditional).

47

Data mining examples

• Fraud detection (phone, credit card).• MCI’s statistical profiles.• Risk assessment for car insurance

(FIG).• NBA strategy analysis.

• But it’s not foolproof…

48

Summary of Chapter 4

• Decision support– decision support vs. operational– decision support systems

• Data warehouses– Characteristics– logical & physical design

• Online analytical processing (OLAP)• Data mining

49