Data Warehousing Lifecycle

38
Data Warehousing Lifecycle Conceptual modeling: System requirements, data sources and warehousing activities. Logical design: Data flow from sources to DW, composition and semantics of activities. DW construction: Schema implementation, data population and warehouse tuning. Application development: DW interfaces, OLAP and data mining tools.

description

Data Warehousing Lifecycle. Conceptual modeling: System requirements, data sources and warehousing activities. Logical design: Data flow from sources to DW, composition and semantics of activities. Application development: DW interfaces, OLAP - PowerPoint PPT Presentation

Transcript of Data Warehousing Lifecycle

Page 1: Data Warehousing Lifecycle

Data Warehousing Lifecycle

Conceptual modeling:

System requirements, data sources and warehousing activities.

Logical design:

Data flow from sources to DW, composition and semantics of activities.

DW construction:

Schema implementation, data population and warehouse tuning.

Application development:

DW interfaces, OLAP and data mining tools.

Page 2: Data Warehousing Lifecycle

On-Line Analytical Processing (OLAP)

Store

Pro

duct

Time (day)

M T W Th F S S

Juice

Milk

Coke

Cream

Soap

Bread

NYSF

LA

10 15 18 5 24 32 16

Dimensions: Time, Product, StoreHierarchies: Day Week Quarter

Product Brand … Store Region Country

roll-up to week

roll-up to brandroll-up to region

Store

Pro

duct

Time (week)

W1 2 3 4

Juice

Milk

Coke

Cream

Soap

Bread

NYSF

LA

120

Operators: roll-up, drill-down, slice and dice.Uses: Business data analysis, e.g., market-driven trend analysis.

Page 3: Data Warehousing Lifecycle

CSE601 3

Cube Aggregates Lattice

city, product, date

city, product city, date product, date

city product date

all

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3p1 67 12 50

129

use greedyalgorithm todecide whatto materialize

Page 4: Data Warehousing Lifecycle

CSE601 4

Dimension Hierarchies

all

state

city

cities city statec1 CAc2 NY

Page 5: Data Warehousing Lifecycle

CSE601 5

Dimension Hierarchies

city, product

city, product, date

city, date product, date

city product date

all

state, product, date

state, date

state, product

state

not all arcs shown...

Page 6: Data Warehousing Lifecycle

Logical Data Modeling: A Star Schema Example

Sales

time_key

branch_key

location_key

product_key

num_units

amount_usd

Time

time_key

day

month

year

Product

product_key

name

brand

type

Supplier

supplier_key

name

type

Location

location_key

city

state

country

Branch

branch_key

name

type

1

n

1

1

1

n

n

n

???

One-to-many relationships between the fact and dimensions. The fact-dimension relationships are certain. Dimensions in star models are often tightly coupled. Star schema does not appear to be very extensible.

Page 7: Data Warehousing Lifecycle

Biomedical Data Resources

• Static data: data on genotypes, biological entities such as nucleic acids, protein and relationships between these entities.

• Dynamic data: data on phenotypes, the dynamics of biological processes.

• Data on analysis tools: data on biological and computer science methods which can be used to identify the entities and relationships.

• References and annotations: to scientific papers and textual explanations.

Page 8: Data Warehousing Lifecycle

Biomedical Data Modeling

• Flat file collections: Databases were built up as indexed ASCII text files.

• Relational databases: many biology databases were implemented using Oracle, Sybase, or MySQL.

• Object-oriented databases: data are modeled as objects that are organized in classes.

• Multidimensional databases: data are organized in star like schema.

Page 9: Data Warehousing Lifecycle

Using Star Schema in Gene Expression Data Management

• “Applying Data Warehouse Concepts to Gene Expression Data Management”, by V. Markowitz and T. Topaloglou

• Three modeling data spaces:– Sample data space– Gene Annotation data space– Gene expression data space

Page 10: Data Warehousing Lifecycle

Gene Expression Data Space

Gene_idExperiment_id

Analysis_idExpression_call

Analysis_idAlgorithm

version

Gene_idGene_name

Gene_symbol

Experiment_idExp_nameExp_dateExp_fileSample

Gene

Analysis

Expression

Experiment

Clinical Sample

Page 11: Data Warehousing Lifecycle

Sample Data Space

BiologicalSample

PathwaysStudy

Donor

DonorDemorgraphics

DonorClinical

Page 12: Data Warehousing Lifecycle

Gene Annotation Data Space

GeneFragmentsSequence

Pathways

SequenceCluster

Known gene

MicroarrayDesign

Chromosome

Page 13: Data Warehousing Lifecycle

OLAP Operations

• Sample selection: extract sets of samples with a certain profile on the sample data space. Eg, a sample set of male colon samples with adenocarcenoma for donors in the age group 40-60.

• Classification on organ: total number of samples classified by liver, brain, …

Page 14: Data Warehousing Lifecycle

OLAP Operations

• Gene selection: extract sets of genes with certain properties over the gene annotation data space. Eg, a gene set of the genes on chromosome 22 …

• Aggregates: gene summarization on sample dimension, sample summarization on gene dimension. Etc.

Page 15: Data Warehousing Lifecycle

Clinical Data Sapce

Clinical Sample

Medical ImageFollowup

Drug

Demographics Clinical Test

Physiology

Patient

1 n

n

n n

1 n 1

n

1 n

n n

Disease

n n

n

Page 16: Data Warehousing Lifecycle

Sample Data Sapce

Protein Expression

mRNA Expression

Anatomy Ontology Biochemical Assay

Genetic Screening

Clinical Sample

n

n

1

n

Patient

n n

1

n n

1 n

n

Page 17: Data Warehousing Lifecycle

Microarray Data Sapce

mRNA Expression

Experiment Measurement Unit

Array Probe

Gene Sequence

n n

n n

1 1

1 1

1

n

Clinical Sample

Page 18: Data Warehousing Lifecycle

Proteomic Data Sapce

Protein Expression

Experiment Measurement Unit

Gene Sequence

n n

n n

1 1

1 1Clinical Sample

Page 19: Data Warehousing Lifecycle

Experiment Data Sapce

Project

Experiment

Publication Normalization

Protocol

Person

n n

n n

n 1 1 n

1 1

1 1

Platform

Page 20: Data Warehousing Lifecycle

Gene Data Sapce

n 1

Protein Expression

Gene Sequence

Promoter Gene Ontology

1

n

n

n

Protein Domain

Protein-Protein Interaction

n

n

1

2

1

n n

n

Gene Cluster

mRNA Expression

Array Probe

n 1

Page 21: Data Warehousing Lifecycle

mRNA Expression

Experiment Measurement Unit

Array Probe

Gene Sequence

n n

n n

1 1

1 1

1

n

Clinical Sample

Anatomy Ontology

n

1

Patient

1

n

Disease

n

n

Project Platform

Normalization

1

n

1

n

1

n

Gene Ontology Gene Cluster

n

n

n

n

Explicit Definition of Concept Hierarchies

Page 22: Data Warehousing Lifecycle

Characteristics of Clinical and Genomic Data

Clinical and Genomic Data Business Data

Complex data structure with many potential dimensions

Easy-to-understand data structure with few dimensions

Often many-to-many relationships between facts and dimensions

Many-to-one relationships between facts and dimensions

Uncertain relationships between fact and dimension objects

Certain relationships between fact and dimension objects

Some measures require advanced temporal support for time validity

Historical data, no advanced temporal support needed

Incomplete and/or imprecise data very common

Few incomplete and/or imprecise data

Page 23: Data Warehousing Lifecycle

Large Number of Dimensions and Evolution of Dimensions

• If Star schema is used and the number of dimensions is large, the fact table will be huge (combination of foreign keys).

• Adding new dimension to Star schema will require re-computing of all data entries in the fact table.

Page 24: Data Warehousing Lifecycle

Many-to-Many relationships

• The many-to-many relationships cannot be easily modeled using Star schema, which is originally designed to handle many-to-one relationships between business fact and a dimension.

Page 25: Data Warehousing Lifecycle

Incompleteness of Data

• Clinical data may be incomplete. This may cause a lot of null values in the fact table for foreign keys, which will result in inconsistency.

Page 26: Data Warehousing Lifecycle

Star Schema Fact

DimKey1DimKey2DimKey3DimKey4Measure1Measure2Measure3Measure4

Dim3

DimKey3

. . .

Dim2

DimKey2

. . .

Dim4

DimKey4. . .

Dim1

DimKey1. . .

BioStar Schema

Fact

FactKey

. . .Dim3

DimKey3

. . .

MTable2

DimKey2FactKeyMeasure2

MTable4

DimKey4FactKeyMeasure4

Dim1

DimKey1. . .

MTable3

DimKey3FactKeyMeasure3

MTable1

DimKey1FactKeyMeasure1

Dim2

DimKey2

. . .

Dim4

DimKey4. . .

Page 27: Data Warehousing Lifecycle

BioStar Schema for Part of the Clinical Data Space

Patient

PatientIDSSNNameGenderDOB

DrugUse

DrugIDPatientIDDosageValidFromValidTo

TestResult

TestIDPatientIDResultDateTested

ClinicalSample

SampleIDPatientIDSourceAmountDateTaken

Diagnosis

DiseaseIDPatientIDSymptomValidFromValidTo

Drug

DrugIDDrugNameDrugTypeDescription

Disease

DiseaseIDNameTypeDescription

ClinicalTest

TestIDTestNameTestTypeTestSetting

Extensibility and flexibility

Page 28: Data Warehousing Lifecycle

BioStar Schema for the Sample Data Space

ClinicalSample

SampleIDPatientIDSourceAmountDateTaken

mRNAExpression

SampleIDArrayProbeIDExperimentIDMeasureUnitIDExpression

AssayResult

AssayIDSampleIDResultCommentDateTested

AnatomyTerm

TermIDTermTypeTermNameDefinition

BiochemAssay

AssayIDAssayNameAssayTypeAssaySettingDescription

SampleAnatomy

TermIDSampleIDDescription

GeneticScreen

MarkerIDSampleIDResultRawDataCommentDateTested

GeneticMarker

MarkerIDMarkerNameMarkerTypeGeneticLocusDescription

Page 29: Data Warehousing Lifecycle

BioStar Schema for Part of the Gene Data Space

GeneSequence

UIDSeqTypeAccessionVersionSeqDatasetSpeciesIDStatus

GOAnnotation

GOIDUIDEvidence

Promoter

PromoterIDUIDPromoterTypePromoterSeqLengthDescription

ProteinInteract

UID1UID2EvidenceDescription

GeneCluster

ClusterID

UID

GOTerm

GOIDAccessionTermTypeTermNameDefinition

Cluster

ClusterIDNumOfGenesExprPatternClusteringToolToolSettingDescription

ArrayProbe

ArrayProbeIDUIDArrayIDProbeNameDescriptionIsQC

GeneDomain

DomainIDUIDAlignmentSeqFromSeqToDomainFromDomainToEValueBitScore

DomainModel

DomainIDModelTypeSourceDBAccessionTitleLengthDescription

Page 30: Data Warehousing Lifecycle

Star Schema for the Microarray Data Space

mRNAExpression

SampleIDArrayProbeIDExperimentIDMeasureUnitIDExpression

Experiment

ExperimentIDExperimentNameExperimentTypeProjectIDPersonIDPlatformIDProtocolIDNormalizationIDPublicationID

ArrayProbe

ArrayProbeIDUIDArrayIDProbeNameDescriptionIsQC

MeasurementUnit

MeasureUnitIDMeasureUnitNameMeasureUnitTypeDescription

GeneSequence

UIDSeqTypeAccessionVersionSeqDatasetSpeciesIDStatus

ClinicalSample

SampleIDPatientIDSourceAmountDateTaken

Page 31: Data Warehousing Lifecycle

Star Schema for the Proteomic Data Space

ProteinExpression

SampleIDUIDExperimentIDMeasureUnitIDExpression

Experiment

ExperimentIDExperimentNameExperimentTypeProjectIDPersonIDPlatformIDProtocolIDNormalizationIDPublicationID

MeasurementUnit

MeasureUnitIDMeasureUnitNameMeasureUnitTypeDescription

GeneSequence

UIDSeqTypeAccessionVersionSeqDatasetSpeciesIDStatus

ClinicalSample

SampleIDPatientIDSourceAmountDateTaken

Page 32: Data Warehousing Lifecycle

Star Schema for the Experiment Data Space

Experiment

ExperimentID

ExperimentName

ExperimentType

ProjectID

PersonID

PlatformID

ProtocolID

NormalizationID

PublicationID

Project

ProjectIDProjectNameInvestigatorDescription

Protocol

ProtocolIDProtocolNameProtocolTextCreatedBy

Publication

PublicationIDPubMedIDTitleAuthorsAbstractPubDateCitation

Platform

PlatformIDHardwareSoftwareSettingsDescription

Person

PersonIDPersonNameLabNameContact

Normalization

NormalizationIDNormTypeSoftwareParametersDescription

Page 33: Data Warehousing Lifecycle

BioStar is not Fact Constellation• You may view measure tables as small “fact”

tables, but fact tables in a constellation usually share multiple dimension tables.

Dimensiontable

Fact table

Fact table

Fact table

Dimensiontable

Dimension table

Dimensiontable

Dimensiontable

Dimensiontable

DimensiontableDimension

table

Page 34: Data Warehousing Lifecycle

Extensibility of BioStar

• Add a protein structure information dimension to gene data space.

GeneSequence

UIDSeqTypeAccessionVersionSeqDatasetSpeciesIDStatus

UIDPDBID

…..

PDBID

…..

ProteinStructureProteinSequence

Dimension tableMeasure table

Populating the two new tables will not affect other tables.

Page 35: Data Warehousing Lifecycle

Flexibility of BioStar

• Separate tables for fact measures to solve the many-to-many relationship problem dimension table and its associated measure table can be populated independently avoid null values.

Page 36: Data Warehousing Lifecycle

Sample Classification Hierarchy

All_sample

Normal Tumor

Brain Blood Colon Breast

CNS_tumor Leukemia

. . .

Adeno-carcinoma

. . .

Glio-blastoma

. . . ALL AML Colontumor

Breasttumor

. . .

(Patients)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Page 37: Data Warehousing Lifecycle

OLAP for Microarray Data Exploration

Mea

sure

men

t

Unit

Gen

e

Sample (patient)

1 2 3 4 5 6 7

D13626

D13627

D13628

J04605

L37042

S78653

X60003

Z11518

PAVal

10 15 18 5 24 32 16

roll-up todisease types

roll-up to GO terms

roll-up to expression

Dimensions: Sample Gene Measurement Unit

Operators: roll-up drill-down slice dice t-test p-select

Application: Exploration of gene expression data

Page 38: Data Warehousing Lifecycle

Data Sources Data Warehouse Unified Access

Clinical data and sample annotations

Gene functional annotations

MicroarraymRNAexpression

Proteomics proteinexpression

Promotersequencesand motifs

Protein domains & interactome

Data Integration

Data extraction, trans-formation, cleaning & loading

Metadata capturing & integration

Data quality control

Refreshment

Data Mining

• Ad hoc queries

• OLAP

• Cluster analysis

• Mining gene regulatory networks

• Interactome prediction

• Pathway analysis

A standard interface for application tools

Object-oriented

Defining basic operators for data access

Biomediacl Data Warehouse System Architecture