Data Warehousing Lifecycle

35
Data Warehousing Lifecycle Conceptual modeling: System requirements, data sources and warehousing activities. Logical design: Data flow from sources to DW, composition and semantics of activities. DW construction: Schema implementation, data population and warehouse tuning. Application development: DW interfaces, OLAP and data mining tools.

description

Data Warehousing Lifecycle. Conceptual modeling: System requirements, data sources and warehousing activities. Logical design: Data flow from sources to DW, composition and semantics of activities. Application development: DW interfaces, OLAP - PowerPoint PPT Presentation

Transcript of Data Warehousing Lifecycle

Page 1: Data Warehousing Lifecycle

Data Warehousing Lifecycle

Conceptual modeling: System requirements, data sources and warehousing activities.

Logical design: Data flow from sources to DW, composition and semantics of activities.

DW construction: Schema implementation, data population and warehouse tuning.

Application development: DW interfaces, OLAP and data mining tools.

Page 2: Data Warehousing Lifecycle

Data Sources Data Warehouse Unified Access

Clinical data and sample annotations

Gene functional annotations

MicroarraymRNAexpression

Proteomics proteinexpression

Promotersequencesand motifs

Protein domains & interactome

Data Integration

Data extraction, trans-formation, cleaning & loading

Metadata capturing & integration

Data quality control

Refreshment

Data Mining

• Ad hoc queries

• OLAP

• Cluster analysis

• Mining gene regulatory networks

• Interactome prediction

• Pathway analysis

A standard interface for application tools

Object-oriented

Defining basic operators for data access

Biomediacl Data Warehouse System Architecture

Page 3: Data Warehousing Lifecycle

On-Line Analytical Processing (OLAP)

Store

Prod

uct

Time (day)M T W Th F S S

JuiceMilkCokeCreamSoapBread

NYSF

LA

10 15 18 5 24 32 16

Dimensions: Time, Product, StoreHierarchies: Day Week Quarter

Product Brand … Store Region Country

roll-up to week

roll-up to brandroll-up to region

Store

Prod

uct

Time (week)W1 2 3 4

JuiceMilkCokeCreamSoapBread

NYSF

LA

120

Operators: roll-up, drill-down, slice and dice.Uses: Business data analysis, e.g., market-driven trend analysis.

Page 4: Data Warehousing Lifecycle

Logical Data Modeling: A Star Schema Example

Sales

time_keybranch_keylocation_keyproduct_keynum_unitsamount_usd

Time

time_keydaymonthyear

Product

product_keynamebrandtype

Supplier

supplier_keynametype

Location

location_keycitystatecountry

Branch

branch_keynametype

1

n

1

1

1

nn

n

???

One-to-many relationships between the fact and dimensions. The fact-dimension relationships are certain. Dimensions in star models are often tightly coupled. Star schema does not appear to be very extensible.

Page 5: Data Warehousing Lifecycle

Biomedical Data Resources• Static data: data on genotypes, biological

entities such as nucleic acids, protein and relationships between these entities.

• Dynamic data: data on phenotypes, the dynamics of biological processes.

• Data on analysis tools: data on biological and computer science methods which can be used to identify the entities and relationships.

• References and annotations: to scientific papers and textual explanations.

Page 6: Data Warehousing Lifecycle

Biomedical Data Modeling

• Flat file collections: Databases were built up as indexed ASCII text files.

• Relational databases: many biology databases were implemented using Oracle, Sybase, or MySQL.

• Object-oriented databases: data are modeled as objects that are organized in classes.

• Multidimensional databases: data are organized in star like schema.

Page 7: Data Warehousing Lifecycle

Using Star Schema in Gene Expression Data Management

• “Applying Data Warehouse Concepts to Gene Expression Data Management”, by V. Markowitz and T. Topaloglou

• Three modeling data spaces:– Sample data space– Gene Annotation data space– Gene expression data space

Page 8: Data Warehousing Lifecycle

Gene Expression Data Space

Gene_idExperiment_id

Analysis_idExpression_call

Analysis_idAlgorithm

version

Gene_idGene_name

Gene_symbol

Experiment_idExp_nameExp_dateExp_fileSample

Gene

Analysis

Expression

Experiment

Clinical Sample

Page 9: Data Warehousing Lifecycle

Sample Data Space

BiologicalSample

PathwaysStudy

Donor

DonorDemorgraphics

DonorClinical

Page 10: Data Warehousing Lifecycle

Gene Annotation Data Space

GeneFragmentsSequence

Pathways

SequenceCluster

Known gene

MicroarrayDesign

Chromosome

Page 11: Data Warehousing Lifecycle

OLAP Operations

• Sample selection: extract sets of samples with a certain profile on the sample data space. Eg, a sample set of male colon samples with adenocarcenoma for donors in the age group 40-60.

• Classification on organ: total number of samples classified by liver, brain, …

Page 12: Data Warehousing Lifecycle

OLAP Operations

• Gene selection: extract sets of genes with certain properties over the gene annotation data space. Eg, a gene set of the genes on chromosome 22 …

• Aggregates: gene summarization on sample dimension, sample summarization on gene dimension. Etc.

Page 13: Data Warehousing Lifecycle

Clinical Data Sapce

Clinical Sample

Medical ImageFollowup

Drug

Demographics Clinical Test

Physiology

Patient

1 n

n

n n

1 n 1

n

1 n

n n

Disease

n n

n

Page 14: Data Warehousing Lifecycle

Sample Data Sapce

Protein Expression

mRNA Expression

Anatomy Ontology Biochemical Assay

Genetic Screening

Clinical Sample

n

n

1

n

Patient

n n

1

n n

1 n

n

Page 15: Data Warehousing Lifecycle

Microarray Data Sapce

mRNA Expression

Experiment Measurement Unit

Array Probe

Gene Sequence

n n

n n

1 1

1 1

1

n

Clinical Sample

Page 16: Data Warehousing Lifecycle

Proteomic Data Sapce

Protein Expression

Experiment Measurement Unit

Gene Sequence

n n

n n

1 1

1 1Clinical Sample

Page 17: Data Warehousing Lifecycle

Experiment Data Sapce

Project

Experiment

Publication Normalization

Protocol

Person

n n

n n

n 1 1 n

1 1

1 1

Platform

Page 18: Data Warehousing Lifecycle

Gene Data Sapce

n 1

Protein Expression

Gene Sequence

Promoter Gene Ontology

1

n

n

n

Protein Domain

Protein-Protein Interaction

n

n

1

2

1

n n

n

Gene Cluster

mRNA Expression

Array Probe

n 1

Page 19: Data Warehousing Lifecycle

mRNA Expression

Experiment Measurement Unit

Array Probe

Gene Sequence

n n

n n

1 1

1 1

1

n

Clinical Sample

Anatomy Ontology

n

1

Patient

1

n

Disease

n

n

Project Platform

Normalization

1

n

1

n

1

n

Gene Ontology Gene Cluster n

n

n

n

Explicit Definition of Concept Hierarchies

Page 20: Data Warehousing Lifecycle

Characteristics of Clinical and Genomic Data

Clinical and Genomic Data Business DataComplex data structure with many potential dimensions

Easy-to-understand data structure with few dimensions

Often many-to-many relationships between facts and dimensions

Many-to-one relationships between facts and dimensions

Uncertain relationships between fact and dimension objects

Certain relationships between fact and dimension objects

Some measures require advanced temporal support for time validity

Historical data, no advanced temporal support needed

Incomplete and/or imprecise data very common

Few incomplete and/or imprecise data

Page 21: Data Warehousing Lifecycle

Large Number of Dimensions and Evolution of Dimensions

• If Star schema is used and the number of dimensions is large, the fact table will be huge (combination of foreign keys).

• Adding new dimension to Star schema will require re-computing of all data entries in the fact table.

Page 22: Data Warehousing Lifecycle

Many-to-Many relationships

• The many-to-many relationships cannot be easily modeled using Star schema, which is originally designed to handle many-to-one relationships between business fact and a dimension.

Page 23: Data Warehousing Lifecycle

Incompleteness of Data

• Clinical data may be incomplete. This may cause a lot of null values in the fact table for foreign keys, which will result in inconsistency.

Page 24: Data Warehousing Lifecycle

Star Schema Fact

DimKey1DimKey2DimKey3DimKey4Measure1Measure2Measure3Measure4

Dim3

DimKey3. . .

Dim2

DimKey2. . .

Dim4

DimKey4. . .

Dim1

DimKey1. . .

BioStar Schema

Fact

FactKey. . .Dim3

DimKey3. . .

MTable2

DimKey2FactKeyMeasure2

MTable4

DimKey4FactKeyMeasure4

Dim1

DimKey1. . .

MTable3

DimKey3FactKeyMeasure3

MTable1

DimKey1FactKeyMeasure1

Dim2

DimKey2. . .

Dim4

DimKey4. . .

Page 25: Data Warehousing Lifecycle

BioStar Schema for Part of the Clinical Data Space

Patient

PatientIDSSNNameGenderDOB

DrugUse

DrugIDPatientIDDosageValidFromValidTo

TestResult

TestIDPatientIDResultDateTested

ClinicalSample

SampleIDPatientIDSourceAmountDateTaken

Diagnosis

DiseaseIDPatientIDSymptomValidFromValidTo

Drug

DrugIDDrugNameDrugTypeDescription

Disease

DiseaseIDNameTypeDescription

ClinicalTest

TestIDTestNameTestTypeTestSetting

Extensibility and flexibility

Page 26: Data Warehousing Lifecycle

BioStar Schema for the Sample Data Space

ClinicalSample

SampleIDPatientIDSourceAmountDateTaken

mRNAExpression

SampleIDArrayProbeIDExperimentIDMeasureUnitIDExpression

AssayResult

AssayIDSampleIDResultCommentDateTested

AnatomyTerm

TermIDTermTypeTermNameDefinition

BiochemAssay

AssayIDAssayNameAssayTypeAssaySettingDescription

SampleAnatomy

TermIDSampleIDDescription

GeneticScreen

MarkerIDSampleIDResultRawDataCommentDateTested

GeneticMarker

MarkerIDMarkerNameMarkerTypeGeneticLocusDescription

Page 27: Data Warehousing Lifecycle

BioStar Schema for Part of the Gene Data Space

GeneSequence

UIDSeqTypeAccessionVersionSeqDatasetSpeciesIDStatus

GOAnnotation

GOIDUIDEvidence

Promoter

PromoterIDUIDPromoterTypePromoterSeqLengthDescription

ProteinInteract

UID1UID2EvidenceDescription

GeneCluster

ClusterIDUID

GOTerm

GOIDAccessionTermTypeTermNameDefinition

Cluster

ClusterIDNumOfGenesExprPatternClusteringToolToolSettingDescription

ArrayProbe

ArrayProbeIDUIDArrayIDProbeNameDescriptionIsQC

GeneDomain

DomainIDUIDAlignmentSeqFromSeqToDomainFromDomainToEValueBitScore

DomainModel

DomainIDModelTypeSourceDBAccessionTitleLengthDescription

Page 28: Data Warehousing Lifecycle

Star Schema for the Microarray Data Space

mRNAExpression

SampleIDArrayProbeIDExperimentIDMeasureUnitIDExpressionExperiment

ExperimentIDExperimentNameExperimentTypeProjectIDPersonIDPlatformIDProtocolIDNormalizationIDPublicationID

ArrayProbe

ArrayProbeIDUIDArrayIDProbeNameDescriptionIsQC

MeasurementUnit

MeasureUnitIDMeasureUnitNameMeasureUnitTypeDescription

GeneSequence

UIDSeqTypeAccessionVersionSeqDatasetSpeciesIDStatus

ClinicalSample

SampleIDPatientIDSourceAmountDateTaken

Page 29: Data Warehousing Lifecycle

Star Schema for the Proteomic Data Space

ProteinExpression

SampleIDUIDExperimentIDMeasureUnitIDExpressionExperiment

ExperimentIDExperimentNameExperimentTypeProjectIDPersonIDPlatformIDProtocolIDNormalizationIDPublicationID

MeasurementUnit

MeasureUnitIDMeasureUnitNameMeasureUnitTypeDescription

GeneSequence

UIDSeqTypeAccessionVersionSeqDatasetSpeciesIDStatus

ClinicalSample

SampleIDPatientIDSourceAmountDateTaken

Page 30: Data Warehousing Lifecycle

Star Schema for the Experiment Data Space

Experiment

ExperimentIDExperimentNameExperimentTypeProjectIDPersonIDPlatformIDProtocolIDNormalizationIDPublicationID

Project

ProjectIDProjectNameInvestigatorDescription

Protocol

ProtocolIDProtocolNameProtocolTextCreatedBy

Publication

PublicationIDPubMedIDTitleAuthorsAbstractPubDateCitation

Platform

PlatformIDHardwareSoftwareSettingsDescription

Person

PersonIDPersonNameLabNameContact

Normalization

NormalizationIDNormTypeSoftwareParametersDescription

Page 31: Data Warehousing Lifecycle

BioStar is not Fact Constellation• You may view measure tables as small “fact”

tables, but fact tables in a constellation usually share multiple dimension tables.Dimension

table

Fact table

Fact table

Fact table

Dimensiontable

Dimension table

Dimensiontable

Dimensiontable

Dimensiontable

DimensiontableDimension

table

Page 32: Data Warehousing Lifecycle

Extensibility of BioStar

• Add a protein structure information dimension to gene data space.

GeneSequence

UIDSeqTypeAccessionVersionSeqDatasetSpeciesIDStatus

UIDPDBID…..

PDBID…..

ProteinStructureProteinSequence

Dimension tableMeasure table

Populating the two new tables will not affect other tables.

Page 33: Data Warehousing Lifecycle

Flexibility of BioStar

• Separate tables for fact measures to solve the many-to-many relationship problem dimension table and its associated measure table can be populated independently avoid null values.

Page 34: Data Warehousing Lifecycle

Sample Classification Hierarchy

All_sample

Normal Tumor

Brain Blood Colon Breast

CNS_tumor Leukemia

. . .

Adeno-carcinoma

. . .

Glio-blastoma

. . . ALL AML Colontumor

Breasttumor

. . .

(Patients) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Page 35: Data Warehousing Lifecycle

OLAP for Microarray Data Exploration

Mea

surem

ent

Unit

Gen

e

Sample (patient)1 2 3 4 5 6 7

D13626D13627D13628J04605L37042S78653X60003Z11518

PAVal

10 15 18 5 24 32 16

roll-up todisease types

roll-up to GO terms

roll-up to expression

Dimensions: Sample Gene Measurement Unit

Operators: roll-up drill-down slice dice t-test p-select

Application: Exploration of gene expression data