Master's Thesis _Daniel Meriläinen 90648

90
Turun kauppakorkeakoulu Turku School of Economics The Federative Approach to Data Governance and Manage- ment in an Open Information Systems Environment A Case Study on Data Governance and Management of Clinical Breast Cancer Treatment Data Master’s Thesis in Information Systems Science Prepared by Daniel Meriläinen (BBA) 90648 Supervisor Tomi Dahlberg (Ph.D.) September 30, 2016 Turku, Finland

Transcript of Master's Thesis _Daniel Meriläinen 90648

Page 1: Master's Thesis _Daniel Meriläinen 90648

Turun kauppakorkeakoulu • Turku School of Economics

The Federative Approach to Data Governance and Manage-

ment in an Open Information Systems Environment

A Case Study on Data Governance and Management of Clinical

Breast Cancer Treatment Data

Master’s Thesis in Information Systems

Science

Prepared by

Daniel Meriläinen (BBA)

90648

Supervisor

Tomi Dahlberg (Ph.D.)

September 30, 2016

Turku, Finland

Page 2: Master's Thesis _Daniel Meriläinen 90648

TABLE OF CONTENTS

1 INTRODUCTION ................................................................................................... 9

1.1 Breast Cancer ............................................................................................... 10

1.1.1 Incidence .......................................................................................... 10

1.1.2 Prediction ......................................................................................... 11

1.1.3 Personal Identity Code ..................................................................... 12

1.1.4 Diagnosis Code of Cancer ............................................................... 12

1.1.5 TNM Staging System ....................................................................... 13

1.1.6 Dates of Events ................................................................................ 13

1.1.7 Cancer Survival Analysis ................................................................. 14

1.2 Data Management Systems Related to Cancer Treatment ........................... 14

1.3 Realization .................................................................................................... 15

1.4 Limitations ................................................................................................... 15

1.5 Research Questions ...................................................................................... 15

1.6 Structure of the Master’s Thesis .................................................................. 16

2 THEORETICAL BACKGROUND ..................................................................... 17

2.1 Federative Approach .................................................................................... 17

2.1.1 Benefits ............................................................................................ 20

2.1.2 Limitations ....................................................................................... 21

2.2 Golden Record Approach ............................................................................. 22

2.3 Comparison of Ontological Approaches ...................................................... 26

2.3.1 Advantages ....................................................................................... 27

2.3.2 Disadvantages .................................................................................. 28

2.4 Ontological Approach in the Case Study ..................................................... 29

3 LITERATURE REVIEW ...................................................................................... 30

3.1 Ontology ....................................................................................................... 30

3.2 Semantics ..................................................................................................... 33

3.3 Data Models ................................................................................................. 34

3.3.1 Conceptual Model ............................................................................ 35

3.3.2 Contextual Model ............................................................................ 36

3.4 Data Classification ....................................................................................... 37

3.4.1 Master Data ...................................................................................... 38

3.4.2 Metadata ........................................................................................... 42

3.5 Data Quality ................................................................................................. 46

3.6 Data Consolidation ....................................................................................... 49

3.6.1 Sharing ............................................................................................. 50

Page 3: Master's Thesis _Daniel Meriläinen 90648

3.6.2 Mapping ........................................................................................... 50

3.6.3 Matching .......................................................................................... 50

3.6.4 Data Federation ................................................................................ 52

3.6.5 Data Integration ............................................................................... 54

3.6.6 Data Warehouse, Storage and Repository ....................................... 57

3.7 Data Management Framework ..................................................................... 58

3.7.1 Data and Corporate Governance ...................................................... 59

3.7.2 Data Management ............................................................................ 60

3.8 Discovery of Data from Large Data Sets ..................................................... 61

3.8.1 Data Mining ..................................................................................... 62

3.8.2 Big Data ........................................................................................... 65

3.8.3 Business Intelligence Systems ......................................................... 66

3.9 Cancer Data .................................................................................................. 67

3.10 Healthcare..................................................................................................... 69

4 METHODOLOGY ................................................................................................ 71

4.1 Case Study .................................................................................................... 71

4.1.1 Data Ontology .................................................................................. 74

4.1.2 Epistemology ................................................................................... 74

4.1.3 Paradigma ........................................................................................ 74

4.1.4 Methods............................................................................................ 75

4.1.5 Rhetoric ............................................................................................ 75

4.1.6 Triangulation .................................................................................... 75

4.2 Research Participation .................................................................................. 76

4.3 Artifact ......................................................................................................... 76

5 RESULTS .............................................................................................................. 78

5.1 Matrices of the Artifact ................................................................................ 79

5.2 Pattern to Implement the Artifact in the Case Study .................................... 81

6 DISCUSSION ........................................................................................................ 83

6.1 Contribution ................................................................................................. 83

6.2 Limitations ................................................................................................... 83

6.3 Future Research Questions ........................................................................... 84

7 REFERENCES ...................................................................................................... 85

Page 4: Master's Thesis _Daniel Meriläinen 90648

List of Figures

Figure 1 Semantic View of the Federative Approach in Practice .............................. 20

Figure 2 Semantic View of the Golden Record Approach in Practice (1) ................. 24

Figure 3 Semantic View of the Golden Record Approach in Practice (2) ................. 25

Figure 4 Semantic View of the Golden Record Approach in Practice (3) ................. 26

Figure 5 Scope of MDM [25, pp. 46] ........................................................................ 39

Figure 6 MDM Registry Federation [21, pp. 28] ....................................................... 41

Figure 7 Generating the Golden Record [25, pp. 99]................................................. 57

Figure 8 Basic Idea of MBR [45, pp. 333]................................................................. 64

Page 5: Master's Thesis _Daniel Meriläinen 90648

List of Tables

Table 1 Design Artifact of the Case Study................................................................. 77

Table 2 Data Federation Artifact - Identification of Shared Attributes ..................... 79

Table 3 Definition of Contextual Metadata Characteristics ....................................... 80

Page 6: Master's Thesis _Daniel Meriläinen 90648

List of Abbreviations

ANSI American National Standards Institute

API Application Programming Interface

B.C. Before Christ

BI&A Business Intelligence and Analytics

CDI Customer Data Integration

CMM Capability Maturity Model

CRF Case Report Form

CRM Customer Relationship Management

CS Computer Science

CT Clinical Trial

CUH Central University Hospital

DAMA Data Management Association

DBA Database Administrator

DC Diagnosis Code

DC Dublin Core

DFD Data Flow Diagram

DLS Digital Library System

DQ Data Quality

DREPT Design-Relevant Explanatory/Predictive Theory

DSRIS Design Science Research Information Systems

DW Data Warehouse

EAD Encoded Archival Description

EAI Enterprise Application Integration

ED&P Early Detection and Prevention

EMPI Enterprise Master Patient Index

ER Model Entity and Relationship Model

ER Endoplasmic Reticulum Marker

ERD Entity Relationship Diagramming

ERP Enterprise Resource Planning

ETL Extract, Transform, Load

FDBS Federated Database System

HETU Personal Identity Code

HER2 Human Epidermal Growth Factor Receptor

ICD-O-3 International Classification of Diseases for Oncology

ID Identifier

IoT Internet of Things

IS Information System

Page 7: Master's Thesis _Daniel Meriläinen 90648

ISAD Information Systems Analysis and Design

ISD Information Systems Development

ISDT Information Systems Design Theory

IT Information Technology

JPL Jet Propulsion Laboratory

KDD Knowledge Discovery in Databases

MB Megabyte

MBR Memory-Based Reasoning

MDM Master Data Management

NASA National Aeronautics and Space Administration

NBER National Bureau of Economic Research

NIH National Institutes of Health

OAI-PMH Open Archives Initiative Protocol for Metadata Harvesting

ODPC Open Database Connectivity

OLAP Online Analytical Processing

PLM Product Lifecycle Management

POP Persistent Organic Pollutant

ROI Return on Investment

RSR Relative Survival Rate

SBCDS Southampton Breast Cancer Data System

SCM Supply Chain Management

SQL Structured Query Language

TAMUS Texas A&M University System

TB Terabyte

TDWI The Data Warehousing Institute

TEKES Finnish Funding Agency for Innovation

TIP TAMUS Information Portal

TNM Tumor Node Metastasis

UML Unified Modeling Language

VSSHP Varsinais-Suomen Sairaanhoitopiiri

XML Extensible Markup Language

Page 8: Master's Thesis _Daniel Meriläinen 90648
Page 9: Master's Thesis _Daniel Meriläinen 90648

9

1 INTRODUCTION

This Master’s thesis is based on the case study, which encompasses data governance and

management and the solution of unstructured data in open distributed IS environments.

The goal of the study is to develop and test the framework of the federative approach to

data governance and management. The framework and approach have been tested previ-

ously with other cases. The goal is also to discover the benefits and limitations of the

framework in the case environment. The framework of the contextual metadata is a pre-

requisite for data federation without changing or transferring the data in the first place, or

the original source.

The ongoing research started in January 2016 and was executed in co-operation with

the data/information specialists and healthcare professionals of the Central University

Hospital (CUH) in Turku, Finland. The Master’s thesis is a part of the overall research.

The case study builds on and extends the results of a project called Management of IT

in Mergers and Acquisitions funded by Tekes (the Finnish Funding Agency for Innova-

tion), Master Data Management Best Practices. The current research addresses the predic-

tion of malignant breast cancers and improving the survival rate of diagnosed malignant

breast cancers by means of data analysis.

The case study is worth contemplating against the background of open systems envi-

ronments and growth with data digitalization and explosion. With Big Data and the Inter-

net of Things (IoT), the importance of data governance and management is increasing

rapidly.

The annual volume growth of digital data is estimated to be approximately 60 %. By

the end of 2011 the amount of digital data created had grown to 1.8 ZB (1021

bytes) during

that year. The proportion of digital data was 99 % of all data created [17]. The estimate

for 2015 of the amount of digital data produced is 12 ZB and the proportion of digital

data is 99.84 % of all data created. As a result, in 2015, mankind produced as much data

as from the year 10000 B.C. to the end of 2003 [17].

A great number of organizations neither know nor understand how to handle or how to

govern data. Currently data governance seems to take place unilaterally on the operative

and executive level. Governance should extend from the bottom to the top in the organi-

zation’s hierarchy.

The empirical case is based on a CUH research project, which aims to improve the

survival rate of widely spread breast cancer. This type of breast cancer accounts for the

majority of the deaths from this disease.

The ongoing research project has two data governance, management and analysis re-

lated goals. The primary goal is to improve and enhance the ability to detect and identify

widely spread cancers (by applying TNM classification) from the data available in the

Page 10: Master's Thesis _Daniel Meriläinen 90648

10

early phase of a cancer. The secondary goal is to enhance the precision of predicting the

survival rate of patients with malignant breast cancer after various cures and treatments.

The aim is to identify the treatments that are, either in isolation or in combination, reliable

predictors of enhanced survival probability.

The dilemma of data governance and management at the hospital is based on the frag-

mented, unstructured and distributed data, which are stored in several distinct locations.

Another serious problem is that data governance seems to be nobody’s responsibility.

The Masters’s thesis for its part, is intended to solve the problems caused by the con-

nected inconsistent, semi-structured, unstructured and fragmented nature of data storages

within the open distributed IS environments. The solution is based on the framework of

the federative approach to data governance and management. The approach can be also

called the interpreted interoperable attribute philosophy or attribute-driven MDM.

1.1 Breast Cancer

Breast cancer can be a malicious tumor. It is the most common type of cancer in women

all over the world. Breast cancer is characterized by genetical and histopathological het-

erogeneity. The reasons for the emergence of breast cancer are unknown [77].

The most common symptom is a lump in the chest. Forms of treatments, such as sur-

gery, radiotherapy, drug treatment, or chemotherapy, hormone therapy and antibody ther-

apy, are widely used [23].

1.1.1 Incidence

Every year over 4100 new cases of breast cancer are detected in Finland. About one half

of the breast cancer cases are found over 60 years’ old patients. Nursing seems to de-

crease a woman's risk of contracting breast cancer [23]. Breast cancer has been the lead-

ing Finnish women’s cancer disease since the 1960s. The incidence of cancer begins to

increase after the age of 40.

Breast cancer screening is based on mammography tests, i.e., breast X-ray investiga-

tion, in which the breast is X-rayed from one or more directions. If the mammography

finding is abnormal, the woman is called for a complementary test. These complementary

tests with additional mammography images include ultrasound examinations and/or nee-

dle research samples. If follow-up tests fail to rule out the possibility of cancer, the tests

are continued in hospital, and a biopsy is taken to clarify the tumor quality [23]. Accord-

ing to Rahmati et al., mammography allows for the detection of intangible tumors and

Page 11: Master's Thesis _Daniel Meriläinen 90648

11

increases the survival rate. Digital mammography uses X-rays to project structures in the

3D female breast onto a 2D image [58].

At the beginning of 1987, national breast screening for breast cancer was initiated in

Finland. Screenings are attended by nearly 90 % of the so-called baby-boom generation.

Graduate studies called under 3 % of the screened women, and breast cancer was found in

about 300 cases [23].

1.1.2 Prediction

Over 80 % of breast cancer patients are alive after five years since the cancer was detect-

ed. The younger the patient, the more likely it is that the cancer recurs [23].

Cancer treatments have made great progress in recent decades. Today, a large propor-

tion of breast cancer patients live a normal life after cancer detection and treatment, and

eventually dies from causes grounds other than cancer [23].

Cancer diseases, as an indirect measure of healing lead to the relative survival rate.

This indicates how many of the cancer patients live a certain period of time compared to

the age of detection of cancer, as a proportion of the population of the same age living at

that time. If the annual relative survival rate is less than 100 %, cancer exposes patients to

additional mortality. Usually improvement of the statistical limit is five years, although in

some cancers slight increased mortality occurs. Even after that limit and, in some cases, a

statistical improvement was reached earlier [32].

Cancer in women with a higher survival rate is largely due to the fact that in women

the most common type of cancer is breast cancer. The prediction is significantly better

than for general lung cancer. In 2007-2009, the five-year survival rate of breast cancer

patients was 89 % [32].

Survival of cancer patients varies considerably, depending on how far the cancer has

spread, and when it is detected. The patients’ prognosis is better when the cancer found is

still local, as it is often possible to remove the entire cancerous tumor by surgery [32].

Breast cancer is an example of a disease where a patient with widespread disease lives

a considerable period of time thanks to effective treatments. In the 2000s a five-year fol-

low-up of local breast cancer survival of patients showed 98 % survival. If the disease had

spread only to the axillary lymph nodes, the figure was 88 %. Even in the event that the

disease had been confirmed to have spread even further, the patients’ five-year survival

figure was 42 % [32].

In the following chapters I’ll describe the data attributes that are used to make breast

cancer interoperable. They can be used to establish the link between various data storages

to make the data attributes of those data storages visible to end users.

Page 12: Master's Thesis _Daniel Meriläinen 90648

12

1.1.3 Personal Identity Code

The personal identity code (HETU) is applied as a key identification number in the pa-

tient management system at the CUH, Turku, Finland. The code is mandatory in the sys-

tem and it is updated and transferred to other IS systems.

The personal identity makes identification more specific and secure than just a name

alone. Many people may have an identical name, but never the same personal identity

code. The code follows a person from birth to death.

The code is given by the Population Register Centre [56]. It identifies each individual

in Finland. The code is always used when having contact with authorities, verifying iden-

tity and relating to a person’s official documents.

The personal identity code is a key attribute in this case study.

1.1.4 Diagnosis Code of Cancer

Cancer is a disease characterized by an abnormal reproduction of cells, which invade and

destroy adjacent tissues, being even able to spread to other parts of the body, through a

process called metastasis [4]. Currently breast radiography (mammography) is the most

frequently used tool for detecting this type of cancer at its stage [4].

Mammography makes possible the identification of abnormalities in their initial de-

velopment stage, which is a determining factor for the success of treatment. Mammogra-

phy allows the detection of intangible tumors and increases survival rate [67]. Digital

mammography uses X-rays to project structures in the 3 D female breast onto a 2D im-

age. Information management for digital mammography generates a number of files and

enormous amounts of imaging data. All this data must then be stored, transmitted, and

displayed. To govern and manage a huge number of files is a challenge for data manage-

ment. A study of U.S. imaging centers conducted by Fajardo indicates that a typical

breast study stored using 4:1 lossless compression requires 8 to 15 MB (106 bytes) of

storage for a system rendering 100-micron resolution; 16 to 38 MB for a 70-micron sys-

tem; and 45 to 60 MB for a 50-micron pixel detector imaging system [19].

The diagnosis code (DC) is one of the attributes used this case study. We are only in-

terested here in persons who have been diagnosed with breast cancer (code).

Page 13: Master's Thesis _Daniel Meriläinen 90648

13

1.1.5 TNM Staging System

According to Edge et al., the extent or stage of cancer at the time of diagnosis is a key

factor that defines prognosis [26]. It is a critical element in determining appropriate

treatment based on the experience and outcomes of groups of prior patients with similar

cancer stages. Among several cancer staging systems, that are used worldwide, the tumor

node metastasis (TNM) system is the most clinically useful staging system [26].

The TNM system classifies cancers by the size and extent of the primary tumor (T),

the involvement of regional lymph node (N), and the presence or absence of distant me-

tastases (M) [26]. The system has been supplemented in recent years by adding carefully

selected non-anatomic prognostic factors, and it includes a TNM staging algorithm for

cancers of virtually every anatomic site and histology [26].

The cancer stage is determined from information on the tumor T, regional nodes N,

and metastases M and by grouping cases with a similar prognosis [26]. The criteria for

defining the anatomic extent of the disease are specific for tumors at different anatomic

sites and of different histologic types. Unlike other types of cancer, in breast cancer, the

size of the tumor is a key factor. Thus the criteria for T, N, and M are defined separately

for each tumor and histologic type [26].

Although T, N, and M are of some value in determining a patient’s future outcome,

there are multiple factors relating to both prognosis and prediction [26]. The use of fac-

tors such as estrogen and progesterone receptor content or HER2 (Human Epidermal

Growth Factor Receptor) status are rather predictive than prognostic [26].

TMN is one of the linking attributes in this case study. It has to be deduced from vari-

ous data storages. Certain values of TMN indicate malignant breast cancer, and access

needs to be provided to the attributes of federated storages for those persons who have

breast cancer, while persons who have particular values of the TNM code, in order to

detect breast cancer and to analyze the outcomes of treatment.

1.1.6 Dates of Events

The date of event gives information about the cancer treatments and consultations that

have taken place. The treatment date of events such as cancer surgery, radiation therapy

or chemotherapy, hormone therapy and biological treatments (for example, antibody ther-

apy, and interferon) are saved into the corresponding IS of the treatment at the CUH of

Turku.

The date of event is one of the attributes used in this case study. Due to the nature ma-

lignant breast cancer, the dates of events need to be close to one another.

Page 14: Master's Thesis _Daniel Meriläinen 90648

14

1.1.7 Cancer Survival Analysis

Analysis of cancer survival data and related outcomes is needed when assessing cancer

treatment programs and monitoring the progress of regional and national cancer control

programs [26]. In order to use data retrieved from the databases of cancer registries for

outcomes analyses properly, it is necessary to understand the correct application of ap-

propriate quantitative tools. The limitations affecting the analyses, based on the source of

data, must be taken into account as well [26].

A survival rate is a statistical index that summarizes the probable frequency of specific

outcomes for a group of patients at a particular point in time. In contrast, a survival curve

is a summary display of the pattern of survival rates over time [26].

1.2 Data Management Systems Related to Cancer Treatment

Information systems include master data. This refers data that are long-lived and slow to

change and are shared by multiple systems without being transactional data. For example,

customer/patient information is often deposited for years, and will occasionally change.

Typical examples of other master data information include product/diagnosis code infor-

mation, organization information, employees, and various other lists of codes.

Master data consist of patients, the diagnosis and treatment information on nuclear da-

ta, which is a relatively-constant relation. The integrity and unambiguity of master data at

the CUH is extremely important. In addition, master data are used in the hospital in sev-

eral different applications or information systems. Thus master data attributes are com-

mon to different systems. It is connected to event data in separate registers. Master data

refer to independent information. Currently there are about 16 distinct information sys-

tems at the CUH in Turku, Finland (2016).

The detection and treatments of breast cancer generate a huge number of different data

elements such as magnetic resonance and X-rays images, laboratory test and pathology

analysis results, cytostatic, medication and/or surgical treatment medical reports as well

as a lot metrics, referrals, prescriptions, analyses, diagnoses, reports and other data [13].

Relevant data could spread out to several years and could even include genetic data about

close relatives and/or data about the patient’s life style and social environment. Clerical

personnel, nurses and doctors participating to the various cancer detection and treatment

tasks create, use, modify and store data about treatment events into dozens of information

systems (ISs), IS modules and related data storages [13]. As a result the IS technical, in-

formation handling and socio-contextual characteristics of federated data differ. The enti-

ty and attribute definitions, formats, hierarchies and granularities of data storages are dif-

Page 15: Master's Thesis _Daniel Meriläinen 90648

15

ferent. Data could be structured, unstructured or multi-structured. Data could be repre-

sented in numeric, alphanumeric, audio or video format(s). Data creation, use, storing and

purging procedures as well as data volumes and velocity vary. The sources of data range

from ISs to sensors and from internal to external data storages (e.g. code registers). Fur-

thermore, data is used for different purposes in various use contexts at a time and over

time and may thus have several valid contextual meanings [13].

1.3 Realization

The framework is based on the idea of common systems, data attributes, whose signifi-

cance in source systems (ISs) is known. The framework was tentatively tested in collabo-

ration with the Hospital District of Southwest Finland (VSSHP) and CUH information

management experts, pathologists and physicians, in the clinical management of breast

cancer derived from healthcare data. The long-term goal is to assess at an early stage the

survival possibility rate of patients suffering from disseminated breast cancer and to iden-

tify the factors affecting it.

The research was conducted as a qualitative case study. The research data were col-

lected by means of workshop activities with the corresponding professionals at the CUH

of Turku, Finland.

1.4 Limitations

The research is limited to concern the governance and management of breast cancer data.

The case study does not take a stance either on other types of cancer or on treatment. The

core subject is the clinical data and how they can be brought into use from the purpose of

research, without time-consuming data transfer, ETL (Extract, Transform, Load) batch

processing and the golden record.

1.5 Research Questions

The research focuses on the functionality of the framework related to data governance

and management of breast cancer data.

The main research question is: How does the theoretical framework of data federation

work in practice, when compared with the golden record?

The first sub-question: What are the benefits of the federative approach?

Page 16: Master's Thesis _Daniel Meriläinen 90648

16

The second sub-question: What are the limitations?

Keywords: attributes, master data, metadata, data federation, breast cancer, golden record,

data governance, data management.

1.6 Structure of the Master’s Thesis

The Master’s thesis is organized as follows. Firstly, data ontology, data governance and

data management are examined by comparing contextual and canonical stances on data

ontology. Secondly, the literature review related to the issue is studied. Thirdly, methodo-

logical issues are explicated. Fourthly, the design artifact of federation is introduced, and

the procedures for using it in the federation of breast cancer data, and other findings of

the study are discussed. Finally, the conclusions of the Master’s thesis are presented.

Page 17: Master's Thesis _Daniel Meriläinen 90648

17

2 THEORETICAL BACKGROUND

The theoretical framework is based mainly on seven articles by Dahlberg et al. The first

article is Master Data Management Best Practices Benchmarking Study [11]; the second

article is Framework and Research Agenda for Master Management in Distributed Envi-

ronments [12]; the third article is Data Federation by Using a Governance of Data

Framework Artifact as a Tool [13]; the fourth article is A Framework for the Corporate

Governance of Data – Theoretical Background and Empirical Evidence [14]; the fifth

article is Managing Datification – Data Federation in Open Systems Environments [15];

the sixth article is The MDM Golden Record is Dead, Rest in Peace – Welcome Interpret-

ed Interoperable Attributes [16]; and finally, the seventh article is Research on: Govern-

ance of Data in the Contexts of Corporate Governance and Governance of IT and Data

Federation in the Context of Master and Big Data [17].

The approach to the theory of the case study proceeds by comparison of the federative

approach and the golden record concept. The comparison is made between the golden

record (the canonical approach) and contextual data ontology (the federative approach).

The canonical approach is prevalent in computer science (natural sciences) and entity

relationship diagramming (ERD). This approach advocates a single truth, meaning that

there is one true data value, which can include several attributes while having one true

value.

The contextual approach is widely used in information systems science (social scienc-

es). In contrast it underlines that data ontology has no single true value depending on the

context use and linked with time.

2.1 Federative Approach

Data federation refers to the activities that facilitate the simultaneous use of data from

storages, which give different IS technical, informational and socio-contextual data char-

acteristics [13]. Federation makes possible to solve some traditionally challenging data

integration problems, such as inconsistencies in an attribute format, compulsion of attrib-

utes etc. The core idea of the federative approach is to make data storages interoperable

through data storage cross-mappings. Federation is carried out on the basis of metadata.

When an information system is designed and built by an organization or by a software

vendor, the data model represents the canonical approach with a specific social use con-

text (the true world as interpreted in the data model of the IS). Since ISs are designed and

built for the different use of contexts, they have different canonical data models.

Page 18: Master's Thesis _Daniel Meriläinen 90648

18

The claim of the federative approach is that different canonical data models should not

be replaced, e.g. with a data model of data models. However, it should be made interoper-

able by identifying attributes that are shared by the data models to be described. The aim

is to make the data attributes of the federated data storages visible to one another. Moreo-

ver, the federative approach claims that in an organizational environment with dozens of

ISs or more, where all or the majority of ISs are purchased from software vendors, it is

not even possible to create a data model of data models, since the data models are owned

by the software vendor, not by the user organization.

Data federation starts by identifying shared attributes and by then describing the IS

technical, informational and socio-contextual metadata through cross-mappings of those

attributes [13]. The contextual stance on data ontology means that data are regarded as

truthfully representing the social use context of the data. Thus apparently similar data

may have several meanings representing each use context to the extent that some of the

meanings could be contradictory. Compared with the canonical stance on data ontology,

the canonical approach proposes that it is possible to agree on one single version of the

truth for data values and then to use those values in all contexts [13].

The federative approach to data governance and management is to govern data. The

approach suggests that all the master data of a selected activity or process should be ad-

dressed rather than a single domain [16]. Domain can be defined as an area of control or

knowledge. Related to MDM, domain refers to the type of data to be mastered. The feder-

ative approach does not require that there is only one interpretation of customer data. It

allows for varied data management and governance arrangements, since the term data is

understood as having several meanings. Thus all local interpretations or contexts are con-

sidered true, and global master data are the sum of all local metadata interpretations. The

federative approach is appropriate and intended for use in an open ISs environment [16].

According to Dahlberg, data interoperability and transferability are far from reality

[16]. Relating to the prevention of electronic data transfer and consolidation, Dahlberg

argues that data creation and handling processes vary, which leads to dissimilarities in

data coding and content. Additionally data concepts, formats, and structures differ, which

result in fragmented and duplicated data.

Dahlberg and Nokkala believe that there is a lack of any accepted and widely used in-

ternational, national and local data model or message standards [14, pp. 27]. The reason

for the present situation is that each organization develops or procures and implements

databases and ISs of its own. No data interoperability, transferability and usability are

taken into account. Dahlberg and Nokkala allot responsibility to data management, since

business professionals should know what the content of data ought to be and what data

are necessary to perform specific tasks [14]. Therefore, if data governance is unclear, no-

body in an organization is responsible for the content quality or the availability of data for

Page 19: Master's Thesis _Daniel Meriläinen 90648

19

specific tasks. The conclusion drawn by Dahlberg and Nokkala is that the governance of

the data framework should be generic and should have a corporate managerial focus [14].

Chen et al. define interoperability as the ability of two systems to understand each oth-

er and to use each other’s functionality [6, pp. 648]. Relating to the definitions of integra-

tion and interoperability, interoperability has the meaning of coexistence, autonomy and

federated environment, whereas integration refers rather to the concepts of coordination,

coherence and uniformization.

The cross-reference is used to create links between the registers and to keep track of

metadata and cross-references [17]. Thus the original data remain untouched in their orig-

inal location. The objective is to make data available by knowing what the data mean.

The solution is built up gradually by adding new registers and refining the cross-

referencing metadata when a new transaction is detected [17].

Dahlberg argues that attribute-driven MDM leads to a narrow ERD and richness of

metadata descriptions [16]. The ERD will be and is narrow because it is indispensable for

modeling only attributes that are used to implement the federation. Rich metadata de-

scriptions are necessary to achieve IS technical, informational, and contextual metadata.

The following procedures are defined to show how the federative approach works in

practice.

Data federation is implemented on the basis of contextual metadata by interpreting at-

tributes that make data interoperability and federation possible. The primary procedure is

to identify and make a cross-reference (Figure 1). Firstly, the identification of the regis-

ters is done in order to federate and execute transactions. The aim is also to steer process-

es and produce management reports. Secondly, all the attributes shared between registers

are identified. These attributes can be used to build links between federated registers.

Thirdly, the basis for the link between registers is created by describing the metadata of

the shared attributes. The technical, informational and semantic metadata are necessary

for making cross-reference attributes, in order to federate them.

The secondary procedure is to use the cross-reference metadata to create links between

the registers and to keep track of the metadata produced by cross-reference (Figure 1).

The original data remain untouched in the same place as it was in the beginning. Firstly,

the core idea is not to replace data, but to make them available by knowing what the data

mean. Secondly, the solution is built up gradually by adding new registers and by refining

the metadata produced by cross-reference on-demand, i.e. when a type of new transaction

is discovered.

This answers the main research question of the Master’s thesis.

Page 20: Master's Thesis _Daniel Meriläinen 90648

20

Figure 1 Semantic View of the Federative Approach in Practice

2.1.1 Benefits

The benefits of the federative approach are categorically as follows [17]. Firstly, an or-

ganization can gradually move and start federating data without risking its legacy IT.

Page 21: Master's Thesis _Daniel Meriläinen 90648

21

Thus it is able to protect IT investments and increase value creation. Potential investments

are solely focused on data attributes and MDM tools in order to make the data interopera-

ble. There is no need to modify legacy IS for data federation, while shared attributes can

be identified.

Secondly, an organization may create new perspectives on federated data by using

MDM tools and without acquiring new IS, since mappings exist. This is a cost-effective

approach to assessing the strength of data federation.

Thirdly, an organization can modify master data in any of the federated ISs, and there-

after all the other federated ISs are capable of loading data by using MDM tools.

Fourthly, any attribute could become an interoperable connector for both master data

and transactional data by creating the data link.

Finally, the federative approach may offer the possibility to start with practical actions,

projects and allow the tight focus related to more understandable data governance and

management models. Additionally, the federative approach leads to narrow entity-

relationship modeling and rich metadata descriptions.

This answers the first sub-question of the Master’s thesis.

2.1.2 Limitations

The limitations of the federative approach are as follows [16]. Taking into account the

current IS environment, open systems environments increase the risk of ontological er-

rors. There are entirely new data sources such as sensors and other Internet of Things

(IoT) devices available. Besides, new formats and dimensions of data prevail, such as

audio and video formats or spatial and temporal data. Additionally, data can be structured,

unstructured or multi-structured.

There is as yet no empirical evidence that data federation can meet all these challeng-

es. In relation to the APIs of various ISs, data federation may face problems. For exam-

ple, data federations between asynchronous and synchronous or, more simply put, differ-

ently timed automated processes and applications can be very demanding. Another limita-

tion is that APIs do not contain metadata about the meaning of data in ISs or changes in

the meaning of the data within ISs [15].

This answers the second sub-question of the Master’s thesis.

Page 22: Master's Thesis _Daniel Meriläinen 90648

22

2.2 Golden Record Approach

The golden record or a single version of the truth is identified and developed for each

product (treatment), place (location), person (patient), or organization (department) for

the use of MDM [18. pp. 173]. The same golden records and values are used in every

transaction-processing database [18]. It is typical for many organizations that they have

transaction systems, which contain inconsistent reference and master data. Data ware-

housing systems must be able to identify both the most veracious system of records and

the most accurate, golden reference and master data values [18].

Dreibelbis et al. in turn define the golden record as a service provided by the trustwor-

thy source for downstream systems. It is dedicated to reporting and analytics, or it is a

system of reference for other operational applications [21, pp. 26].

The golden record approach requires a closed IS environment where data are internal

and structured. The approach requires that one knows data models and entities with an

assumption that it is possible to create a data model in order to achieve a single version of

the truth. However, currently an IS environment is increasingly open and transparent. The

data are also increasingly multi-structured and external. The golden record implies a sin-

gle standard record. It is generated using data from multiple source systems [25, pp. 101].

The golden record approach appeared as a solution to answer the dilemma of connect-

ed inconsistent, unstructured and fragmented data storages [13, pp. 9]. The golden record

philosophy represents the current mainstream in data management discussion [2, 15, 18].

Although the golden record term is actively used among professionals, a clear consistent

definition is hard to find.

Dreibelbis et al. present the case of consolidation implementation, where master data

have been brought together from a number of ISs [21, pp. 26]. The data are processed

(transform, cleanse, match, and integrate) in order to capture a single golden record relat-

ed to master data domains.

The golden record approach belongs to the era of closed information system (IS) envi-

ronments, because currently user organizations have acquired ISs and storage solutions as

packages or as cloud services. The approach is prevalent when each organization de-

velopes or is responsible for developing its own system with information architecture.

This is particularly necessary in order for an organization to be able to execute its own

business processes. The strength of the golden record approach is that an organization is

aware of each IS scope in details, with data entities, attributes and their dependencies on

IS. On the other hand, the weaknesses of the approach are closed systems dependency and

the canonical data ontology assumption [16].

Thanks to the closed IS environment, the information architecture can be designed

without involving any useless overlaps [16]. Thus data interfaces are defined in advance.

Page 23: Master's Thesis _Daniel Meriläinen 90648

23

This means that data are interchanged and federated by applying the data model of each

IS [16, pp. 2]. According to Dahlberg, data federation requires the presence of connecting

data element, such as a customer (patient) or a product (treatment) [16].

In open overlapping (redundant) systems, which contain equal data, such as customers,

vendors, products, services etc. or business transactions, reports and documents, the ques-

tion of trustworthy data inevitably arises [11]. Due to the explosion of external data faced

by open system environments, it is necessary to federate data from different sources. As a

result, compared with closed system environments with structured data of internal ISs,

data federation was not relevant.

Data ontology assumes that data has the same meaning based on the one version truth

philosophy [14]. The assumption can be divided into two different forms. The first form

assumes that data entities and attributes have a single meaning. This means that, with the

overlapping data of ISs, a single version of the truth can be created for data entity and

attribute values. Relating to the golden record, a record thus has the true values of shared

data attributes, e.g. for a customer. As a conclusion, the true values should be used by all

ISs and the purpose of the matching and merging process is to delete non-true values [16].

The following procedures are carried out in order to show how the golden record

works in practice. The primary procedure is based on matching and merging (Figure 2).

Firstly, all the patient registers are examined and attributes are listed on the matrix, where

rows describe attributes and column registers. Secondly, the attributes, which are com-

mon to all or most registers, are listed. These are descriptive attributes referring to key

entities of the patient data, and they are used to build the attributes of a golden record.

The golden record as a result does not necessarily resemble the record in any original data

register. Thirdly, the one true value for each attribute of the golden record is defined. The

definition is thus created for each patient. If there are several values for the same attribute

in different registers, then the latest and the most accurate value is chosen and other val-

ues are deleted as duplicates or deviations.

The secondary procedure is based on cloning. The golden record can be cloned either

to all data registers or transformed from a master system (Figure 3). Cloning means that

the golden values are substituted for the data of the original registers. All ISs should use

these true values. The process of matching and merging removes other than true values. If

the golden record is enforced as a whole in the original registers, then the IS maintenance

is required. This may be impossible due to the high costs incurred in open system envi-

ronments. Transforming involves a master system being used as the source of the golden

data, i.e. maintenance is carried out in only one IS. The data are made available to other

ISs, typically via a data bus, i.e. Enterprise Application Integration (EAI) (Figure 4).

Page 24: Master's Thesis _Daniel Meriläinen 90648

24

Figure 2 Semantic View of the Golden Record Approach in Practice (1)

Page 25: Master's Thesis _Daniel Meriläinen 90648

25

Figure 3 Semantic View of the Golden Record Approach in Practice (2)

Page 26: Master's Thesis _Daniel Meriläinen 90648

26

Figure 4 Semantic View of the Golden Record Approach in Practice (3)

2.3 Comparison of Ontological Approaches

Both approaches, contextual and canonical, are fundamentally related to master data. The

canonical approach was originally created in closed ISs environments. A golden record is

a master data record, which aligns all relevant attributes from all available data sources. It

is administered in a central repository, where data cleansing and data matching guarantee

its quality.

Page 27: Master's Thesis _Daniel Meriläinen 90648

27

The contextual approach works in open distributed ISs systems. It considers both con-

text and time. Dependent on the particular characteristics of open systems environments,

data federation should be performed on an attribute level, rather than according to an enti-

ty or a data model [13]. Data federation needs at least one connecting, i.e. shared, data

attribute when there are two or more data sets. The approach does not intend to create a

single harmonized record as the canonical approach does. The goal is to identify shared

attributes, which form links between federated registers.

2.3.1 Advantages

The golden record approach is a simple and straightforward concept. Duplicates of attrib-

utes are avoided by raising the level of data quality [52]. An all-encompassing data record

contains links to the master data records in the different original data sources. When an

update is made to an attribute in a particular data source, the same update is also made to

all other relevant sources. Consequently, all the available data remains consistent

throughout, and as it does not have to be physically moved, it is not stored redundantly.

All these actions save time and money. As a result, all the data silos in the company are

synchronized without any fragmented data. The golden record approach can be used in a

single homogeneous use context [52]. In the case study, the approach could be applied if

breast cancers were always treated in the same way. The strength of the golden record

approach is that an organization is aware of each IS scope in details, with data entities,

attributes and their dependencies on IS [16].

From the perspective of IT, golden records reduce the costs for data availability, ex-

change, integration and data migration [52]. The actual quantity of data can be reduced.

From the business stance, the golden records make a holistic customer view possible.

According to Martin, this leads to short-term increased turnover, as well as long-term

customer satisfaction and loyalty [52].

Van der Lans introduces the advantages by deploying data federation as follows [44,

pp. 7 & pp. 149]:

Increased speed of report development

Easier maintenance of metadata specification

Consistent reporting

Cost reduction due to simplification

Increased flexibility of the business intelligence system

Easier data store migration

Seamless adoption of new technology

Data model-driven development

Page 28: Master's Thesis _Daniel Meriläinen 90648

28

Transparent archiving of data.

The advantages are considered from the stance of business intelligence systems.

2.3.2 Disadvantages

It is not necessarily possible to implement the golden record approach with high quality

requirements, since ISs are becoming more and more distributed and open. The weak-

nesses of the approach are closed systems dependency and the canonical data ontology

assumption [16]. Organizations have replaced self-developed ISs, which used to be typi-

cal in a closed systems’ environment, with software packages services from independent

IS service vendors. With the increasing numbers of ISs used in organizations and the vol-

umes of digital data, the technological move to an open systems environment is accelerat-

ing.

The data model of many commercial software packages represents the model of a ge-

neric user organization. This may cause risks of incompatibilities between different in-

stances of the same software package in a single organization. A user organization may

therefore not even have access to the data model of a software package. Changes and up-

dates are made externally by a software vendor to the generic data model of the software.

Due to the ever-increasing deployment of IT with larger numbers of ISs in use, there

might be data about the same persons (patients), facilities and locations (healthcare facili-

ties), things (cytostatic materials), concepts (disease diagnoses) and other data elements in

dozens, hundreds or even in thousands of data storages with possibly unknown intercon-

nections. As a result, data definitions tend to be out of control. Additionally, the IS tech-

nical, informational and socio-contextual characteristics of data may differ between dif-

ferent data storages.

The golden record approach is unable to deliver organization-wide federation of cus-

tomer (patient), product (treatment), location (healthcare facilities), and other types of

data. Despite improvements, MDM solutions are fragmented. The reason is based on the

canonical ontological assumption.

In open systems environments, data federation requires more IS technical, informa-

tional, and social metadata. Thus metadata of this kind is needed to understand how

shared attributes have been created and what the meanings of the attributes in their use

contexts are.

The potential disadvantages of data federation are as follows [44, pp. 151]:

Extra layer of software

Repeated processing of transformations

Proprietary development language

Page 29: Master's Thesis _Daniel Meriläinen 90648

29

Management of data servers

Limited experience.

The disadvantages are viewed from the perspective of business intelligence systems.

2.4 Ontological Approach in the Case Study

The federative approach was used in this case study. Several attributes were detected and

two basic matrices were built, forming an artifact. The artifact is a keystone in data feder-

ation. The first matrix (Table 1) describes information systems and four shared and in-

teroperable attributes (HETU, TNM, Diagnosis Code, Dates of Events). The second ma-

trix (Table 3) is more important, because it analyzes each attribute one at a time using

metadata. It includes data about the accountability of those inserting data, which process-

es and phases are in use, life cycles, and the level of understanding, i.e. what is under-

stood and meant at a different phase of the data life cycle.

Page 30: Master's Thesis _Daniel Meriläinen 90648

30

3 LITERATURE REVIEW

The literature review is based on the key literature, the latest and prominent articles and

books relating to the issue. The literature review moves through the main content of sci-

entific articles concerning the context of the research and the case study.

The literature search was based on the keywords as follows: attributes, master data,

metadata, data federation, breast cancer, golden record, data governance, data manage-

ment. The search engines and databases were used as follows: Google Scholar, Volter,

Melinda, Arto and Nelli Portal by Turku University, Finland.

In each chapter a comparison is made between the federative approach and the canoni-

cal philosophy (the golden record approach).

3.1 Ontology

Ontology can be defined as an explicit formal specification of how to represent objects,

concepts and other entities that are assumed to exist in some area of interest and the rela-

tionships holding among them [59, pp. 5]. Systems that share the same ontology are able

to communicate about domain of discourse without necessarily operating on a globally

shared theory. System commits to ontology if its observable actions are consistent with

the definitions in the ontology. Ontology is also defined as a model with concepts and

their relationships within a domain [18]. For example, the canonical data ontology stance

proposes that it is possible to agree on one single version of the truth of data values as the

golden record.

The ontology defined by DAMA describes individuals (instances), classes (concepts),

attributes, and relations [18]. It creates relationship between a taxonomic hierarchy of

classes and definitions with the subsumed relation, for example decomposing intelligent

behavior into many simpler behavior modules and layers [18]. Ontology defines basic

terms and relations comprising the vocabulary of a topic area and comprises rules for

combining terms and relations to define extensions to the vocabulary. In contrast, taxon-

omy denotes classification of information entities in the form of a hierarchy, according to

the presumed relationships of the real-world entities that they represent. The two terms

ontology and taxonomy are widely used to describe the results of modeling efforts [50,

pp. 175).

In their article On the Ontological Expressiveness of Information Systems Analysis

and Design Grammar, Wand and Weber introduce the grammars that ISAD (Information

Systems Analysis and Design) methodologies provide to describe various features of the

real world [69, pp. 219]. They argue that an ISAD grammar can be used to describe all

Page 31: Master's Thesis _Daniel Meriläinen 90648

31

ontological constructs completely and clearly. ISAD intends to represent the real world,

and by tracking changes in the existing or imagined real-world phenomena, it strives to

model and facilitate the design of a structured information system, which is well decom-

posed [69, pp. 218].

According to the article An Ontological Model of an Information System, by Wand

and Weber, theoretical developments in the CS and IS disciplines have been inhibited by

inadequate formalization of basic constructs [68]. The article proposes an ontological

model of an information system that provides precise definitions of fundamental concepts

such system, subsystem, and coupling. The model is used to analyze some static and dy-

namic properties of an information system and to examine the question of what consti-

tutes a good decomposition of an information system. Tsiknakis et al. underline the im-

portance of extensive use of ontology and metadata in research [66].

The article A Semantic Grid Infrastructure Enabling Integrated Access and Analysis of

Multilevel Biomedical Data in Support of Postgenomic Clinical Trials on Cancer, by

Tsiknakis et al., presents the master ontology on cancer, developed by the project, and

their approach to developing the required metadata registries [66]. The master ontology

defines the ontology of cancer research and management as an objective to enable seman-

tic data integration.

According to Tsiknakis et al., clinical researchers or molecular biologists often en-

counter difficulties in exploiting one another’s expertise, since the prevailing research

environment is not sufficiently cooperative [66]. Cooperation would enable the sharing of

data, resources, or tools for comparing results and experiments, and creating a uniform

platform supporting the seamless integration and analysis of disease-related data at all

levels [66].

Tsiknakis et al. [66] state that the results from the research analyses of breast cancer

reside in separate dedicated databases, e.g. clinical trial (CT) database, histopathological

database, institutional or modality-specific image management systems, microarray data-

bases, proteomic information systems, etc. In addition, since several clinical research or-

ganizations are participating in a given trial, they are also geographically distributed with-

in or across countries [66].

Once data sets are generated, a range of specialized tools is required for the integrated

access, processing, analysis, and visualization of the data sets [66]. In addition, the tools

must provide a dynamically evolving set of validated data exploration, analysis, simula-

tion, and modeling services. The integration of applications and services requires substan-

tial meta-information on algorithms and input/output formats if the tools are to interoper-

ate. Furthermore, the assembly of tools into complex discovery workflows will only be

possible if the data formats are compatible, and the semantic relationships between ob-

jects shared or transferred in workflows are clear. To integrate the highly fragmented and

Page 32: Master's Thesis _Daniel Meriläinen 90648

32

isolated data sources, semantics is needed in order to answer higher-level questions.

Therefore, it becomes critically important to describe the context, in which the data were

captured. This contextualization of data is described as metadata [66].

In the case study by Tsiknakis et al. [66], the user performs queries against a single

virtual repository. The virtual repository represents the integration of several heterogene-

ous sources of information. The integration process relies on a common interoperability

infrastructure, based at a conceptual level on domain ontology [66].

The article A Paradigmatic Analysis on Information System Development, by Iivari et

al., analyses the fundamental philosophical assumptions of five contrasting information

systems development (ISD) approaches: the interactionist approach, the speech act-based

approach, the soft systems methodology approach, the trade unionist approach, and the

professional work practice approach [40]. According to the article, these five approaches

are selected for analysis because they illustrate alternative philosophical assumptions to

the dominant orthodoxy identified in the research literature.

The article also proposes a distinction between approach and methodology. The analy-

sis of the five approaches is organized around four basic questions. Firstly, what is the

assumed nature of an information system (ontology)? Secondly, what is human

knowledge and how can it be obtained (epistemology)? Thirdly, what are the preferred

research methods for continuing the improvement of each approach (research methodolo-

gy)? And fourthly, what are the implied values of information system research (ethics)?

Each of these questions is explored from the internal perspective of the particular ISD

approach. The questions are addressed through a conceptual structure, which is based on

a paradigmatic framework for analyzing ISD approaches [40].

From the ontological stance the golden record philosophy assumes that it is possible to

define and agree on one version of the truth and on the context in which all the data enti-

ties and data attributes have the same meaning in every data use context. However, this

canonical approach is contradictory to the ontology principles of Wand and Weber [68,

69, 70]. The approach aims to establish canonical data models, where all the context- and

time-specific values between IS-shared attributes are purpose-specific exceptions and are

replaced by the canonical true values.

The federative approach is based on the ontological assumption that data are and

should be contextually defined in order to maintain local completeness. Thus data are

considered to have several true meanings, depending on their use context and the time of

data usage. The federative approach does not remove alternative meanings as the golden

record approach advocates. It uses metadata about the meanings of data to execute data

federation on the basis of interoperable/shared attributes. The approach produces truer

and richer insights.

Page 33: Master's Thesis _Daniel Meriläinen 90648

33

The first canonical assumption is that there is only one context in effect. If there are

more contexts, they can be described with the aid of attributes and as a result the concep-

tual models are created. The second assumption is that data and the interpretation of data

can be defined and controlled.

In the federative approach every single operating condition is contextual, which is true.

For example, the X-rayed images are evaluated from the stance of a pathologist and relat-

ed to a state of breast cancer. Thus the pathological data are significant and true. If anom-

alous values are found, they are registered as additional properties.

When all the single operating conditions are bound together, no single version of the

truth can be found. In this case there are contextual interpretations of the data which can-

not be linked, since the ontological interpretation of these data can be lost. Another per-

spective is that there is no control over the interpretations of data, as no accountability can

be found.

3.2 Semantics

Semantics indicates the meaning of expressions written in a certain language, as opposed

to their syntax, which describes how the symbols may be combined independently of their

meaning [59, pp. 5]. Semantic modeling on the other hand is a type of knowledge model-

ing [18, pp. 249]. Modeling includes a network of concepts, i.e. ideas or topics of concern

and their relationships. A semantic model such as an ontology model contains the con-

cepts and relationships together [18].

Based on the article A Framework for Theory Development in Design Science Re-

search, by Kuechler and Vaishnavi, one point of convergence in the many recent discus-

sions on design science research in information Systems (DSRIS) has been the desirabil-

ity of a directive design theory (ISDT) as one of the outputs from a DSRIS project [43].

Kuechler and Vaishnavi introduce the framework from a knowledge representation

perspective and then provide typological and epistemological perspectives [43]. The aim

is to motivate the desirability of both directive-prescriptive theory (ISDT) and explanato-

ry-predictive theory (DREPT) for IS design science research and practice. Kuechler and

Vaishnavi position both types of theory in Gregor’s (2006) taxonomy of IS theory within

a typological view of the framework. Gregor claims that an appropriate taxonomy for IS

depends on classifying theories with respect to the degree and manner. These theories

address four central goals of theory as follows: analysis, explanation, prediction and pre-

scription [36, pp. 614].

From the semantic perspective the federative approach needs semantic metadata to

federate data. In particular, semantic metadata are needed in order to cross-reference at-

Page 34: Master's Thesis _Daniel Meriläinen 90648

34

tributes in registers. Semantic metadata describe what data mean during their lifecycle, or

what their use is. In addition, semantic metadata are used to describe business/data rules

for data federation. For example, data governance builds on understanding the meaning of

the various, i.e. the contextually defined semantic descriptive metadata of those represen-

tations. In federated databases semantics, added to the local schemas, facilitates negotia-

tions, specify views, and supplements queries. In the federative approach, semantics must

be known, as the data are made interoperable and the semantic meaning of the data is

changed as a function of time.

Semantics has no great role in the canonical approach, since the data model governs

semantics and has to be unchangeable. If it is changed, then the data model must be

brought into focus and supplemented by changing the data model.

3.3 Data Models

A data model is an abstract model that organizes elements of data and standardizes how

they relate to one another and to properties of the real world entities. It explicitly deter-

mines the structure of data. A data model can be sometimes referred to as a data structure,

especially in the context of programming languages. Data models are often complement-

ed by function models, especially in the context of enterprise models.

A conceptual model is a representation of a system, made of the composition of con-

cepts which are used to help people know, understand, or simulate a subject the model

represents. A contextual model defines how context data are structured and maintained. It

aims to produce a formal or semi-formal description of the context information that is

present in a context-aware system.

The canonical model is a design pattern used to communicate between different data

formats. It is any model that is canonical in nature, i.e. a model which is in the simplest

form possible based on a standard, application integration (EAI) solution.

The canonical approach builds on entities and data models when designing an infor-

mation system. The canonical approach is a transformation of the data model. If the same

data are available in the environment, they introduce additional attributes into the data

model, but the model itself is not changed and it remains the same model.

In the federative approach, it does not make sense to acquire one large conceptual

model, as the model becomes complicated and difficult to govern. Every single operating

condition requires a quantity of data and is not necessarily needed, as it hampers practical

work. Essential operating conditions include their own data models, which are canonical,

as they must always be understood equally. Every operating condition must have a con-

ceptual model. Data models are extremely valuable, because they help us to understand

Page 35: Master's Thesis _Daniel Meriläinen 90648

35

metadata. In the federative approach entities in particular are principal elements for the

understanding of metadata.

3.3.1 Conceptual Model

The article Conceptual Model Enhancing Accessibility of Data from Cancer-Related En-

vironmental Risk Assessment Studies, by Dušek et al., proposes a conceptual model,

which can be used to facilitate the discovery, integration and analysis of environmental

data in cancer-related risk studies [24]. According to the article, persistent organic pollu-

tants were chosen as a model due to their persistence, bioaccumulation potential and gen-

otoxicity. The part dealing with cancer risk is primarily focused on population-based ob-

servations encompassing a wide range of epidemiologic studies, from local investigations

to national cancer registries. The proposed model adopted as content-defining classes a

multilayer hierarchy working with characteristics of given entities (POPs, cancer diseases

as nomenclature classes) and couples observation-measurement [24].

The proposal extends the formally used taxonomy applying a multidimensional set of

descriptors, including measurement validity and precision. Dušek et al argue that it has

the potential to aid multidisciplinary data discovery and knowledge mining. The same

structure of descriptors used for the environmental and cancer parts enables the users to

integrate different data sources [24].

According to the article Information Systems and Conceptual Modeling, by Wand and

Weber, within the information systems field, the task of conceptual modeling involves

building a representation of selected phenomena in some domain [70]. Wand and Weber

argue that high-quality conceptual-modeling work is important because it facilitates early

detection and correction of system development errors. It also plays an increasingly im-

portant role in activities such as business process reengineering and documentation of

best-practice data and process models in enterprise resource planning systems. Yet little

research has been undertaken on many aspects of conceptual modeling [70].

Wand and Weber propose a framework to motivate research that addresses the follow-

ing fundamental question: How can the world be modelled to better facilitate our devel-

oping, implementing, using, and maintaining more valuable information systems? The

framework comprises four elements: conceptual-modeling grammars, conceptual-

modeling methods, conceptual-modeling scripts, and conceptual-modeling contexts [70].

According to the article Leveraging Information Technology for Transforming Organ-

izations, by Henderson and Venkatraman, it is clear that even though IT has evolved from

its traditional orientation of administrative support toward a more strategic role within an

Page 36: Master's Thesis _Daniel Meriläinen 90648

36

organization, there is still a glaring lack of fundamental frameworks, within which we

could understand the potential of IT for tomorrow’s organizations [38].

Henderson and Venkatraman developed a model for conceptualizing and directing the

emerging area of strategic management of IT [38]. This model, termed the Strategic

Alignment Model, is defined in terms of four fundamental domains of strategic choice:

business strategy, information technology strategy, organizational infrastructure and pro-

cesses, and information technology infrastructure and processes – each with its own un-

derlying dimensions. Henderson and Venkatraman also present the power of the model in

terms of two fundamental characteristics of strategic management: strategic fit (the inter-

relationships between external and internal components) and functional integration (inte-

gration between business and functional domains [38].

3.3.2 Contextual Model

According to the article A Framework for the Corporate Governance of Data - Theoreti-

cal Background and Empirical Evidence, by Dahlberg and Nokkala, the contextual ap-

proach acknowledges that the universal approach suits situations, in which real-world

representations are closely related, for example similar tasks or chains of tasks [14, pp.

35]. The approach emphasizes the role of metadata and the use of agreed messages in

sharing data between data storages.

The contextual approach proposes that data represent granted interests and dynamic in-

terplay between socially-constructed concepts, especially the representations of human

behavior in context by the IS [14, pp. 34]. Dahlberg and Nokkala argue that digital data

are contextually defined and central to business management. Contextual metadata are

used to describe business and data rules, and they are necessary for federating data.

The framework of this Master’s thesis is based on the ontological assumption that data

are contextually defined [68, 69, 70]. Thus data could have several meanings and inter-

pretations, which depend on their social use context. Dahlberg et al. argue that the mean-

ings could even be contradictory [13]. Additionally data are defined in their various use

contexts over the life cycle of the data.

The federative approach builds on a contextual stance to the data ontology prevailing

in open systems environments [13]. It thus pays more attention to data governance.

Page 37: Master's Thesis _Daniel Meriläinen 90648

37

3.4 Data Classification

DAMA defines data as a representation of facts as text, numbers, graphics, images,

sound, or video [18, pp. 2]. The difference between data and information is that infor-

mation is data in context. Knowledge in turn, is defined as understanding, awareness,

cognizance, and the recognition of a situation and familiarity with its complexity [18, pp.

3]. In other words, one gains in knowledge when the significance of the information is

understood.

Data can be classified as follows [21, pp. 33]:

• Metadata

• Reference data

• Master data

• Transaction data

• Historical data.

The term metadata refers to descriptive information. In turn, reference data defines and

distributes the collections of common values. Reference data enables to process opera-

tional and analytical activities. It also processes the same defined values of information

for common abbreviations, for codes, and for validation [16, pp. 34].

Master data represents the common agreed and shared business objects within an en-

terprise [21, pp. 35]. Master data are managed within a dedicated MDM (Master Data

Management) system. An MDM System often uses both reference data and metadata in

processing.

Reference data ensures common and consistent values for the attributes of master data,

such as a patient’s name or gender. Transaction data refer to fine-grained information,

representing the details of any enterprise [21, pp. 35]. From the business point of view,

this kind of information includes sales transactions, inventory information, and invoices

and bills. In non-profit organizations, transaction data represents passport applications,

logistics, or court cases. Transaction data also describes what is happening in an organiza-

tion, while historical data represents the accumulation of transaction and master data over

time. These data are used for both analytical processing and for regulatory compliance

and auditing [21, pp. 35].

The federative approach takes advantage of interoperable/shared attributes based on

metadata, which can have different meanings depending on where and when the data are

used. The approach considers master data contextually, and it is applicable beyond inter-

nal master and transaction data, particularly complex small data and Big Data. The feder-

ative approach accepts the canonical approach in the case of one single context, if it is

possible. If the contexts are different and diverge from one another, then the canonical

approach is not possible. Federation is performed with respect to contexts and entities. In

Page 38: Master's Thesis _Daniel Meriläinen 90648

38

practice, federation is done by describing the dependencies of master data, i.e. the loca-

tions which can be refered to. The federative approach is interested in the life cycle of

data, i.e. what happens to data and what data are needed in each phase of the life cycle,

when the metadata differ from technical metadata.

Master data and reference data are alike. Both data types are governed with the help of

transactions. Transactions are shared in an information system. Master data are important

in both ontological approaches. The canonical philosophy leads to golden records in order

to harmonize data. From the data storage, the attributes that are shared are chosen. Then

the values which it is exclusively permissible to use are agreed upon for every attribute

(e.g. customer, patient). The canonical approach does not pay attention to how the data

are processed, but underlines the fact that there is only one version of the truth, which is

based on technical metadata.

3.4.1 Master Data

MDM with master data is a collection of best data management practices [47, pp. 8]. On

the one hand, it organizes key stakeholders, participants, and business clients by incorpo-

rating the business applications, information management methods, and data management

tools. MDM implements the policies, procedures, services, and infrastructure to support

the capture, integration, and subsequent shared use of accurate, timely, consistent, and

complete master data. On the other hand, MDM is a set of disciplines and methods to

ensure the currency, meaning, and quality of a company’s reference data within and

across various data subject areas [25, pp. 43].

According to Martin, MDM represents a complex task, which includes all strategic,

organizational, methodical and technological activities related to a company’s master

data. MDM ensures consistent, complete, up-to-date, correct, and high quality master data

for supporting the business processes (e.g. ERP, CRM, SCM, PLM) of a company [52,

pp. 1].

Loshin argues that MDM combines core identifying data attributes along with associ-

ated relevant data attributes into a master repository. It links an indexing capability that

acts as a registry for any additional and distributed data within an enterprise [47, pp. 180].

Federated information models of this kind are often serviced via Enterprise Application

Integration (EAI) or Enterprise Information Integration (EII) styles of data integration

tools. Loshin believes that this capability is important in MDM systems built on a registry

framework or using any framework that does not maintain all attributes in the repository

for materializing views on demand. Master data record materialization is based on the

existence of a unique identifier with a master registry. The registry carries both core iden-

Page 39: Master's Thesis _Daniel Meriläinen 90648

39

tifying information and an index to location across the enterprise, holding the best values

for designated master attributes [47].

According to the work, Customer Data Integration, by Dyché and Levy, MDM usually

refers to the management of reference data and the establishment of standard data values

for that reference data across the company [25, p.44]. Most MDM programs also include

transactional and relationship data as they are needed for specific business processes

(Figure 5).

Figure 5 Scope of MDM [25, pp. 46]

According to the article Uncovering Four Strategies to Approach Master Data Man-

agement, by Cleven and Wortmann, just recently much Information Systems (IS) research

focuses on master data management (MDM), which promises to increase an organiza-

tion’s overall core data quality [10]. Above any doubt, however, MDM initiatives con-

front organizations with multi-faceted and complex challenges that call for a more strate-

gic approach to MDM.

According to the article Master Data Management Best Practices, by Dahlberg, opera-

tionally, critical and other master data are fragmented and inconsistent [11]. As a result,

the information content quality is poor. This has a negative impact on business activities,

such as business transparency, loss of revenue, the use of power-accident occurrence, and

incorrect business management reporting. According to Dahlberg, poor quality also re-

duces the chances of achieving the benefits of improvement of business processes and

other related development initiatives [11]. Older managers, who may not understand the

importance of critical business data, accept the status quo and consider it normal.

Based on the article Framework and Research Agenda for Master Data Management in

Distributed Environments, by Dahlberg et al., master data provide the foundation for re-

lating business transactions with business entities, such as customers, products, locations

Page 40: Master's Thesis _Daniel Meriläinen 90648

40

etc. [12]. These entities are also referred to as domains in the master data literature. The

integrity, availability and timeliness of master data in single and growingly multi-domain

combinations are crucial in e-Business transactions over the Internet, or in the cloud for

multiple stakeholders. Dahlberg et al. argue that distributed environments set additional

challenges for the management of master data [12]. Master data, management processes,

responsibilities and other contemporary master data management practices are described

as aiming to ensure master data quality in different domains. Even though these practical

means are of help in improving master data quality and managing master data, they are

insufficient to capture the underlying root cause of master data problems [12].

Dahlberg takes his stance on master data management from the IS theoretical view-

point [11]. He suggests that holistic approaches, such as enterprise architecting, stake-

holder analysis, or business modeling, could serve as coherent frameworks in identifying

common and specific master data management research themes for global businesses with

networked IT environments [11].

Figure 6 shows a simple example, where the MDM System holds enough master data

to uniquely identify a customer (in this case, the Name, TaxID, and Primary address in-

formation), and then provides cross-references to additional customer information stored

in Information System 1 and Information System 2 [21]. When a service request for cus-

tomer information is received (getCustInfo()), the MDM System looks up the information

that it stores locally, as well as the cross-references, to return the additional information

from Systems 1 and 2. The MDM System brings together the information desired through

federation, as it is needed. Federation can be performed at the database layer or by dy-

namically invoking services to retrieve the needed data in each of the source systems [21,

pp. 29].

Since most information remains in the source systems and is available when needed,

the information returned is always up-to-date [21]. This is implemented by a MDM Sys-

tem, which is able to meet transactional inquiry needs in an operational environment. This

kind of registry-based implementation can also be applied in complex organizational en-

vironments where one group may not be able to provide all of its data to another. The

registry can be implemented relatively quickly, since responsibility for most of the data

remains within the source systems [21].

Page 41: Master's Thesis _Daniel Meriläinen 90648

41

Figure 6 MDM Registry Federation [21, pp. 28]

Based on Doan et al. Master Data Management (MDM) uses a central warehouse as a

repository of knowledge about the enterprise’s critical business objects, rules, and pro-

cesses [20, pp. 273-274]. Essential to MDM is a clean, normalized version of the terms

used through the enterprise, whether addresses, names, or concepts and information about

the related metadata. Ideally, whenever business objects are used in systems throughout

the enterprise, the data values used by these systems can be tied back to the master data.

In many ways, a master data repository is merely a data warehouse with a particular role

to play [20].

According to Doan et al. as with any warehouse, MDM gives the various data owners

and stakeholders a bird’s eye view of all of the data entities as well as a common inter-

mediate representation [20]. The master data repository is also intended to be a central

repository where relevant properties about data, especially constraints and assumptions,

can be captured, making it the home of all metadata as well as data. In many cases, the

master repository is made query-friendly to all of the data owners, such that they can di-

rectly incorporate it into their systems and processes. The repository is seen as a way of

improving risk management, decision making, and analysis [20].

Doan et al. believe that MDM provides a process to allow data to be overseen and

managed through data governance [20]. It refers to the process and organization put in

place to oversee the creation and modification of data entities in a systematic way.

Page 42: Master's Thesis _Daniel Meriläinen 90648

42

3.4.2 Metadata

DAMA defines metadata as information about the physical data, technical and business

processes, data rules and constraints, and logical and physical structures of the data, as

used by an organization [18]. It represents data to data what data is to reality.

Watson defines metadata (data about data) as a description of each data type, its for-

mat, coding standards, and the meaning of the field [72, pp. 23].

Metadata can be classified into four major types [18, pp. 262]. Business model data in-

cludes the business names and definitions of subject and concept areas, entities, and at-

tributes; attribute data types and other attribute properties, range descriptions; calcula-

tions; algorithms and business rules; and valid domain values and their definitions. Tech-

nical and operational metadata provides developers and technical users with information

about their systems. Technical metadata includes physical database table and column

names, column properties, other database object properties, and data storage. Process

metadata are data that define and describe the characteristics of other system elements

(processes, business rules, programs, jobs, tools, etc.). Data stewardships metadata are

data about data stewards, stewardship processes, and responsibility assignments. Data

stewards ensure that data and metadata are accurate, with high quality across the enter-

prise [18].

In turn, Maier et al. define metadata as data about data [50, pp. 173-174]. The structure

of knowledge is based on knowledge elements and the relations between elements and

metadata. Relations expose further information about the content and associations of ele-

ments. A single knowledge element is called metadata. It can simultaneously be trans-

formed into another knowledge element of data [50].

Metadata can be used to describe any kind of data from structured to unstructured [50].

The structure itself is already a form of metadata and usually provides information about

the name of the data element (e.g. an XML Schema for an XML document). Element

names are often not sufficient to carry all the relevant information. Additional metadata

that either describe the content (e.g. keywords, domain) or the context of the data, are

needed, especially for semi-structured data. The context can further be subdivided into

creation context (e.g. customer, intended use) [50].

Three types of metadata can be identified [50]. Content metadata relates to what the

object contains or is about, and is intrinsic to an information object. Context metadata

indicates the aspects associated with the object’s creation and /or application and is ex-

trinsic to an information object (e.g. who, what, why, where and how aspects). Structure

metadata relate to the formal set of associations within or among individual information

objects and can be intrinsic or extrinsic [50].

Page 43: Master's Thesis _Daniel Meriläinen 90648

43

Dahlberg argues that rich metadata descriptions are necessary to achieve three types of

metadata related to data federation [17]. Firstly, IS technical metadata describes the tech-

nical properties of the attribute that is used to federate data in the various federated regis-

ters. Secondly, informational metadata describes the data life-cycle properties of the at-

tribute including the accountable that created the attribute. This kind of metadata can also

be used to establish a data governance model. Accountability for data is allocated to the

person(s) who know it best. Thirdly, contextual/semantic metadata describes the meaning

of data during their lifecycle or the purpose of their use. Contextual metadata are used to

describe business and data rules in data federation [17].

Based on Maier et al., the structure is extrinsic in data base tables (data and structure

are separated) and intrinsic in XML documents (task and content mixed). Metadata can be

informal (e.g. structured according to a user-invented structure) or formal (e.g. structured

and compliant with a standard) [50]. The Dublin Core Metadata Initiative defines a set of

elements that are mainly based on experience acquired in public libraries.

Metadata ontology provides a vocabulary that is used to describe contents based on the

Dublin Core metadata standard. For example, data integration implements standards that

define character sets, addressing, markup, scopes and schema definitions [50, pp. 77). In

integration, documents are captured and additional metadata are assigned, i.e. to indexing

or attribution of documents. Metadata are usually stored in a data base together with a

link to the document, which is transferred to a separate storage medium or system [50, pp.

252).

Dyché and Levy classify metadata into four main types of data. Transactional data are

records of individual customer interactions [25, pp. 44]. A transaction represents an activ-

ity at a point in time. Reference data identify a product, customer, or other business enti-

ty. Relationship data are data that further describes an entity in order to relate it to other

entities. Metadata are basically descriptive data about individual data elements [25].

Maier et al., on the other hand, identify three types of metadata. Content metadata re-

lates to what the object contains or is about, and are intrinsic to an information object [50,

pp. 172]. Context metadata indicates the aspects associated with the object’s creation

and/or application and are extrinsic to an information object (e.g., who, what, why, where

and how aspects). Structure metadata relates to the formal set of associations within or

among individual information objects and may be intrinsic or extrinsic [50].

As a conclusion, the common denominator to all these numerous definitions intro-

duced above is that metadata are core and elementary data of ISs. It is also a vital part of

master data and MDM. In order to understand the meaning of data federation it is indis-

pensable to comprehend metadata with their attributes and entities, and the structure they

form together.

Page 44: Master's Thesis _Daniel Meriläinen 90648

44

According to the article Metadata Management by Sen, in the past, metadata has al-

ways been a second-class citizen in the world of databases and data warehouses [63]. Its

main purpose has been to define the data. However, the current emphasis on metadata in

the data warehouse and software repository communities has elevated it to a new promi-

nence. The organization now needs metadata for tool integration, data integration and

change management. Sen presents a chronological account of this evolution from both

conceptual and management perspectives. Repository concepts are currently being used

to manage metadata for tool integration and data integration. Alongside the evolution

process, Sen points out the need for a concept called Metadata Warehouse. He proposes

that the metadata warehouse needs to be designed to store the metadata and manage their

changes [63].

According to the article An Interoperable Data Architecture for Data Exchange in a

Biomedical Research Network, by Crichton et al., knowledge discovery and data correla-

tion require a unified approach to basic data management [8]. Nevertheless, achieving

such an approach is nearly impossible with hundreds of disparate data sources, legacy

systems, and data formats. Crichton et al. argue that the problem is pervasive in the bio-

medical research community, where data models, taxonomies, and data management sys-

tems are locally implemented. These local implementations create an environment where

interoperability and collaboration between researchers and research institutions are lim-

ited [8].

Crichton et al. demonstrate how technology developed by NASA’s Jet Propulsion La-

boratory (JPL) for space science can be used to build an interoperable data architecture

for bioinformatics [8]. JPL has taken a novel approach towards solving the problem by

exploiting web technologies usually dedicated to e-commerce, combined with a rich

metadata-based environment. The article discusses the approach to developing a proto-

type based on data architecture for the discovery and validation of disease biomarkers

within a biomedical research network. Biomarkers are measured parameters of normal

biologic processes, pathogenic processes (cancer research), or pharmacologic responses

to a therapeutic intervention. Biomarkers are of growing importance in biomedical re-

search for therapeutic discovery, disease prevention, and detection. A bioinformatics in-

frastructure is crucial for supporting the integration and analysis of large, complex biolog-

ical and epidemiologic data sets [8].

Based on Customer Data Integration by Dyché and Levy, metadata are defined as data

about data [25, pp. 44]. They are basically descriptive data about individual data ele-

ments. Metadata can include system-level metadata, used by applications to navigate and

distinguish certain data types. In addition, the term may also include a user-defined

metadata that involves persistent definitions of important data fields. Four main types of

data can be found in a business. These are [25]:

Page 45: Master's Thesis _Daniel Meriläinen 90648

45

Transactional data (records of individual patient interactions)

Reference data (identification of a cancer type, patient, or other healthcare enti-

ty)

Relationship data (description of an entity in order to relate it to other entities)

Metadata (data about data, i.e. descriptive data about individual data elements).

According to the article A Methodology for Sharing Archival Descriptive Metadata in

a Distributed Environment, by Ferro & Silvello, the core question is how to exploit wide-

ly accepted solutions for interoperation, for example the pair Open Archives Initiative

Protocol for Metadata Harvesting (OAI-PMH) and the Dublin Core (DC) metadata format

[31]. The goal is to deal with the peculiar features of archival description metadata and

allow their sharing on the subject. Ferro and Silvello present a methodology for mapping

Encoded Archival Description (EAD) metadata into DC metadata records without losing

information. The methodology exploits Digital Library System (DLS) technologies, en-

hancing archival metadata sharing possibilities and at the same time considering archival

needs. Furthermore, it makes it possible to open valuable information resources held by

archives to the wider context of cross-domain interoperation among different cultural

heritage institutions [31].

Metadata go beyond the data model that lets business users know what types of infor-

mation are stored in the database [45, pp. 623]. Metadata provides an invaluable service.

When not available, this type of information needs to be gleaned, usually from friendly

database administrators and analysts. This is an inefficient and time-consuming way of

gathering information. For a data warehouse, metadata provides discipline, since changes

to the warehouse must be reflected in the metadata to be communicated to users. Related

to the metadata repository, Linoff and Berry argue that metadata should also be consid-

ered a component of the data warehouse [45, pp. 630]. The lowest level of metadata is the

database schema, the physical layout of the data. Metadata answers questions posed by

end users about the availability of data, gives them tools for browsing through the con-

tents of the data warehouse, and gives everyone more confidence in the data [45].

A good metadata system should include the following elements [45, pp. 630]:

• The annotated logical data model. The annotations should explain the entities and

attributes, including valid values.

• Mapping from the logical data model to the source systems.

• The physical schema

• Mapping from the logical model to the physical schema

• Common views and formulas for accessing the data. What is useful to one user

may be useful to others.

• Information about loads and updates

• Security and access information

Page 46: Master's Thesis _Daniel Meriläinen 90648

46

• Interfaces for end users and developers, so that they share the same description of

the database.

3.5 Data Quality

According to the article Identifying, Investigating and Classifying Data Errors, by Duda,

high quality data are essential to the accuracy and validity of clinical study results [22].

Data quality assurance has been a particular emphasis in clinical trials, where extensive

personnel training and data monitoring programs are built into the study protocol in an

effort to prevent scientific misconduct and ensure compliance with the International Con-

ference on Harmonization’s Guidelines for Good Clinical Practice [22].

Duda claims that clinical trials can be elaborate and expensive, and the cohorts are not

always large or varied enough to answer broad research questions [22]. According to Du-

da, researchers and funding agencies seek to leverage existing clinical care data by pool-

ing data sets from multiple sites. The United States of America’s National Institutes of

Health (NIH) have indicated an interest in promoting and expanding such clinical re-

search networks by featuring them as a cornerstone of the NIH Roadmap for Medical

Research [22].

The U.S. Nationwide Health Information Network, a standards initiative for health in-

formation exchange over the Internet, supports complementary standards for both clinical

care and clinical research data in order to encourage and support the reuse of healthcare

data for observational studies and population monitoring. Duda argues that medical re-

search is experiencing a simultaneous upsurge in international research collaborations

[22].

Membership in multi-national research networks has grown exponentially, and publi-

cations by multi-national research teams receive more citations than similar work from

domestic collaborations [22]. These trends combine in the increased reuse of clinical care

data for international research collaborations. Data collected during routine patient care

are readily available and relatively inexpensive to acquire, so that even clinical sites in

resource-limited settings are contributing data to shared repositories or multi-site data

sets. Unfortunately scientists seldom investigate the quality of such secondary-use data as

thoroughly as data generated in clinical trials or similarly regulated studies [22].

Duda believes that some research groups rely on data cleaning performed at the data

coordinating center in order to detect data discrepancies, or they request that their partici-

pating sites perform regular quality self-assessments [22]. Given time and funding re-

strictions and a dearth of data management personnel in academic centers, it is likely that

many groups simply accept secondary-use data as it is. According to Duda significant

Page 47: Master's Thesis _Daniel Meriläinen 90648

47

challenges to high quality data exist within such international, multi-site research net-

works, but that these issues can be remedied through well-planned, cost-effective quality

control activities. Duda underlines the necessity of data quality assessments for observa-

tional networks, as well as means of identifying and evaluating data errors and improving

the audit process [22].

Based on the article Quality and Value of the Data Resource in Large Enterprises by

Otto, enterprises are facing problems in managing the quality and value of their key data

objects [55]. The article presents the findings from a case study comprising six large en-

terprises. The study results point to the importance of the situational nature of master data

as a strategic resource, which must be considered when analyzing how the quality of data

affects their value for business [55].

The article Anchoring Data Quality Dimensions in Ontological Foundations, by Wand

and Wang, claims that poor data quality can have a severe impact on the overall effec-

tiveness of an organization [67]. A leading computer industry information service firm

indicated that it expects most business process reengineering initiatives to fail through

lack of attention to data quality. An industry executive report noted that more than 60% of

the surveyed firms (500 medium-size corporations with annual sales of more than $20

million) had problems with data quality [67].

According to the article Data Quality Assessment in Context, by Watts et al., in organ-

izations today, the risk of poor information quality is becoming increasingly high as larg-

er and more complex information resources are being collected and managed [72]. To

mitigate this risk, decision makers assess the quality of the information provided by their

IS systems in order to make effective decisions based on it. They may rely on quality

metadata: objective quality measurements tagged by data managers into the information

used by decision makers. Watts et al. claim that decision makers may also gauge infor-

mation quality on their own, subjectively and contextually, assessing the usefulness of the

information for solving the specific task at hand. Although information quality has been

defined as fitness for use, models of information quality assessment have thus far tended

to ignore the impact of contextual quality on information use and decision outcomes.

Contextual assessments can be as important as objective quality indicators since they can

affect which information is used for decision-making tasks [72].

The research by Watts et al. offers a theoretical model for understanding users' contex-

tual information quality assessment processes [72]. The model is grounded in dual pro-

cess theories of human cognition, which enable simultaneous evaluation of both objective

and contextual information quality attributes. The findings of an exploratory laboratory

experiment suggest that the theoretical model provides an avenue for understanding con-

textual aspects of information quality assessment in concert with objective ones. The

model offers guidance for the design of information environments that can improve per-

Page 48: Master's Thesis _Daniel Meriläinen 90648

48

formance by integrating both objective and subjective aspect of the users' quality assess-

ments [72].

Based on the article Master Data Management and Customer Data Integration for a

Global Enterprise, by Berson and Dubov, data quality is one of the key components of

any successful data strategy and data governance initiative, and is also one of the core

enabling requirements for MDM [2, pp. 135-136, pp. 305]. Conversely, MDM is a power-

ful technique that helps an enterprise to improve the quality of master data [2, pp. 117].

Berson and Dubov claim that a key challenge of data quality is incomplete or unclear

semantic definitions of what the data are supposed to represent, in what form, and with

what kind of timeliness requirements [2]. The metadata repository is the place where

these definitions are stored. The quality of metadata may be low, because there are many

data quality dimensions and contexts, each of which may require a different approach to

the measurement and improvement of the data quality. For example, in order to measure

and improve address information about customers, there are numerous techniques and

references data sources that can provide an accurate view of a potentially misspelled or

incomplete address. Similarly, in order to validate a social security number or a driver’s

license number, it is possible to use a variety of standard sources of this information to

validate and correct the data [2].

The canonical approach intends to establish canonical data models where all the con-

text and time specific values of intermediate IS-shared attributes are considered purpose-

specific exceptions and replaced with canonical true data values [18]. Purpose-specific

requirements lead organizations to create purpose-specific applications each with similar,

yet inconsistent data values in differing formats. These inconsistencies have a dramatical-

ly negative impact on overall data quality [18]. In the canonical approach the data that do

not belong to a group are flawed. The approach does not take into account the differences

between information processes. An information process can create incorrect data, which

are then not considered.

The canonical approach presupposes an unambiguous meaning, and other meanings

are interpreted as a fault, to be deleted. If they are needed, then they are formed from data

attributes. Further, if the meaning of data is changed as a function of the data, the correct

way has to be found, and other attributes are substituted for or new attributes are made

from obsolete data.

Since the canonical approach does not pay attention to how data is generated, data in-

accuracy may be due to the fact that the information processes used to create the same

data may be different. The approach does not recognize the fact that there are two kinds

of error sources regarding data. The first error source is that the data has not been created,

but a real error occurs. The second is that the canonical approach does not recognize the

Page 49: Master's Thesis _Daniel Meriläinen 90648

49

fact that there are processes that produce systematic errors, and that different types of data

can be produced by various processes.

The federative approach advocates that there are different types of contents, and if they

represent different contexts, this is acceptable. If the contents describe the same context,

they are incorrect and must be corrected, but there may be deliberately different values

for a particular result. For example, relating to temperature measurement in the human

body, if a patient’s body temperature is measured before surgery, and he/she does not

have a fever, while after surgery the patient has a slight bit fever. Both cases can be inter-

preted as representing normal body temperature, since it is known that the temperature

after surgery is slightly higher. As a result it does not make sense to correct the latter val-

ue as an error.

In the federative approach the data always reside in their original location. The ap-

proach includes different rules for use when a context is flawed. If an information process

produces incorrect data during its lifecycle it must be corrected.

3.6 Data Consolidation

According to Loshin, data consolidation can be defined as data instances located from

different sources and brought together [47, pp. 179]. The integration tools use the parsing,

standardization, harmonization, and matching capabilities of the data quality technologies

to consolidate data into unique records in the master data model. Loshin argues that data

consolidation depends on the number of data sources that feed the master repository and

the expectation of a single view of each master entity [47, pp. 174].

The federative approach consolidates data in its original place without data transfer or

extraction. The approach uses interoperable/shared attributes in order to build a linkage

between different registers. It is based on the processes of sharing and matching, but not

on integration. Integration is possible, and the reason for integrating is to ensure interop-

erability, not to match and cleanse data. At least two data storages with attributes are

needed when integrating, and when cross-reference is performed, the interoperable attrib-

utes are the result.

The canonical philosophy consolidates data into the golden record by applying the

master data model to entities. Relating to the stance on data storages, the canonical ap-

proach advocates reporting databases in data storages, while in the federative approach,

the data reside in their original location. However, in the federative approach, repositories

with the meanings of attributes, registered cross-references, and descriptions of contents

are used in federation.

Page 50: Master's Thesis _Daniel Meriläinen 90648

50

3.6.1 Sharing

Based on Loshin, the essence of MDM revolves around data sharing and interchange [47,

pp. 146]. Information is shared using data integration tools in three ways, as follows:

Data transformation

Data monitoring

Data consolidation.

Data transformation means that data are transformed into a format that is acceptable to

the target architecture. Data monitoring provides a way of incorporating the types of data

rules both discovered and defined during the data profiling phase. Data consolidation

means that data are consolidated into unique records in the master data model [47].

Heimbigner and McLeod [37] argue that the federated architecture provides mecha-

nisms for sharing data, for sharing transactions (via message types), for combining infor-

mation from several components, and for coordinating activities among autonomous

components (via negotiation). A prototype implementation of the federated database

mechanism is currently operational on an experimental basis [37].

3.6.2 Mapping

According to Loshin, data mapping is the process of creating data element mappings be-

tween two distinct data models [47]. Data mapping is used as a first step in a wide variety

of data integration tasks. The first task is data transformation or data mediation between a

data source and a destination. The second task is identification of data relationships as

part of data lineage analysis. The third task is discovery of hidden sensitive data such as

the last four digits of a social security number hidden in another user ID as part of a data

masking or de-identification project. Finally the fourth task is consolidation of multiple

databases into a single data base and identifying redundant columns of data for consolida-

tion or elimination [47].

3.6.3 Matching

Based on the article Data Matching Concepts and Techniques for Record Linkage, Entity

Resolution, and Duplicate Detection, by Christen, data matching is the task of identifying,

matching, and merging records that corresponds to the same entities from several data-

bases [9]. The entities under consideration most commonly refer to people, for example

patients, customers, tax payers, or travelers.

Page 51: Master's Thesis _Daniel Meriläinen 90648

51

Christen claims that a major challenge in data matching is the lack of common entity

identifiers in the databases to be matched [9]. The matching needs to be conducted using

attributes that contain partially identifying information, such as names, addresses, or dates

of birth. Such identifying information is often of low quality. Personal details in particular

suffer from frequently occurring typographical variations and errors, and such infor-

mation can change over time, or it is only partially available in the databases to be

matched [9].

DAMA believes that one of the greatest ongoing challenges in MDM is the process for

the matching, merging, and linking of data from multiple systems about the same person

(patient), group (gender), place (location), or thing (diagnosis) [18].

The key to building a data integration application is the source description, a kind of

glue that connects the mediated schema and the schemas of the sources [20, pp. 11]. The

descriptions specify the properties of the sources that the system needs to know in order

to use their data. The main component of source descriptions is semantic mapping, which

relates the schemata of the data sources to the mediated schema [20].

Semantic mappings specify how attributes in the sources correspond to attributes in the

mediated schema, and how the different group of attributes into tables is resolved [20].

The semantic mappings specify how to resolve differences in how data values are speci-

fied in different sources. Thus specification between every pair of data sources is unnec-

essary. The semantic mappings are specified declaratively, which enables the data inte-

gration system to reason about the contents of the data sources and their relevance to a

given query and to optimize the query execution [20].

Data matching is the problem of finding structured data items that describe the same

real-world entity (20, pp. 173). In many integration situations, merging multiple databases

with identical schemas is not possible without a unique global ID and the need to decide,

which rows are duplicates.

Doan et al. name the matching techniques available as follows [20]:

• Rule-based matching: the aim is to match tuples from two tables with the same

schema, but generalizing to other contexts is straightforward. The rule computes the simi-

larity score between a pair of tuples x and y and a linearly weighted combination.

• Learning-base matching: supervised learning is to automatically create matching

rules from labeled examples.

• Matching by clustering.

• Probabilistic approaches to data matching.

• Collective matching.

• Scaling up data matching.

Page 52: Master's Thesis _Daniel Meriläinen 90648

52

3.6.4 Data Federation

According to van der Lans, data federation refers to the combining of autonomously op-

erating objects [44]. Basically data federation means combining autonomous data stores

to form one large data store. Data federation is a form of data virtualization where the

data stored in a heterogeneous set of autonomous data stores is made accessible to data

consumers as one integrated data store by using on-demand data integration. The defini-

tion is based on data virtualization [44].

The difference between federated and canonical data integration is that data are not

collected into one (harmonized record, i.e. a golden record) IS, but instead the data are

left in the original location [44]. Technically data federation is conducted with the help of

a metadata repository, which maps federated data sets to each other by using interopera-

ble attributes. The metadata repository comprises data storage for federation rules, mean-

ings of attributes, descriptions of data formats, and definitions of mappings. Metadata

descriptions are created, modified and used only when data federation is needed. New

federation rules can be added whenever needed, e.g. for new reporting needs. The idea is

to avoid large data banks and to produce practically useful results from the very begin-

ning [44].

Data federation should make it possible to bring data together from data stores using

different storage structures, different access languages, and different application pro-

gramming interfaces (APIs). An application that uses data federation is able to access

different types of database servers and files with various formats. It makes it possible to

integrate data from all these data sources. Data federation also offers features for trans-

forming the data. It allows applications and tools to access the data through various APIs

and languages [44].

Data stores accessed by data federation are able to operate independently. This means

that they can be used outside the scope of data federation. Regardless of how and where

data are stored, they are presented as one integrated data set. This implies that data fed-

eration involves transformation, cleansing, and possibly even enrichment of data [44].

Based on Dahlberg and et al., data federation, which uses a data governance frame-

work artifact as a tool, may differ from canonical data integration [13]. Data is made in-

teroperable without changing the original data in federated data storages. Canonical data

integration often means and leads to data transformation, cleansing, harmonization and/or

standardization. In contrast, data federation makes it possible to use simultaneously data

from data storages with different technical, informational and social characteristics. It

requires that users of the federated data understand the meaning of the outcomes. The

purpose of data federation is to make data storages that are linked and interoperable

Page 53: Master's Thesis _Daniel Meriläinen 90648

53

through data mapping [13]. The data are made interoperable without changing the origi-

nal data in federated data storages.

Current problems in data federation are related primarily to two issues [13]:

• Data ontology

• Insufficient attention paid to the governance of data.

The article Towards Information Systems as a Science of Meta-Artifacts by Iivari ar-

gues that more emphasis should be given to the nature of Information Systems as an ap-

plied, engineering-like discipline that develops various meta-artifacts to support the de-

velopment of IS artifacts [41]. The article refers to data federation with the use of arti-

facts. Iivari argues that building such meta-artifacts is a complementary approach to the

theory with practical implications type of research. The primacy assigned to theory and

research method has effectively excluded constructive research on building meta-artifacts

from the major IS journals. The article also claims that information systems as a category

of IT artifacts, and especially the focus on IS development, can help to distinguish the IS

discipline from its sister and reference disciplines [41].

According to the article Data Federation Methods and System, by Chen et al., a meth-

od is provided for processing tree-like data structures in a streaming manner [5]. An ini-

tial context of name/value bindings is set up, and a tree of objects is constructed. Each

element in the tree of objects is represented as a function object, which accepts a context

parameter and a target parameter that it can send a stream of start, content, and events to

represent tree output. The parse tree of objects is examined for element names that are

recognized as commands. The commands are converted into special function objects that

implement the command’s semantics. Other elements that are not recognized as com-

mands, are mapped to a default function object in data federation [5].

In managing datification and data federation in open systems environments, master da-

ta are non-transactional data saved by several ISs, which from the data federation per-

spective, provide links to various data sets and sources [2]. The golden record approach

emerged as a solution to the problem of what to do with inconsistent and fragmented data,

saved in storages that had been brought together. The approach can be defined as a single,

well-defined version of all the data entities in an organizational ecosystem [2].

It is necessary to understand the use of the artifact based on nine questions (the princi-

ples of data federation) [15]. Contextual metadata are also needed in order to execute data

federation. Data federation starts from understanding of the contextual metadata. The

ability to federate and combine firstly, and then to analyze data, makes any data potential-

ly valuable, worth gathering, maintaining and describing [15].

Based on the article The MDM Golden Record Is Dead, Rest in Peace – Welcome In-

terpreted Interoperable Attributes in Open Systems Environments, by Dahlberg, data fed-

erations are characterized by differences in the formats, structure, granularity and other

Page 54: Master's Thesis _Daniel Meriläinen 90648

54

characteristics of data [16]. Dahlberg claims that data federation can happen only if there

is a connecting data element, such as a patient (a customer) or a type of care (a product)

available [16].

Based on Design of Enterprise Systems by Giachetti, a federated database is defined as

a collection of heterogeneous, component databases, over which a global view is created

[35, pp. 396]. This makes it possible that applications can treat the separate databases as a

single database.

The article Federated Database Systems for Managing Distributed, Heterogeneous and

Autonomous Databases, by Sheth and Larson, describes a federated database system

(FDBS) as a collection of cooperating database systems that are autonomous and possibly

heterogeneous [65]. The article defines the reference architecture for distributed database

management systems based on system and schema viewpoints and show how various

FDBS architectures can be developed [65].

The article A Federated Architecture for Information Management, by Heimbigner and

McLeod, presents an approach to the coordinated sharing and interchange of computer-

ized information, which is described emphasizing partial, controlled sharing among au-

tonomous databases [37]. Office information systems provide a particularly appropriate

context for this type of information sharing and exchange. The federated database archi-

tecture consists of a collection of independent database systems, which are united into a

loosely coupled federation in order to share and exchange information [37].

According to Heimbigner and McLeod, a federation consists of components (of which

there may be any number) and a single federal dictionary [37]. The components represent

individual users, applications, workstations, or other components of an office information

system. The federal dictionary is a specialized component that maintains the topology of

the federation and oversees the entry of new components. Each component in the federa-

tion controls its interactions with other components by means of an export schema and an

import schema. The export schema specifies the information that a component will share

with other components, while the import schema specifies the non-local information that

a component wishes to manipulate [37].

3.6.5 Data Integration

Data integration is a form of data federation or data virtualization. According to van der

Lans, data integration is the process of combining data from a possibly heterogeneous set

of data stores to create one unified view of all the data [44, pp. 8]. Data integration is in-

volved, among other things, in joining data, in transforming data values, enriching data,

Page 55: Master's Thesis _Daniel Meriläinen 90648

55

and cleansing data values. The definition itself does not take a stance on how integration

occurs.

Linthicum defines Enterprise Application Integration (EAI) as the unrestricted sharing

of data and business processes among any connected applications and data sources in the

enterprise [46]. Sharing information among different systems is particularly difficult as

many of them are not designed to access anything outside their own proprietary technolo-

gy [46, pp. 3-4].

EAI allows many of the stovepipe applications (e.g. patient management systems,

ERP, CRM etc.) to share both processes and data. EAI does not need to make changes to

the applications or data structures. According to Gable EAI allows diverse systems to

connect with one another quickly in order to share data, communications, and processes,

alleviating the information silos that plague many businesses [33, pp. 48]. EAI implemen-

tation integrates the Information Systems (ISs) so that a data warehouse can aggregate

account data, providing a single view to the end user.

On-demand integration refers to the issue of when data from a heterogeneous set of da-

ta stores is integrated. Unlike data federation, integration takes place on the fly, and not in

a batch. When data consumers ask for data, only then are the data accessed and integrat-

ed. Thus data are not stored in an integrated way, but remains in their original location

and format [44].

Maier et al. argue that data integration requires mutual understanding in user organiza-

tions or applications. It relates to exchanging data on how data resources are addressed

over a network and in the most generic sense over the Internet [50]. Organizations must

consider which character set to use. They are expected to know about the internal struc-

ture of documents, i.e. text markup and about the scope or domain, in which the specified

names in the markup are valid. Finally, it is necessary to know how to define a schema,

the structure of the elements in a semi-structured text, and how to translate a document

that is an instance of one schema so that it conforms to another schema [50].

According to Giachetti, data integration technologies either create a single, unified

model of the data by merging databases together, or provide the tools and technologies to

move data between systems [35, pp. 393-397]. To share data across the enterprise, the

options include the following. Point-to-point integration connects two databases together

by defining data translators between them; data middleware (ODPC): creation of interfac-

es between the database and all the other applications [35]. A single, centralized database

is for the entire organization and every application writes and reads from the shared data-

base (e.g. ERP).

The term federated data refers to collection of cooperating but autonomous databases.

It consists of a collection of heterogeneous, component databases, over which a global

view is created, so that applications can treat the separate databases as a single database.

Page 56: Master's Thesis _Daniel Meriläinen 90648

56

This is implemented by means of data mediation, which converts data from one format to

another. The original data sources are left untouched [35].

A data warehouse collects data from one or more operational databases, integrates the

data, and makes them available for querying, reporting, and analysis [35]. No transactions

are executed in connection with the warehouse but it is only used to obtain information.

There is a relationship between the data warehouse and the operational databases. The

relationship is called a process of ETL, which takes data from the operational databases,

cleans the data, and then loads them into the data warehouse. The process is executed by

batches.

Giachetti describes data integration as a process, which is executed by taking the data

structure and data definitions from the legacy systems, redesigning the structure and data

definitions, and recreating them in the new system [35, pp. 409]. The conversion process

is called ETL. The data objects are extracted from the old data systems, cleansed, then

transformed and loaded into the new database. Cleansing of the data is required to ensure

that they are correct, complete, consistent, and that they adhere to business rules (dirti-

ness: missing data attributes, noise words, different languages, misspellings, multiple

terms for the same data).

According to Doan et al., most integration systems are based on warehousing or virtual

integration [20, pp. 9]. In warehousing, data from the individual data sources are loaded

and materialized into a physical database (warehouse), where queries about the data can

be answered. In virtual integration, the data remain in the sources and are accessed as

needed at query time [20].

Five mainstay challenges of data integration are as follows as follows [25, pp. 72]:

The need for a different development framework

Difficulty with stakeholder enlistment

Operational data are not available

Nonexistent metadata

Poor data quality.

Customer Data Integration by Data Dyché and Levy [25, pp. 98] argues that the goal

of matching is to identify all the data for a particular customer held by the enterprise. The

golden record implies a single, standard record that has usually been generated with data

from multiple source systems [25, pp. 101]. For example, the CDI hub recognizes, match-

es, and consolidates the data into a master record for that customer. The CDI hub will pull

the most accurate value from each source system so that the golden record may contain

the customer’s first and last name from one system, the phone number from another sys-

tem, and the home address from another system (Figure 7). CDI hubs are distinguished

from other technology solutions by their ability to identify the optimal values that com-

prise the golden record for a customer. The CDI hub has the ability to tie-break between

Page 57: Master's Thesis _Daniel Meriläinen 90648

57

data sources and individual elements, and it decides on the best combination of elements

to comprise the golden record. Ultimately, the golden record becomes the enterprise’s

view of an individual customer’s information [25, pp. 102].

Figure 7 Generating the Golden Record [25, pp. 99]

According to Watson, there is a lack of data integration in most organization [72, pp.

22]. Due to the limitations of the available technology, early computer systems were not

integrated. Organizations created simple file systems to support a particular function. In-

tegration is a long-term goal. As new systems are developed and old ones rewritten, or-

ganizations can evolve integrated systems. It would be too costly and disruptive to try to

solve the data integration problem in one step.

3.6.6 Data Warehouse, Storage and Repository

DAMA defines a data warehouse as a combination of an integrated decision support da-

tabase and the related software programs [18, pp. 197]. The programs take care of collect-

ing, cleansing, transforming, and storing data from internal and external sources.

Based on Data Mapping Diagrams for Data Warehouse Design with UML in Data

Warehouse (DW) Scenarios, by Luján-Mora et al., ETL (Extract, Transform, Load) pro-

cesses are responsible for the extraction of data from heterogeneous operational data

Page 58: Master's Thesis _Daniel Meriläinen 90648

58

sources, their transformation (conversion, cleaning, normalization, etc.) and their loading

into the DW [49].

The article by Luján-Mora et al. presents a framework for the design of the DW back-

stage and the respective ETL processes, based on the key observation that this task fun-

damentally involves dealing with the specificities of information at very low levels of

granularity, including transformation rules at the attribute level [49]. Specifically, the

article introduces a disciplined framework for the modeling of relationships between

sources and targets at different levels of granularity, which include coarse mappings at the

database and table levels to detailed inter-attribute mappings at the attribute level [49].

A data storage is for archiving data in an electromagnetic and digital form. According

to Dahlberg et al., data storages have different IS technical, informational, and socio-

contextual data characteristics [12]. In addition, the accountabilities of various data stor-

ages seem to be unclear in most organizations [11]. Various data storages provide frag-

mented, overlapping, and even controversial data on the same issue.

A repository is a particular kind of setup within an overall IT structure, such as a group

of databases, where an organization keeps data of various kinds. Based on the article

Adex – A Meta Modeling Framework for Repository-Centric Systems Building, by Red-

dy et al., enterprises use repositories for storing, and subsequently leveraging, the descrip-

tions of diverse information systems and the various complex relationships present in

them [60, pp. 1]. An information system repository should also support modeling of pro-

cesses that coordinate various system-building activities.

Based on Watson, a data registry is equivalent to the reference repository, which in-

cludes metadata. It contains a description of each data type, format, programming stand-

ards (e.g. volume in liters) and the meaning of a field [71, pp. 437]. For the sake of a data

warehouse, a data registry also includes specifications of an operating system with which

the data were created, data transformations and the frequency of data retrievals. Analysts

need to enter the metadata in order to design their analyses and learn the contents of data

warehouses. If a data registry is not found, it should be implemented and maintained in

order to ensure the integrity of a data warehouse [71].

3.7 Data Management Framework

DAMA defines data governance as a core function of the data management framework

[18, pp. 37]. Data governance interacts with and influences each of the surrounding data

management functions. Data management in turn is a high-level business process [18, pp.

17]. The functions are the planning and execution of policies, practices, and projects.

They acquire, control, protect, deliver, and enhance the value data and information assets

Page 59: Master's Thesis _Daniel Meriläinen 90648

59

[18]. Shleifer and Vishny define corporate governance as referring to the way in which

suppliers of finance assure themselves a return on their investment [62, pp. 737].

Data governance can prevent data deficiency problems. It is possible to agree account-

abilities to prevent ontological or design failures and operational problems of centralizing

the management and interpretation of the data. It makes the governance of the organiza-

tion to match the ontological stance [14].

The federative approach is one of the solutions for governing and managing data effi-

ciently in an open information systems environment. The approach governs all the data in

a certain context, and context-based governance models are built. Models define clear

accountabilities and understanding of sources (sensor data, social media data, IoT etc.), as

well as the temporal and spatial dimensions. An understanding of ontology is a keystone

in governance. There is no such element in the canonical approach and it is not given any

attention.

Relating to the canonical approach, one domain (e.g. product data, customer data, de-

vice data) is viewed at a time. The aim is to go through all the information systems of the

domain. As a result a hierarchical governance model is built, governing who owns what

in each phase. Since it is unilateral, no account is taken of the different dimensions of the

data and how they should be governed. The canonical approach pays no attention to the

temporal and spatial dimensions, accountabilities, structural data, distinct images, text

data or accurate data. All the data are interpreted indistinguishably without considering

the location where the data are created. The canonical philosophy considers that an organ-

ization takes responsibility for its own IS and thus knows the meaning of the data stored

in each database.

3.7.1 Data and Corporate Governance

Based on the article A Framework for the Corporate Governance of Data – Theoretical

Background and Empirical Evidence Governance, by Dahlberg and Nokkala, in a modern

organization, IT and digital data have transformed from being functional resources to

being integral elements of business strategy [14]. They apply the framework to the gov-

ernance of data relating to aging societies, that is, to answer the question of how best to

manage the provision of services with digital data enablement and support to citizens.

Dahlberg and Nokkala disclose the results of two recent surveys, with 212 and 68 re-

spondents respectively, on the business significance of data governance [14]. The survey

results show that good governance of data is considered critical to organizations. As a

result of continuous increase of ISs and data storage systems, respectively overlapping

data on the same citizens, services, and professionals are increasing. Dahlberg and Nok-

Page 60: Master's Thesis _Daniel Meriläinen 90648

60

kala argue that this is a managerial issue, because only business professionals know what

the content of data should be and what data are needed to perform specific tasks [14, pp.

28].

The article A Survey of Corporate Governance, by Shleifer and Vishny, is focused on

research on corporate governance [62]. Schleifer and Vishny argue that most world-wide

corporate governance mechanisms with large share holdings, relationship banking, and

takeovers can be viewed as examples of large investors exercising their power [62, pp.

739]. Corporate governance deals with constraints that managers put on themselves, or

that investors put on managers, to reduce the ex post misallocation and thus to induce

investors to provide more funds in advance [62, pp. 743]. Successful corporate govern-

ance systems combine significant legal protection of at least some investors with an im-

portant role for large investor [62, pp. 774].

Based on the article The State of Corporate Governance Research, by Bebchuk and

Weisbach, the special issue on corporate governance, co-sponsored by the Review of Fi-

nancial Studies and the National Bureau of Economic Research (NBER), states that poor

governance can limit capital flows and the integration of capital markets in the global

economy [1, pp. 952]. Bebchuk and Weisbach remind that corporate governance is in part

a product of legal systems set in place and the legal infrastructure accompanying them.

According to the article The Governance of Inter-organizational Coordination Hubs,

by Markus and Bui, business-to-business interactions are increasingly conducted through

inter-organizational coordination hubs, in which standardized information technology–

based platforms provide data and business process interoperability for interactions among

the organizations in particular industrial communities [51]. Because the governance of

inter-organizational arrangements is believed to affect their efficiency and effectiveness,

the article explores how and why inter-organizational coordination hubs are governed.

Analysis of the relevant prior theory and case examples shows that coordination hub

governance is designed to balance the sometimes conflicting needs for capital to invest in

new technology, for participation of industry members, and for the protection of data re-

sources. The findings by Markus and Bui suggest that the governance of inter-

organizational coordination hubs is not the starkly categorical choice between collective

(member-owned) and investor-owned forms, as suggested by the prior theory [51].

3.7.2 Data Management

DAMA defines data management as a business function [18]. It is responsible for plan-

ning, controlling and delivering data and information assets. Management consists of the

disciplines of development, execution, and supervision of plans, policies, programs, pro-

Page 61: Master's Thesis _Daniel Meriläinen 90648

61

jects, processes, practices and procedures. The disciplines control, protect, deliver, and

enhance the value of data and information assets [18].

Data management includes database administration-database design, implementation,

and production support [18]. It is a responsibility shared between the data management

professionals within IT organizations and the business data stewards, who represents the

collective interests of data producers and information consumers [18, pp. 5]. The profes-

sionals serve as the expert curators and technical custodians of the data, while stewards

serve as the appointed trustees for data assets.

The holistic data management function encompasses the following [18, pp. 6]:

Data governance

Data architecture management

Data development

Data operations management

Data security management

Data quality management

Reference and master data management

Data warehousing and business intelligence management

Document and content management

Metadata management.

3.8 Discovery of Data from Large Data Sets

Data mining refers to a set of methodologies aimed at finding relevant information from

large masses of data, i.e. Big Data [61]. Data mining can be applied very broadly. Typi-

cally, the data used in data mining include, for example, measurements of industrial pro-

cesses, and excerpts from the customer database or web server log files. Definitions of the

purposes of data mining do not limit the methods available. In most cases, the algorithms

used are, for example, various clustering, correlations, neural networks, self-organizing

maps, etc. Generally speaking, the successful utilization of data mining needs the most

relevant data with their various holistic understanding of the variables. Also, a simple

innovative approach, for example, data visualization, can help to see the benefits of the

data warehouse from an entirely new perspective [61].

Most definitions of Big Data focus on the size of data in storage. Size matters, but

there are other important attributes of Big Data, namely data variety and data velocity.

The three Vs of Big Data (volume, variety, and velocity) constitute a comprehensive def-

inition, and they bust the myth that Big Data is only about data volume. In addition, each

of the three Vs has its own ramifications for analytics [61]. Attributes can be divided into

Page 62: Master's Thesis _Daniel Meriläinen 90648

62

two groups: stable and flexible. Stable attributes mean attributes, whose values cannot be

changed (e.g. age or maiden name), while the values of flexible attributes can be changed

[59].

Big Data are culpable for the emergence of the federative approach. The approach re-

gards Big Data as a whole which can be processed in its original location. The structure

of the data does not play a crucial role, as the interoperable, shared attributes with cross-

references take care of data harmonization. Based on various different data storages, Big

Data can be defined as complex data, and data federation is necessary in order to solve

data integration of this kind, where attributes are shared between different data storages

and databases.

The canonical philosophy aims to build a canonical data model with separate entities

in order to gain the golden record. Due to the size of Big Data the golden record approach

faces critical problems, because Big Data consists of heterogeneous data formats, e.g.

social media, data streaming and IoT etc., which make difficult or impossible to drive into

the golden record at the expense of data quality.

The canonical approach builds a single data model from Big Data, and if the infor-

mation systems are governed by an organization, no data model can be found and it is not

possible to create one, since the data are external and owned by software vendors. In this

case attention is paid to public and well-known data storages, where the data are better

available, but federation of the data is difficult.

3.8.1 Data Mining

The article Framework for Early Detection and Prevention of Oral Cancer Using Data

Mining, by Sharma and Om, proposes an ED&P framework, which is used to develop a

data mining model for early detection and prevention of malignancy of the oral cavity

[64]. The database of 1025 patients has been created and the required information stored

in the form of 36 attributes. According to Sharma and Om, data mining in clinical data

sets is one of the extensively researched areas in computer science and information tech-

nology, owing to the wide influence exhibited by this computational technique in diverse

fields including finance, clinical research, multimedia, education and the like. Adequate

surveys and literature have been devoted to clinical data mining, an active interdiscipli-

nary area of research that is considered to be the consequent of applying artificial intelli-

gence and data mining concepts to the field of medicine and healthcare [64].

The article Data Mining in Clinical Data Sets: A Review, by Jacob and Ramani, aims

to provide a review on the foundation principles of mining clinical data sets, and presents

the findings and results of past researches on utilizing data mining techniques for mining

Page 63: Master's Thesis _Daniel Meriläinen 90648

63

healthcare data and patient records [42]. The scope of the article is to present a brief re-

port on previous investigations made in the sphere of mining clinical data, the techniques

applied, and the conclusions reached [42].

According to Data Mining Techniques by Linoff and Berry, memory-based reasoning

(MBR) results are based on analogous situations in the past [45, pp. 321]. In medical

treatments the most effective treatment for a given patient is probably the treatment that

resulted in the best outcomes for similar patients. MBR can find the treatment that pro-

duces the best outcome. It does not care about the format of the records. MBR only takes

into consideration the existence of two operations as follows [45]:

A distance function capable of calculating a distance between any two records

A combination function capable of combining results from several neighbors to

arrive at an answer.

These functions can be defined for many kinds of records, including records with

complex or unusual data types, such as geographic locations, images, audio files, and free

text. These types of data are usually difficult to handle by other analysis techniques. One

case study presented by Linoff and Berry describes using MBR for medical diagnosis [45,

pp. 323]. They introduce an example that takes advantage of ideas from image processing

to determine whether a mammogram is normal or abnormal (Figure 8).

MBR can be applied to the identification of abnormal mammograms. A radiologist

learns how to read mammograms by studying thousands of them, before she/he ever sees

any patients [45, pp. 332]. The approach essentially takes many pre-classified mammo-

grams and, for a new mammogram, finds the ones that are closest. The idea is that two

identical mammograms require no additional information so that their mutual information

similarity is maximized (the resulting distance between them is zero). If there is no rela-

tionship at all between the pixels in the images, then the images are not similar [45].

Page 64: Master's Thesis _Daniel Meriläinen 90648

64

Figure 8 Basic Idea of MBR [45, pp. 333]

In any data warehousing environment, each of these pieces of information is available

somewhere. They may exist in scripts written by the DBA, in e-mail messages, in docu-

mentation, in the system tables in the database, and so on. A metadata repository makes

this information available to the users in a format they can readily understand. Data ware-

houses store and retrieve clean, consistent data effectively [45].

According to the article Data Mining Techniques in Health Informatics: A Case Study

from Breast Cancer Research by Lu et al., the healthcare domain covers a vast amount of

complex data generated and developed over the years through electronic patient records,

diseases diagnoses, hospital resources and medical devices [48, pp. 56]. The article intro-

duces Knowledge Discovery in Databases (KDD) as a substitute for traditional methods

of data analysis. KDD automatically searches large volumes of data for interesting pat-

terns, useful information and knowledge. Data mining plays a key role in KDD. It brings

a set of techniques and methods that can be applied to the processed data to discover hid-

den patterns. Data mining can provide healthcare professionals with the ability to analyze

patient records and disease treatment over time, which in turn can help to improve the

quality of life for those facing terminal illnesses such as breast cancer [48].

Clinical data mining is an active interdisciplinary area of research that can be consid-

ered as arising from applying artificial intelligence concepts to medicine and healthcare,

for example, in personalized medicine based on a patient’s profile, history, physical ex-

amination and diagnosis, and utilizing previous treatment patterns, new treatment plans

can effectively be proposed [48].

Data mining requires high quality of data. Mining consists of selecting relevant attrib-

utes, generating the corresponding data set, and cleaning and replacing missing values

[48, pp. 64].

Page 65: Master's Thesis _Daniel Meriläinen 90648

65

In the case study by Lu et al., 16319 breast cancer patient records are extracted from

SBCDS (Southampton Breast Cancer Data System), and there are 22380 records, i.e.

instances, which show the patient’s cancer details [55]. The system does not record the

order, in which breast cancer treatments occur between presentations. After all, the deci-

sions about which information is sought will come from the clinical researchers [48].

3.8.2 Big Data

The article The World’s Technological Capacity to Store, Communicate and Compute

Information, by Hilbert and López in 2007, argues that humankind was able to store 2.9 ×

1020 optimally compressed bytes, to communicate almost 2 × 1021 bytes, and to carry

out 6.4 × 1018 instructions per second on general-purpose computers [39]. General pur-

pose computing capacity grew at an annual rate of 58%. The world’s capacity for bidirec-

tional telecommunication grew at 28% per year, closely followed by the increase in glob-

ally stored information (23%). Humankind’s capacity for unidirectional information dif-

fusion through broadcasting channels has experienced comparatively modest annual

growth (6%). Telecommunication has been dominated by digital technologies since 1990

(99.9% in digital format in 2007), and the majority of our technological memory has been

in digital format since the early 2000s (94% digital in 2007) [39].

The article Metcalfe’s Law after 40 Years of Ethernet, by Metcalfe, claims that critics

have declared Metcalfe’s law a gross overestimation of the network effect, but nobody

has tested the law with real data [53]. The law states that the value of a network grows as

the square of the number of its users, a gross overestimation of the network effect, but

nobody has tested the law with real data. Using a generalization of the sigmoid function

called the netoid, the Ethernet’s inventor and the law’s originator models Facebook user

growth over the past decade and fits his law to the associated revenue [53].

Data analytics is a method for analyzing mainly Big Data. Based on the article of

Business Intelligence and Analytics by Chen et al., business intelligence and analytics

(BI&A) has emerged as an important area of study for both practitioners and researchers,

reflecting the magnitude and impact of data-related problems to be solved in contempo-

rary business organizations [7].

It’s obvious that data volume is the primary attribute of Big Data. With that in mind,

most people define Big Data in terabytes (1012

bytes), sometimes in petabytes (1015

bytes). For example, a number of users interviewed by TDWI are managing 3 to 10 tera-

bytes (TB) of data for analytics [61]. Yet, Big Data can also be quantified by counting

records, transactions, tables, or files. Some organizations find it more useful to quantify

Big Data in terms of time. For example, due to the seven-year statute of limitations in the

Page 66: Master's Thesis _Daniel Meriläinen 90648

66

U.S., many firms prefer to keep seven years of data available for risk, compliance, and

legal analysis [61].

According to the article Big Data Analytics, by Russom, a new flood of user organiza-

tions is currently commencing or expanding solutions for analytics for Big Data [61]. To

supply the demand, vendors have recently released numerous new products and functions,

specifically for advanced forms of analytics (beyond OLAP and reporting) and analytic

databases that can manage Big Data. While it is good to have options, it is hard to track

them and determine, in which situations they are ready for use. The purpose of the article

is to accelerate users’ understanding of the many new tools and techniques that have

emerged for Big Data analytics in recent years. It will also help readers map newly avail-

able options to real-world use cases [61].

The article A Knowledge-Based Platform for Big Data Analytics is based on Pub-

lish/Subscribe Services and Stream Processing by Esposito et al. [29]. Big Data analytics

is considered an imperative aspect that needs to be further improved in order to increase

the operating margin of both public and private enterprises, and it represents the next

frontier for their innovation, competition, and productivity. Big Data are typically pro-

duced in different sectors of the above organizations, often geographically distributed

throughout the world, and are characterized by large size and variety.

Esposito et al. argue that there is a strong need for platforms handling larger and larger

amounts of data in contexts characterized by complex event processing systems and mul-

tiple heterogeneous sources, dealing with the various issues related to efficiently dissemi-

nating, collecting and analyzing the systems and sources in a fully distributed way. In

such a scenario, the article proposes a method for solving two fundamental issues: data

heterogeneity and advanced processing capabilities [29].

The article by Esposito et al. presents a knowledge-based solution for Big Data analyt-

ics, which consists in applying automatic schema mapping to deal with data heterogenei-

ty, as well as ontology extraction and semantic inference to support innovative processing

[29]. Such a solution, based on the publish/subscribe paradigm, has been evaluated within

the context of a simple experimental proof-of-concept, in order to determine its perfor-

mance and effectiveness.

3.8.3 Business Intelligence Systems

According to van der Lans, a business intelligence system can be defined as a solution for

supporting and improving the decision-making process of an organization [44, pp. 29].

From the user’s perspective the user interfaces of reporting and analytical tools are the

most practical elements in a business intelligence system. Reporting tools include e.g.

Page 67: Master's Thesis _Daniel Meriläinen 90648

67

OLAP and data discovery/exploitation tools. Analytical tools consist of data mining and

statistical analysis tools [44].

Due to the increasing number of external data sources, organizations are interested in

combining their own internal data with these new data sources [44]. This enriches report-

ing and analytical capabilities in an organization. Most users of business intelligence sys-

tems are decision makers at strategical and tactical management levels [44].

Data federation may solve the dilemma that faces business intelligence systems based

on a chain of databases [44]. Data are transformed and copied from one database to an-

other until they reach an endpoint, i.e. a database being accessed by a reporting or analyt-

ical tool [44, pp. 3]. This process is called ETL, which is long, complex, and highly inter-

connected.

Based on an agile architecture, data federation involves fewer databases and transfor-

mations. As a result, data federation in a business intelligence system leads to a shorter

chain [44].

3.9 Cancer Data

According to the article Cancer in Finland 2015 by Pukkala et al., the risk of breast cancer

in women has increased continuously. Thus breast cancer is clearly the most common

cancer affecting women. The incidence of breast cancer was arisen a tenth in 1987, when

nationwide mammography screening for breast cancer was started, and the incidence of

breast cancer in women seems to be increasing further [57]. Already in 2015, it was esti-

mated that 42 % of all female cancer incidences are breast cancer. More than a thousand

breast cancer cases, most of which are sympthom-free, are detected annually in screen-

ings.

Pukkala et al. argue that menopausal hormone therapies are involved in the significant

increase in breast cancer [57]. The decline of long-term hormone therapies in the early

2000s in Norway and Sweden has reversed the incidence of breast cancer. However, the

reduction is less in Finland.

Breast cancer appears before retirement age in about two in 15 women, and during

their entire life cycle, in more than one in ten. Breast cancer has been the most common

malignant disease affecting Finnish women since the 1960s. The incidence of cancer

starts to increase after the age of 40 [57].

According to Pukkala et al., the five-year survival rate of patients with breast cancer in

2007-2009 was 89 % [57, pp. 58]. Breast cancer is an example of a disease, in which a

patient with the advanced disease lives a considerable period of time due to effective

treatments [66, pp. 59]. In the case of local breast cancer in a sample of 200 patients, the

Page 68: Master's Thesis _Daniel Meriläinen 90648

68

five-year survival rate is as high as 98%. If the disease had spread to the armpit and

lymph nodes, the figure is 88%. Even if it is found that the disease had spread even fur-

ther, the five-year survival rate of patients is 42% [57].

According to the company Noona Healthcare, the number of breast cancer patients is

projected to grow 50 % by the year 2030 [54]. In 2016, nearly two million women will be

diagnosed with breast cancer worldwide. Most will survive cancer thanks to advanced

treatment methods. The growth in patient volumes poses new challenges for the

healthcare ecosystem. Some patients require treatment for troublesome side effects and

some relapse over the course of many years. In future, the systematic monitoring of large

numbers of patients and their symptoms and recovery will have a huge effect on clinical

resources [54].

Based on the article Long-Term Cancer Patient Survival achieved by the End of the

20th

Century: Most Up-to-Date Estimates from the Nationwide Finnish Cancer Registry,

by Brenner and Hakulinen, a new method of survival analysis, called period analysis, has

recently been developed. The method has been shown to provide more up-to-date esti-

mates of long-term survival rates than traditional methods of survival analysis [3].

Brenner and Hakulinen introduce applied period analysis to data from the nationwide

Finnish cancer registry to provide up-to-date estimates of 5-, 10-, 15- and 20-year relative

survival rates (RSR) achieved by the survival estimates, which suggests that for these

cancers, there has been ongoing major progress in survival rates in recent years, which

has so far remained undisclosed by traditional methods of survival analysis. For example,

period analysis reveals that 10-year RSRs have come close to (or even exceed) 75 % for

breast cancer. Period analysis further reveals that 20-year RSRs have now come close to

(or even exceed) 60 % for breast cancer [3].

Brenner and Hakulinen claim that RSR represents the survival rate in the hypothetical

situation where the cancer in question is the only possible cause of death [3]. It is defined

as the absolute survival rate among cancer patients divided by the expected survival rate

of a comparable group from the general population.

The most common forms of cancer are breast and lung cancer, with an average of

more than 2000 incident cases per year, followed by cancer of the prostate, stomach and

colon. The final coding of cancer data is carried out by qualified secretaries and super-

vised by the Registry physician (pathologist). The Registry follows the ICD-O-3 nomen-

clature [3].

The nationwide Finnish Cancer Registry is a database covering about 5.5 million peo-

ple [3, pp. 367]. It contains the highest quality data of any population-based cancer regis-

try in the world. Notification of cancer cases to the registry is mandatory by law, and the

information comes from many different sources, including hospitals, physicians working

outside hospitals, dentists, and pathological and cytological laboratories. Copies are also

Page 69: Master's Thesis _Daniel Meriläinen 90648

69

obtained of all death certificates where cancer is mentioned. Mortality follow-up is ex-

tremely efficient in Finland, due to the existence of personal identification numbers. Us-

ing these numbers as the key, the cancer registry files are matched annually with the an-

nual listed deaths. Matching with the central population register (a register of all people

currently alive and living in Finland) is performed as an additional check on the vital sta-

tus of patients [3].

According to the article Estimates of the Cancer Incidence and Mortality in Europe

published in 2006 by Ferlay et al., breast cancer is by far the most common form of can-

cer diagnosed in European women [30, pp. 586]. Breast cancer is the leading cause of

death from cancer in Europe [30, pp. 590].

In Europe the most common form of cancers was breast cancer (429 900 cases, 13.5%

of all cancer cases) in 2006 and the most common cause of death was from breast cancer

(131 900) [30]. Evidence-based public health measures exist to reduce mortality from

breast cancer.

The federative approach with the use of artifacts is an appropriate tool for processing

cancer data. The data are heterogeneous by nature and it includes a large number of imag-

es, text, films, recordings etc.

3.10 Healthcare

CDI is a dynamic solution in a healthcare environment where patient recognition and

payer-provider collaboration are continuously changing the prevailing situation [25, pp.

228]. It can offer patients access to their private healthcare records. The patient must be

informed of all transactions based on storing and sharing patient data. Hospitals, provid-

ers, health maintenance organizations, and insurers are legislatively required to track in-

dividual patient records across the lifespan of care. The implementation of electronic

medical records might solve the tracking dilemma [25]. It requires that there will be a

careful focus on data quality and accuracy in healthcare, even though it is already strug-

gling with a heavy data load. It is estimated that nearly 99 % of a supplier’s patient data

had error rates that risked and jeopardized the accurate identification of individual pa-

tients. It is thus logical that patient records must be matched and identified with individu-

al patients. This could be a matter of life and death [25].

The Enterprise Master Patient Index (EMPI) capabilities are dedicated to helping

healthcare suppliers to recognize patients as individuals [25]. It combines data across

multiple hospitals, doctors’ offices, clinics, laboratories, pharmacies, and other patient

entry points, as well as across diverse systems [25]. The goal is to have a combined, reli-

able view of every patient for every point of care or every system across the healthcare

Page 70: Master's Thesis _Daniel Meriläinen 90648

70

delivery network [25, pp. 229]. The core issue is patient safety. There are hundreds of

thousands of instances where missing or incomplete patient records have had tragic con-

sequences for patients [25].

According to Dyché and Levin, the idea of interoperable healthcare is that with in-

teroperable electronic health records, updated medical information could be available

wherever and whenever (ubiquitous) the patient and the attending health professional

need it right across the healthcare ecosystem [25, pp. 62]. For example, in the U.S., 785

million healthcare tests are conducted each year. The lack of interoperable systems to

effectively communicate the results among the various providers who need to review

them consume 1 billion hours of administrative processing time simply to get the data in

the right place [25, pp. 63].

Noona Healthcare aims to create the world’s largest evolving database of cancer pa-

tients at various stages of the disease [54]. The database has been designed for long-term

analysis and insight gained from the Big Data on millions of cancer patients. The data-

base will provide doctors, researchers and treatment developers with unique opportunities

to find new ways to overcome cancer. The company has a mobile service to provide can-

cer centers with a real-time holistic view of their patients’ wellbeing. The service im-

proves the quality of cancer patient care and makes the patient-clinic relationship more

personal and meaningful. Clinical staff can rapidly respond to severe symptoms and pro-

vide better care to far greater numbers of patients. The system enables patients to follow

their own wellbeing and recovery and stay in close contact with their clinic [54].

Based on Lu et al., a hospital’s health data sets come from various sources (e.g. clini-

cal data, administrative data, financial data), and health information systems are generally

optimized for high speed continuous updating of individual patient data and patient que-

ries in small transactions [48, pp. 57]. Data warehousing is used to integrate data from

multiple operational systems and provide population-based views of health information.

With the help of a clinical evidence-based process, clinical data warehouse can facilitate

strategic decision making for the generation of treatment rules. By using ETL technolo-

gies, data, which originates from different medical sources are extracted, transformed and

loaded into the existing data warehouse structure. Clustering is used to identify the groups

of individuals, who present similar risk profiles and symptoms [48].

No single ontological approach has importance in relation to healthcare, but data quali-

ty may have impact on patient safety.

Page 71: Master's Thesis _Daniel Meriläinen 90648

71

4 METHODOLOGY

Methodology refers to the range of appropriate means for the observation of reality and

the collection of information. Methodology applies to both qualitative and quantitative

research. Wand and Weber argue that a methodology must possess the features that ena-

ble users to construct a representation of their view of the real world [69, pp. 218]. A us-

er’s view may reflect existing real-world phenomena or imagined real-world phenomena.

4.1 Case Study

This Master’s thesis is based on a qualitative case study. The study methodology in turn

relies on an intensive case study. The intensive case study aims to form a holistic and

contextual description of data governance and management.

According to Ghauri and Grønhaug, a case study is associated with descriptive and ex-

perimental research [34]. However, according to Yin's view, a case study is not limited to

the above two categories [73]. In business research, a case study is especially useful when

the considered phenomenon is difficult to study outside the state of being a natural phe-

nomenon, and furthermore, exploring the concepts and variables is difficult to quantify

[34, pp. 109]. Investigations often collide with the fact that there are too many variables

present and this makes the research methods inappropriate.

The case study refers to the qualitative and empirical input angle and case studies

analysis [34, pp. 109]. It is based on a process model and a description of the manage-

ment of the situation. A case study is often associated with data collection from multiple

sources: verbal reports, personal interviews and observations of the primary sources of

information. In addition, a case study is based on the collection of data sources, such as

financial reports, archives and budget, and the activities based on the reports concerning

market and competition reports [34].

Case studies are not suitable for all types of research, depending on the research prob-

lem and objectives, which determine the suitability of the research method [34]. A case

study is useful for the development of the theory and its testing. The main feature is the

intensity of the phenomenon, the individual, group, organization, culture, event or situa-

tion. There must be sufficient information to characterize and to explain the unique fea-

tures of the case, as well as to demonstrate properties that are common in many cases.

Case study research relies on integrating forces. The ability to study the phenomenon in

many dimensions and then to make an integrative interpretation is needed [34].

Based on Ghauri and Grønhaug, a case study may either be holistic or integrated into a

single case study [34, pp. 178]. A holistic case study is the opposite of a built-in case

Page 72: Master's Thesis _Daniel Meriläinen 90648

72

study [73]. A holistic case study includes a comprehensive analysis of several cases with-

out subunits. A built-in case study, on the other hand, contains more than one subunit of

the analysis. Research methodology by integrating the qualitative and the quantitative

method can be implemented in a single study [73].

A case study requires the most accurate and in-depth perception and understanding of

metadata and attributes for the integration of the clinical medicine application environ-

ment. A qualitative case study with a comprehensive approach involves exploring items

[34, pp. 105], [28], [73].

The case study is most suitable for studies, whose aim is to look at a particular phe-

nomenon in its real environment [34]. In this study, the phenomenon is based on cancer

research, metadata and attribute-driven federation, the application process and the distri-

bution of information. The case study is based on individual case research (Single Case

Study). The subject is a cancer clinical department’s information application process. The

case study is appropriate to limit the investigation at this stage, in order to test and verify

the framework for reliability and validity in a precisely defined environment.

The article Building Theories from Case Study Research, by Eisenhardt, describes the

process of inducting theory using case studies from specifying the research questions to

reaching closure [27]. Some features of the process, such as problem definition and con-

struct validation, are similar to hypothesis-testing research. Others, such as within-case

analysis and replication logic, are unique to the inductive, case-oriented process. Overall,

the process described here is highly iterative and tightly linked to data. This research ap-

proach is especially appropriate in new topic areas. The resultant theory is often novel,

testable, and empirically valid. Finally, frame-breaking insights, the tests of good theory

(e.g. parsimony, logical coherence), and convincing grounding in the evidence are the key

criteria for evaluating this type of research [27].

The theoretical framework is based on Dahlberg’s and Nokkala’s article [14]. The

framework is presented in Chapter 2. It has not been validated, but this study is a part of

the wider research project, the purpose of which is to test and validate the empirical

framework.

The collection of empirical data is utilized according to Yin [73] and the principles of

Eisenhardt [27] for case studies and constructions. Case studies can combine different

types of data collecting methods, such as interviews, observation, and archival materials.

With the aid of the artifact, the data federation of breast cancer data from data storages

available to the CUH, the aim is to detect malignant breast cancer cases [13]. The CUH

has access to enormous amounts of relevant data both internal and external due to its role

in the healthcare system of the country of the present research. The CUH provides special

healthcare services to the citizens of healthcare districts. Numerous professionals and

software vendors have participated and participate in the development and operating of

Page 73: Master's Thesis _Daniel Meriläinen 90648

73

ISs and data storages kept by hospitals and by breast cancer specialists. Yet, the detection

of malignant breast cancer cases is currently largely manual and based on the expertise of

the professionals. The reason is that most data characteristics in relevant data storages

differ for certain reasons [13].

The development of artifact and data collection has been organized through workshops

with the data/information specialists of the hospital since January 2016 [13]. Prior to a

workshop, the latest version of the artifact is prepared for its presentation at the work-

shop. The researchers and the data/information specialists implement the first version of

the artifact. Then in a workshop, researchers and data/information specialists interview

one specific group of specialists in breast cancer at a time, for example pathology special-

ists and IS-support personnel having responsibility for pathology ISs [13].

After the first workshop, the design artifact was modified. The ability of the artifact to

support data federation is evaluated after each workshop and will be evaluated more thor-

oughly after the last workshop. If the data/information specialists and the medical chief

information officer (CIO) of the university hospital consider the artifact and the federative

approach useful, the artifact and the approach will be made available to the CUH, and to

be used as generic tools in the federation of clinical data.

The governance of data framework with the federative approach discussed by Dahl-

berg et al. was applied in order to craft the artifact [13, 14, 15]. The steps in the imple-

mentation of the artifact are as follows:

• Step 1. Identify the most relevant ISs/modules and data storages for data federa-

tion. Identify groups of specific specialists to be interviewed about how data in those

ISs/modules are to be understood and used.

• Step 2. Identify shared attributes that are needed to make data interoperable be-

tween the identified ISs/modules and data storages.

• Step 3. Describe IS technical, informational and socio-technical metadata for each

shared attribute.

The steps are iterative. Thus it is possible to modify both by adding and reducing,

ISs/modules and data storages, shared attributes and their metadata characteristics. For

example, the first three shared attributes were identified and at fourth one was later added.

Similarly, there were initially 30 to 40 candidates for the metadata characteristics of each

shared attribute, but the number of candidates was later reduced.

In the collection of empirical data about the case, the guidelines of Yin [73] and Ei-

senhardt [27] for case studies and for the building of research constructs from case studies

were applied. Case studies can combine different data collection methods, such as inter-

views, observation and archival material. All other data collection methods with the ex-

ception of direct observation were used. This case study has more features of a single-

case research than of a design science research. The federative approach and comparable

Page 74: Master's Thesis _Daniel Meriläinen 90648

74

artifacts to the artifact developed in this case have been used earlier to federate master

and social media site data in large commercial projects. As a consequence of this fact, the

artifact in this study is not new while the research context itself is brand-new.

4.1.1 Data Ontology

On the one hand, ontology refers to the nature of the test of reality or the nature of know-

ing and the nature of existence. On the other hand, ontology can be defined as a type of

model that represents a set of concepts and their relationships within a domain [18, pp.

249]. Both declarative statements and diagrams using data modeling techniques can de-

scribe these concepts and relationships.

In this Master’s thesis, the ontological or federative angle of incidence is the attribute

federation of different information systems without data transfer.

4.1.2 Epistemology

Epistemology is the information science, which examines perception of reality,

knowledge, opportunities to observe reality and to obtain information. It answers the

question of whether the study results can be generalized or what kind of information can

be obtained through research.

In this Master’s thesis, the epistemological viewing angle is data federation, recovera-

bility of data harmonization.

4.1.3 Paradigma

Paradigma refer to research into the guiding of basic ways and established assumptions.

Paradigma involve a sector-specific belief system science, containing the philosophical,

intellectual, experiential, and learned elements of reality and its examination.

In this Master’s thesis, both conceptual and contextual paradigma are relevant.

Page 75: Master's Thesis _Daniel Meriläinen 90648

75

4.1.4 Methods

Research methods refer to the reality of observation and data collection practices. They

are rules and procedures, and they are as tools or ways of proceeding to solve problems.

Methods play several roles, such as [34, pp. 37]:

Logic or ways of reasoning to arrive at solutions;

Rules for communication, i.e. to explain how the findings have been achieved;

Rules of intersubjectivity, i.e. outsiders should be able to examine and evaluate

research findings.

In this Master’s thesis, interviews, surveys and observations of the forming of a central

data collection channel allow the artifact (research matrix) content to be assembled.

4.1.5 Rhetoric

Rhetoric defines how the study should be reported. It deals with the established terms and

concepts, as well as with the concept of hierarchy.

In this Master’s thesis, the reporting of the results includes the thesis as a whole.

4.1.6 Triangulation

According to Ghauri and Grønhaug, triangulation involves a combination of methodolo-

gies studying the same phenomenon [34, pp. 181]. Triangulation can be used to improve

the accuracy of the assessment and therefore the results of the collection of data by differ-

ent methods, or even by collecting data from different subject areas of the research. When

the study is validated by the need to improve, then the data need to be collected, or ana-

lyzed through triangulation. In cases where the accuracy or resolution of the data is sig-

nificant, it is rather logical to collect information through a variety of methods and angles

of incidence [34]. Triangulation can improve the accuracy of judgements and thus results,

by collecting data through different methods or by collecting different kinds of data on

the subject matter of the study [34, pp. 212].

In this Master’s thesis, triangulation is applied in interviews related to the case study.

Page 76: Master's Thesis _Daniel Meriläinen 90648

76

4.2 Research Participation

This study covers the CUH of clinical cancer and its information management (IT Gov-

ernance). Professionals in IT management and nursing staff participated in the work-

shops. The medical expertise consists of professionals in oncology, pathology and radiol-

ogy. The ongoing research started in January 2016 and is being executed in co-operation

with the data/information specialists of the hospital.

The development of the artifact and data collection was organized through workshops.

Prior a workshop, the latest version of the artifact was prepared for its presentation at the

workshop. The researchers and the data/information specialists crafted the first version of

the artifact. Then in a workshop, researchers and data/information specialists interview

one specific group of (breast cancer) specialists at a time, for example pathology special-

ists and IS-support personnel having responsibility for pathology ISs. The design artifact

was modified after the workshop based on feedback. The ability of the artifact to support

data federation was evaluated lightly and temporarily after each workshop.

4.3 Artifact

The artifact of a matrix is used as a design tool in data federation based on the article by

Dahlberg et al. [13]. The cells of the matrix are interoperable and shared attributes (Table

1). The columns are ISs, which are chosen according to the hospital professionals and the

supervisor. The first phase in data federation is to identify interoperable attributes. Cross-

references are taken advantage when of filling the matrix.

On the vertical level interoperable attributes are the following:

HETU

TNM Code (TNM)

Diagnosis Code (DC)

Date of Event (Date).

On the horizontal level the key factors are ISs, as follows:

Uranus, Miranda, Oberon (Patient Information System)

Weblab (Laboratory Information System)

StellarQ (Management Tracking System)

Aria (Control System of Radiology and Patient-Specific Radiotherapy)

QpatiWeb (Pathology Information System).

Page 77: Master's Thesis _Daniel Meriläinen 90648

77

Table 1 Design Artifact of the Case Study

The artifact supports the federation of breast cancer data from data storages available

to a central university hospital and the detection of malignant breast cancer cases [13].

The development of the artifact and data collection is organized with the aid of work-

shops.

Page 78: Master's Thesis _Daniel Meriläinen 90648

78

5 RESULTS

This Master’s thesis aims to show that the theoretical framework is understandable and

workable. The goal was to answer the following research questions.

The main research question (How does the theoretical framework of data federation

work in practice compared with the golden record?), the first sub-question (What are the

benefits of the federative approach?) and the second sub-question (What are the limita-

tions?) are answered in Chapter 2.

The empirical data were collected by interviewing the CUH staff in Turku, Finland.

The acquisition of empirical material is based on workshops. Each workshop took about a

couple of hours. The design artifact was applied to acquire data. Data acquisition was

focused on five separate information systems. The criteria for choosing the particular ISs

were based on the advice and expertise of the CUH staff.

The MDM of the case includes unstructured, fragmented and non-governed data. The

MDM of hospital organizations showed outstanding differences related to data quality

(DQ) and MDM. Data governance seems to belong to nobody in an organization. It seems

that there is a lack of understanding as to which data should be governed. The canonical

approach was not able to produce any organization-wide solutions. Additionally, it was

unclear whether it is legally permissible to use data and to modify them in a designed

service. It is also unclear whether data are protected as well as who owns the data and

who needs to provide permissions.

For the sake of this case study, an artifact was designed to conceptualize and opera-

tionalize the considerations of understanding the reason for data creation and the govern-

ance accountabilities for the data. The case study shows that the amount of digital data is

increasing exponentially in the data environment of the CUH. With the volume growth,

sources, structures and all kinds of data are multiplying. Simultaneously the user organi-

zations are losing control over the data models and their data.

Currently data are increasingly often external data and provided by a kind of service

with unknown schemas through APIs or adapters. In spite of the current trend the ability

to manage and federate data is extremely important. Both the private and public sectors

need to benefit from data in a large number of contexts. Thus it is vital to know why the

federated data sets are created, for what purpose they serve and the reason the data are

stored.

A comparison between two practical approaches was made. The theoretical discussion

was conducted by reflecting the phenomena against the existing literature on ontologies

and data federation.

Page 79: Master's Thesis _Daniel Meriläinen 90648

79

The artifact of data federation in the case study is built of two matrices shown in Table

2 and Table 3. In the field tests the artifact proved to be practical and useful. The IS pro-

fessionals at the hospital accepted the use of the artifact positively.

The case study revealed that the primary intellectual dilemma is based on understand-

ing the ontological stance of the federative approach and the artifact. It is a generally

shared view that data are on the one hand contextually defined. On the other hand, it is

usual to apply the canonical data models of information systems. As a result, there is a

potential failure to take the necessary intellectual step of paying attention to the ontology

of the data. This step is mandatory in order to federate data in open systems environments

from incompatible data storages. If the ontological stance of the federative approach is

understood and recognized, then the artifact and its meaning make sense.

5.1 Matrices of the Artifact

The matrix shown in Table 2 was designed for and during the iterative second steps of the

artifact design. The second steps are focused on the identification of shared attributes.

Table 3 is shown in a generic and concise format due to the confidential nature of the data

in the ISs of the CUH. The matrix was designed by placing ISs and modules and data

storages in the matrix as columns of the matrix. The shared attributes were placed in the

matrix as rows of the matrix. The matrix of Table 2 shows the outcome of step 2 (Chapter

5.2) and could be used to check once more that the shared attributes really exist in all the

federated data storages.

Table 2 Data Federation Artifact - Identification of Shared Attributes

The case study shows that the identification of shared interoperable attributes proved

to be an easy task for the IS professionals at the CUH. It also made sense to the cancer

specialists. The matrix compiles shared attributes from all ISs and modules and data stor-

ages in one table. On the basis of the case study, it is palpable that the best way to imple-

ment the matrix is to add at a time one IS/module and data storage.

The matrix shown in Table 3 was implemented for and during the iterative third steps

of the artifact design. The third steps focused on the definition of contextual metadata

Page 80: Master's Thesis _Daniel Meriläinen 90648

80

characteristics for each shared attribute. Also Table 3 is shown in a generic and concise

format in order to prevent identification of the CUH’s information systems and sensitive

data.

Table 3 Definition of Contextual Metadata Characteristics

The content in the cells of the matrix shown in Table 3 was produced by answering to

the following questions:

What kind of IS technical properties does a shared attribute have (format,

length, hierarchy, granularity, mandatory, search key etc.)?

What kind of informational properties does a shared attribute have?

Is the data type of the shared attribute, such as a transaction, a report, a

document, a content, master data, reference data or metadata?

What is the source of the shared attribute: a business transaction sys-

tem, a sensor device, a control device, a spatial device, a temporal de-

vice, a social media device, or other?

Is the shared attribute structured, unstructured or multi-structured?

Is the origin of the shared attribute an internal or an external data

source? If the source is external, how is the organization allowed to

process and use the shared attribute and the related data storage?

Who enters and modifies the shared attribute during its life cycle?

What kind of socio-contextual properties does a shared attribute have?

What does the shared attribute mean in each use contexts during the life

cycle of the attribute?

For what purpose is the shared attribute used and created and what does

it mean at the time of creation?

For what purpose is the shared attribute used and what does it mean

when being used?

What is the reason for storing the shared attribute and what does it

mean when being stored?

What other life-cycle stages does the shared attribute have and what is

the meaning of the attribute at each stage?

Who are responsible for a shared attribute?

Page 81: Master's Thesis _Daniel Meriläinen 90648

81

Who are responsible for each of the IS technical, informational and so-

cio-contextual metadata characteristic of the shared attribute?

Who are responsible for the data quality of the shared attribute?

How are the availability and the access rights of data ensured for the

shared attribute?

Based on the confidential agreement between the CUH and Turku School of Economics,

no detailed contents of the matrices are able to introduce in this Master’s thesis.

The case study with the chosen approach is distinguished from a great number of ca-

nonical data integration endeavors in that no attempt is made to collect all the data into

the harmonized data storage. Usually the storage is used for reporting. The core idea of

the study is that the original data are untouchable and reside in their original locations.

Technically, data federations are conducted with the aid of metadata repositories. A re-

pository obtains cross-mappings of federated data storages by using the metadata of

shared attributes. The metadata repository is a data storage/repository for federation rules,

the meanings of attributes and their metadata, descriptions of data formats, and definitions

of cross mappings. Metadata descriptions are created, modified and used only when a

data federation need is recognized. New federation rules can be added whenever needed,

e.g. for a new reporting need. The idea is to avoid big bangs and to proceed at the pace of

learning.

5.2 Pattern to Implement the Artifact in the Case Study

The prerequisite for federating data is the contextual metadata. This is a fundamental re-

sult of this case study. Data federation is initiated by understanding the contextual

metadata. To avoid data deficiencies related to the human perception of the real word, the

representations of reality in data and ISs with their combined implications must be taken

into account.

The pattern consists of the following logical steps when implementing the artifact:

• Step 1. Identification of the most relevant IS/modules and data storages. Identifi-

cation of specific specialist groups who need to be interviewed about how data in those

ISs/modules are to be understood and used.

• Step 2. Identification of shared attributes is necessary when making data interop-

erable between the identified ISs/modules and data storages.

• Step 3. Description of IS technical, informational and socio-technical metadata for

each shared attribute.

All the steps are iterative and thus the process of each step must be repeated until the

appropriate result is achieved. During the cycles of iteration it is possible to add to and

Page 82: Master's Thesis _Daniel Meriläinen 90648

82

reduce information system modules and data storages, shared attributes and their metada-

ta characteristics. In this case study the first three shared attributes (HETU, TNM, Diag-

nosis Code) were identified and the fourth ones (Date of Event) were added later. Initial-

ly, 30 to 40 candidates for the metadata characteristics of each shared attribute were iden-

tified, but the number of attributes was reduced later.

Page 83: Master's Thesis _Daniel Meriläinen 90648

83

6 DISCUSSION

In Chapter 2.1, I argue that user organizations have lost control over the data models they

use and partly also over their data. In Chapter 2.2 I state that the golden record philosophy

is obsolete, because the data are increasingly external and provided as a service, with un-

known data models, APIs and/or adapters. Yet, the ability to manage and federate data is

becoming continuously important for organizations in order that they can benefit from

digital data. To manage and federate data effectively, it is necessary to know why the

federated data sets have been created, for what purposes they are used, and why the data

are stored.

Data federation starts from understanding the contextual metadata. The avoidance of

data deficiencies relates to the human perception of real word states, and the representa-

tions of the real world in data/ISs and their combined effects must also be considered

when data are federated.

6.1 Contribution

The background of the Master’s thesis is based on five years of relevant research. In spite

of the relatively long research period, there is as yet a limited amount of empirical public-

ly available data to support the federative approach. This also applies to the artifact de-

signed in this case study.

The federative approach may increase value creation along with overall IT invest-

ments. The necessary investments are recommendable in order to focus on data attributes

and MDM tools with the stance of interoperability. Data federation creates a new ap-

proach to integration with the aid of MDM tools. Thus it is not necessary to have a new

IS as mappings are available. This is an important finding since IT expenses in organiza-

tions are apt to skyrocket when overlapping systems and appliances are required.

The most important scientific contribution to the overall research is based on the ex-

tension of the research work by Wang et al. [74, 75, 76, 77]. In addition, data governance

and management based on the data framework by Dahlberg and Nokkala are introduced

at a practical level [14].

6.2 Limitations

This Master’s thesis has certain limitations related to the conceptual sources. Based on the

research cycle of five years, the amount of empirical, publicly available data is limited in

Page 84: Master's Thesis _Daniel Meriläinen 90648

84

terms of supporting the federative approach, including the designed artifact and the judg-

ment on the golden record approach. The federative framework introduced in this Mas-

ter’s thesis has not yet been empirically validated. The artifact as a tool for federation has

neither been validated nor reliability tested.

6.3 Future Research Questions

Data management researchers should focus on the ontological nature of digital data. As a

result of the data explosion, the importance of Big Data and data mining is growing rapid-

ly. Another relevant point is to gain an understanding of what kind of data on the opera-

tional level is necessary in order to perform diverse tasks. Last, but not least, further tests

are necessary to verify the applicability of data federation in practice relating also to other

branches than healthcare, in order to be able to qualify and rely on the federative frame-

work.

Page 85: Master's Thesis _Daniel Meriläinen 90648

85

7 REFERENCES

[1] Bebchuck, L.A., Weisbach, M.S. (2010). The State of Corporate Governance Re-

search. The Review of Financial Studies, Vol. 23, No. 3, pp. 939-961.

[2] Berson, A., Dubov, L. (2007). Master Data Management and Customer Data Integra-

tion for a Global Enterprise, McGraw-Hill, New York, pp. 1-434.

[3] Brenner, H., Hakulinen, T. (2001). Long-Term Cancer Patient Survival Achieved by

the End of the 20th

Century: Most Up-To-Date estimates from the Nationwide Finnish

Cancer Registry. British Journal of Cancer, Vol. 85, Issue 3, pp. 367-371.

[4] Cheikhouhou, I., Djemal, K., Maaref, H. (2010). Mass Description for Breast Cancer

Recognition. Emoataz et al. (Editors). ICISP 2010, LNCS 6134, pp. 576-584.

[5] Chen, B., Oliver, J., Schwartz, D., Lindsey, W., MacDonald, A. (2005). Data Federa-

tion Methods and System. United States Patent Application Publication, US

2005/0021502 A1, Jan. 27, 2005, pp. 1-9.

[6] Chen, D., Doumeingts, G., Vernadat, F. (2008). Architectures for Enterprise Integra-

tion and Interoperability: Past, Present and Future. Computers in Industry 59 (2008) pp.

647-659.

[7] Chen, H., Chiang, R.H.L., Storey, V.C. (2012). Business Intelligence and Analytics:

From Big Data to Big Impact. MIS Quarterly, Vol. 36, No. 4, pp. 1165-1188.

[8] Crichton, D., Kincaid, H., Downing, G.J., Srivastava, S., Hughes, J.S. (2001). An

Interoperable Data Architecture for Data Exchange in a Biomedical Research Network.

Computer-Based Medical Systems, on July 26-27, 2001. CBMS 2001. Proceedings. 14th

IEEE Symposium, pp. 1-8.

[9] Christen, P. (2014). Data Matching – Concept and Techniques for Record Linkage,

Entity Resolution, and Duplicate Detection. Springer-Verlag Berlin and Heidelberg

GmbH & Co.LG, pp. 3-265.

[10] Cleven, A., Wormann, F. (2010). Uncovering Four Strategies to Approach Master

Data Management, System Sciences(HICSS), 2010 the 43rd

Hawaii International Confer-

ence on IEEE, pp. 1-10.

[11] Dahlberg, T. (2010). Master Data Management Best Practices Benchmarking Study

2010. Dataset March 2010 htpps://www. Researchgate.net/publication/267508624, pp. 1-

46.

[12] Dahlberg, T., Heikkilä, J., Heikkilä, M. (2011). Framework and Research Agenda for

Master Data Management in Distributed Environments. Proceedings of IRIS 2011.

TUCS Lecture Notes No. 15, October 2011, pp. 82-90.

[13] Dahlberg, T., Heikkilä, J., Heikkilä, M., Nokkala, T. (2015). Data Federation by Us-

ing a Governance of Data Framework Artifact as a Tool - Case Clinical Breast Cancer

Page 86: Master's Thesis _Daniel Meriläinen 90648

86

Treatment Data. Åbo Akademi, pp. 1-15.

[14] Dahlberg, T., Nokkala, T. (2015). A Framework for the Corporate Governance of

Data – Theoretical Background and Empirical Evidence, pp. 25-45.

[15] Dahlberg, T. (2015). Managing Datification - Data Federation in Open Systems En-

vironments. Elsevier Editorial System for the Journal of Strategic Information Systems

Manuscript Draft. Åbo Akademi, pp. 1-22.

[16] Dahlberg, T. (2015). The MDM Golden Record is Dead, Rest in Peace – Welcome

Interpreted Interoperable Attributes. Åbo Akademi, pp. 1-13.

[17] Dahlberg, T. (2016). Research on: Governance of Data in the Contexts of Corporate

Governance and Governance of IT. Data Federation in the Context of Master and Big

Data. Presentation Slides. Åbo Akademi, pp. 1-46.

[18] DAMA (2010). The DAMA Guide to the Data Management Body of Knowledge

DAMA-DMBOK Guide, Technics Publications, 1st Edition, LLC: Bradley Beach, NJ, pp.

1-406.

[19] Diagnostic Imaging. November 1, 2007. Digital Mammography Produces Large Da-

ta Loads. http://www.diagnosticimaging.com/articles/digital-mammography-produces-

large-data-loads, Retrieved on August 23, 2016.

[20] Doan, A., Halevy, A., Ives, Z. (2012). Principles of Data Integration.Elsevier, Inc.,

pp. 1-487.

[21] Dreibelbis,A., Hechler,E., Milman, I., Oberhofer, M., van Run, P., Wolfson, D.

(2008). Enterprise Master Data Management an SOA Approach to Managing Core Infor-

mation, IBM Press/Pearson plc, Upper Saddle River, NJ, pp. 1-656.

[22] Duda, S. (2011). Identifying, Investigating and Classifying Data Errors: An Analysis

of Clinical Research Data Quality from an Observationl HIV Research Network in Latin

America and the Caribbean. Dissertation. The Faculty of the Graduate School of Vander-

bilt University, Nashville, Tennessee, USA, pp. 1-100.

[23] Duodecim (2002). Rintasyövän diagnostiikka ja seuranta. On June 14, 2002. Duo-

decim. https://www.duodecim.fi/. Retrieved on September 1, 2016.

[24] Dušek, L., Hřebíček, J., Kubásek, M., Jarkovský, J., Kalina, J., Baroš, R., Bednářo-

vá, Z., Klánová, J., Holoubek, I. 2011. Conceptual Model Enhancing Accessibility of

Data from Cancer–Related Environmental Risk Assessment Studies. International Federa-

tion for Information Processing (IFIP), AICT 359, pp. 461-479.

[25] Dyché, J., Levy, E. (2006). Customer Data Integration. John Wiley and Sons, Inc.,

pp. 1-324.

[26] Edge, S.B., Byrd, D.R., Compton, C.C., Fritz, A.G., Greene, F.L., Trotti, A.III

(2011). Cancer Staging Handbook (AJCC). Springer-Verlag, New York, NY, pp. 1-718.

Page 87: Master's Thesis _Daniel Meriläinen 90648

87

[27] Eisenhardt, K.M. (1989). Building Theories from Case Study Research. Academy of

Management Review, Vol. 14, Issue 4, pp. 532-550.

[28] Eriksson, P., Kovalainen, A. (2011). Qualitative Methods in Business Research.

Sage Publications Inc., pp. 115-136.

[29] Esposito, C., Ficco, M., Palmieri, F., Castiglione, A. (2015). A Knowledge-Based

Platform for Big Data Analytics Based on Publish/Subscribe Services and Stream Pro-

cessing. Knowledge-Based Systems, Vol. 79, May 2015, pp. 3-17.

[30] Ferlay, J., Autier, P., Boniol, M., Heanue, M., Colombet, M., Boyle, P. Estimates of

the Cancer Incidence and Mortality in Europe in 2006. 2007. Annals of Oncology, Vol.

18, pp. 581-592.

[31] Ferro, N., Silvello, G. (2008). A Methodology for Sharing Archival Descriptive

Metadata in a Distributed Environment. ECDL 2008, LNCS 5173, pp. 268–279.

[32] Finnish Cancer Registry.

http://www.cancer.fi/syoparekisteri/en/?x56215626=112197488. Retrieved on January 1,

2016.

[33] Gable, J. (2002). Enterprise Application Integration. The Information Management

Journal, March/April 2002, pp. 48-52.

[34] Ghauri, P., Grønhaug, K. (2010). Research Methods in Business Studies. A Practical

Guide. 4th

Edition. Prentice Hall, Pearson Education Limited, pp. 3-265.

[35] Giachetti, R.E. (2010). Design of Enterprise Systems. CRC Press, pp. 1-423.

[36] Gregor, S. (2006). The Nature of Theory in Information Systems. MIS Quartely,

Volume 30, Issue 3, pp. 611-642.

[37] Heimbigner, D., McLeod, D. (1985). A Federated Architecture for Information

Management. ACM Transactions on Office Information Systems. Vol.3., No.3., July

1985, pp. 253-278.

[38] Henderson, J.C., Venkatraman, N. (1993, 1999). Strategic Alignment: Leveraging

Information technology for Transforming Organizations. IBM Systems Journal, Vol. 32,

No. 1, pp. 472-484.

[39] Hilbert, M., Lopez, P. (2011). The World's Technological Capacity to Store, Com-

municate and Compute. Information Science, Vol.332, No.6025, pp. 60-65.

[40] Iivari, J., Hirschheim, R., Klein, H.K. (1998). A Paradigmatic Analysis Contrasting

Information Systems Development Approaches and Methodologies, Information Systems

Research, 9(2), pp.164-193.

[41] Iivari, J. (2003). The IS Core – VII: Towards Information Systems as a Science of

Meta-Artifacts. Communications of the Association for Information Systems, Vol. 12,

No. 37, pp. 567-582.

[42] Jacob, S.G., Ramani, R. G. (2012). Data Mining in Clinical Data Sets: A Review.

International Journal of Applied Information Systems (IJAIS), Vol. 4, No. 6, pp. 15-26.

Page 88: Master's Thesis _Daniel Meriläinen 90648

88

[43] Kuecler, W., Vaisnavi, V. (2012). A Framework for Theory Development in Design

Science Research: Multiple Perspectives. Journal of the Association for Information Sys-

tems, Vol. 13, Issue 6, pp. 395-423.

[44] Lans, R.F.van der (2012). Data Virtualization for Business Intellience Systems.

Elsevier, Inc., pp. 1-275.

[45] Linoff, G.S.-Berry, M.J.A. (2011) Data Mining Techniques. 3rd

Edition. Wiley

Publishing, Inc. USA, pp. 1-821.

[46] Linthicum, D.S. (2000). Enterprise Application Integration. Adison Wesley. USA,

pp. 1-379.

[47] Loshin,D. (2010). Master Data Management. Morgan Kaufmann, pp. 1-274.

[48] Lu, J., Hales, A., Rew, D., Keech, M., Fröhlingsdorf, C., Mills-Mullett, A., Wette, C.

(2015). Data Mining Techniques in Health Informatics: A Case Study from Breast Cancer

Research. Information Technology in Bio- and Medical Informatics, Vol. 9267, pp. 56-

70.

[49] Luján-Mora, S., Vassiliadis, P., Trujillo, J. (2004). Data Mapping Diagrams for Data

Warehouse Design with UML. Lecture Notes in Computer Science, Vol. 3288, pp. 191-

204.

[50] Maier, R.-Hädric, T., Peinl, R. (2005). Enterprise Knowledge Infrastructures.

Springer-Verlag Berlin Heidelberg, Germany, pp. 1-379.

[51] Markus, M.L., Bui, Q.N. (2012). Going Concerns: The Governance of Interorganiza-

tional Coordination Hubs. Journal of Management Information Systems, Spring 2012,

Vol. 28, No, 4, pp. 163-167.

[52] Martin, W. (2014). The Advantages of a Golden Record in Customer Master Data

Management. Uniserv GmbH, pp. 1-6. http://www.wolfgang-martin-

team.net/paper/SpecialistReport_golden%20record_ENG.PDF. Retrieved on September

16, 2016.

[53] Metcalfe, B. (2013). Metcalfe’s Law after 40 Years of Ethernet. The IEE Computer

Society, December 2013, pp. 26-31.

[54] Noona Healthcare, www.noona.com, Retrieved on July 8, 2016.

[55] Otto,B. (2015). Quality and Value of the Data Resource in Large Enterprises. Infor-

mation Systems Management, Vol 32, pp. 234–251.

[56] Population Registry Centre (2016). http://vrk.fi/en/personal-identity-code1. Ret-

rieved on June 11, 2016.

[57] Pukkala, E., Dyba, T., Hakulinen, T., Sankila, R. Syöpä Suomessa 2015. (2015).

Syöpäjärjestöjen julkaisuja 2015. Suomen Syöpäyhdistys. Helsinki, pp. 7-16.

[58] Rahmati, P., Hamarneh,, Nussbaum, D., Adler, A. (2010). A New Preprocessing

Filter for Digital Mammograms. Elmoataz et al. (Editors): ICISP 2010, LNCS 6134, pp.

585-592.

Page 89: Master's Thesis _Daniel Meriläinen 90648

89

[59] Ras, Z.W., Tzacheva, A., Tsay, L-S. (2006). Encyclopedia of Data Warehousing and

Mining. Action Rules. Edited by Wang, J. Idea Group Reference, USA, pp. 1-5.

[60] Reddy, S.S., Mulani, J., Bahulkar, A. (2000). Adex – A Meta Modeling Framework

for Repository Centric Systems Building, in Advances in Data Management. Edited by

Ramamritham, K.,and Vijayaraman, T.M., Tata McGraw-Hill Publishing Company Ltd.,

pp. 1-10.

[61] Russom, P. (2011). Big Data Analytics. TDWI (The Data Warehousing Institute)

Best Practices Report, 4th

Quarter 2011, pp. 3-34.

[62] Schleifer, A., Vishny, R.W. (1997). A Survey of Corporate Governance. The Journal

of Finance, Vol. 52, Issue 2, pp. 737-783.

[63] Sen, A. (2004). Metadata Management: Past, Present, Future. Decision Support Sys-

tems, Volume 37, pp. 151-173.

64/63/69/74/[64] Sharma, N., Om, H. (2012). Framework for Early Detection and Preven-

tion of Oral Cancer Using Data Mining. International Journal of Advances in Engineering

& Technology, September 2012, pp. 302-310.

[65] Sheth, A.P., Larson, J.A. (1990). Federated Database Systems for Managing Distrib-

uted, Heterogeneous, and Autonomous Databases, ACM Computing Surveys, Vol.22,

No.3, pp.183-236.

[66] Tsiknakis, M., Brochhausen, M., Nabrzyski, J., Pucacki, J., Sfakianakis, S. G., Pota-

mias, G., Desmedt, C., Kafetzopoulos, D. (2008). A Semantic Grid Infrastructure Ena-

bling Integrated Access and Analysis of Multilevel Biomedical Data in Support of Post-

genomic Clinical Trials on Cancer. IEEE Transactions on Information Technology in

Biomedicine, Vol. 12, no. 2, pp. 205-279.

[67] Wand, Y., Wang, R.Y. (1996). Anchoring Data Quality Dimensions in Ontological

Foundations Communications of the ACM, Vol. 39, Issue 11, pp. 86-95.

[68] Wand, Y., Weber, R. (1990). An Ontological Model of an Information System, IEEE

Transactions on Software Engineering, Vol. 16, No. 1, pp. 1282-1292.

[69] Wand, Y., Weber, R. (1993). On the Ontological Expressiveness of Information Sys-

tems Analysis and Design Grammars, Information Systems Journal, Vol. 3, No. 4, pp.

217-237.

[70] Wand, Y., Weber, R. (2002). Research Commentary: Information Systems and

Conceptual Modeling—a Research Agenda", Information Systems Research, Vol. 13, No.

4, pp. 363-376.

[71] Watson, R.T. (2006). Data Management: Databases and Organizations. 5th

Edition.

John Wiley & Sons, Inc., pp. 1-603.

[72] Watts, S., Shankaranarayanan, G., Evem, A. (2009). Data Quality Assessment in

Context: A Cognitive Perspective. Decision Support Systems, Vol. 48, pp. 202-211.

Page 90: Master's Thesis _Daniel Meriläinen 90648

90

[73] Yin, R.K. (2014). Case Study Research: Design and Methods, 5th

Edition, Sage Pub-

lications, USA, pp. 3-282.