REVIEW OF LITERATURE - Shodhganga : a reservoir of...

- 25 -

CHAPTER 2

REVIEW OF LITERATURE

2.0 INTRODUCTION

In this chapter a review of the prevailing literature within the scope of this

research has been made. The literature review was undertaken to understand the

foundations of this research. Literature reviewed includes journals, publications,

magazines, books and technical reports from renowned organizations leading the data

warehouse industry. The theoretical and empirical research findings of various

researchers have been studied and analyzed to understand the feasibility of defined

objectives.

2.1 REVIEW OF LITERATURE:

This chapter categorizes and elucidates the work done so far in the area of data

warehouse development. The chapter has been organized into three parts. Initially the

theoretical vision about data warehouse development has been presented, which

includes issues associated with data warehouse design, deployment and data

warehouse architectures proposed by various researchers. Secondly literature

concerned with generating synthetic test data has been reviewed. Subsequent

literature review is concerned about ETL systems and data quality issues. Afterwards

literature regarding contemporary distributed data management issues has been

reviewed to propose a refined architecture for distributed data management.

The literature review starts from gaining the primitive knowledge of data

warehouse architectures. Initially it proceeds from primitive data warehouse

conjectures towards advanced data warehouse setups prevailing in the data warehouse

industry. Afterwards the test data generation methodologies proposed so far for

Review of Literature

- 26 -

generating synthetic records were studied followed by reviewing the testing strategies

proposed by eminent researchers. The literature review further unveiled the micros of

the Extraction, Transformation and Loading module of the data warehouse. As this

research is ascertain to promulgate required testing procedures for quality ETL

routines, hence appropriate literature was reviewed to locate the possible loopholes in

the ETL development procedure.

2.1.1 On being familiar Terms with Data Warehouse:

Robert M. Bruckner et al. [86] identified intelligent and comprehensive data

warehouse systems as a powerful instrument for organizations to analyze their

business. The implementation of such decision systems for an enterprise-wide

management and decision support system can be very different from traditional

software implementations. Because data warehouse systems are strongly data-driven

entities, hence their development process is highly dependent on their underlying

data. Since data warehouse systems concern many organizational units, the collection

of unambiguous, complete, verifiable, consistent and usable requirements can be a

very difficult task. Use cases are generally used as standard notation for object-

oriented requirement modelling. In this paper the team showed how use cases can

enhance communication between stakeholders, domain experts, data warehouse

designers and other professionals with diverse backgrounds. Three different

abstraction levels namely business, user and system requirements of data warehouse

requirements were introduced and discussed.

Beate List et al. [16] stated in this case study on Process Warehousing that the

building a data warehouse as a very challenging issue because compared to software

engineering it is quite a young discipline and does not yet offer well-established


- 27 -

strategies and techniques for the development process. Current data warehouse

development methods can fall within three basic groups: data-driven, goal-driven and

user-driven. During this study all three development approaches have been applied to

the Process Warehouse that is used as the foundation of a process-oriented decision

support system, which aims to analyse and improve business processes continuously.

In this paper the authors evaluated all three development methodologies by various

assessment criteria. The aim is to establish a link between the methodology and the

requirement domain.

R.G. Little and M.L. Gibson [82] in this study, have surveyed data warehouse

implementation prospective. Project participants were consulted to record their

perceptions, which may further contribute to the data warehouse implementation

process. The respondents included: functional managers/staff, IS managers/staff, and

consultants. The study identified eight significant factors that according to

participants may impact data warehouse implementation.

Hugh J. Watson and Thilini Ariyachandra [40] conducted this research for

analyzing the success factor of different data warehouse architectures. Following were

the main objectives of this study:

a) Understanding the factors that influence the selection of data warehouse

architecture.

b) The success of the various architectures. The academic and data warehousing

literature and industry experts were used to identify architecture selection

factors and success measures. Information attained was then to create

questions for a Web-based survey that was used to collect data from 454

companies about the respondents, their companies, their data warehouses, the

architectures they use, and the success of their architectures. The success


- 28 -

metrics identified were information quality, system quality, individual

impacts, organizational impacts, development time, and development cost.

c) Perhaps the most interesting finding of this study is the almost equal score of

the bus, hub and spoke, and centralized architectures on the information and

system quality along with the individual and organizational metrics. It helps

explain why these competing architectures have survived over time they are

equally successful for their intended purposes. Based on these metrics, no

single architecture is dominant, but still hub and spoke architecture seems to

be the first choice among organizations implementing data warehouse.

Joy Mundy et al. [48] have discussed five major aspects of data warehouse

implementation. This book is organized into five parts; a brief synopsis of these parts

is as under:

Part I Requirements, Realities, and Architecture: first three chapters have

been included into this part. The first chapter of this part deals with a brief summary

of the Business Dimensional Lifecycle. Afterwards the most important step of

gathering the business requirements is discussed. In chapter two the authors present a

brief primer on how to develop a dimensional model. This chapter presents

terminology and concepts used throughout the book, so it’s vital that one should

understand this material. Chapter three deal with the Architecture and Product

Selection tasks, which are straightforward for a Microsoft DW/BI system. In this short

chapter there is a detailed discussion about how and where to use the various

components of SQL Server 2005, other Microsoft products, and even where to use

third-party software in your system.

Part II Developing and Populating the Databases: The second part of the

book presents the steps required to effectively build and populate the data warehouse


- 29 -

databases. Most Microsoft DW/BI systems will implement the dimensional data

warehouse in both the relational database and the Analysis Services database. It

includes chapter 4 to 7.

Chapter 4 describes how to install and configure the various components of

SQL Server 2005. Then issues like system sizing and configuration, and how and why

one may choose to distribute DW/BI system across multiple servers are discussed.

There are some physical design issues to consider, notably whether to partition the

fact tables or not? Chapter 4 is focused on the Product Selection and Installation.

Chapter 5 describes the ETL portion of the DW/BI system as an evergreen design

challenge. In this chapter introduction to SQL Server’s new ETL technology,

Integration Services is discussed. Chapter 6 talks about the basic design for ETL

system in Integration Services. One has to walk through the details before loading

dimension tables. The chapter addresses the key dimension management issues such

as surrogate key assignment and attribute change management. According to chapter

7 the more closely the relational database and ETL process are designed to meet your

business requirements, the easier it is to design the Analysis Services database.

Analysis Services includes lots of features for building an OLAP database on top of a

poorly constructed database, but for better results one should follow the proprietor’s

instructions and have a clean and conformed relational data warehouse as the starting

point for a flawless OLAP database. Keeping in view the scope of this research work

part II is of prime importance in this book.

Part III Developing the BI Applications: The third part of the book

demonstrates the steps required to present the data to the business users. It starts with

a chapter that clearly defines what is meant by BI applications. Then the reporting


- 30 -

services are discussed. Data mining can deliver huge value to your business, by

looking for hidden relationships and trends in the data. Chapter 10 talks about how to

use Analysis Services to build data mining models.

Part IV Deploying and Managing the DW/BI System: The fourth section of

the book includes information about how to deploy and operate the DW/BI system.

This part highlights the techniques used to deal with existing data warehouse. Prime

security issues followed by the metadata management and maintenance strategies for

a data warehouse have also been discussed.

Part V Extending the DW/BI System: the fifth part is associated with

growth management of data warehouse. Emphasis is focused on real time business

intelligence followed by a discussion on present imperatives and future outlook.

Mark I. Hwang and Hongjiang Xu [41] consider data warehousing as an

important area of practice and research. According to this research only a few studies

have assessed its practices in general and critical success factors in particular.

Although plenty of guidelines for implementation exist, but few have been subjected

to empirical testing. In order to better understand implementation factors and their

effect on data warehousing success, perceptions of data warehousing professionals are

examined in a cross sectional survey. Best subsets regression is used to identify the

specific factors that are important to each success variable. Since different companies

may have different objectives or emphases in their data warehousing endeavours, the

results are useful in identifying the exact factors that need attention and in providing a

basis for prioritizing those factors. The results also suggest several promising

directions for continued research for data warehousing success.


- 31 -

Stanisław Kozielski and Robert Wrembel [91] in this special issue of the

Annals of Information Systems presented current advances in the data warehouse and

OLAP technologies. The issue is composed of 15 chapters that in general address the

following research and technological areas: advanced technologies for building XML,

spatial, temporal, and real-time data warehouses, novel approaches to DW modelling,

data storage and data access structures, as well as advanced data analysis techniques.

The first two chapters of this publication were found to be of prime importance for

this research. First chapter overviews open research problems concerning data

warehouse development for integrating and analyzing various complex types of data,

dealing with temporal aspects of data, handling imprecise data, and ensuring privacy

in Data Warehouses. In chapter two challenges in designing ETL processes for real-

time (or near real-time) data warehouses have been discussed and further an

architecture real-time data warehousing system has been proposed.

2.1.2 Generating Test Data and Data Warehouse Testing:

J. Gray et al. [43] states that evaluating database system performance often

requires generating synthetic databases which have certain statistical properties but

filled with dummy information. When evaluating different database designs, it is

often necessary to generate several databases and evaluate each design. As database

sizes grow to terabytes, generation often requires longer time than evaluation. This

paper presents several techniques for synthetic database generation. In particular it

discusses parallelism to get generation speedup and scaleu. Congruential generators to

get dense unique uniform distributions and Special-case discrete logarithms to

generate indices concurrent to the base table generation.


- 32 -

Tom Bersano, et al. [98] in this paper considers the non availability of

representative data as a significant issue in database testing. When real data is hard to

obtain or when its properties are hard to identify, synthetic data becomes an appealing

alternative. Authors perceive that synthetic data is used mainly in testing data-mining

algorithms, and synthetic data generators are thus limited to small toy data-mining

domains and are not very flexible. In this paper the authors have considered issues in

extending the use of synthetic data for performance analysis and benchmarking

different database algorithms and systems. Their approach consists of describing and

implementing a synthetic data specification language and data generation tool which

is more flexible and allows the generation of more realistic data. Then the authors

have discussed the procedure to generate synthetic data from specifications provided

by the user.

Ashraf Aboulnaga, et al. [12] opines that the synthetically generated data has

always been important for evaluating and understanding new ideas in database

research. In this paper, the authors described a data generator for generating synthetic

complex-structured XML data that allows a high level of control over the

characteristics of the generated data. This data generator is certainly not the ultimate

solution to the problem of generating synthetic XML data, but according to authors it

is very useful in the research on XML data management, and it was believed that it

can also be useful to other researchers. Furthermore, this paper has started a

discussion in the XML community about characterizing and generating XML data,

and moreover it may serve as a first step towards developing a commonly accepted

XML data generator for the community concerned.

N. Bruno and S. Chaudhuri [66] considers the evaluation and applicability

of many database techniques ranging from access methods, histograms and


- 33 -

optimization strategies to data normalization and mining, crucially depend on their

ability to cope with varying data distributions in a robust way. However,

comprehensive real data is often hard to come by, and there is no flexible data

generation framework capable of modeling varying rich data distributions. This has

led individual researchers to develop their own ad-hoc data generators for specific

tasks. As a consequence, the resulting data distributions and query workloads are

often hard to reproduce, analyze, and modify, thus preventing their wider usage. In

this paper the authors presented a flexible, easy to use, and scalable framework for

database generation. Afterwards they discussed how to map several proposed

synthetic distributions to their framework and reported preliminary results.

Xintao Wu, et al. [106] in this paper states that testing of database

applications is of great importance. A significant issue in database application testing

consists in the availability of representative data. This paper investigates the problem

of generating a synthetic database based on a-priori knowledge about a production

database. The approach followed was to fit general location model using various

characteristics (e.g., constraints, statistics, rules) extracted from the production

database and then generate the synthetic data using model learnt. The generated data

was valid and similar to real data in terms of statistical distribution, hence it can be

used for functional and performance testing. As characteristics extracted may contain

information which may be used by attacker to derive some confidential information

about individuals, hence the paper presents a disclosure analysis method which

applies cell suppression technique for identity disclosure analysis and perturbation for

value disclosure.

Eric Tang et al. [31] discussed the importance of testing a database

application. AGENDA (A (test) Generator for Database Applications) was designed


- 34 -

to aid in testing database applications. In addition to the tools that AGENDA currently

has, three additional tools were made to enhance testing and feedback. They are the

log analyzer, attribute analyzer, and query coverage. The log analyzer finds relevant

entries in log file produced by DBMS, lexically analyzes them using a grammar

written in JavaCC, and stores some of the data in a database table. When the log entry

represents an executed SQL statement, this statement is recorded. The attribute

analyzer parses SQL statements. A SQL grammar for JavaCC was modified, adding

code to determine which attributes are read and written in each SQL statement. A new

test coverage criterion, query coverage, is defined. Query coverage checks whether

queries that the tester thinks should be executed actually are executed. Similar to log

analyzer, JavaCC was used to implement this. It is implemented by pattern matching

executed queries against patterns representing abstract queries (including host

variables) identified by the tester.

Pengyue J. Lin, et al. [80] said that data mining research has yielded many

significant and useful results such as discovering consumer-spending habits, detecting

credit card fraud, and identifying anomalous social behaviour. Information Discovery

and Analysis Systems (IDAS) extract information from multiple sources of data and

use data mining methodologies to identify potential significant events and

relationships. This research designed and developed a tool called IDAS Data and

Scenario Generator (IDSG) to facilitate the creation, testing and training of IDAS.

IDSG focuses on building a synthetic data generation engine powerful and flexible

enough to generate synthetic data based on complex semantic graphs.

Harry M. Sneed [38] contributed this experience report on system testing and

in particular on the testing of a data warehouse system. According to this report data

warehouses are large databases used solely for querying and reporting purposes. The


- 35 -

data warehouse in question here was dedicated to fulfilling the reporting requirements

of the BASEL-II agreement on the provision of auditing data by the banks, the

European equivalent of the Sarbane-Oxley Act. The purpose of the testing project was

to prove that the contents of the data warehouse are correct in accordance with the

rules specified to fill them. In the end, the only way to achieve this was to rewrite the

rule specifications in a machine readable form and to transform them into post

assertions, which could be interpreted by a data verification tool for comparison of the

actual data contents with the expected data contents. The testing project was never

fully completed, since not all of the rules could be properly transformed.

Binnig Carsten, et al. [17] considers that OLTP applications usually

implement use cases which execute a sequence of actions whereas each action usually

reads or updates only a small set of tuples in the database. In order to automatically

test the correctness of the different execution paths of the use cases implemented by

an OLTP application, a set of test cases and test databases needs to be created. In this

paper, it is suggested that a tester should specify a test database individually for each

test case using SQL in a declarative test database specification language. Moreover,

the authors also discussed the design of a database generator which creates a test

database based on such a specification. Consequently, their approach allows to

generate a tailor-made test database for each test case and to bundle them together for

the test case execution phase.

Soumyajit Sarcar [90] in this paper takes a look at the different strategies to

test a data warehouse application. It attempts to suggest various approaches that could

be beneficial while testing the ETL process in a DW. A data warehouse is a critical

business application and defects in it results in business loss that cannot be accounted


- 36 -

for. Here, some of the basic phases and strategies to minimize defects have been

proposed.

Vincent Rainardi [101] in this book has presented pioneering ideas for data

warehouse development with examples of SQL Server. keeping in view the scope of

this study the chapter seven and nine of this book are of prime importance as they

concentrate on ETL development, functionality and data quality assurance in a data

warehouse. In this chapter, the author discusses what data quality is and why it is so

important. Further the data quality process and the components involved in the

process have been discussed. The categories of data quality rules in terms of

validation and the choices of what to do with the data when a data quality rule is

violated: reject, allow, or fix have been beautifully explained.

Manoj Philip Mathen [60] in this white paper opines that exhaustive testing

of a Data warehouse during its design is essential and it is an ongoing process. Testing

the Data warehouse implementations are also of utmost significance. Organization

decisions depend entirely on the Enterprise data and the data has to be of supreme

quality. Complex Business rules and transformation logic implementations mandates

a diligent and thorough testing. Finally, the paper addresses some challenges for data

warehouse testing like voluminous data, heterogeneous sources, temporal

inconsistency and estimation challenges.

2.1.3 Understanding ETL, Data Quality and Data Management:

Larry P. English [52] revolutionized the data quality assurance framework

through this book. In data warehouse literature this book from Larry English is

considered as the information quality Bible for the information age. This book starts


- 37 -

from the premise that a high proportion of businesses hold back on information

quality, through inaccurate and missing data. Not only do they lose out substantially

as a result, but such inaccuracies, in turn, corrupt data warehouses which, then, fail.

The book's aim is to be a 'one stop' source for helping businesses to reduce costs and

increase profits by showing them how to measure the quality of their information

resources, as well as how to cleanse their data and keep it clean. This book comes

with a guarantee: that the author will personally refund the cost of the book to any

reader whose organisation does not get a substantial benefit from following its

recommendations.

Man-Yee Chan and Shing-Chi Cheung [61] states that testing of database

applications crucial for ensuring high software quality as undetected faults can result

in unrecoverable data corruption. The problem of database application testing can be

broadly partitioned into the problems of test cases generation, test data preparation

and test outcomes verification. Among the three problems, the problem of test cases

generation directly affects the effectiveness of testing. Conventionally, database

application testing is based upon whether or not the application can perform a set of

predefined functions. While it is useful to achieve a basic degree of quality by

considering the application to be a black box in the testing process, white box testing

is required for more thorough testing. However, the semantics of the Structural Query

Language (SQL) statements embedded in database applications are rarely considered

in conventional white box testing techniques. In this paper, the authors propose to

complement white box techniques with the inclusion of the SQL semantics. their

approach is to transform the embedded SQL statements to procedures in some

general-purpose programming language and thereby generate test cases using

conventional white box testing techniques. Additional test cases that are not covered


- 38 -

in traditional white box testing are generated to improve the effectiveness of database

application testing. The steps of both SQL statements transformation and test cases

generation are explained and illustrated using an example adapted from a course

registration system. The authors successfully identified additional faults involving the

internal states of databases.

Michael H. Brackett [64] in this book has a view that poor data quality

hampers today's organizations in many ways: it makes data warehousing and

knowledge management applications more expensive and less effective. The book

presents major obstacles to e-Business transformation, which may slash day-to-day

employee productivity, and translates directly into poor strategic and tactical

decisions. Subsequently ten "bad habits" that may lead to poor data and ten proven

solutions that may enable business managers to transform these bad habits into best

practices have been discussed. Brackett has showed how the "bad habits" evolved,

and exactly how to replace them with best practices for ensuring improvement in data

quality. Brackett further demonstrates exactly how to implement a solid foundation

for quality data to develop organization-wide, integrated, subject-oriented data

architecture and then build a high-quality data resource within that architecture.

David Chays et al. [26] presented the important role that databases play in

nearly every modern organization. According to this paper yet relatively little research

effort has focused on how to test them. This paper discusses issues arising in testing

database systems and presents an approach to testing database applications. In testing

such applications, the state of the database before and after the user's operation plays

an important role, along with the user's input and the system output. A tool for

populating the database with meaningful data that satisfy database constraints has


- 39 -

been prototyped. Its design and role in a larger database application testing tool set

has also been discussed.

Panos Vassiliadis [75] in this Ph. D thesis discusses the issues to integrate

heterogeneous information sources in organizations, and to enable On-Line Analytic

Processing (OLAP). According to him, neither the accumulation, nor the storage

process, seems to be completely credible. The question that arises, here is how to

organize the design, administration and evolution choices in such a way that all the

different, and sometimes opposing, quality user requirements can be simultaneously

satisfied.

To tackle this problem, this thesis makes the following contributes:

The first major result presented here is a general framework for the treatment

of data warehouse metadata in a metadata repository. The framework requires the

classification of metadata in at least two instantiation layers and three perspectives.

The metamodel layer constitutes the schema of the metadata repository and at the

metadata layer resides the actual meta-information for a particular data warehouse.

Then, he gave a proposal for a quality metamodel, which was built on the widely

accepted Goal- Question-Metric approach for the quality management of information

systems. This metamodel is capable of modeling complex activities, their

interrelationships, the relationship of activities with data sources and execution

details. The ex ante treatment of the metadata repository is enabled by a full set of

steps, i.e., quality question, which constitute the methodology for data warehouse

quality management and the quality-oriented evolution of a data warehouse based on

the architecture, process and quality metamodels. Special attention is paid to a

particular part of the architecture metamodel, i.e., the modeling of OLAP databases.


- 40 -

To this end, he first provided a categorization of the work in the area of OLAP logical

models by surveying some major efforts, including commercial tools, benchmarks and

standards, and academic efforts. He also attempted a comparison of the various

models along several dimensions, including representation and querying aspects.

Finally, this thesis gives an extended review of the existing literature on the field, as

well as a list of related open research issues.

David Sammon and Pat Finnegan [27] consider that a data warehouse can

be of great potential for business organizations. Nevertheless, implementing a data

warehouse is a complex project that has caused difficulty for organizations. This

paper presents the results of a study of four mature users of data warehousing

technology. Collectively, these organizations have experienced many problems and

have devised many solutions while implementing data warehousing. These

experiences are captured in the form of ten organizational prerequisites for

implementing data warehousing. The authors believe that this model could potentially

be used by organizations to internally assess the likelihood of data warehousing

project success, and to identify the areas that require attention prior to commencing

implementation

E. Rahm and H. Hai Do [30] classify data quality problems that can be

addressed by data cleaning routines and provides an overview of the main solution

approaches. Data cleaning is especially required when integrating heterogeneous

sources of data. Data from divergent sources should be addressed together with

schema-related data transformations. In data warehouses, data cleaning is a major part


- 41 -

of the so-called ETL process. The article also presents contemporary tool support for

data cleaning process.

Thomas Vetterli et al. [97] Metadata has been identified as a key success

factor in data warehouse projects. It captures all kinds of information necessary to

analyze, design, build, use, and interpret the data warehouse contents. In order to

spread the use of metadata, one should enable the interoperability between

repositories, and for tool integration within data warehousing architectures, a standard

for metadata representation and exchange is needed. This paper considers two

standards and compares them according to specific areas of interest within data

warehousing. Despite their incontestable similarities, there are significant differences

between the two standards which would make their unification difficult.

V. Raman and J. Hellerstein [100] consider the cleansing of data errors in

structure and content as an important aspect for data warehouse integration. Current

solutions for data cleaning involve many iterations of data “auditing” to find errors,

and long-running transformations to fix them. Users need to endure long waits, and

often write complex transformation scripts. Authors presented Potter’s Wheel, an

interactive data cleaning system that tightly integrates transformation and discrepancy

detection. Users gradually build transformations to clean the data by adding or

undoing transforms on a spreadsheet-like interface; the effect of a transform is shown

at once on records visible on screen. These transforms are specified either through

simple graphical operations, or by showing the desired effects on example data

values. In the background, Potter’s Wheel automatically infers structures for data


- 42 -

values in terms of user-defined domains, and accordingly checks for constraint

violations. Thus users can gradually build a transformation as discrepancies are found,

and clean the data without writing complex programs or enduring long delays.

Angela Bonifati et al. [8] opines that data warehouses are specialized

databases devoted to analytical processing. They are used to support decision-making

activities in most modern business settings typically when complex data sets have to

be studied and analyzed. The technology for analytical processing assumes that data

are presented in the form of simple data marts, consisting of a well-identified

collection of facts and data analysis dimensions (star schema). Despite the wide

diffusion of data warehouse technology and concepts, industry still don’t have

methods that help and guide the designer in identifying and extracting such data marts

out of an enterprise wide information system, covering the upstream, requirement-

driven stages of the design process. Many existing methods and tools support the

activities related to the efficient implementation of data marts on top of specialized

technology (such as the ROLAP or MOLAP data servers). This paper presents a

method to support the identification and design of data marts. The method is based on

three basic steps. A first top-down step makes it possible to elicit and consolidate user

requirements and expectations. This is accomplished by exploiting a goal-oriented

process based on the Goal/Question/Metric paradigm developed at the University of

Maryland. Ideal data marts are derived from user requirements. The second bottom-up

step extracts candidate data marts.

Blakeslee, M. Dorothy and John Rumble, Jr [18] in this paper discussed the

steps involved in the process of turning an initial concept for a database into a

finished product. The paper further discusses those quality related issues that need to


- 43 -

be addressed to ensure the high quality of the database. The basic requirements for a

successful database quality process have been presented with specific examples drawn

from experience gained by the authors at the Standard Reference Data Program at the

National Institute of Standards and Technology.

Fá Rilston Silva Paim and Jaelson Freire Brelaz de Castro [33] have a

view that in the novel domain of Data Warehouse Systems, software engineers are

required to define a solution that integrates a number of heterogeneous sources to

extract, transform and aggregate data, as well as to offer flexibility to run ad-hoc

queries that retrieve analytic information. Moreover, these activities should be

performed based on a concise dimensional schema. This intricate process with its

particular multidimensionality claims for a requirements engineering approach to aid

the precise definition of data warehouse applications. In this paper, the authors adapt

the traditional requirements engineering process and propose DWARF, a Data

Warehouse Requirements definition method. A case study demonstrates how the

method has been successfully applied in the company wise development of a large-

scale data warehouse system that stores hundreds of gigabytes of strategic data for the

Brazilian Federal Revenue Service.

P. Vassiliadis et al. [74] state that Extraction-Transformation-Loading (ETL)

tools are pieces of software responsible for the extraction of data from several

sources, their cleansing, customization and insertion into a data warehouse. In this

paper, the logical design of ETL scenarios have been discussed. The described a

framework for the declarative specification of ETL scenarios with two main

characteristics: genericity and customization. The authors presented a palette of

several templates, representing frequently used ETL activities along with their

semantics and their interconnection. Finally, they discussed implementation issues


- 44 -

and presented a graphical tool, ARKTOS II that facilitates the design of ETL

scenarios.

Chiara Francalanci and Barbara Pernici [22] in this paper suggested that

the quality of data is often defined as "fitness for use", i.e., the ability of a data

collection to meet user requirements. The assessment of data quality dimensions

should consider the degree to which data satisfy user’s needs. User expectations are

clearly related to the selected services and at the same time a service can have

different characteristics depending on the type of user that accesses it. The data

quality assessment process has to consider both aspects and, consequently, select a

suitable evaluation function to obtain a correct interpretation of results. This paper

proposes a model that ties the assessment phase to user requirements. Multichannel

information systems are considered as an example to show the applicability of the

proposed model.

F. Missi et al. [32] reports the results of a study conducted on the

implementation of data-driven customer relationship management (CRM) strategies.

Despite its popularity, there is still a significant failure rate of CRM projects. A

combination of survey and interviews/case studies research approach was used. It is

found that CRM implementers are not investing enough efforts in improving data

quality and data integration processes to support their CRM applications. This paper

reports the results of a study into the implementation of data-driven customer

relationship management (CRM) strategies.

Alexandros Karakasidis et al. [6] discussed the traditional practice of offline

data refreshment in a data warehouse industry. Active Data Warehousing refers to a

new trend where data warehouses are updated as frequently as possible, to


- 45 -

accommodate the high demands of users for fresh data. In this paper, authors

proposed a framework for the implementation of active data warehousing, with the

following goals: (a) minimal changes in the software configuration of the source, (b)

minimal overhead for the source due to the active nature of data propagation, (c) the

possibility of smoothly regulating the overall configuration of the environment in a

principled way. In their framework, they have implemented ETL activities over queue

networks and employ queue theory for the prediction of the performance and the

tuning of the operation of the overall refreshment process. Due to the performance

overheads incurred, the authors explored different architectural choices for this task

and discuss the issues that arise for each of them.

Aristides Triantafillakis et al. [10] highlights the importance of data

warehouses for collaborative and interoperation e-commerce environments. These e-

commerce environments are physically scattered along the value chain. Adopting

system and information quality as success variables, the authors argue that what is

required for data warehouse refreshment in this context is inherently more complex

than the materialized view maintenance problem and the authors offer an approach

that addresses refreshment in a federation of data warehouses. Defining a special kind

of materialized view, an open multi-agent architecture is proposed for their

incremental maintenance while considering referential integrity constraints on source

data.

Paolo Giorgini et al. [79] proposed a goal-oriented approach to requirement

analysis for data warehouses. Two different perspectives are integrated for

requirement analysis: organizational modeling, centered on stakeholders, and

decisional modeling, focused on decision makers. The approach followed can be

employed within a demand-driven and a mixed supply/demand-driven design


- 46 -

framework: in the second case, while the operational sources are still explored to

shape hierarchies, user requirements play a fundamental role in restricting the area of

interest for analysis and in choosing facts, dimensions, and measures. The

methodology proposed, supported by a prototype, is described with reference to a real

case study.

Alkis Simitsis [7] in this paper confirmed Extraction-Transformation-Loading

(ETL) tools as pieces of software responsible for the extraction of data from several

sources, their cleansing, customization and insertion into a data warehouse. In

previous line of research, the author has presented a conceptual and a logical model

for ETL processes. This paper describes the mapping of the conceptual to the logical

model. First, it is identified that how a conceptual entity is mapped to a logical entity.

Next, the execution order in the logical workflow using information adapted from the

conceptual model has been determined. Finally, the article provides a methodology

for the transition from the conceptual to the logical model.

Lutz Schlesinger et al. [55] view data extraction from heterogeneous data

sources and transferring data into the data warehouse system as one of the most cost

intensive tasks in setting up and operating a data warehouse. Special tools may be

used to connect different sources and target systems. In this paper, the authors

propose an architecture which enables the flexible integration of data sources into any

target database system. The approach is based on the idea of splitting the classical

wrapping module into a source specific and a target specific part and establishing the

communication between these components based on Web Service technology. The

paper describes the general architecture, the use of Web Service technology to

describe and dynamically integrate participating data sources and the deployment

within a specific database system


- 47 -

Sid Adelman et al. [89] Stressed on the need to test extract, transform and

load (ETL) during development process. One should verify the results each time the

ETL process runs and the ETL tool tie-outs that will verify that the database is loaded

correctly. In addition, the user who runs the queries and generates reports should

always validate their results. The best test cases come from the detailed requirements

documents created during the planning and analysis phases. Each and every

requirement that is documented must be measurable, achievable and testable. Each

requirement must be assigned to someone who wants it, who will implement it and

who will test and measure it. Test cases will be determined by the nature of the

project. The best test cases are developed jointly between the business users and the

developers on the project.

Majid Abai [59] the president of Seena technologies stated that One of the

major problems identified in every data warehousing system including operational

data stores and data marts is the problem of data certification. Data quality in data

warehouse has always been a hot topic, but in this whitepaper, he focused on holistic

and detailed data certification for a DW project. The potential data certification

problems from the standpoint of extraction, transformation, cleansing, and loading of

such applications in addition to data summarization and cubing of a typical system

have also been discussed. In addition, some methods have also been introduced for

verifying the quality of data at each risk point to enable the ease of isolating the

problem and solving it. This discussion also included performance of the ETL

process.

Adam Wolszczak and Marcin Dabrowski [5] have a view that automated

testing has become a buzz word in the data warehouse and business intelligence

management teams in the past years. Project Mangers (PM), Lead Teams put big


- 48 -

pressure on limiting the schedule and time of projects development. They often look

for options how to improve the quality assurance cost building factors in the testing

phase of the software engineering process. Automation is seen by them as an ideal

solution to their needs: not only does it speeds up the testing process but decreases

costs as everything is tested by itself. You only need to click and verify. The reality of

software testing automation is far from this realistic management perspective.

Automation is not a panacea for all software testing engineering work and is not

applicable to many quality assurance types of work, especially in the data warehouse

development. In fact, in many cases automation, especially badly managed, may bring

more costs than positive Return on Investment. This paper directly addresses the

analysis of testing automation and testing automation tool usage in the selected fifteen

testing areas of data warehouse. It underlines the need of analysis where effective

quality assurance test automation should be performed and where it should not be

introduced. It identifies the business intelligence (BI) development regions where the

Return on Investment of automation in a given time scope is positive and describes

where manual testing remains the best testing option. The analysis is based on cost

effectiveness.

Nikolay Iliev [70] highlights the common practice of using many different

software systems from different vendors in order to provide full coverage of the

business functions. The integration between these systems often requires real time or

off-line data migration as part of the operational communication between the systems

or in order to populate analytical decision making systems on regular basis. The

second most common situation is the transition to a new software system which could

be either a new version, but with different data structure or could be a different system

from another vendor. Data migration is not an easy task as it is usually performed


- 49 -

using in house hand coded ETL algorithms. This low level implementation is difficult

to implement and to maintain. Hence rigorous testing is required to identify and to

rectify hidden errors in ETL routines at logical and technical level.

Panos Vassiliadis et al. [77] presented Extraction–Transform–Load (ETL)

processes as complex data workflows, which are responsible for the maintenance of a

Data Warehouse. Their practical importance is denoted by the fact that a plethora of

ETL tools currently constitutes a multi-million dollars market. However, each one of

them follows a different design and modeling technique and internal language. So far,

the research community has not agreed upon the basic characteristics of ETL tools.

Hence, there is a necessity for a unified way to assess ETL workflows. In this paper,

the authors have investigated the main characteristics and peculiarities of ETL

processes and proposed a principled organization of test suites for the problem of

experimenting with ETL scenarios.

Xuan Thi Dung et al. [107] believes that globe has witnessed significant

number of integration techniques for data warehouses to support web integrated data.

However, the existing works focus extensively on the design concept. In this paper,

authors have focused on the performance of a web database application such as an

integrated web data warehousing using a well-defined and uniform structure to deal

with web information sources including semi-structured data such as XML data, and

documents such as HTML in a web data warehouse system. Thus, a prototype was

developed that can not only be operated through the web, but can also handle the

integration of web data sources and structured data sources. The main contribution of

this research is the performance evaluation of an integrated web data warehouse

application which includes two tasks. Task one is to perform a verification of the


- 50 -

correctness of integrated data based on the result set that is retrieved from the web

integrated data warehouse system using complex and OLAP queries. The result set is

checked against the result set that is retrieved from the existing independent data

source systems. Task two is to measure the performance of OLAP or complex query

by investigating source operation functions used by these queries to retrieve the data.

Ricardo Jorge santos and Jorge bernardio [84] in this article stated that a

data warehouse provides information for analytical processing, decision making and

data mining tools. As the concept of real-time enterprise evolved, the synchronism

between transactional data and data warehouses has been statically implemented and

redefined. Traditional data warehouse systems have static structures of their schemas

and relationships between data, and therefore are not able to support any dynamics in

their structure and content. Their data is only periodically updated because they are

not prepared for continuous data integration. For real-time enterprises with needs in

decision support purposes, real-time data warehouses seem to be very promising. In

this paper the authors present a methodology on how to adapt data warehouse

schemas and user-end OLAP queries for efficiently supporting real-time data

integration. To accomplish this, the authors used techniques such as table structure

replication and query predicate restrictions for selecting data, to enable continuously

loading of the data warehouse with minimum impact in query execution time. The

authors demonstrate the efficiency of this method by analyzing its impact on query

performance using TPC-H benchmark. Query workloads were calculated while

simultaneously performing continuous data integration at various insertion time rates.

Lila Rao, Kweku-Muata and Osei-Bryson [54] consider data as an

important organizational asset because of its assumed value, including their potential

to improve the organizational decision-making processes. Such potential value,


- 51 -

however, comes with various costs, including those of acquiring, storing, securing and

maintaining the given assets at appropriate quality levels. Clearly, if these costs

outweigh the value that results from using the data, it would be counterproductive to

acquire, store, secure and maintain the data. Thus cost benefit assessment is

particularly important in data warehouse development; yet very few techniques are

available for determining the value that the organization will derive from storing a

particular data table and hence determining which data set should be loaded in the

data warehouse. This research seeks to address the issue of identifying the set of data

with the potential for producing the greatest net value for the organization by offering

a model that can be used to perform a cost benefit analysis on the decision support

views that the warehouse can support and by providing techniques for estimating the

parameters necessary for this model.

Angélica Caro et al. [9] have a view that data quality is a critical issue in

today's interconnected society. Advances in technology are making use of the Internet

and one can witness the creation of a great variety of internet based applications such

as web portals. These applications are important data sources and/or means of

accessing information which many people use to make decisions or to carry out tasks.

Quality is a very important factor in any software product and also in data. As quality

is a wide concept, quality models are usually used to assess the quality of a software

product. From the software point of view there is a widely accepted standard proposed

by ISO/IEC (the ISO/IEC 9126) which proposes a quality model for software

products. However, until now a similar proposal for data quality has not existed.

Although authors have found some proposals of data quality models, some of them

working as "de facto" standards, none of them focus specifically on web portal data

quality and the user's perspective. In this paper, a set of 33 attributes have been


- 52 -

proposed which are relevant for portal data quality. These have been obtained from a

revision of literature and a validation process carried out by means of a survey.

Although these attributes do not conform to a usable model, but according to authors

they might be considered as a good starting point for constructing one.

Fadila Bentayeb et al. [34] state that data warehouses store aggregated data

issued from different sources to meet users' analysis needs for decision support. The

nature of the work of users implies that their requirements are often changing and do

not reach a final state. Therefore, a data warehouse cannot be designed in one step;

usually it evolves over the time. In this paper, the authors have proposed a user-driven

approach that enables a data warehouse schema update. It consists in integrating the

users' knowledge in the data warehouse modeling to allow new analysis possibilities.

The approach followed is composed of four phases: (1) users knowledge acquisition

(2) knowledge integration (3) data warehouse schema update and (4) on-line analysis.

To support the proposed approach, the authors have defined a Rule-based Data

Warehouse (R-DW) model composed of two parts: one "fixed" part and one

"evolving" part. The fixed part corresponds to the initial data warehouse schema,

whose purpose is to provide an answer to global analysis needs. The evolving part is

defined by means of aggregation rules, which allow personalized analyses.

From the literary survey the prevailing gaps in the data warehousing process

has been concluded to ascertain the direction of this research work. From the reigns of

the literature available a test data generator and an ETL routine has been designed to

empirically envisage the possible data quality issues through automated ETL testing.

_____________________________________________________________________

☻☺☻☺☻☺☻☺☻☺

REVIEW OF LITERATURE - Shodhganga : a reservoir of...

Documents

Transcript of REVIEW OF LITERATURE - Shodhganga : a reservoir of...