Post on 16-Mar-2018
- 25 -
CHAPTER 2
REVIEW OF LITERATURE
2.0 INTRODUCTION
In this chapter a review of the prevailing literature within the scope of this
research has been made. The literature review was undertaken to understand the
foundations of this research. Literature reviewed includes journals, publications,
magazines, books and technical reports from renowned organizations leading the data
warehouse industry. The theoretical and empirical research findings of various
researchers have been studied and analyzed to understand the feasibility of defined
objectives.
2.1 REVIEW OF LITERATURE:
This chapter categorizes and elucidates the work done so far in the area of data
warehouse development. The chapter has been organized into three parts. Initially the
theoretical vision about data warehouse development has been presented, which
includes issues associated with data warehouse design, deployment and data
warehouse architectures proposed by various researchers. Secondly literature
concerned with generating synthetic test data has been reviewed. Subsequent
literature review is concerned about ETL systems and data quality issues. Afterwards
literature regarding contemporary distributed data management issues has been
reviewed to propose a refined architecture for distributed data management.
The literature review starts from gaining the primitive knowledge of data
warehouse architectures. Initially it proceeds from primitive data warehouse
conjectures towards advanced data warehouse setups prevailing in the data warehouse
industry. Afterwards the test data generation methodologies proposed so far for
Review of Literature
- 26 -
generating synthetic records were studied followed by reviewing the testing strategies
proposed by eminent researchers. The literature review further unveiled the micros of
the Extraction, Transformation and Loading module of the data warehouse. As this
research is ascertain to promulgate required testing procedures for quality ETL
routines, hence appropriate literature was reviewed to locate the possible loopholes in
the ETL development procedure.
2.1.1 On being familiar Terms with Data Warehouse:
Robert M. Bruckner et al. [86] identified intelligent and comprehensive data
warehouse systems as a powerful instrument for organizations to analyze their
business. The implementation of such decision systems for an enterprise-wide
management and decision support system can be very different from traditional
software implementations. Because data warehouse systems are strongly data-driven
entities, hence their development process is highly dependent on their underlying
data. Since data warehouse systems concern many organizational units, the collection
of unambiguous, complete, verifiable, consistent and usable requirements can be a
very difficult task. Use cases are generally used as standard notation for object-
oriented requirement modelling. In this paper the team showed how use cases can
enhance communication between stakeholders, domain experts, data warehouse
designers and other professionals with diverse backgrounds. Three different
abstraction levels namely business, user and system requirements of data warehouse
requirements were introduced and discussed.
Beate List et al. [16] stated in this case study on Process Warehousing that the
building a data warehouse as a very challenging issue because compared to software
engineering it is quite a young discipline and does not yet offer well-established
Review of Literature
- 27 -
strategies and techniques for the development process. Current data warehouse
development methods can fall within three basic groups: data-driven, goal-driven and
user-driven. During this study all three development approaches have been applied to
the Process Warehouse that is used as the foundation of a process-oriented decision
support system, which aims to analyse and improve business processes continuously.
In this paper the authors evaluated all three development methodologies by various
assessment criteria. The aim is to establish a link between the methodology and the
requirement domain.
R.G. Little and M.L. Gibson [82] in this study, have surveyed data warehouse
implementation prospective. Project participants were consulted to record their
perceptions, which may further contribute to the data warehouse implementation
process. The respondents included: functional managers/staff, IS managers/staff, and
consultants. The study identified eight significant factors that according to
participants may impact data warehouse implementation.
Hugh J. Watson and Thilini Ariyachandra [40] conducted this research for
analyzing the success factor of different data warehouse architectures. Following were
the main objectives of this study:
a) Understanding the factors that influence the selection of data warehouse
architecture.
b) The success of the various architectures. The academic and data warehousing
literature and industry experts were used to identify architecture selection
factors and success measures. Information attained was then to create
questions for a Web-based survey that was used to collect data from 454
companies about the respondents, their companies, their data warehouses, the
architectures they use, and the success of their architectures. The success
Review of Literature
- 28 -
metrics identified were information quality, system quality, individual
impacts, organizational impacts, development time, and development cost.
c) Perhaps the most interesting finding of this study is the almost equal score of
the bus, hub and spoke, and centralized architectures on the information and
system quality along with the individual and organizational metrics. It helps
explain why these competing architectures have survived over time they are
equally successful for their intended purposes. Based on these metrics, no
single architecture is dominant, but still hub and spoke architecture seems to
be the first choice among organizations implementing data warehouse.
Joy Mundy et al. [48] have discussed five major aspects of data warehouse
implementation. This book is organized into five parts; a brief synopsis of these parts
is as under:
Part I Requirements, Realities, and Architecture: first three chapters have
been included into this part. The first chapter of this part deals with a brief summary
of the Business Dimensional Lifecycle. Afterwards the most important step of
gathering the business requirements is discussed. In chapter two the authors present a
brief primer on how to develop a dimensional model. This chapter presents
terminology and concepts used throughout the book, so it’s vital that one should
understand this material. Chapter three deal with the Architecture and Product
Selection tasks, which are straightforward for a Microsoft DW/BI system. In this short
chapter there is a detailed discussion about how and where to use the various
components of SQL Server 2005, other Microsoft products, and even where to use
third-party software in your system.
Part II Developing and Populating the Databases: The second part of the
book presents the steps required to effectively build and populate the data warehouse
Review of Literature
- 29 -
databases. Most Microsoft DW/BI systems will implement the dimensional data
warehouse in both the relational database and the Analysis Services database. It
includes chapter 4 to 7.
Chapter 4 describes how to install and configure the various components of
SQL Server 2005. Then issues like system sizing and configuration, and how and why
one may choose to distribute DW/BI system across multiple servers are discussed.
There are some physical design issues to consider, notably whether to partition the
fact tables or not? Chapter 4 is focused on the Product Selection and Installation.
Chapter 5 describes the ETL portion of the DW/BI system as an evergreen design
challenge. In this chapter introduction to SQL Server’s new ETL technology,
Integration Services is discussed. Chapter 6 talks about the basic design for ETL
system in Integration Services. One has to walk through the details before loading
dimension tables. The chapter addresses the key dimension management issues such
as surrogate key assignment and attribute change management. According to chapter
7 the more closely the relational database and ETL process are designed to meet your
business requirements, the easier it is to design the Analysis Services database.
Analysis Services includes lots of features for building an OLAP database on top of a
poorly constructed database, but for better results one should follow the proprietor’s
instructions and have a clean and conformed relational data warehouse as the starting
point for a flawless OLAP database. Keeping in view the scope of this research work
part II is of prime importance in this book.
Part III Developing the BI Applications: The third part of the book
demonstrates the steps required to present the data to the business users. It starts with
a chapter that clearly defines what is meant by BI applications. Then the reporting
Review of Literature
- 30 -
services are discussed. Data mining can deliver huge value to your business, by
looking for hidden relationships and trends in the data. Chapter 10 talks about how to
use Analysis Services to build data mining models.
Part IV Deploying and Managing the DW/BI System: The fourth section of
the book includes information about how to deploy and operate the DW/BI system.
This part highlights the techniques used to deal with existing data warehouse. Prime
security issues followed by the metadata management and maintenance strategies for
a data warehouse have also been discussed.
Part V Extending the DW/BI System: the fifth part is associated with
growth management of data warehouse. Emphasis is focused on real time business
intelligence followed by a discussion on present imperatives and future outlook.
Mark I. Hwang and Hongjiang Xu [41] consider data warehousing as an
important area of practice and research. According to this research only a few studies
have assessed its practices in general and critical success factors in particular.
Although plenty of guidelines for implementation exist, but few have been subjected
to empirical testing. In order to better understand implementation factors and their
effect on data warehousing success, perceptions of data warehousing professionals are
examined in a cross sectional survey. Best subsets regression is used to identify the
specific factors that are important to each success variable. Since different companies
may have different objectives or emphases in their data warehousing endeavours, the
results are useful in identifying the exact factors that need attention and in providing a
basis for prioritizing those factors. The results also suggest several promising
directions for continued research for data warehousing success.
Review of Literature
- 31 -
Stanisław Kozielski and Robert Wrembel [91] in this special issue of the
Annals of Information Systems presented current advances in the data warehouse and
OLAP technologies. The issue is composed of 15 chapters that in general address the
following research and technological areas: advanced technologies for building XML,
spatial, temporal, and real-time data warehouses, novel approaches to DW modelling,
data storage and data access structures, as well as advanced data analysis techniques.
The first two chapters of this publication were found to be of prime importance for
this research. First chapter overviews open research problems concerning data
warehouse development for integrating and analyzing various complex types of data,
dealing with temporal aspects of data, handling imprecise data, and ensuring privacy
in Data Warehouses. In chapter two challenges in designing ETL processes for real-
time (or near real-time) data warehouses have been discussed and further an
architecture real-time data warehousing system has been proposed.
2.1.2 Generating Test Data and Data Warehouse Testing:
J. Gray et al. [43] states that evaluating database system performance often
requires generating synthetic databases which have certain statistical properties but
filled with dummy information. When evaluating different database designs, it is
often necessary to generate several databases and evaluate each design. As database
sizes grow to terabytes, generation often requires longer time than evaluation. This
paper presents several techniques for synthetic database generation. In particular it
discusses parallelism to get generation speedup and scaleu. Congruential generators to
get dense unique uniform distributions and Special-case discrete logarithms to
generate indices concurrent to the base table generation.
Review of Literature
- 32 -
Tom Bersano, et al. [98] in this paper considers the non availability of
representative data as a significant issue in database testing. When real data is hard to
obtain or when its properties are hard to identify, synthetic data becomes an appealing
alternative. Authors perceive that synthetic data is used mainly in testing data-mining
algorithms, and synthetic data generators are thus limited to small toy data-mining
domains and are not very flexible. In this paper the authors have considered issues in
extending the use of synthetic data for performance analysis and benchmarking
different database algorithms and systems. Their approach consists of describing and
implementing a synthetic data specification language and data generation tool which
is more flexible and allows the generation of more realistic data. Then the authors
have discussed the procedure to generate synthetic data from specifications provided
by the user.
Ashraf Aboulnaga, et al. [12] opines that the synthetically generated data has
always been important for evaluating and understanding new ideas in database
research. In this paper, the authors described a data generator for generating synthetic
complex-structured XML data that allows a high level of control over the
characteristics of the generated data. This data generator is certainly not the ultimate
solution to the problem of generating synthetic XML data, but according to authors it
is very useful in the research on XML data management, and it was believed that it
can also be useful to other researchers. Furthermore, this paper has started a
discussion in the XML community about characterizing and generating XML data,
and moreover it may serve as a first step towards developing a commonly accepted
XML data generator for the community concerned.
N. Bruno and S. Chaudhuri [66] considers the evaluation and applicability
of many database techniques ranging from access methods, histograms and
Review of Literature
- 33 -
optimization strategies to data normalization and mining, crucially depend on their
ability to cope with varying data distributions in a robust way. However,
comprehensive real data is often hard to come by, and there is no flexible data
generation framework capable of modeling varying rich data distributions. This has
led individual researchers to develop their own ad-hoc data generators for specific
tasks. As a consequence, the resulting data distributions and query workloads are
often hard to reproduce, analyze, and modify, thus preventing their wider usage. In
this paper the authors presented a flexible, easy to use, and scalable framework for
database generation. Afterwards they discussed how to map several proposed
synthetic distributions to their framework and reported preliminary results.
Xintao Wu, et al. [106] in this paper states that testing of database
applications is of great importance. A significant issue in database application testing
consists in the availability of representative data. This paper investigates the problem
of generating a synthetic database based on a-priori knowledge about a production
database. The approach followed was to fit general location model using various
characteristics (e.g., constraints, statistics, rules) extracted from the production
database and then generate the synthetic data using model learnt. The generated data
was valid and similar to real data in terms of statistical distribution, hence it can be
used for functional and performance testing. As characteristics extracted may contain
information which may be used by attacker to derive some confidential information
about individuals, hence the paper presents a disclosure analysis method which
applies cell suppression technique for identity disclosure analysis and perturbation for
value disclosure.
Eric Tang et al. [31] discussed the importance of testing a database
application. AGENDA (A (test) Generator for Database Applications) was designed
Review of Literature
- 34 -
to aid in testing database applications. In addition to the tools that AGENDA currently
has, three additional tools were made to enhance testing and feedback. They are the
log analyzer, attribute analyzer, and query coverage. The log analyzer finds relevant
entries in log file produced by DBMS, lexically analyzes them using a grammar
written in JavaCC, and stores some of the data in a database table. When the log entry
represents an executed SQL statement, this statement is recorded. The attribute
analyzer parses SQL statements. A SQL grammar for JavaCC was modified, adding
code to determine which attributes are read and written in each SQL statement. A new
test coverage criterion, query coverage, is defined. Query coverage checks whether
queries that the tester thinks should be executed actually are executed. Similar to log
analyzer, JavaCC was used to implement this. It is implemented by pattern matching
executed queries against patterns representing abstract queries (including host
variables) identified by the tester.
Pengyue J. Lin, et al. [80] said that data mining research has yielded many
significant and useful results such as discovering consumer-spending habits, detecting
credit card fraud, and identifying anomalous social behaviour. Information Discovery
and Analysis Systems (IDAS) extract information from multiple sources of data and
use data mining methodologies to identify potential significant events and
relationships. This research designed and developed a tool called IDAS Data and
Scenario Generator (IDSG) to facilitate the creation, testing and training of IDAS.
IDSG focuses on building a synthetic data generation engine powerful and flexible
enough to generate synthetic data based on complex semantic graphs.
Harry M. Sneed [38] contributed this experience report on system testing and
in particular on the testing of a data warehouse system. According to this report data
warehouses are large databases used solely for querying and reporting purposes. The
Review of Literature
- 35 -
data warehouse in question here was dedicated to fulfilling the reporting requirements
of the BASEL-II agreement on the provision of auditing data by the banks, the
European equivalent of the Sarbane-Oxley Act. The purpose of the testing project was
to prove that the contents of the data warehouse are correct in accordance with the
rules specified to fill them. In the end, the only way to achieve this was to rewrite the
rule specifications in a machine readable form and to transform them into post
assertions, which could be interpreted by a data verification tool for comparison of the
actual data contents with the expected data contents. The testing project was never
fully completed, since not all of the rules could be properly transformed.
Binnig Carsten, et al. [17] considers that OLTP applications usually
implement use cases which execute a sequence of actions whereas each action usually
reads or updates only a small set of tuples in the database. In order to automatically
test the correctness of the different execution paths of the use cases implemented by
an OLTP application, a set of test cases and test databases needs to be created. In this
paper, it is suggested that a tester should specify a test database individually for each
test case using SQL in a declarative test database specification language. Moreover,
the authors also discussed the design of a database generator which creates a test
database based on such a specification. Consequently, their approach allows to
generate a tailor-made test database for each test case and to bundle them together for
the test case execution phase.
Soumyajit Sarcar [90] in this paper takes a look at the different strategies to
test a data warehouse application. It attempts to suggest various approaches that could
be beneficial while testing the ETL process in a DW. A data warehouse is a critical
business application and defects in it results in business loss that cannot be accounted
Review of Literature
- 36 -
for. Here, some of the basic phases and strategies to minimize defects have been
proposed.
Vincent Rainardi [101] in this book has presented pioneering ideas for data
warehouse development with examples of SQL Server. keeping in view the scope of
this study the chapter seven and nine of this book are of prime importance as they
concentrate on ETL development, functionality and data quality assurance in a data
warehouse. In this chapter, the author discusses what data quality is and why it is so
important. Further the data quality process and the components involved in the
process have been discussed. The categories of data quality rules in terms of
validation and the choices of what to do with the data when a data quality rule is
violated: reject, allow, or fix have been beautifully explained.
Manoj Philip Mathen [60] in this white paper opines that exhaustive testing
of a Data warehouse during its design is essential and it is an ongoing process. Testing
the Data warehouse implementations are also of utmost significance. Organization
decisions depend entirely on the Enterprise data and the data has to be of supreme
quality. Complex Business rules and transformation logic implementations mandates
a diligent and thorough testing. Finally, the paper addresses some challenges for data
warehouse testing like voluminous data, heterogeneous sources, temporal
inconsistency and estimation challenges.
2.1.3 Understanding ETL, Data Quality and Data Management:
Larry P. English [52] revolutionized the data quality assurance framework
through this book. In data warehouse literature this book from Larry English is
considered as the information quality Bible for the information age. This book starts
Review of Literature
- 37 -
from the premise that a high proportion of businesses hold back on information
quality, through inaccurate and missing data. Not only do they lose out substantially
as a result, but such inaccuracies, in turn, corrupt data warehouses which, then, fail.
The book's aim is to be a 'one stop' source for helping businesses to reduce costs and
increase profits by showing them how to measure the quality of their information
resources, as well as how to cleanse their data and keep it clean. This book comes
with a guarantee: that the author will personally refund the cost of the book to any
reader whose organisation does not get a substantial benefit from following its
recommendations.
Man-Yee Chan and Shing-Chi Cheung [61] states that testing of database
applications crucial for ensuring high software quality as undetected faults can result
in unrecoverable data corruption. The problem of database application testing can be
broadly partitioned into the problems of test cases generation, test data preparation
and test outcomes verification. Among the three problems, the problem of test cases
generation directly affects the effectiveness of testing. Conventionally, database
application testing is based upon whether or not the application can perform a set of
predefined functions. While it is useful to achieve a basic degree of quality by
considering the application to be a black box in the testing process, white box testing
is required for more thorough testing. However, the semantics of the Structural Query
Language (SQL) statements embedded in database applications are rarely considered
in conventional white box testing techniques. In this paper, the authors propose to
complement white box techniques with the inclusion of the SQL semantics. their
approach is to transform the embedded SQL statements to procedures in some
general-purpose programming language and thereby generate test cases using
conventional white box testing techniques. Additional test cases that are not covered
Review of Literature
- 38 -
in traditional white box testing are generated to improve the effectiveness of database
application testing. The steps of both SQL statements transformation and test cases
generation are explained and illustrated using an example adapted from a course
registration system. The authors successfully identified additional faults involving the
internal states of databases.
Michael H. Brackett [64] in this book has a view that poor data quality
hampers today's organizations in many ways: it makes data warehousing and
knowledge management applications more expensive and less effective. The book
presents major obstacles to e-Business transformation, which may slash day-to-day
employee productivity, and translates directly into poor strategic and tactical
decisions. Subsequently ten "bad habits" that may lead to poor data and ten proven
solutions that may enable business managers to transform these bad habits into best
practices have been discussed. Brackett has showed how the "bad habits" evolved,
and exactly how to replace them with best practices for ensuring improvement in data
quality. Brackett further demonstrates exactly how to implement a solid foundation
for quality data to develop organization-wide, integrated, subject-oriented data
architecture and then build a high-quality data resource within that architecture.
David Chays et al. [26] presented the important role that databases play in
nearly every modern organization. According to this paper yet relatively little research
effort has focused on how to test them. This paper discusses issues arising in testing
database systems and presents an approach to testing database applications. In testing
such applications, the state of the database before and after the user's operation plays
an important role, along with the user's input and the system output. A tool for
populating the database with meaningful data that satisfy database constraints has
Review of Literature
- 39 -
been prototyped. Its design and role in a larger database application testing tool set
has also been discussed.
Panos Vassiliadis [75] in this Ph. D thesis discusses the issues to integrate
heterogeneous information sources in organizations, and to enable On-Line Analytic
Processing (OLAP). According to him, neither the accumulation, nor the storage
process, seems to be completely credible. The question that arises, here is how to
organize the design, administration and evolution choices in such a way that all the
different, and sometimes opposing, quality user requirements can be simultaneously
satisfied.
To tackle this problem, this thesis makes the following contributes:
The first major result presented here is a general framework for the treatment
of data warehouse metadata in a metadata repository. The framework requires the
classification of metadata in at least two instantiation layers and three perspectives.
The metamodel layer constitutes the schema of the metadata repository and at the
metadata layer resides the actual meta-information for a particular data warehouse.
Then, he gave a proposal for a quality metamodel, which was built on the widely
accepted Goal- Question-Metric approach for the quality management of information
systems. This metamodel is capable of modeling complex activities, their
interrelationships, the relationship of activities with data sources and execution
details. The ex ante treatment of the metadata repository is enabled by a full set of
steps, i.e., quality question, which constitute the methodology for data warehouse
quality management and the quality-oriented evolution of a data warehouse based on
the architecture, process and quality metamodels. Special attention is paid to a
particular part of the architecture metamodel, i.e., the modeling of OLAP databases.
Review of Literature
- 40 -
To this end, he first provided a categorization of the work in the area of OLAP logical
models by surveying some major efforts, including commercial tools, benchmarks and
standards, and academic efforts. He also attempted a comparison of the various
models along several dimensions, including representation and querying aspects.
Finally, this thesis gives an extended review of the existing literature on the field, as
well as a list of related open research issues.
David Sammon and Pat Finnegan [27] consider that a data warehouse can
be of great potential for business organizations. Nevertheless, implementing a data
warehouse is a complex project that has caused difficulty for organizations. This
paper presents the results of a study of four mature users of data warehousing
technology. Collectively, these organizations have experienced many problems and
have devised many solutions while implementing data warehousing. These
experiences are captured in the form of ten organizational prerequisites for
implementing data warehousing. The authors believe that this model could potentially
be used by organizations to internally assess the likelihood of data warehousing
project success, and to identify the areas that require attention prior to commencing
implementation
E. Rahm and H. Hai Do [30] classify data quality problems that can be
addressed by data cleaning routines and provides an overview of the main solution
approaches. Data cleaning is especially required when integrating heterogeneous
sources of data. Data from divergent sources should be addressed together with
schema-related data transformations. In data warehouses, data cleaning is a major part
Review of Literature
- 41 -
of the so-called ETL process. The article also presents contemporary tool support for
data cleaning process.
Thomas Vetterli et al. [97] Metadata has been identified as a key success
factor in data warehouse projects. It captures all kinds of information necessary to
analyze, design, build, use, and interpret the data warehouse contents. In order to
spread the use of metadata, one should enable the interoperability between
repositories, and for tool integration within data warehousing architectures, a standard
for metadata representation and exchange is needed. This paper considers two
standards and compares them according to specific areas of interest within data
warehousing. Despite their incontestable similarities, there are significant differences
between the two standards which would make their unification difficult.
V. Raman and J. Hellerstein [100] consider the cleansing of data errors in
structure and content as an important aspect for data warehouse integration. Current
solutions for data cleaning involve many iterations of data “auditing” to find errors,
and long-running transformations to fix them. Users need to endure long waits, and
often write complex transformation scripts. Authors presented Potter’s Wheel, an
interactive data cleaning system that tightly integrates transformation and discrepancy
detection. Users gradually build transformations to clean the data by adding or
undoing transforms on a spreadsheet-like interface; the effect of a transform is shown
at once on records visible on screen. These transforms are specified either through
simple graphical operations, or by showing the desired effects on example data
values. In the background, Potter’s Wheel automatically infers structures for data
Review of Literature
- 42 -
values in terms of user-defined domains, and accordingly checks for constraint
violations. Thus users can gradually build a transformation as discrepancies are found,
and clean the data without writing complex programs or enduring long delays.
Angela Bonifati et al. [8] opines that data warehouses are specialized
databases devoted to analytical processing. They are used to support decision-making
activities in most modern business settings typically when complex data sets have to
be studied and analyzed. The technology for analytical processing assumes that data
are presented in the form of simple data marts, consisting of a well-identified
collection of facts and data analysis dimensions (star schema). Despite the wide
diffusion of data warehouse technology and concepts, industry still don’t have
methods that help and guide the designer in identifying and extracting such data marts
out of an enterprise wide information system, covering the upstream, requirement-
driven stages of the design process. Many existing methods and tools support the
activities related to the efficient implementation of data marts on top of specialized
technology (such as the ROLAP or MOLAP data servers). This paper presents a
method to support the identification and design of data marts. The method is based on
three basic steps. A first top-down step makes it possible to elicit and consolidate user
requirements and expectations. This is accomplished by exploiting a goal-oriented
process based on the Goal/Question/Metric paradigm developed at the University of
Maryland. Ideal data marts are derived from user requirements. The second bottom-up
step extracts candidate data marts.
Blakeslee, M. Dorothy and John Rumble, Jr [18] in this paper discussed the
steps involved in the process of turning an initial concept for a database into a
finished product. The paper further discusses those quality related issues that need to
Review of Literature
- 43 -
be addressed to ensure the high quality of the database. The basic requirements for a
successful database quality process have been presented with specific examples drawn
from experience gained by the authors at the Standard Reference Data Program at the
National Institute of Standards and Technology.
Fá Rilston Silva Paim and Jaelson Freire Brelaz de Castro [33] have a
view that in the novel domain of Data Warehouse Systems, software engineers are
required to define a solution that integrates a number of heterogeneous sources to
extract, transform and aggregate data, as well as to offer flexibility to run ad-hoc
queries that retrieve analytic information. Moreover, these activities should be
performed based on a concise dimensional schema. This intricate process with its
particular multidimensionality claims for a requirements engineering approach to aid
the precise definition of data warehouse applications. In this paper, the authors adapt
the traditional requirements engineering process and propose DWARF, a Data
Warehouse Requirements definition method. A case study demonstrates how the
method has been successfully applied in the company wise development of a large-
scale data warehouse system that stores hundreds of gigabytes of strategic data for the
Brazilian Federal Revenue Service.
P. Vassiliadis et al. [74] state that Extraction-Transformation-Loading (ETL)
tools are pieces of software responsible for the extraction of data from several
sources, their cleansing, customization and insertion into a data warehouse. In this
paper, the logical design of ETL scenarios have been discussed. The described a
framework for the declarative specification of ETL scenarios with two main
characteristics: genericity and customization. The authors presented a palette of
several templates, representing frequently used ETL activities along with their
semantics and their interconnection. Finally, they discussed implementation issues
Review of Literature
- 44 -
and presented a graphical tool, ARKTOS II that facilitates the design of ETL
scenarios.
Chiara Francalanci and Barbara Pernici [22] in this paper suggested that
the quality of data is often defined as "fitness for use", i.e., the ability of a data
collection to meet user requirements. The assessment of data quality dimensions
should consider the degree to which data satisfy user’s needs. User expectations are
clearly related to the selected services and at the same time a service can have
different characteristics depending on the type of user that accesses it. The data
quality assessment process has to consider both aspects and, consequently, select a
suitable evaluation function to obtain a correct interpretation of results. This paper
proposes a model that ties the assessment phase to user requirements. Multichannel
information systems are considered as an example to show the applicability of the
proposed model.
F. Missi et al. [32] reports the results of a study conducted on the
implementation of data-driven customer relationship management (CRM) strategies.
Despite its popularity, there is still a significant failure rate of CRM projects. A
combination of survey and interviews/case studies research approach was used. It is
found that CRM implementers are not investing enough efforts in improving data
quality and data integration processes to support their CRM applications. This paper
reports the results of a study into the implementation of data-driven customer
relationship management (CRM) strategies.
Alexandros Karakasidis et al. [6] discussed the traditional practice of offline
data refreshment in a data warehouse industry. Active Data Warehousing refers to a
new trend where data warehouses are updated as frequently as possible, to
Review of Literature
- 45 -
accommodate the high demands of users for fresh data. In this paper, authors
proposed a framework for the implementation of active data warehousing, with the
following goals: (a) minimal changes in the software configuration of the source, (b)
minimal overhead for the source due to the active nature of data propagation, (c) the
possibility of smoothly regulating the overall configuration of the environment in a
principled way. In their framework, they have implemented ETL activities over queue
networks and employ queue theory for the prediction of the performance and the
tuning of the operation of the overall refreshment process. Due to the performance
overheads incurred, the authors explored different architectural choices for this task
and discuss the issues that arise for each of them.
Aristides Triantafillakis et al. [10] highlights the importance of data
warehouses for collaborative and interoperation e-commerce environments. These e-
commerce environments are physically scattered along the value chain. Adopting
system and information quality as success variables, the authors argue that what is
required for data warehouse refreshment in this context is inherently more complex
than the materialized view maintenance problem and the authors offer an approach
that addresses refreshment in a federation of data warehouses. Defining a special kind
of materialized view, an open multi-agent architecture is proposed for their
incremental maintenance while considering referential integrity constraints on source
data.
Paolo Giorgini et al. [79] proposed a goal-oriented approach to requirement
analysis for data warehouses. Two different perspectives are integrated for
requirement analysis: organizational modeling, centered on stakeholders, and
decisional modeling, focused on decision makers. The approach followed can be
employed within a demand-driven and a mixed supply/demand-driven design
Review of Literature
- 46 -
framework: in the second case, while the operational sources are still explored to
shape hierarchies, user requirements play a fundamental role in restricting the area of
interest for analysis and in choosing facts, dimensions, and measures. The
methodology proposed, supported by a prototype, is described with reference to a real
case study.
Alkis Simitsis [7] in this paper confirmed Extraction-Transformation-Loading
(ETL) tools as pieces of software responsible for the extraction of data from several
sources, their cleansing, customization and insertion into a data warehouse. In
previous line of research, the author has presented a conceptual and a logical model
for ETL processes. This paper describes the mapping of the conceptual to the logical
model. First, it is identified that how a conceptual entity is mapped to a logical entity.
Next, the execution order in the logical workflow using information adapted from the
conceptual model has been determined. Finally, the article provides a methodology
for the transition from the conceptual to the logical model.
Lutz Schlesinger et al. [55] view data extraction from heterogeneous data
sources and transferring data into the data warehouse system as one of the most cost
intensive tasks in setting up and operating a data warehouse. Special tools may be
used to connect different sources and target systems. In this paper, the authors
propose an architecture which enables the flexible integration of data sources into any
target database system. The approach is based on the idea of splitting the classical
wrapping module into a source specific and a target specific part and establishing the
communication between these components based on Web Service technology. The
paper describes the general architecture, the use of Web Service technology to
describe and dynamically integrate participating data sources and the deployment
within a specific database system
Review of Literature
- 47 -
Sid Adelman et al. [89] Stressed on the need to test extract, transform and
load (ETL) during development process. One should verify the results each time the
ETL process runs and the ETL tool tie-outs that will verify that the database is loaded
correctly. In addition, the user who runs the queries and generates reports should
always validate their results. The best test cases come from the detailed requirements
documents created during the planning and analysis phases. Each and every
requirement that is documented must be measurable, achievable and testable. Each
requirement must be assigned to someone who wants it, who will implement it and
who will test and measure it. Test cases will be determined by the nature of the
project. The best test cases are developed jointly between the business users and the
developers on the project.
Majid Abai [59] the president of Seena technologies stated that One of the
major problems identified in every data warehousing system including operational
data stores and data marts is the problem of data certification. Data quality in data
warehouse has always been a hot topic, but in this whitepaper, he focused on holistic
and detailed data certification for a DW project. The potential data certification
problems from the standpoint of extraction, transformation, cleansing, and loading of
such applications in addition to data summarization and cubing of a typical system
have also been discussed. In addition, some methods have also been introduced for
verifying the quality of data at each risk point to enable the ease of isolating the
problem and solving it. This discussion also included performance of the ETL
process.
Adam Wolszczak and Marcin Dabrowski [5] have a view that automated
testing has become a buzz word in the data warehouse and business intelligence
management teams in the past years. Project Mangers (PM), Lead Teams put big
Review of Literature
- 48 -
pressure on limiting the schedule and time of projects development. They often look
for options how to improve the quality assurance cost building factors in the testing
phase of the software engineering process. Automation is seen by them as an ideal
solution to their needs: not only does it speeds up the testing process but decreases
costs as everything is tested by itself. You only need to click and verify. The reality of
software testing automation is far from this realistic management perspective.
Automation is not a panacea for all software testing engineering work and is not
applicable to many quality assurance types of work, especially in the data warehouse
development. In fact, in many cases automation, especially badly managed, may bring
more costs than positive Return on Investment. This paper directly addresses the
analysis of testing automation and testing automation tool usage in the selected fifteen
testing areas of data warehouse. It underlines the need of analysis where effective
quality assurance test automation should be performed and where it should not be
introduced. It identifies the business intelligence (BI) development regions where the
Return on Investment of automation in a given time scope is positive and describes
where manual testing remains the best testing option. The analysis is based on cost
effectiveness.
Nikolay Iliev [70] highlights the common practice of using many different
software systems from different vendors in order to provide full coverage of the
business functions. The integration between these systems often requires real time or
off-line data migration as part of the operational communication between the systems
or in order to populate analytical decision making systems on regular basis. The
second most common situation is the transition to a new software system which could
be either a new version, but with different data structure or could be a different system
from another vendor. Data migration is not an easy task as it is usually performed
Review of Literature
- 49 -
using in house hand coded ETL algorithms. This low level implementation is difficult
to implement and to maintain. Hence rigorous testing is required to identify and to
rectify hidden errors in ETL routines at logical and technical level.
Panos Vassiliadis et al. [77] presented Extraction–Transform–Load (ETL)
processes as complex data workflows, which are responsible for the maintenance of a
Data Warehouse. Their practical importance is denoted by the fact that a plethora of
ETL tools currently constitutes a multi-million dollars market. However, each one of
them follows a different design and modeling technique and internal language. So far,
the research community has not agreed upon the basic characteristics of ETL tools.
Hence, there is a necessity for a unified way to assess ETL workflows. In this paper,
the authors have investigated the main characteristics and peculiarities of ETL
processes and proposed a principled organization of test suites for the problem of
experimenting with ETL scenarios.
Xuan Thi Dung et al. [107] believes that globe has witnessed significant
number of integration techniques for data warehouses to support web integrated data.
However, the existing works focus extensively on the design concept. In this paper,
authors have focused on the performance of a web database application such as an
integrated web data warehousing using a well-defined and uniform structure to deal
with web information sources including semi-structured data such as XML data, and
documents such as HTML in a web data warehouse system. Thus, a prototype was
developed that can not only be operated through the web, but can also handle the
integration of web data sources and structured data sources. The main contribution of
this research is the performance evaluation of an integrated web data warehouse
application which includes two tasks. Task one is to perform a verification of the
Review of Literature
- 50 -
correctness of integrated data based on the result set that is retrieved from the web
integrated data warehouse system using complex and OLAP queries. The result set is
checked against the result set that is retrieved from the existing independent data
source systems. Task two is to measure the performance of OLAP or complex query
by investigating source operation functions used by these queries to retrieve the data.
Ricardo Jorge santos and Jorge bernardio [84] in this article stated that a
data warehouse provides information for analytical processing, decision making and
data mining tools. As the concept of real-time enterprise evolved, the synchronism
between transactional data and data warehouses has been statically implemented and
redefined. Traditional data warehouse systems have static structures of their schemas
and relationships between data, and therefore are not able to support any dynamics in
their structure and content. Their data is only periodically updated because they are
not prepared for continuous data integration. For real-time enterprises with needs in
decision support purposes, real-time data warehouses seem to be very promising. In
this paper the authors present a methodology on how to adapt data warehouse
schemas and user-end OLAP queries for efficiently supporting real-time data
integration. To accomplish this, the authors used techniques such as table structure
replication and query predicate restrictions for selecting data, to enable continuously
loading of the data warehouse with minimum impact in query execution time. The
authors demonstrate the efficiency of this method by analyzing its impact on query
performance using TPC-H benchmark. Query workloads were calculated while
simultaneously performing continuous data integration at various insertion time rates.
Lila Rao, Kweku-Muata and Osei-Bryson [54] consider data as an
important organizational asset because of its assumed value, including their potential
to improve the organizational decision-making processes. Such potential value,
Review of Literature
- 51 -
however, comes with various costs, including those of acquiring, storing, securing and
maintaining the given assets at appropriate quality levels. Clearly, if these costs
outweigh the value that results from using the data, it would be counterproductive to
acquire, store, secure and maintain the data. Thus cost benefit assessment is
particularly important in data warehouse development; yet very few techniques are
available for determining the value that the organization will derive from storing a
particular data table and hence determining which data set should be loaded in the
data warehouse. This research seeks to address the issue of identifying the set of data
with the potential for producing the greatest net value for the organization by offering
a model that can be used to perform a cost benefit analysis on the decision support
views that the warehouse can support and by providing techniques for estimating the
parameters necessary for this model.
Angélica Caro et al. [9] have a view that data quality is a critical issue in
today's interconnected society. Advances in technology are making use of the Internet
and one can witness the creation of a great variety of internet based applications such
as web portals. These applications are important data sources and/or means of
accessing information which many people use to make decisions or to carry out tasks.
Quality is a very important factor in any software product and also in data. As quality
is a wide concept, quality models are usually used to assess the quality of a software
product. From the software point of view there is a widely accepted standard proposed
by ISO/IEC (the ISO/IEC 9126) which proposes a quality model for software
products. However, until now a similar proposal for data quality has not existed.
Although authors have found some proposals of data quality models, some of them
working as "de facto" standards, none of them focus specifically on web portal data
quality and the user's perspective. In this paper, a set of 33 attributes have been
Review of Literature
- 52 -
proposed which are relevant for portal data quality. These have been obtained from a
revision of literature and a validation process carried out by means of a survey.
Although these attributes do not conform to a usable model, but according to authors
they might be considered as a good starting point for constructing one.
Fadila Bentayeb et al. [34] state that data warehouses store aggregated data
issued from different sources to meet users' analysis needs for decision support. The
nature of the work of users implies that their requirements are often changing and do
not reach a final state. Therefore, a data warehouse cannot be designed in one step;
usually it evolves over the time. In this paper, the authors have proposed a user-driven
approach that enables a data warehouse schema update. It consists in integrating the
users' knowledge in the data warehouse modeling to allow new analysis possibilities.
The approach followed is composed of four phases: (1) users knowledge acquisition
(2) knowledge integration (3) data warehouse schema update and (4) on-line analysis.
To support the proposed approach, the authors have defined a Rule-based Data
Warehouse (R-DW) model composed of two parts: one "fixed" part and one
"evolving" part. The fixed part corresponds to the initial data warehouse schema,
whose purpose is to provide an answer to global analysis needs. The evolving part is
defined by means of aggregation rules, which allow personalized analyses.
From the literary survey the prevailing gaps in the data warehousing process
has been concluded to ascertain the direction of this research work. From the reigns of
the literature available a test data generator and an ETL routine has been designed to
empirically envisage the possible data quality issues through automated ETL testing.
_____________________________________________________________________
☻☺☻☺☻☺☻☺☻☺