[IEEE 2012 IEEE International Conference on Systems, Man and Cybernetics - SMC - Seoul, Korea...
-
Upload
ngoc-thanh -
Category
Documents
-
view
219 -
download
3
Transcript of [IEEE 2012 IEEE International Conference on Systems, Man and Cybernetics - SMC - Seoul, Korea...
A Framework for Data Warehouse FederationsBuilding
Marcin MaleszkaInstitute of Informatics
Wroclaw University of Technology,
Wroclaw, Poland
Bernadetta MianowskaInstitute of Informatics
Wroclaw University of Technology,
Wroclaw, Poland
Ngoc Thanh NguyenInstitute of Informatics
Wroclaw University of Technology,
Wroclaw, Poland
Abstract—Data warehouse federations are a next step inmodern information storage and processing systems. They area system of classical data warehouses connected by an integratedschema, that allows access to all the data of the source ware-houses. We propose a generic approach to create a data warehousefederation, providing methods for schema integration (in order tocreate the federation schema), query decomposition (in order totranslate the query to the federation into queries to subsequentwarehouses) and result integration (to select the answer that bestsuits the user information need). We test this approach by creatinga basic implementation and present a working example.
Index Terms—data warehouse federation, data warehouse,knowledge integration, collective intelligence.
I. INTRODUCTION
The data warehouse federations are becoming a more and
more important research topic. The need for their use arises
as the information storing and processing systems become
larger and more distributed. Overall we can distinguish the
following types of information storing and processing systems
as a slow progression of technologies: distributed databases,
where many databases have the same schema and the data
is duplicated; federated databases, where the schema may be
different between instances, but there is a global system to
obtain any necessary data; data warehouses, which can be
treated as multidimensional databases geared for data analysis;
and finally data warehouse federations that consist of multiple
data warehouses. In technical sense, even data marts - as
data warehouse connected with particular subject - can be a
foundation for data warehouse federations [13].
In our paper we use the definition of federation of data
warehouses as an independent but consistent data warehouses
based on a common enterprise data model (federated archi-
tecture) [7]. In this context federated data warehouses are
multiple data warehouses having semantically similar data.
Federation can be understood for example as a new approach
of integration of data marts in order to make business decisions
on the level of management board.
We assume that a data warehouse federation is a set of data
warehouses that meets the following conditions:
• Federation is treated as a whole by the user who asks
queries (transparency).
• Federation consists of a logical schema and mechanisms
for user query processing. Federation does not have any
data (the main difference between data warehouse and
federation).
• Federation takes care about data consistency.
The main objective of our work is a proposition of al-
gorithms for building logical schema of federation and user
queries processing: decomposition and integration of results
techniques. The idea of our approach is presented in Fig. 1.
We assume the user needs information about some data stored
in many data warehouses. He sends a query Q to the federation
and expects an answer concatenated from results obtained
in particular warehouses. The tasks of federation processing
methods are presented as follows:
• Translating user’s query into a proper data warehouse
queries: QH1, QH2
, . . . , QHn;
• Gathering results from each DW: rH1, rH2
, . . . , rHn;
• Integrating them for presentation aims.
Fig. 1. Data warehouse federation as a logical layer over multiple datawarehouses.
The rest of the parer is organized as follows. In the Section
II we present related works connected with architecture of
federated data warehouses. Section III contains description of
proposed algorithm for data warehouses schema integration;
query decomposition algorithm is presented in Section IV
2012 IEEE International Conference on Systems, Man, and Cybernetics October 14-17, 2012, COEX, Seoul, Korea
978-1-4673-1714-6/12/$31.00 ©2012 IEEE 2897
while results integration procedure for presentation is described
in Section V. Algorithms implementation and example of using
those algorithms are in Section VI. In the last Section VII we
summarize our propositions and present future works.
II. RELATED WORKS
The research into federated data warehouses so far may be
grouped into three main areas: data marts, strict federations
and loosely-formed federations.
A data warehouse based on data marts is the most known
structure and is rarely named a federation, even if technically
it may be considered one. In general, a data mart has several
identical dimensions and the measures are not important for
the federation. Sahama and Croll [11] uses data marts both
to speed up data access in a vast medical warehouse and
to provide the users with appropriate perspectives. The data
warehouse on the top level is the federation (this schema is
accessible to the administrator only) and the data marts may
be treated as sources. The review done in [2] describes the
LAV approach to creating databases in terms of a data mart,
while they treat the GAV approach as the single view data
warehouse. It is also discussed as a possible approach in [4].
Strict federations are sometimes enforced in larger cor-
porations - the idea behind them is creating identical data
warehouses, so any access to them may be possible from any
level. A perfect example is the regional data warehouse as
discussed in [5]. A more relaxed example may be found in
[10], where the strictness is enforced during the process of
creating the warehouse, not designed from the start as such.
Loosely formed federations on the other hand are sometimes
available as extensions of existing star schemas into snowflake
schemas by way of adding external databases. In [8] this
approach is used to gather additional information from remote
servers (even the Internet) in order to provide better feedback
for the final user. A similar approach may be found in [1],
where the data warehouse is not constructed, but the query
answers are accessed in what may be considered a snowflake
schema view. In a more general way it is also attempted in
[9], where the user context is determined and additional data
sources become available during the user queries. This outside
information is not part of the data warehouse, but may be
accessed from it.
A few other approaches to creating data warehouses feder-
ations are possible, as we propose in this paper and as shown
in [6]. This rarely researched area may be used to create
federations from warehouses with similar but not identical
dimensions and measures. In [6] it is done by comparing the
names and types of attributes and measures (and names for
dimensions), to expand the star schema of one of the source
warehouses by adding new dimensions from other sources.
Similarly in [12] a data warehouse is incrementally changed
and the final result may be considered a data warehouse
federation in the sense proposed in [6].
III. ALGORITHM FOR DETERMINING LOGICAL SCHEMA OF
DATA WAREHOUSE FEDERATION
In order to present a logical schema of the federation
to the user some method must be used to determine the
relation between elements of the warehouses. This may be
done entirely by the expert but any automatization would speed
up this process. Here we present a method with a few automatic
and few manual phases.
We divide the integration process as follows:
1) Gathering user requirements (manual phase)
2) Pre-processing (automated phase)
3) Gathering expert knowledge (manual phase)
4) Rule generation (automated phase)
5) Rule inference (automated phase)
6) Similarity calculation (automated phase)
7) Element mapping (automated phase)
8) Verification (manual phase)
In the first phase, the domain of the federation is broadly
defined. The main expert may provide the domain directly or
give some clues, that will be used in the pre-processing to
determine the domain. The expert also provides the proposed
queries for the federated data warehouse, either as simple
SQL queries or as natural language sentences. These are for
example, the queries that the company CEO wants answered.
These queries may be also used as the clues for determining
the domain.
The second phase is automated and uses the input data
warehouses and the previously acquired information to create
two distinct dictionaries for use in phase three. First, if the
domain is not yet resolved, an inference process takes place
to determine it, using some pre-defined if-then rules. Once the
domain is known, the federation queries are analyzed to deter-
mine the dimensions that will be required in the federation. For
example, if there are queries about time, then a time dimension
will be most probably required in the federation. Once again
if-then rules are used. After the analysis of the queries, the first
dictionary is completed, consisting of all names of probably
required dimensions. Finally, the attributes and measures are
analyzed to create a second dictionary. At this stage some
if-then rules are used, but due to large number of possible
attribute names, some of the attributes names are simply copied
to the dictionary. A possible extension of this phase is using
a thesaurus to limit the dictionary size.
The third phase of integration is once again manual. The
subsidiary experts (i.e. administrators of each data warehouse)
receive two prepared tables each. One of the tables has the di-
mension dictionary as column names and dimensions from this
expert’s warehouse in rows. The task of the expert is marking
any relations between rows and columns. The second table is
similar, but attribute dictionary and attribute names from the
expert’s warehouse are used. Once again the expert marks any
relations. Multiple choices in both tables are allowed.
The fourth phase of integration is based on the tables created
in the previous stage. For each table row (in all tables) a rule
is created. It may be interpreted as the equivalence between
2898
Algorithm 1: Pre-processing step.
Input:• Domain and set Q of queries to federation
• If-then rules and domain thesaurus T containing
possible dimensions and attributes names
• Logical schemes of each data warehouse
DWi ∈ DW, i = 1, 2, . . . , n
Output: Forms for determining dimensions and attributes
required in the federation
foreach query q in Q doforeach if-then rule do
if q is predicate of the rule thenAdd the consequence of the rule to Dic1 (if
rule is for dimensions) or Dic2 dictionary (if
rule is for attributes);
(Optional) Add all attribute names to Dic2.end
endendforeach data warehouse DWi, i = 1, 2, . . . , n do
Prepare two forms: for dimensions and for measures
and attributes;
Send forms to appropriate warehouse expert and ask
for filling the forms.end
an attribute/dimension name and some dictionary terms. With
this interpretation the rules are logical formulas that may be
further processed.
In the fifth phase, an inference engine is used on the
previously created rules. Thanks to it, a series of new rules may
be created. These new rules will be in fact direct mappings
between dimensions and attributes in different data ware-
houses. Additionally, some additional rules may be created in
this phase, based on pre-defined if-then rules for the domain
concerned. An example of such rule is that a city infers region,
and the region infers country. A possible extension of this
phase is the manual input of if-then rules by the main expert
or subsidiary experts.
With the information gathered in the previous phases, in
phase six it is possible to calculate similarity between dimen-
sions in different warehouses and then between attributes in
similar dimensions. If the elements are equivalent according
to a rule generated in the previous phase, their similarity is set
to maximum (1 in the normalized scale). If no rules for the
elements exist, additional calculations are required to measure
the similarity. For example we may use name similarity (ex-
tended by a thesaurus), type similarity, drill-down and drill-up
similarity (if hierarchies exist). If any inconsistency arises for
similarities calculated thanks to rules, additional calculations
may also help with resolving the conflict.
In the seventh phase, the most similar dimensions are
mapped to create a dimension in the federation. A thresh-
old value (here: 0, 5) is used to omit dimensions with no
similarities, as it would otherwise be connected with some
random dimension. A dimension created in this way may be
defined as a tuple consisting of the name of the new dimension
(usually one of the names of mapped dimensions) and a set of
mapped dimensions used (this will be necessary in querying
the federation). Similarly, in mapped dimension the attributes
are also mapped to each other and a similar structure is used
to denote them.
The final phase of the process is the manual verification of
the integrated structure by the main expert. Here, any errors of
the automated process may be corrected. Note that the expert
changes only the structure created in phase seven and does not
interfere with other elements.
IV. QUERY DECOMPOSITION ALGORITHM
A data warehouse federation schema, as created in the previ-
ous section, is useless if no queries may be correctly answered.
For this, two additional steps are required: transforming the
user query from federation level to the warehouse level, and
transforming the warehouse answer to the query back to the
federation level. In this section we will provide the algorithm
for the first of this tasks. The federation schema consists in
large part of mappings between dimensions, attributes, etc.
This makes the process of decomposing the user query a
straightforward one: first, the algorithm determines the data
warehouses that have the appropriate dimensions, measures
and attributes. Then, if a warehouse has all the elements, the
query is simply translated step by step using the mappings
from the schema. If the warehouse has only some elements,
the query is transformed omitting the missing parts. If the data
warehouse has no elements required for the query, it will not
be queried (and automatically respond with a null in the next
step).
V. RESULT INTEGRATION AND PRESENTATION
In the previous subsection the query decomposition was
described. With a decomposed queries, the source warehouses
may be accessed for results. At this juncture, one question
remains: how can the result be guaranteed to answer the user
information need and not some irrevelant query?
The result of the previous section was either some query or
an empty query. In the second case we may easily determine
that there is no answer to the user query from this warehouse.
As for the first case, two situations are possible: the query and
the answer reflect the user need; or the query is incomplete
and the answer is only similar to the user need. We may
determine if the first occurs, by using the reverse of the
decomposition algorithm to translate the warehouse query back
into the federation query. If the federation level query remains
unchanged, then the results of it on the warehouse level will
be as the user desires. In other cases, the results may differ.
Due to the above, we determined that it is best to just show
the user the results from the queries that completely reflect
his need. All other information should be available if the user
explicitly demands it.
2899
Algorithm 2: Query Decomposition.
Input: Query qF to federation F , schema mappings
F–H1, F–H2, . . .
Output: Queries qH1, qH2
, qH3, . . .
foreach warehouse Hi doCreate empty query qHi
;
foreach table T from section FROM in qF doif mapping F.T −Hi.T exists then
Add T to section FROM in query qHi, ;
endendif section FROM in qHi , has less tables than sectionFROM in qF then
Empty qHi;
elseforeach attribute A in qF do
if mapping F.Tj .A – Hi.Tj .A exists thenAdd A in appropriate place in qHi
;
endendif section SELECT in qHi
is empty thenEmpty qHi
;
elseQuery qHi is correctly transformed;
Query the warehouse.end
endend
Algorithm 3: Result Presentation.
Input: Federation level query qF , warehouse level
queries qHi
Output: Query results rHi
foreach warehouse Hi doif (qHi is empty) then
Return null answer;
endUse decomposition algorithm to translate qHi
into q′F ;
if (q′F = qF ) thenQuery the warehouse and display the result rHi
;
elseQuery the warehouse and save (but do not display
until demanded) the result rHi;
endend
VI. IMPLEMENTATION AND EXAMPLE
For better presentation of this algorithm, we prepared an
extended example of a sales related data warehouse federation.
Note that all the stages presented below are implemented
in a test-suite and the example provides the results of the
implemented system. Case studies, as the following example,
are the only possibility to evaluate our proposed method, as the
individuality of each case prevents any analytical or statistical
study.
Let as assume two input data warehouses, H1 concerned
with retail (company retail shops) and H2 concerned with
wholesale (company wholesale shops). Each data warehouse
has some facts, measures, dimensions and attributes that may
be similar. The task is to create a federated data warehouse.
Fig. 2. Interface View
In gathering user requirements phase, the user may provide
the domain (sales) directly or the system may try to guess
them from the user queries. Assuming that the only information
provided by the expert at this stage is a natural language query
“What is the total number of sold products in each categoryin every distinct month of 2011?” the system will determine
the domain of the federation as “sales”. Naturally, the more
information provided, the closer the result will be to the real
domain.
After the domain is determined, the system analyzes the
input warehouses to prepare the dictionaries for experts map-
ping the dimensions and attributes. The dimension dictionary
for sales domain is: sales, client, product, location, time. The
facts table is also provided in this dictionary. Note that it is
open to expert opinion if the “location” term is related to the
client or the shop location (in this example both have location
data). The attribute dictionary is based on attribute names
already in the data warehouses, some terms predefined for
each domain, and synonyms of some of attribute names. Even
in this simple example this dictionary is quite large, some of
the terms are: single product price, no of products, client, first
2900
Fig. 3. Form for determining common dimensions and fact table.
name, surname, day, month, year, sex, age, employee, product
name, producer, category, country, etc.
For each of the input data warehouses, the expert determines
the mapping between dictionary terms and dimensions/at-
tributes. Some of the mappings in this example are:
H1.Sales ⇔ SalesH1.Client ⇔ Client
H1.Time ⇔ TimeH2.Sales ⇔ Sales
H2.Client ⇔ ClientH2.Order ⇔ Time
H1.Sales.Quantity ⇔ No of productsH1.Sales.Unit price ⇔ Single prod price ∧ Gross cost ∧ Money
H1.Time.Year ⇔ Time ∧ YearH2.Sales.No of products ⇔ No of products
H2.Sales.Total net price ⇔ Net price ∧ Total value ∧ MoneyH2.Sales.Total gross price ⇔ Gross price ∧ Total value ∧ Money
H2.Order.Order year ⇔ Time ∧ Year ∧ Order
In the rule inference phase these rules are then processed to
create mappings between the warehouses. Here some of those
mappings are:
H1.Sales ⇔ H2.SalesH1.Product ⇔ H2.Product
H1.Sales.Transaction value ⇔ H2.Sales.Total gross priceH1.Client.Age ⇔ H2.Client.Company age
With these rules the first version of the federated data
warehouse is created, consisting of only mapped elements. A
dimension in such federation may look as follows:
Dimension: Shop (H1.Shop; H2.Shop):• Name (H1.Shop.ShopName; H2.Shop.Name).
Finally we use additional methods to try and map remaining
elements. The final result may be as follows:
Dimension: Shop (H1.Shop; H2.Shop):• Name (H1.Shop.ShopName; H2.Shop.Name),
• Address (H1.Shop.Address),
• City (H2.Shop.City),
• Region (H2.Shop.Region).
One may note that City and Region are related to Address
and some inference is possible (infer City from Address, then
infer Region from City). Our algorithm does not allow that
possibility in the automated phases. The last manual phase
verification may be used to add some additional mappings,
but without additional mechanisms our query decomposition
and result presentation algorithms will not be able to use such
attributes.
For the parts of our method described in Section IV and V
the following queries were used as examples:
1) What is the total number of sold products in each
category in every distinct month of 2011?
2) What is the total amount of products ordered on each
distinct day of the month?
3) What products are bought by women?
This natural language queries take the following form in
SQL:
1) SELECT F.Sales.ProductAmount, F.Products.Category,F.Invoice.MonthFROM F.Sales, F.Products, F.InvoiceWHERE F.Sales.IdProduct = F.Products.IdProduct,F.Sales.InvoiceNo = F.Invoice.InvoiceNo ANDF.Invoice.Year = 2011GROUP BY F.Products.Category, F.Invoice.Month
2) SELECT F.Sales.ProductAmount, F.Order.DayFROM F.Sales, F.OrderWHERE F.Sales.Orderid = F.Order.OrderIdGroup BY F.Order.Day
3) SELECT F.Sales.ProductAmount, F.ProductNameFROM F.Sales, F.Products, F.ClientsWHERE F.Sales.ProductId = F.Products.ProductIdAND F.Sales.ClientId = F.Clients.ClientId ANDF.Clients.Sex = ’Female’GROUP BY F.ProductName
These queries may be decomposed for each source ware-
house as follows. Queries to the first warehouse H1:
1) SELECT H1.Sales.Amount, H1.Products.Category,H1.Time.MonthFROM H1.Sales, H1.Products, H1.TimeWHERE H1.Sales.IdProduct = H1.Products.IdProduct,H1.Sales.Time = H1.Time.Time AND H1.Time.Year =2011GROUP BY H1.Products.Category, H1.Time.Month
2) EMPTY QUERY3) SELECT H1.Sales.Amount, H1.ProductName
FROM H1.Sales, H1.Products, H1.ClientsWHERE H1.Sales.ProductId = H1.Products.ProductIdAND H1.Sales.ClientId = H1.Clients.ClientId ANDH1.Clients.Sex = ’Female’GROUP BY H1.ProductName
Queries to the second warehouse H2:
1) SELECT H2.Sales.ProductAmount,H2.Products.Category, H2.Invoice.Month
2901
FROM H2.Sales, H2.Products, H2.InvoiceWHERE H2.Sales.IdProduct = H2.Products.IdProduct,H2.Sales.InvoiceNo = H2.Invoice.InvoiceNo ANDH2.Invoice.Year = 2011GROUP BY H2.Products.Category, H2.Invoice.Month
2) SELECT H2.Sales.ProductAmount, H2.Order.DayFROM H2.Sales, H2.OrderWHERE H2.Sales.OrderId = H2.Order.OrderIdGroup BY H2.Order.Day
3) SELECT H2.Sales.ProductAmount, H2.ProductNameFROM H2.Sales, H2.Products, H2.ClientsWHERE H2.Sales.ProductId = H2.Products.ProductIdAND H2.Sales.ClientId = H2.Clients.ClientIdGROUP BY H2.ProductName
With these queries, we may get results from the source
warehouse. In the first case, both the results will reflect the
user information need so both answers will be displayed. In the
second case the lack of the “Order” table makes it impossible
to provide any sensible answer from H1, thus only one result is
displayed. Finally, in the last query, the client sex is stored only
in H1, so only the results from one query will be immediately
displayed. The user will still get the information that it is
possible to access H2 (in fact the data will be prefetched
already), but the answer may not fully reflect his query.
VII. CONCLUSION
In this paper we proposed a generic approach to creating
a data warehouse federation, providing methods for:
• Schema integration in order to create the federation
schema.
• Query decomposition in order to translate the query to
the federation into queries to subsequent warehouses.
• Result integration in order to select the answer that best
suits the user information need.
With the working implementation and theoretical examples
we were able to test our approach and determine its validity
for practical use.
Creating a working data warehouse federation provides a
new layer of generalization of the data in some enterprise.
If multiple warehouses exist on the same topic, with similar
(but not identical) schemas, accessing the data and processing
it as a whole may be impossible due to inconsistencies. A
data warehouse federation is a logical layer over the physical
warehouses. The aim of the federation is to present the user
just one schema to query. The federation processes the query
automatically and translates it to the component databases in
order to receive some results. Finally it presents these results
to the user as a disambiguated collection.
The main difference between the federation approach and
other existing methods, like e.g. distributed database is the
high level of abstraction present in the federation, with no
actual data stored and duplicated, but easily accessible from
the source components.
The presented model may still be subject to many improve-
ments, mostly by reducing the necessary human interaction
during the initial stages. This would speed up the process
considerably and will be our main area of focus in the future
research.
ACKNOWLEDGMENT
This research was partially supported by Polish Ministry of
Science and Higher Education under grant no. N N519 407437
(2009-2012).
REFERENCES
[1] Barg M., Wong R. K.: Cooperative Query Answering for SemistructuredData. In Proceedings of ADC2003, ACM (2003)
[2] Bennett T. A., Bayrak C.: Bridging The Data Integration Gap: FromTheory to Implementation. ACM SIGSOFT Software Engineering NotesVolume 36 Number 3 (2011)
[3] Berger S., Schrefl M.: From Federated Databases to a Federated DataWarehouse System. Proceedings of the 41st Hawaii International Confer-ence on System Sciences, IEEE (2008)
[4] Dayal U., Castellanos M., Simitsis A., Wilkinson K.: Data IntegrationFlows for Business Intelligence. In Proceedings of EDBT’09, ACM (2009)
[5] Jindal R., Acharya A.: Federated Data Warehouse Architecture. WiproTechnologies.
[6] Kern R., Ryk K., Nguyen N. T.: A Framework for Building LogicalSchema and Query Decomposition in Data Warehouse Federations. InProceedings of ICCCI’11. LNAI 6922, pp. 612–622 (2011)
[7] Moody D. L., Kortink M. A. R.: From Enterprise Models to DimensionalModels: A Methodology for Data Warehouse and Data Mart Design. InProceedings of the International Workshop on Design and Managementof Data Warehouses (2000)
[8] Moya L. G., Kudama S., Aramburu M. H.: Integrating Web Feed Opinionsinto a Corporate Data Warehouse. In Proceedings of BEWEB 2011, ACM(2011)
[9] Priebe T., Pernul G.: Towards Integrative Enterprise Knowledge Portals.In Proceedings of CIKM03, ACM (2003)
[10] Riazati D., Thom J. A., Zhang X.: Enforcing Strictness in Integrationof Dimensions: Beyond Instance Matching. In Proceedings of DOLAP11,ACM (2011)
[11] Sahama T. R., Croll P. R.: A Data Warehouse Architecture for ClinicalData Warehousing. Conferences in Research and Practice in InformationTechnology, Vol. 68, pp. 227 – 232 (2007)
[12] Schtz C., Schrefl M., Neumayr B.: Incremental Integration of DataWarehouses: The Hetero-Homogeneous Approach. In Proceedings ofDOLAP11, ACM (2011)
[13] Todman C.: Designing a data warehouse: supporting customer relation-ship management. Prentice Hall (2001)
2902