[IEEE 2012 IEEE International Conference on Systems, Man and Cybernetics - SMC - Seoul, Korea...

A Framework for Data Warehouse FederationsBuilding

Marcin MaleszkaInstitute of Informatics

Wroclaw University of Technology,

Wroclaw, Poland

[email protected]

Bernadetta MianowskaInstitute of Informatics


Wroclaw, Poland

[email protected]

Ngoc Thanh NguyenInstitute of Informatics


Wroclaw, Poland

[email protected]

Abstract—Data warehouse federations are a next step inmodern information storage and processing systems. They area system of classical data warehouses connected by an integratedschema, that allows access to all the data of the source ware-houses. We propose a generic approach to create a data warehousefederation, providing methods for schema integration (in order tocreate the federation schema), query decomposition (in order totranslate the query to the federation into queries to subsequentwarehouses) and result integration (to select the answer that bestsuits the user information need). We test this approach by creatinga basic implementation and present a working example.

Index Terms—data warehouse federation, data warehouse,knowledge integration, collective intelligence.

I. INTRODUCTION

The data warehouse federations are becoming a more and

more important research topic. The need for their use arises

as the information storing and processing systems become

larger and more distributed. Overall we can distinguish the

following types of information storing and processing systems

as a slow progression of technologies: distributed databases,

where many databases have the same schema and the data

is duplicated; federated databases, where the schema may be

different between instances, but there is a global system to

obtain any necessary data; data warehouses, which can be

treated as multidimensional databases geared for data analysis;

and finally data warehouse federations that consist of multiple

data warehouses. In technical sense, even data marts - as

data warehouse connected with particular subject - can be a

foundation for data warehouse federations [13].

In our paper we use the definition of federation of data

warehouses as an independent but consistent data warehouses

based on a common enterprise data model (federated archi-

tecture) [7]. In this context federated data warehouses are

multiple data warehouses having semantically similar data.

Federation can be understood for example as a new approach

of integration of data marts in order to make business decisions

on the level of management board.

We assume that a data warehouse federation is a set of data

warehouses that meets the following conditions:

• Federation is treated as a whole by the user who asks

queries (transparency).

• Federation consists of a logical schema and mechanisms

for user query processing. Federation does not have any

data (the main difference between data warehouse and

federation).

• Federation takes care about data consistency.

The main objective of our work is a proposition of al-

gorithms for building logical schema of federation and user

queries processing: decomposition and integration of results

techniques. The idea of our approach is presented in Fig. 1.

We assume the user needs information about some data stored

in many data warehouses. He sends a query Q to the federation

and expects an answer concatenated from results obtained

in particular warehouses. The tasks of federation processing

methods are presented as follows:

• Translating user’s query into a proper data warehouse

queries: QH1, QH2

, . . . , QHn;

• Gathering results from each DW: rH1, rH2

, . . . , rHn;

• Integrating them for presentation aims.

Fig. 1. Data warehouse federation as a logical layer over multiple datawarehouses.

The rest of the parer is organized as follows. In the Section

II we present related works connected with architecture of

federated data warehouses. Section III contains description of

proposed algorithm for data warehouses schema integration;

query decomposition algorithm is presented in Section IV

2012 IEEE International Conference on Systems, Man, and Cybernetics October 14-17, 2012, COEX, Seoul, Korea

978-1-4673-1714-6/12/$31.00 ©2012 IEEE 2897

while results integration procedure for presentation is described

in Section V. Algorithms implementation and example of using

those algorithms are in Section VI. In the last Section VII we

summarize our propositions and present future works.

II. RELATED WORKS

The research into federated data warehouses so far may be

grouped into three main areas: data marts, strict federations

and loosely-formed federations.

A data warehouse based on data marts is the most known

structure and is rarely named a federation, even if technically

it may be considered one. In general, a data mart has several

identical dimensions and the measures are not important for

the federation. Sahama and Croll [11] uses data marts both

to speed up data access in a vast medical warehouse and

to provide the users with appropriate perspectives. The data

warehouse on the top level is the federation (this schema is

accessible to the administrator only) and the data marts may

be treated as sources. The review done in [2] describes the

LAV approach to creating databases in terms of a data mart,

while they treat the GAV approach as the single view data

warehouse. It is also discussed as a possible approach in [4].

Strict federations are sometimes enforced in larger cor-

porations - the idea behind them is creating identical data

warehouses, so any access to them may be possible from any

level. A perfect example is the regional data warehouse as

discussed in [5]. A more relaxed example may be found in

[10], where the strictness is enforced during the process of

creating the warehouse, not designed from the start as such.

Loosely formed federations on the other hand are sometimes

available as extensions of existing star schemas into snowflake

schemas by way of adding external databases. In [8] this

approach is used to gather additional information from remote

servers (even the Internet) in order to provide better feedback

for the final user. A similar approach may be found in [1],

where the data warehouse is not constructed, but the query

answers are accessed in what may be considered a snowflake

schema view. In a more general way it is also attempted in

[9], where the user context is determined and additional data

sources become available during the user queries. This outside

information is not part of the data warehouse, but may be

accessed from it.

A few other approaches to creating data warehouses feder-

ations are possible, as we propose in this paper and as shown

in [6]. This rarely researched area may be used to create

federations from warehouses with similar but not identical

dimensions and measures. In [6] it is done by comparing the

names and types of attributes and measures (and names for

dimensions), to expand the star schema of one of the source

warehouses by adding new dimensions from other sources.

Similarly in [12] a data warehouse is incrementally changed

and the final result may be considered a data warehouse

federation in the sense proposed in [6].

III. ALGORITHM FOR DETERMINING LOGICAL SCHEMA OF

DATA WAREHOUSE FEDERATION

In order to present a logical schema of the federation

to the user some method must be used to determine the

relation between elements of the warehouses. This may be

done entirely by the expert but any automatization would speed

up this process. Here we present a method with a few automatic

and few manual phases.

We divide the integration process as follows:

1) Gathering user requirements (manual phase)

2) Pre-processing (automated phase)

3) Gathering expert knowledge (manual phase)

4) Rule generation (automated phase)

5) Rule inference (automated phase)

6) Similarity calculation (automated phase)

7) Element mapping (automated phase)

8) Verification (manual phase)

In the first phase, the domain of the federation is broadly

defined. The main expert may provide the domain directly or

give some clues, that will be used in the pre-processing to

determine the domain. The expert also provides the proposed

queries for the federated data warehouse, either as simple

SQL queries or as natural language sentences. These are for

example, the queries that the company CEO wants answered.

These queries may be also used as the clues for determining

the domain.

The second phase is automated and uses the input data

warehouses and the previously acquired information to create

two distinct dictionaries for use in phase three. First, if the

domain is not yet resolved, an inference process takes place

to determine it, using some pre-defined if-then rules. Once the

domain is known, the federation queries are analyzed to deter-

mine the dimensions that will be required in the federation. For

example, if there are queries about time, then a time dimension

will be most probably required in the federation. Once again

if-then rules are used. After the analysis of the queries, the first

dictionary is completed, consisting of all names of probably

required dimensions. Finally, the attributes and measures are

analyzed to create a second dictionary. At this stage some

if-then rules are used, but due to large number of possible

attribute names, some of the attributes names are simply copied

to the dictionary. A possible extension of this phase is using

a thesaurus to limit the dictionary size.

The third phase of integration is once again manual. The

subsidiary experts (i.e. administrators of each data warehouse)

receive two prepared tables each. One of the tables has the di-

mension dictionary as column names and dimensions from this

expert’s warehouse in rows. The task of the expert is marking

any relations between rows and columns. The second table is

similar, but attribute dictionary and attribute names from the

expert’s warehouse are used. Once again the expert marks any

relations. Multiple choices in both tables are allowed.

The fourth phase of integration is based on the tables created

in the previous stage. For each table row (in all tables) a rule

is created. It may be interpreted as the equivalence between

2898

Algorithm 1: Pre-processing step.

Input:• Domain and set Q of queries to federation

• If-then rules and domain thesaurus T containing

possible dimensions and attributes names

• Logical schemes of each data warehouse

DWi ∈ DW, i = 1, 2, . . . , n

Output: Forms for determining dimensions and attributes

required in the federation

foreach query q in Q doforeach if-then rule do

if q is predicate of the rule thenAdd the consequence of the rule to Dic1 (if

rule is for dimensions) or Dic2 dictionary (if

rule is for attributes);

(Optional) Add all attribute names to Dic2.end

endendforeach data warehouse DWi, i = 1, 2, . . . , n do

Prepare two forms: for dimensions and for measures

and attributes;

Send forms to appropriate warehouse expert and ask

for filling the forms.end

an attribute/dimension name and some dictionary terms. With

this interpretation the rules are logical formulas that may be

further processed.

In the fifth phase, an inference engine is used on the

previously created rules. Thanks to it, a series of new rules may

be created. These new rules will be in fact direct mappings

between dimensions and attributes in different data ware-

houses. Additionally, some additional rules may be created in

this phase, based on pre-defined if-then rules for the domain

concerned. An example of such rule is that a city infers region,

and the region infers country. A possible extension of this

phase is the manual input of if-then rules by the main expert

or subsidiary experts.

With the information gathered in the previous phases, in

phase six it is possible to calculate similarity between dimen-

sions in different warehouses and then between attributes in

similar dimensions. If the elements are equivalent according

to a rule generated in the previous phase, their similarity is set

to maximum (1 in the normalized scale). If no rules for the

elements exist, additional calculations are required to measure

the similarity. For example we may use name similarity (ex-

tended by a thesaurus), type similarity, drill-down and drill-up

similarity (if hierarchies exist). If any inconsistency arises for

similarities calculated thanks to rules, additional calculations

may also help with resolving the conflict.

In the seventh phase, the most similar dimensions are

mapped to create a dimension in the federation. A thresh-

old value (here: 0, 5) is used to omit dimensions with no

similarities, as it would otherwise be connected with some

random dimension. A dimension created in this way may be

defined as a tuple consisting of the name of the new dimension

(usually one of the names of mapped dimensions) and a set of

mapped dimensions used (this will be necessary in querying

the federation). Similarly, in mapped dimension the attributes

are also mapped to each other and a similar structure is used

to denote them.

The final phase of the process is the manual verification of

the integrated structure by the main expert. Here, any errors of

the automated process may be corrected. Note that the expert

changes only the structure created in phase seven and does not

interfere with other elements.

IV. QUERY DECOMPOSITION ALGORITHM

A data warehouse federation schema, as created in the previ-

ous section, is useless if no queries may be correctly answered.

For this, two additional steps are required: transforming the

user query from federation level to the warehouse level, and

transforming the warehouse answer to the query back to the

federation level. In this section we will provide the algorithm

for the first of this tasks. The federation schema consists in

large part of mappings between dimensions, attributes, etc.

This makes the process of decomposing the user query a

straightforward one: first, the algorithm determines the data

warehouses that have the appropriate dimensions, measures

and attributes. Then, if a warehouse has all the elements, the

query is simply translated step by step using the mappings

from the schema. If the warehouse has only some elements,

the query is transformed omitting the missing parts. If the data

warehouse has no elements required for the query, it will not

be queried (and automatically respond with a null in the next

step).

V. RESULT INTEGRATION AND PRESENTATION

In the previous subsection the query decomposition was

described. With a decomposed queries, the source warehouses

may be accessed for results. At this juncture, one question

remains: how can the result be guaranteed to answer the user

information need and not some irrevelant query?

The result of the previous section was either some query or

an empty query. In the second case we may easily determine

that there is no answer to the user query from this warehouse.

As for the first case, two situations are possible: the query and

the answer reflect the user need; or the query is incomplete

and the answer is only similar to the user need. We may

determine if the first occurs, by using the reverse of the

decomposition algorithm to translate the warehouse query back

into the federation query. If the federation level query remains

unchanged, then the results of it on the warehouse level will

be as the user desires. In other cases, the results may differ.

Due to the above, we determined that it is best to just show

the user the results from the queries that completely reflect

his need. All other information should be available if the user

explicitly demands it.

2899

Algorithm 2: Query Decomposition.

Input: Query qF to federation F , schema mappings

F–H1, F–H2, . . .

Output: Queries qH1, qH2

, qH3, . . .

foreach warehouse Hi doCreate empty query qHi

;

foreach table T from section FROM in qF doif mapping F.T −Hi.T exists then

Add T to section FROM in query qHi, ;

endendif section FROM in qHi , has less tables than sectionFROM in qF then

Empty qHi;

elseforeach attribute A in qF do

if mapping F.Tj .A – Hi.Tj .A exists thenAdd A in appropriate place in qHi

;

endendif section SELECT in qHi

is empty thenEmpty qHi

;

elseQuery qHi is correctly transformed;

Query the warehouse.end

endend

Algorithm 3: Result Presentation.

Input: Federation level query qF , warehouse level

queries qHi

Output: Query results rHi

foreach warehouse Hi doif (qHi is empty) then

Return null answer;

endUse decomposition algorithm to translate qHi

into q′F ;

if (q′F = qF ) thenQuery the warehouse and display the result rHi

;

elseQuery the warehouse and save (but do not display

until demanded) the result rHi;

endend

VI. IMPLEMENTATION AND EXAMPLE

For better presentation of this algorithm, we prepared an

extended example of a sales related data warehouse federation.

Note that all the stages presented below are implemented

in a test-suite and the example provides the results of the

implemented system. Case studies, as the following example,

are the only possibility to evaluate our proposed method, as the

individuality of each case prevents any analytical or statistical

study.

Let as assume two input data warehouses, H1 concerned

with retail (company retail shops) and H2 concerned with

wholesale (company wholesale shops). Each data warehouse

has some facts, measures, dimensions and attributes that may

be similar. The task is to create a federated data warehouse.

Fig. 2. Interface View

In gathering user requirements phase, the user may provide

the domain (sales) directly or the system may try to guess

them from the user queries. Assuming that the only information

provided by the expert at this stage is a natural language query

“What is the total number of sold products in each categoryin every distinct month of 2011?” the system will determine

the domain of the federation as “sales”. Naturally, the more

information provided, the closer the result will be to the real

domain.

After the domain is determined, the system analyzes the

input warehouses to prepare the dictionaries for experts map-

ping the dimensions and attributes. The dimension dictionary

for sales domain is: sales, client, product, location, time. The

facts table is also provided in this dictionary. Note that it is

open to expert opinion if the “location” term is related to the

client or the shop location (in this example both have location

data). The attribute dictionary is based on attribute names

already in the data warehouses, some terms predefined for

each domain, and synonyms of some of attribute names. Even

in this simple example this dictionary is quite large, some of

the terms are: single product price, no of products, client, first

2900

Fig. 3. Form for determining common dimensions and fact table.

name, surname, day, month, year, sex, age, employee, product

name, producer, category, country, etc.

For each of the input data warehouses, the expert determines

the mapping between dictionary terms and dimensions/at-

tributes. Some of the mappings in this example are:

H1.Sales ⇔ SalesH1.Client ⇔ Client

H1.Time ⇔ TimeH2.Sales ⇔ Sales

H2.Client ⇔ ClientH2.Order ⇔ Time

H1.Sales.Quantity ⇔ No of productsH1.Sales.Unit price ⇔ Single prod price ∧ Gross cost ∧ Money

H1.Time.Year ⇔ Time ∧ YearH2.Sales.No of products ⇔ No of products

H2.Sales.Total net price ⇔ Net price ∧ Total value ∧ MoneyH2.Sales.Total gross price ⇔ Gross price ∧ Total value ∧ Money

H2.Order.Order year ⇔ Time ∧ Year ∧ Order

In the rule inference phase these rules are then processed to

create mappings between the warehouses. Here some of those

mappings are:

H1.Sales ⇔ H2.SalesH1.Product ⇔ H2.Product

H1.Sales.Transaction value ⇔ H2.Sales.Total gross priceH1.Client.Age ⇔ H2.Client.Company age

With these rules the first version of the federated data

warehouse is created, consisting of only mapped elements. A

dimension in such federation may look as follows:

Dimension: Shop (H1.Shop; H2.Shop):• Name (H1.Shop.ShopName; H2.Shop.Name).

Finally we use additional methods to try and map remaining

elements. The final result may be as follows:

Dimension: Shop (H1.Shop; H2.Shop):• Name (H1.Shop.ShopName; H2.Shop.Name),

• Address (H1.Shop.Address),

• City (H2.Shop.City),

• Region (H2.Shop.Region).

One may note that City and Region are related to Address

and some inference is possible (infer City from Address, then

infer Region from City). Our algorithm does not allow that

possibility in the automated phases. The last manual phase

verification may be used to add some additional mappings,

but without additional mechanisms our query decomposition

and result presentation algorithms will not be able to use such

attributes.

For the parts of our method described in Section IV and V

the following queries were used as examples:

1) What is the total number of sold products in each

category in every distinct month of 2011?

2) What is the total amount of products ordered on each

distinct day of the month?

3) What products are bought by women?

This natural language queries take the following form in

SQL:

1) SELECT F.Sales.ProductAmount, F.Products.Category,F.Invoice.MonthFROM F.Sales, F.Products, F.InvoiceWHERE F.Sales.IdProduct = F.Products.IdProduct,F.Sales.InvoiceNo = F.Invoice.InvoiceNo ANDF.Invoice.Year = 2011GROUP BY F.Products.Category, F.Invoice.Month

2) SELECT F.Sales.ProductAmount, F.Order.DayFROM F.Sales, F.OrderWHERE F.Sales.Orderid = F.Order.OrderIdGroup BY F.Order.Day

3) SELECT F.Sales.ProductAmount, F.ProductNameFROM F.Sales, F.Products, F.ClientsWHERE F.Sales.ProductId = F.Products.ProductIdAND F.Sales.ClientId = F.Clients.ClientId ANDF.Clients.Sex = ’Female’GROUP BY F.ProductName

These queries may be decomposed for each source ware-

house as follows. Queries to the first warehouse H1:

1) SELECT H1.Sales.Amount, H1.Products.Category,H1.Time.MonthFROM H1.Sales, H1.Products, H1.TimeWHERE H1.Sales.IdProduct = H1.Products.IdProduct,H1.Sales.Time = H1.Time.Time AND H1.Time.Year =2011GROUP BY H1.Products.Category, H1.Time.Month

2) EMPTY QUERY3) SELECT H1.Sales.Amount, H1.ProductName

FROM H1.Sales, H1.Products, H1.ClientsWHERE H1.Sales.ProductId = H1.Products.ProductIdAND H1.Sales.ClientId = H1.Clients.ClientId ANDH1.Clients.Sex = ’Female’GROUP BY H1.ProductName

Queries to the second warehouse H2:

1) SELECT H2.Sales.ProductAmount,H2.Products.Category, H2.Invoice.Month

2901

FROM H2.Sales, H2.Products, H2.InvoiceWHERE H2.Sales.IdProduct = H2.Products.IdProduct,H2.Sales.InvoiceNo = H2.Invoice.InvoiceNo ANDH2.Invoice.Year = 2011GROUP BY H2.Products.Category, H2.Invoice.Month

2) SELECT H2.Sales.ProductAmount, H2.Order.DayFROM H2.Sales, H2.OrderWHERE H2.Sales.OrderId = H2.Order.OrderIdGroup BY H2.Order.Day

3) SELECT H2.Sales.ProductAmount, H2.ProductNameFROM H2.Sales, H2.Products, H2.ClientsWHERE H2.Sales.ProductId = H2.Products.ProductIdAND H2.Sales.ClientId = H2.Clients.ClientIdGROUP BY H2.ProductName

With these queries, we may get results from the source

warehouse. In the first case, both the results will reflect the

user information need so both answers will be displayed. In the

second case the lack of the “Order” table makes it impossible

to provide any sensible answer from H1, thus only one result is

displayed. Finally, in the last query, the client sex is stored only

in H1, so only the results from one query will be immediately

displayed. The user will still get the information that it is

possible to access H2 (in fact the data will be prefetched

already), but the answer may not fully reflect his query.

VII. CONCLUSION

In this paper we proposed a generic approach to creating

a data warehouse federation, providing methods for:

• Schema integration in order to create the federation

schema.

• Query decomposition in order to translate the query to

the federation into queries to subsequent warehouses.

• Result integration in order to select the answer that best

suits the user information need.

With the working implementation and theoretical examples

we were able to test our approach and determine its validity

for practical use.

Creating a working data warehouse federation provides a

new layer of generalization of the data in some enterprise.

If multiple warehouses exist on the same topic, with similar

(but not identical) schemas, accessing the data and processing

it as a whole may be impossible due to inconsistencies. A

data warehouse federation is a logical layer over the physical

warehouses. The aim of the federation is to present the user

just one schema to query. The federation processes the query

automatically and translates it to the component databases in

order to receive some results. Finally it presents these results

to the user as a disambiguated collection.

The main difference between the federation approach and

other existing methods, like e.g. distributed database is the

high level of abstraction present in the federation, with no

actual data stored and duplicated, but easily accessible from

the source components.

The presented model may still be subject to many improve-

ments, mostly by reducing the necessary human interaction

during the initial stages. This would speed up the process

considerably and will be our main area of focus in the future

research.

ACKNOWLEDGMENT

This research was partially supported by Polish Ministry of

Science and Higher Education under grant no. N N519 407437

(2009-2012).

REFERENCES

[1] Barg M., Wong R. K.: Cooperative Query Answering for SemistructuredData. In Proceedings of ADC2003, ACM (2003)

[2] Bennett T. A., Bayrak C.: Bridging The Data Integration Gap: FromTheory to Implementation. ACM SIGSOFT Software Engineering NotesVolume 36 Number 3 (2011)

[3] Berger S., Schrefl M.: From Federated Databases to a Federated DataWarehouse System. Proceedings of the 41st Hawaii International Confer-ence on System Sciences, IEEE (2008)

[4] Dayal U., Castellanos M., Simitsis A., Wilkinson K.: Data IntegrationFlows for Business Intelligence. In Proceedings of EDBT’09, ACM (2009)

[5] Jindal R., Acharya A.: Federated Data Warehouse Architecture. WiproTechnologies.

[6] Kern R., Ryk K., Nguyen N. T.: A Framework for Building LogicalSchema and Query Decomposition in Data Warehouse Federations. InProceedings of ICCCI’11. LNAI 6922, pp. 612–622 (2011)

[7] Moody D. L., Kortink M. A. R.: From Enterprise Models to DimensionalModels: A Methodology for Data Warehouse and Data Mart Design. InProceedings of the International Workshop on Design and Managementof Data Warehouses (2000)

[8] Moya L. G., Kudama S., Aramburu M. H.: Integrating Web Feed Opinionsinto a Corporate Data Warehouse. In Proceedings of BEWEB 2011, ACM(2011)

[9] Priebe T., Pernul G.: Towards Integrative Enterprise Knowledge Portals.In Proceedings of CIKM03, ACM (2003)

[10] Riazati D., Thom J. A., Zhang X.: Enforcing Strictness in Integrationof Dimensions: Beyond Instance Matching. In Proceedings of DOLAP11,ACM (2011)

[11] Sahama T. R., Croll P. R.: A Data Warehouse Architecture for ClinicalData Warehousing. Conferences in Research and Practice in InformationTechnology, Vol. 68, pp. 227 – 232 (2007)

[12] Schtz C., Schrefl M., Neumayr B.: Incremental Integration of DataWarehouses: The Hetero-Homogeneous Approach. In Proceedings ofDOLAP11, ACM (2011)

[13] Todman C.: Designing a data warehouse: supporting customer relation-ship management. Prentice Hall (2001)

2902

[IEEE 2012 IEEE International Conference on Systems, Man and Cybernetics - SMC - Seoul, Korea...

Documents

Transcript of [IEEE 2012 IEEE International Conference on Systems, Man and Cybernetics - SMC - Seoul, Korea...