ETL Testing (Extract, Transform, And Load)

8/13/2019 ETL Testing (Extract, Transform, And Load)

http://slidepdf.com/reader/full/etl-testing-extract-transform-and-load 1/39

What is data warehouse?

A data warehouse is a electronic storage of

an Organization's historical data for the

purpose of reporting, analysis and data

mining or knowledge discovery.

Other than that a data warehouse can also be used for the purpose of data integration,

master data management etc.

According to Bill Inmon, a datawarehouse

should be subject-oriented, non-volatile,

integrated and time-variant.

Explanatory Note

Note here, Non-volatile means that the data once loaded in the warehouse will not get deleted

later. Time-variant means the data will change with respect to time.

The above definition of the data warehousing is typically considered as "classical" definition.

However, if you are interested, you may want to read the article - What is a data warehouse - A101 guide to modern data warehousing - which opens up a broader definition of data

warehousing.

What is the benefits of data warehouse?

A data warehouse helps to integrate data (see Data integration) and store them historically so thatwe can analyze different aspects of business including, performance analysis, trend, prediction

etc. over a given time frame and use the result of our analysis to improve the efficiency of

business processes.

Why Data Warehouse is used?

For a long time in the past and also even today, Data warehouses are built to facilitate reportingon different key business processes of an organization, known as KPI. Data warehouses also help

to integrate data from different sources and show a single-point-of-truth values about the

business measures.

Data warehouse can be further used for data mining which helps trend prediction, forecasts,

pattern recognition etc. Check this article to know more about data mining

What is the difference between OLTP and OLAP?

OLTP is the transaction system that collects business data. Whereas OLAP is the reporting andanalysis system on that data.

http://www.dwbiconcepts.com/data-warehousing/18-dwbi-basic-concepts/87-what-is-a-data-warehouse-guide.html




http://www.dwbiconcepts.com/data-warehousing/18-dwbi-basic-concepts/92-data-integration.html



http://www.dwbiconcepts.com/data-warehousing/11-data-mining/97-data-mining-for-beginners.html









OLTP systems are optimized for INSERT, UPDATE operations and therefore highly

normalized. On the other hand, OLAP systems are deliberately denormalized for fast data

retrieval through SELECT operations.

Explanatory Note:

In a departmental shop, when we pay the prices at the check-out counter, the sales person at the

counter keys-in all the data into a "Point-Of-Sales" machine. That data is transaction data and the

related system is a OLTP system.

On the other hand, the manager of the store might want to view a report on out-of-stock

materials, so that he can place purchase order for them. Such report will come out from OLAPsystem

What is data mart?

Data marts are generally designed for a single subject area. An organization may have data pertaining to different departments like Finance, HR, Marketting etc. stored in data warehouse

and each department may have separate data marts. These data marts can be built on top of thedata warehouse.

What is ER model?

ER model or entity-relationship model is a particular methodology of data modeling wherein the

goal of modeling is to normalize the data by reducing redundancy. This is different than

dimensional modeling where the main goal is to improve the data retrieval mechanism.

What is dimensional modeling?

Dimensional model consists of dimension and fact tables. Fact tables store different transactional

measurements and the foreign keys from dimension tables that qualifies the data. The goal of

Dimensional model is not to achive high degree of normalization but to facilitate easy and faster

data retrieval.

Ralph Kimball is one of the strongest proponents of this very popular data modeling techniquewhich is often used in many enterprise level data warehouses.

If you want to read a quick and simple guide on dimensional modeling, please check our Guide

to dimensional modeling.

What is dimension?

A dimension is something that qualifies a quantity (measure).

For an example, consider this: If I just say… ―20kg‖, it does not mean anything. But if I say,

"20kg of Rice (Product) is sold to Ramesh (customer) on 5th April (date)", then that gives a

http://www.dwbiconcepts.com/data-warehousing/12-data-modelling/127-dimensional-data-modeling.html








meaningful sense. These product, customer and dates are some dimension that qualified the

measure - 20kg.

Dimensions are mutually independent. Technically speaking, a dimension is a data element that

categorizes each item in a data set into non-overlapping regions.

What is Fact?

A fact is something that is quantifiable (Or measurable). Facts are typically (but not always)numerical values that can be aggregated.

What are additive, semi-additive and non-additive measures?

Non-additive Measures

Non-additive measures are those which can not be used inside any numeric aggregation function

(e.g. SUM(), AVG() etc.). One example of non-additive fact is any kind of ratio or percentage.Example, 5% profit margin, revenue to asset ratio etc. A non-numerical data can also be a non-

additive measure when that data is stored in fact tables, e.g. some kind of varchar flags in the facttable.

Semi Additive Measures

Semi-additive measures are those where only a subset of aggregation function can be applied.Let’s say account balance. A sum() function on balance does not give a useful result but max() or

min() balance might be useful. Consider price rate or currency rate. Sum is meaningless on rate;

however, average function might be useful.

Additive Measures

Additive measures can be used with any aggregation function like Sum(), Avg() etc. Example is

Sales Quantity etc.

"Classifying data for successful modeling"

What is data?

Let us begin our discussion by defining what is data. Data are values of qualitative or

quantitative variables, belonging to a set of items. Simply put, it's an attribute or property orcharacteristics of an object. Point to note here is, data can be both qualitative (brown eye color)and quantitative (20cm long).

A common way of representing or displaying a set of correlated data is through table typestructures comprised of rows and columns. In such structures, the columns of the table generally

signify attributes or characteristics or features and the rows (tuple) signify a set of co-related

features belonging to one single item.



While speaking about data, it is important to understand the difference of data with other similar

terms like information or knowledge. While a set of data can be used together to directly derive

an information, knowledge or wisdom is often derived in an indirect manner. In our previousarticle on learning data mining, we have given examples to illustrate the differences in data /

information and knowledge. Using the same example, consider a store manager of a local market

sells hundreds of candles every Sunday to its customers. Which customer is buying the candleson any certain date, those are the data that are stored in the database of the store. These datagives information like how many candles are sold from the store per week - this information may

be valuable for inventory management. These information can be further used to indirectly infer

that people who buy candles on every Sunday goes to Church to offer a prayer. Now that'sknowledge - it's a new learning based on available information.

Another way to look at it is by considering the level of abstraction in them. Data is objective andthus have the lowest level of abstraction whereas information and knowledge are increasingly

subjective and involves higher levels of abstraction.

In terms of scientific definition, one may conclude that data have higher level of entropy thaninformation or knowledge.

Types of Data

One of the fundamental aspects you must learn before attempting to do any kind of datamodeling is the fact that how we model the data depends completely on the nature or type of

data. Data can be both qualitative and quantitative. It's important to understand the distinctions

between them.

Qualitative Data

Qualitative data are also called categorical data as they represent distinct categories rather than

numbers. In case of dimensional modeling, they are often termed as "dimension". Mathematical

operations such as addition or subtraction do not make any sense on that data.

Example of qualitative data are, eye color, zip code, phone number etc.

Qualitative data can be further classified into below classes:

NOMINAL :

Nominal data represents data where order of the data does not represent any meaningfulinformation. Consider your passport number. There is no information as such if your passport number is greater or lesser than some one else's passport number. Consider

Eye color of people, does not matter in which order we represent the eye colors, order

does not matter.

ID, ZIP code, Phone number, eye color etc. are example of nominal class of qualitative

data.

http://dwbiconcepts.com/data-warehousing/11-data-mining/97-data-mining-for-beginners.html






ORDINAL :

Order of the data is important for ordinal data. Consider height of people - tall,medium, short. Although they are qualitative but the order of the attributes does matter,

in the sense that they represent some comparative information. Similarly, letter grades,

scale of 1-10 etc. are examples of Ordinal data.

In the field of dimensional modeling, this kind of data are sometimes referred as non-

additive facts.

Quantitative data

Quantitative data are also called numeric data as they represent numbers. In case of dimensional

data modeling approach, these data are termed as "Measure".

Example of quantitative data is, height of a person, amount of goods sold, revenue etc.

Quantitative attributes can be further classified as below.

INTERVAL :

Interval classification is used where there is no true zero point in the data and divisionoperation does not make sense. Bank balance, temperature in Celsius scale, GRE score

etc. are the examples of interval class data. Dividing one GRE score with another GRE

score will not make any sense. In dimensional modeling this is synonymous to semi-

additive facts.

RATIO :

Ratio class is applied on the data that has a true "zero" and where division does makesense. Consider revenue, length of time etc. These measures are generally additive.

Below table illustrates different actions that are possible to implement on various data types

ACTIONS --> Distinct Order Addition Multiplication

Nominal Y

Ordinal Y Y

Interval Y Y Y



Ratio Y Y Y Y

It is essential to understand the above differences in the nature of data and suggest appropriate

model to store them. Many of our analytical (e.g. MS Excel) and data mining tools (e.g. R) donot automatically understand the nature of the data, so we need to explicitly model the data for

those tools. For example, "R" provides 2 test function "is.numeric()" and "is.factor()" todetermine if the data is numeric or categorical (dimensional) respectively, and if the default

attribution is wrong we can use functions like "as.factor()" or "as.numeric()" to re-attribute the

nature of the data.

What is Star-schema?

This schema is used in data warehouse models where one centralized fact table referencesnumber of dimension tables so as the keys (primary key) from all the dimension tables flow into

the fact table (as foreign key) where measures are stored. This entity-relationship diagram lookslike a star, hence the name.

Consider a fact table that stores sales quantity for each product and customer on a certain time.Sales quantity will be the measure here and keys from customer, product and time dimension

tables will flow into the fact table.

If you are not very familiar about Star Schema design or its use, we strongly recommend you

read our excellent article on this subject - different schema in dimensional modeling

http://www.dwbiconcepts.com/data-warehousing/12-data-modelling/128-dimensional-modeling-schema.html



http://png.dwbiconcepts.com/images/tutorial/dwbi_interview/star.png




Data warehouse:

In 1980, Bill Inmon known as father of data warehousing. "A Data warehouse is a subject

oriented, integrated ,time variant, non volatile collection of data in support of management'sdecision making process".

Subject oriented : means that the data addresses a specific subject such as sales,inventory etc.

Integrated : means that the data is obtained from a variety of sources.

Time variant : implies that the data is stored in such a way that when some data ischanged.

Non volatile : implies that data is never removed. i.e., historical data is also kept.

2. What is the difference between database and data warehouse?

A database is a collection of related data.

A data warehouse is also a collection of information as well as a supporting system.

3. What are the benefits of data warehousing?

Historical information for comparative and competitive analysis.

Enhanced data quality and completeness.

Supplementing disaster recovery plans with another data back up source.

4. What are the types of data warehouse?

There are mainly three type of Data Warehouse are :

Enterprise Data Warehouse

Operational data store

Data Mart

5. What is the difference between data mining and data warehousing?

Data mining, the operational data is analyzed using statistical techniques and clusteringtechniques to find the hidden patterns and trends. So, the data mines do some kind of

summarization of the data and can be used by data warehouses for faster analytical processing

for business intelligence.

Data warehouse may make use of a data mine for analytical processing of the data in a fasterway.

What are the applications

of data warehouse?

Datawarehouse are used extensively in banking and

financial services, consumer goods.

Datawarehouse is mainly used for generating reports and



answering predefined queries. Datawarehouse is used for strategic purposes, performing

multidimensional analysis.

Datawarehouse is used for knowledge discovery and

strategic decision making using data mining tools.

7. What are the types of datawarehouse applications?

Info processing

Analytical processing

Data mining

8. What is metadata?

Metadata is defined as the data about data. Metadata describes the entity and attributes

description.

9. What are the benefits of Datawarehousing?

The implementation of a data warehouse can provide many benefits to an

organization. A data warehouse can : Facilitate integration in an environmentcharacterized by un – integrated applications.

Integrate enterprise data across a variety of functions.

Integrate external as well as internal data.

Support strategic and long – term business planning.

Support day – to – day tactical decisions.

Enable insight into business trends and business opportunities.

Organize and store historical data needed for analysis.

Make available historical data, extending over many years, which enables trend

analysis.

Provide more accurate and complete information.

Improve knowledge about the business.

Enable cost – effective decision making.

Enable organizations to understand their customers, and their needs, as well

competitors.

Enhance customer servicc and satisfaction.

Provide competitive advantage.

Provide easy access for end – users.

Provide timely access to corporate information.

10. What is the difference between dimensional table and fact table?

A dimension table consists of tuples of attributes of the dimension. A fact table can be

thought of as having tuples, one per a recorded fact. This fact contains some measured or



observed variables and identifies them with pointers to dimension tables.

ETL testing (Extract, Transform, and Load)

It has been observed that Independent Verification and Validation is gaining huge market potential and

many companies are now seeing this as prospective business gain. Customers have been offered

different range of products in terms of service offerings, distributed in many areas based on technology,

process and solutions. ETL or data warehouse is one of the offerings which are developing rapidly and

successfully.

Why do organizations need Data Warehouse? Organizations with organized IT practices are looking forward to create a next level of

technology transformation. They are now trying to make themselves much more operational with

easy-to-interoperate data. Having said that data is most important part of any organization, it may

be everyday data or historical data. Data is backbone of any report and reports are the baseline onwhich all the vital management decisions are taken.

Most of the companies are taking a step forward for constructing their data warehouse to store

and monitor real time data as well as historical data. Crafting an efficient data warehouse is not

an easy job. Many organizations have distributed departments with different applications running

on distributed technology. ETL tool is employed in order to make a flawless integration betweendifferent data sources from different departments. ETL tool will work as an integrator, extracting

data from different sources; transforming it in preferred format based on the business

transformation rules and loading it in cohesive DB known are Data Warehouse.

Well planned, well defined and effective testing scope guarantees smooth conversion of the

project to the production. A business gains the real buoyancy once the ETL processes areverified and validated by independent group of experts to make sure that data warehouse is

concrete and robust.

ETL or Data warehouse testing is categorized into four different engagements irrespective

of technology or ETL tools used:

New Data Warehouse Testing – New DW is built and verified from scratch. Data input

is taken from customer requirements and different data sources and new data warehouse

is build and verified with the help of ETL tools.

Migration Testing – In this type of project customer will have an existing DW and ETL

performing the job but they are looking to bag new tool in order to improve efficiency.

Change Request – In this type of project new data is added from different sources to an

existing DW. Also, there might be a condition where customer needs to change theirexisting business rule or they might integrate the new rule.

Report Testing – Report are the end result of any Data Warehouse and the basic propose

for which DW is build. Report must be tested by validating layout, data in the report andcalculation.

ETL Testing Techniques:



1) Verify that data is transformed correctly according to various business requirements and rules.

2) Make sure that all projected data is loaded into the data warehouse without any data loss and

truncation.3) Make sure that ETL application appropriately rejects, replaces with default values and reports

invalid data.

4) Make sure that data is loaded in data warehouse within prescribed and expected time frames toconfirm improved performance and scalability.

Apart from these 4 main ETL testing methods other testing methods like integration testing anduser acceptance testing is also carried out to make sure everything is smooth and reliable.

ETL Testing Process:

Similar to any other testing that lies under Independent Verification and Validation, ETL also go

through the same phase.

Business and requirement understanding

Validating

Test Estimation

Test planning based on the inputs from test estimation and business requirement

Designing test cases and test scenarios from all the available inputs

Once all the test cases are ready and are approved, testing team proceed to perform pre-

execution check and test data preparation for testing

Lastly execution is performed till exit criteria are met

Upon successful completion summary report is prepared and closure process is done.

It is necessary to define test strategy which should be mutually accepted by stakeholders before

starting actual testing. A well defined test strategy will make sure that correct approach has been

followed meeting the testing aspiration. ETL testing might require writing SQL statements

extensively by testing team or may be tailoring the SQL provided by development team. In anycase testing team must be aware of the results they are trying to get using those SQL statements.

Difference between Database and Data Warehouse Testing There is a popular misunderstanding that database testing and data warehouse is similar while the

fact is that both hold different direction in testing.

Database testing is done using smaller scale of data normally with OLTP (Onlinetransaction processing) type of databases while data warehouse testing is done with large

volume with data involving OLAP (online analytical processing) databases.

In database testing normally data is consistently injected from uniform sources while in

data warehouse testing most of the data comes from different kind of data sources which

are sequentially inconsistent.

We generally perform only CRUD (Create, read, update and delete) operation in database

testing while in data warehouse testing we use read-only (Select) operation.

http://www.softwaretestinghelp.com/how-to-test-software-requirements-specification-srs/



http://www.softwaretestinghelp.com/software-test-estimation-how-to-estimate-testing-time-accurately/


http://www.softwaretestinghelp.com/test-plan-sample-softwaretesting-and-quality-assurance-templates/


http://www.softwaretestinghelp.com/how-to-write-effective-test-cases-test-cases-procedures-and-definitions/


http://www.softwaretestinghelp.com/tips-to-design-test-data-before-executing-your-test-cases/



http://www.softwaretestinghelp.com/category/database-testing/



http://en.wikipedia.org/wiki/Data_warehouse












Normalized databases are used in DB testing while demoralized DB is used in data

warehouse testing.

There are number of universal verifications that have to be carried out for any kind of data

warehouse testing. Below is the list of objects that are treated as essential for validation in ETL

testing:- Verify that data transformation from source to destination works as expected

- Verify that expected data is added in target system

- Verify that all DB fields and field data is loaded without any truncation- Verify data checksum for record count match

- Verify that for rejected data proper error logs are generated with all details

- Verify NULL value fields

- Verify that duplicate data is not loaded- Verify data integrity

ETL Testing Challenges:

ETL testing is quite different from conventional testing. There are many challenges we faced

while performing data warehouse testing. Here is the list of few ETL testing challenges I

experienced on my project:- Incompatible and duplicate data.

- Loss of data during ETL process.

- Unavailability of inclusive test bed.- Testers have no privileges to execute ETL jobs by their own.

- Volume and complexity of data is very huge.

- Fault in business process and procedures.

- Trouble acquiring and building test data.

- Missing business flow information.

Data is important for businesses to make the critical business decisions. ETL testing plays asignificant role validating and ensuring that the business information is exact, consistent and

reliable. Also, it minimizes hazard of data loss in production.

In computing, Extract, Transform and Load (ETL) refers to a process in database usage and

especially in data warehousing that involves:

Extracting data from outside sources

Transforming it to fit operational needs, which can include quality levels

Loading it into the end target (database, more specifically, operational data store, datamart or data warehouse)

Extract

The first part of an ETL process involves extracting the data from the source systems. In manycases this is the most challenging aspect of ETL, since extracting data correctly sets the stage for

how subsequent processes go further.

http://www.softwaretestinghelp.com/manual-and-automation-testing-challenges/



http://en.wikipedia.org/wiki/Computing



http://en.wikipedia.org/wiki/Database






http://en.wikipedia.org/wiki/Data_extraction


http://en.wikipedia.org/wiki/Data_transformation


http://en.wikipedia.org/wiki/Operational_data_store



http://en.wikipedia.org/wiki/Data_mart



















ETL Architecture Pattern

Most data warehousing projects consolidate data from different source systems. Each separatesystem may also use a different data organization and/or format. Common data source formats

are relational databases and flat files, but may include non-relational database structures such asInformation Management System (IMS) or other data structures such as Virtual Storage Access

Method (VSAM) or Indexed Sequential Access Method (ISAM), or even fetching from outside

sources such as through web spidering or screen-scraping. The streaming of the extracted data

source and load on-the-fly to the destination database is another way of performing ETL when

no intermediate data storage is required. In general, the goal of the extraction phase is to convertthe data into a single format which is appropriate for transformation processing.

An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if

the data meets an expected pattern or structure. If not, the data may be rejected entirely or in part.

http://en.wikipedia.org/wiki/Data_format



http://en.wikipedia.org/wiki/Relational_database



http://en.wikipedia.org/wiki/Flat_file_database



http://en.wikipedia.org/wiki/Information_Management_System


http://en.wikipedia.org/wiki/VSAM




http://en.wikipedia.org/wiki/ISAM



http://en.wikipedia.org/wiki/Web_spider



http://en.wikipedia.org/wiki/Screen-scraping



http://en.wikipedia.org/wiki/Parsing



http://en.wikipedia.org/wiki/File:ETL_Architecture_Pattern.jpg
















Transform

The transform stage applies a series of rules or functions to the extracted data from the source to

derive the data for loading into the end target. Some data sources will require very little or evenno manipulation of data. In other cases, one or more of the following transformation types may

be required to meet the business and technical needs of the target database:

Selecting only certain columns to load (or selecting null columns not to load). For example, if the

source data has three columns (also called attributes), for example roll_no, age, and salary, then

the extraction may take only roll_no and salary. Similarly, the extraction mechanism may ignore

all those records where salary is not present (salary = null).

Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the

warehouse stores M for male and F for female)

Encoding free-form values (e.g., mapping "Male" to "M")

Deriving a new calculated value (e.g., sale_amount = qty * unit_price)

Sorting

Joining data from multiple sources (e.g., lookup, merge) and deduplicating the data

Aggregation (for example, rollup— summarizing multiple rows of data— total sales for each

store, and for each region, etc.)

Generating surrogate-key values

Transposing or pivoting (turning multiple columns into multiple rows or vice versa)

Splitting a column into multiple columns (e.g., converting a comma-separated list, specified as a

string in one column, into individual values in different columns)

Disaggregation of repeating columns into a separate detail table (e.g., moving a series of

addresses in one record into single addresses in a set of records in a linked address table)

Lookup and validate the relevant data from tables or referential files for slowly changing

dimensions.

Applying any form of simple or complex data validation. If validation fails, it may result in a full,

partial or no rejection of the data, and thus none, some or all the data is handed over to thenext step, depending on the rule design and exception handling. Many of the above

transformations may result in exceptions, for example, when a code translation parses an

unknown code in the extracted data.

Load

The load phase loads the data into the end target, usually the data warehouse (DW). Dependingon the requirements of the organization, this process varies widely. Some data warehouses may

overwrite existing information with cumulative information; frequently, updating extracted datais done on a daily, weekly, or monthly basis. Other data warehouses (or even other parts of the

same data warehouse) may add new data in a historical form at regular intervals -- for example,

hourly. To understand this, consider a data warehouse that is required to maintain sales records

of the last year. This data warehouse will overwrite any data that is older than a year with newerdata. However, the entry of data for any one year window will be made in a historical manner.

The timing and scope to replace or append are strategic design choices dependent on the time

available and the business needs. More complex systems can maintain a history and audit trail ofall changes to the data loaded in the data warehouse.

http://en.wikipedia.org/wiki/Null_%28SQL%29



http://en.wikipedia.org/wiki/Join_%28relational_algebra%29#Joins_and_join-like_operators


http://en.wikipedia.org/wiki/Record_linkage



http://en.wikipedia.org/wiki/Surrogate_key



http://en.wikipedia.org/wiki/Transpose


http://en.wikipedia.org/wiki/Pivot_table



http://en.wikipedia.org/wiki/Comma_separated_values






http://en.wikipedia.org/wiki/Business



http://en.wikipedia.org/wiki/Audit_trail















As the load phase interacts with a database, the constraints defined in the database schema — as

well as in triggers activated upon data load — apply (for example, uniqueness, referential

integrity, mandatory fields), which also contribute to the overall data quality performance of theETL process.

For example, a financial institution might have information on a customer in severaldepartments and each department might have that customer's information listed in a different

way. The membership department might list the customer by name, whereas the accounting

department might list the customer by number. ETL can bundle all this data and consolidate it

into a uniform presentation, such as for storing in a database or data warehouse.

Another way that companies use ETL is to move information to another application

permanently. For instance, the new application might use another database vendor and most

likely a very different database schema. ETL can be used to transform the data into a format

suitable for the new application to use.

An example of this would be an Expense and Cost Recovery System (ECRS) such as used by

accountancies, consultancies and lawyers. The data usually ends up in the time and billing

system, although some businesses may also utilize the raw data for employee productivity

reports to Human Resources (personnel dept.) or equipment usage reports to Facilities

Management.

Real-life ETL cycle

The typical real-life ETL cycle consists of the following execution steps:

1. Cycle initiation

2. Build reference data

3. Extract (from sources)

4. Validate

5. Transform (clean, apply business rules, check for data integrity, create aggregates or

disaggregates)

6. Stage (load into staging tables, if used)

7. Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to

diagnose/repair)

8. Publish (to target tables)

9. Archive

10. Clean up

Challenges

ETL processes can involve considerable complexity, and significant operational problems can

occur with improperly designed ETL systems.

The range of data values or data quality in an operational system may exceed the expectations of

designers at the time validation and transformation rules are specified. Data profiling of a source

http://en.wikipedia.org/wiki/Referential_integrity




http://en.wikipedia.org/wiki/Expense_and_Cost_Recovery_System_%28ECRS%29



http://en.wikipedia.org/wiki/Accountancy


http://en.wikipedia.org/wiki/Consultancy



http://en.wikipedia.org/wiki/Law_firm



http://en.wikipedia.org/wiki/Law_practice_management_software




http://en.wikipedia.org/wiki/Reference_data



http://en.wikipedia.org/wiki/Data_validation


http://en.wikipedia.org/wiki/Data_cleaning



http://en.wikipedia.org/wiki/Business_rule



http://en.wikipedia.org/wiki/Data_integrity



http://en.wikipedia.org/wiki/Aggregate_%28data_warehouse%29



http://en.wikipedia.org/wiki/Staging_%28data%29



http://en.wikipedia.org/wiki/Audit_report


http://en.wikipedia.org/wiki/Archiving


http://en.wikipedia.org/wiki/Data_profiling























during data analysis can identify the data conditions that will need to be managed by transform

rules specifications. This will lead to an amendment of validation rules explicitly and implicitly

implemented in the ETL process.

Data warehouses are typically assembled from a variety of data sources with different formats

and purposes. As such, ETL is a key process to bring all the data together in a standard,homogeneous environment .

Design analysts should establish the scalability of an ETL system across the lifetime of its usage.This includes understanding the volumes of data that will have to be processed within service

level agreements. The time available to extract from source systems may change, which may

mean the same amount of data may have to be processed in less time. Some ETL systems have toscale to process terabytes of data to update data warehouses with tens of terabytes of data.

Increasing volumes of data may require designs that can scale from daily batch to multiple-day

micro batch to integration with message queues or real-time change-data capture for continuous

transformation and update.

Performance

ETL vendors benchmark their record-systems at multiple TB (terabytes) per hour (or ~1 GB per

second) using powerful servers with multiple CPUs, multiple hard drives, multiple gigabit-

network connections, and lots of memory. The fastest ETL record is currently held by

Syncsort,[1]

Vertica and HP at 5.4TB in under an hour which is more than twice as fast as theearlier record held by Microsoft and Unisys.

In real life, the slowest part of an ETL process usually occurs in the database load phase.Databases may perform slowly because they have to take care of concurrency, integrity

maintenance, and indices. Thus, for better performance, it may make sense to employ:

Direct Path Extract method or bulk unload whenever is possible (instead of querying the

database) to reduce the load on source system while getting high speed extract

most of the transformation processing outside of the database

bulk load operations whenever possible.

Still, even using bulk operations, database access is usually the bottleneck in the ETL process.

Some common methods used to increase performance are:

Partition tables (and indices). Try to keep partitions similar in size (watch for null values which

can skew the partitioning). Do all validation in the ETL layer before the load. Disable integrity checking (disable

constraint ...) in the target database tables during the load.

Disable triggers (disable trigger ...) in the target database tables during the load. Simulate

their effect as a separate step.

Generate IDs in the ETL layer (not in the database).

Drop the indices (on a table or partition) before the load - and recreate them after the load

(SQL: drop index ...; create index ...).

http://en.wikipedia.org/wiki/Scalability



http://en.wikipedia.org/wiki/Service_level_agreement




http://en.wikipedia.org/wiki/Batch_processing



http://en.wikipedia.org/wiki/Message_queue



http://en.wikipedia.org/wiki/Extract,_transform,_load#cite_note-1



http://en.wikipedia.org/wiki/Partition_%28database%29





http://en.wikipedia.org/wiki/Database_trigger



http://en.wikipedia.org/wiki/Database_index















Pipeline: Allowing the simultaneous running of several components on the same data stream.

For example: looking up a value on record 1 at the same time as adding two fields on record 2.

Component: The simultaneous running of multiple processes on different data streams in the

same job, for example, sorting one input file while removing duplicates on another file.

All three types of parallelism usually operate combined in a single job.

An additional difficulty comes with making sure that the data being uploaded is relatively

consistent. Because multiple source databases may have different update cycles (some may be

updated every few minutes, while others may take days or weeks), an ETL system may berequired to hold back certain data until all sources are synchronized. Likewise, where a

warehouse may have to be reconciled to the contents in a source system or with the general

ledger, establishing synchronization and reconciliation points becomes necessary.

Rerunnability, recoverability

Data warehousing procedures usually subdivide a big ETL process into smaller pieces runningsequentially or in parallel. To keep track of data flows, it makes sense to tag each data row with"row_id", and tag each piece of the process with "run_id". In case of a failure, having these IDs

will help to roll back and rerun the failed piece.

Best practice also calls for "checkpoints", which are states when certain phases of the process are

completed. Once at a checkpoint, it is a good idea to write everything to disk, clean out sometemporary files, log the state, and so on.

Virtual ETL

As of 2010 data virtualization had begun to advance ETL processing. The application of datavirtualization to ETL allowed solving the most common ETL tasks of data migration and

application integration for multiple dispersed data sources. So-called Virtual ETL operates with

the abstracted representation of the objects or entities gathered from the variety of relational,semi-structured and unstructured data sources. ETL tools can leverage object-oriented modeling

and work with entities' representations persistently stored in a centrally located hub-and-spoke

architecture. Such a collection that contains representations of the entities or objects gatheredfrom the data sources for ETL processing is called a metadata repository and it can reside in

memor y[2]

or be made persistent. By using a persistent metadata repository, ETL tools can

transition from one-time projects to persistent middleware, performing data harmonization and

data profiling consistently and in near-real time.[citation needed ]

Dealing with keys

Keys are some of the most important objects in all relational databases as they tie everything

together. A primary key is a column which is the identifier for a given entity, where a foreignkey is a column in another table which refers a primary key. These keys can also be made up

from several columns, in which case they are composite keys. In many cases the primary key is

http://en.wikipedia.org/wiki/Pipeline_%28computing%29


http://en.wikipedia.org/wiki/Data_stream



http://en.wikipedia.org/wiki/Process_%28computing%29



http://en.wikipedia.org/wiki/Data_virtualization



http://en.wikipedia.org/wiki/Data_migration



http://en.wikipedia.org/wiki/Hub-and-spoke








http://en.wikipedia.org/wiki/Wikipedia:Citation_needed



http://en.wikipedia.org/wiki/Unique_key



http://en.wikipedia.org/wiki/Foreign_key


















an auto generated integer which has no meaning for the business entity being represented, but

solely exists for the purpose of the relational database - commonly referred to as a surrogate key.

As there will usually be more than one datasource being loaded into the warehouse the keys are

an important concern to be addressed.

Your customers might be represented in several data sources, and in one their SSN (SocialSecurity Number ) might be the primary key, their phone number in another and a surrogate in

the third. All of the customers information needs to be consolidated into one dimension table.

A recommended way to deal with the concern is to add a warehouse surrogate key, which will be

used as a foreign key from the fact table.[3]

Usually updates will occur to a dimension's source data, which obviously must be reflected in the

data warehouse.

If the primary key of the source data is required for reporting, the dimension already containsthat piece of information for each row. If the source data uses a surrogate key, the ware house

must keep track of it even though it is never used in queries or reports.

That is done by creating a lookup table which contains the warehouse surrogate key and the

originating key.[4]

This way the dimension is not polluted with surrogates from various source

systems, while the ability to update is preserved.

The lookup table is used in different ways depending on the nature of the source data. There are

5 types to consider ,[5]

where three selected ones are included here:

Type 1: - The dimension row is simply updated to match the current state of the source system. The

warehouse does not capture history. The lookup table is used to identify which dimension row to

update/overwrite.Type 2: - A new dimension row is added with the new state of the source system. A new surrogate key is

assigned. Source key is no longer unique in the lookup table.

Fully logged: - A new dimension row is added with the new state of the source system, while the previous

dimension row is updated to reflect it is no longer active and record time of deactivation.

Tools

Programmers can set up ETL processes using almost any programming language, but building

such processes from scratch can become complex. Increasingly, companies are buying ETL toolsto help in the creation of ETL processes.

[6]

By using an established ETL framework, one may increase one's chances of ending up with

better connectivity and scalability[citation needed ]

. A good ETL tool must be able to communicate

with the many different relational databases and read the various file formats used throughout anorganization. ETL tools have started to migrate into Enterprise Application Integration, or even

Enterprise Service Bus, systems that now cover much more than just the extraction,




http://en.wikipedia.org/wiki/Social_Security_Number







http://en.wikipedia.org/wiki/Lookup_table









http://en.wikipedia.org/wiki/Programming_language














http://en.wikipedia.org/wiki/Enterprise_Application_Integration



http://en.wikipedia.org/wiki/Enterprise_Service_Bus


















transformation, and loading of data. Many ETL vendors now have data profiling, data quality,

and metadata capabilities. A common use case for ETL tools include converting CSV files to

formats readable by relational databases. A typical translation of millions of records is facilitated by ETL tools that enable users to input csv-like data feeds/files and import it into a database with

as little code as possible.

ETL Tools are typically used by a broad range of professionals - from students in computer

science looking to quickly import large data sets to database architects in charge of company

account management, ETL Tools have become a convenient tool that can be relied on to getmaximum performance. ETL tools in most cases contain a GUI that helps users conveniently

transform data as opposed to writing large programs to parse files and modify data types - which

ETL tools facilitate as much as possible

Business intelligence (BI) is a set of theories, methodologies, processes, architectures, and

technologies that transform raw data into meaningful and useful information for business

purposes. BI can handle large amounts of information to help identify and develop new

opportunities. Making use of new opportunities and implementing an effective strategy can provide a competitive market advantage and long-term stability.[1]

BI technologies provide historical, current and predictive views of business operations. Common

functions of business intelligence technologies are reporting, online analytical processing,

analytics, data mining, process mining, complex event processing, business performancemanagement, benchmarking, text mining, predictive analytics and prescriptive analytics.

Though the term business intelligence is sometimes a synonym for competitive intelligence (because they both support decision making), BI uses technologies, processes, and applications

to analyze mostly internal, structured data and business processes while competitive intelligence

gathers, analyzes and disseminates information with a topical focus on company competitors. Ifunderstood broadly, business intelligence can include the subset of competitive intelligence.[

Slowly changing dimension

Dimension is a term in data management and data warehousing. It's the logical groupings of data

such as geographical location, customer or product information. With Slowly Changing

Dimensions (SCDs) data changes slowly, rather than changing on a time-based, regular

schedule.[1]

For example, you may have a dimension in your database that tracks the sales records of your

company's salespeople. Creating sales reports seems simple enough, until a salesperson is

transferred from one regional office to another. How do you record such a change in your sales

dimension?

You could calculate the sum or average of each salesperson's sales, but if you use that tocompare the performance of salesmen, that might give misleading information. If the salesperson

was transferred and used to work in a hot market where sales were easy, and now works in a

market where sales are infrequent, his/her totals will look much stronger than the other




http://en.wikipedia.org/wiki/Data_quality



http://en.wikipedia.org/wiki/Metadata_%28computing%29



http://en.wikipedia.org/wiki/Business_intelligence#cite_note-1



http://en.wikipedia.org/wiki/Business_reporting



http://en.wikipedia.org/wiki/Online_analytical_processing



http://en.wikipedia.org/wiki/Analytics


http://en.wikipedia.org/wiki/Data_mining



http://en.wikipedia.org/wiki/Process_mining



http://en.wikipedia.org/wiki/Complex_event_processing



http://en.wikipedia.org/wiki/Business_performance_management




http://en.wikipedia.org/wiki/Benchmarking


http://en.wikipedia.org/wiki/Text_mining



http://en.wikipedia.org/wiki/Predictive_Analysis



http://en.wikipedia.org/wiki/Prescriptive_Analytics



http://en.wikipedia.org/wiki/Competitive_intelligence



http://en.wikipedia.org/wiki/Decision_making






http://en.wikipedia.org/wiki/Dimension_%28data_warehouse%29


http://en.wikipedia.org/wiki/Data_management



http://en.wikipedia.org/wiki/Data_warehousing



http://en.wikipedia.org/wiki/Data



http://en.wikipedia.org/wiki/Slowly_changing_dimension#cite_note-KimballToolkit-1

































salespeople in their new region. Or you could create a second salesperson record and treat the

transferred person as a new sales person, but that creates problems.

Dealing with these issues involves SCD management methodologies referred to as Type 0

through 6. Type 6 SCDs are also sometimes called Hybrid SCDs.

ype 0

The Type 0 method is passive. It manages dimensional changes and no action is performed.Values remain as they were at the time the dimension record was first inserted. In certain

circumstances history is preserved with a Type 0. High order types are employed to guarantee

the preservation of history whereas Type 0 provides the least or no control.

The most common types are I, II, and III.

Type I

This methodology overwrites old with new data, and therefore does not track historical data.

Example of a supplier table:

Supplier_Key Supplier_Code Supplier_Name Supplier_State

123 ABC Acme Supply Co CA

In the above example, Supplier_Code is the natural key and Supplier_Key is a surrogate key.

Technically, the surrogate key is not necessary, since the row will be unique by the natural key(Supplier_Code). However, to optimize performance on joins use integer rather than character

keys.

If the supplier relocates the headquarters to Illinois the record would be overwritten:

Supplier_Key Supplier_Code Supplier_Name Supplier_State

123 ABC Acme Supply Co IL

The disadvantage of the Type I method is that there is no history in the data warehouse. It has theadvantage however that it's easy to maintain.

If you have calculated an aggregate table summarizing facts by state, it will need to berecalculated when the Supplier_State is changed.

[1]

http://en.wikipedia.org/wiki/Natural_key














Type II

This method tracks historical data by creating multiple records for a given natural key in the

dimensional tables with separate surrogate keys and/or different version numbers. Unlimitedhistory is preserved for each insert.

For example, if the supplier relocates to Illinois the version numbers will be incrementedsequentially:

Supplier_Key Supplier_Code Supplier_Name Supplier_State Version.

123 ABC Acme Supply Co CA 0

124 ABC Acme Supply Co IL 1

Another method is to add 'effective date' columns.

Supplier_Key Supplier_Code Supplier_Name Supplier_State Start_Date End_Date

123 ABC Acme Supply Co CA 01-Jan-2000 21-Dec-2004

124 ABC Acme Supply Co IL 22-Dec-2004

The null End_Date in row two indicates the current tuple version. In some cases, a standardizedsurrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be

included in an index, and so that null-value substitution is not required when querying.

Transactions that reference a particular surrogate key (Supplier_Key) are then permanently

bound to the time slices defined by that row of the slowly changing dimension table. An

aggregate table summarizing facts by state continues to reflect the historical state, i.e. the statethe supplier was in at the time of the transaction; no update is needed.

If there are retrospective changes made to the contents of the dimension, or if new attributes areadded to the dimension (for example a Sales_Rep column) which have different effective dates

from those already defined, then this can result in the existing transactions needing to be updated

to reflect the new situation. This can be an expensive database operation, so Type 2 SCDs are not

a good choice if the dimensional model is subject to change.[1]

Type III

This method tracks changes using separate columns and preserves limited history. The Type III

preserves limited history as it's limited to the number of columns designated for storing historical

data. The original table structure in Type I and Type II is the same but Type III adds additional



















columns. In the following example, an additional column has been added to the table to record

the supplier's original state - only the previous history is stored.

Supplier_Ke

y

Supplier_Cod

e

Supplier_Nam

e

Original_Supplier_Sta

te

Effective_Dat

e

Current_Supplier_Sta

te

123 ABC Acme Supply

Co CA 22-Dec-2004 IL

This record contains a column for the original state and current state — cannot track the changesif the supplier relocates a second time.

One variation of this is to create the field Previous_Supplier_State instead of

Original_Supplier_State which would track only the most recent historical change.[1]

Type IV

The Type 4 method is usually referred to as using "history tables", where one table keeps thecurrent data, and an additional table is used to keep a record of some or all changes. Both the

surrogate keys are referenced in the Fact table to enhance query performance.

For the above example the original table name is Supplier and the history table is

Supplier_History.

Supplier

Supplier_key Supplier_Code Supplier_Name Supplier_State

123 ABC Acme Supply Co IL

Supplier_History

Supplier_key Supplier_Code Supplier_Name Supplier_State Create_Date

123 ABC Acme Supply Co CA 22-Dec-2004

This method resembles how database audit tables and change data capture techniques function.

Type 6 / hybrid

The Type 6 method combines the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6). One possible

explanation of the origin of the term was that it was coined by Ralph Kimball during a




http://en.wikipedia.org/wiki/Change_data_capture



http://en.wikipedia.org/wiki/Ralph_Kimball








conversation with Stephen Pace from Kalido[citation needed ]

. Ralph Kimball calls this method

"Unpredictable Changes with Single-Version Overlay" in The Data Warehouse Toolkit .[1]

The Supplier table starts out with one record for our example supplier:

Supplier_Key

Supplier_Code

Supplier_Name

Current_State

Historical_State

Start_Date

End_Date

Current_Flag

123 ABC Acme Supply

Co CA CA

01-Jan-

2000

31-Dec-

9999 Y

The Current_State and the Historical_State are the same. The Current_Flag attribute indicates

that this is the current or most recent record for this supplier.

When Acme Supply Company moves to Illinois, we add a new record, as in Type 2 processing:

Supplier_Ke

y

Supplier_Cod

e

Supplier_Na

me

Current_Stat

e

Historical_Sta

te

Start_Dat

e

End_Dat

e

Current_Fla

g

123 ABC Acme Supply

Co IL CA

01-Jan-

2000

21-Dec-

2004 N

124 ABC Acme Supply

Co IL IL

22-Dec-

2004

31-Dec-

9999 Y

We overwrite the Current_State information in the first record (Supplier_Key = 123) with thenew information, as in Type 1 processing. We create a new record to track the changes, as in

Type 2 processing. And we store the history in a second State column (Historical_State), which

incorporates Type 3 processing.

For example if the supplier were to relocate again, we would add another record to the Supplier

dimension, and we would overwrite the contents of the Current_State column:

Supplier_Ke

y

Supplier_Cod

e

Supplier_Na

me

Current_Stat

e

Historical_Sta

te

Start_Dat

e

End_Dat

e

Current_Fla

g

123 ABC Acme Supply

Co NY CA

01-Jan-

2000

21-Dec-

2004 N

124 ABC Acme Supply

Co NY IL

22-Dec-

2004

03-Feb-

2008 N















125 ABC Acme Supply

Co NY NY

04-Feb-

2008

31-Dec-

9999 Y

Note that, for the current record (Current_Flag = 'Y'), the Current_State and the Historical_State

are always the same.[1]

Type 2 / Type 6 fact implementation

Type 2 surrogate key with Type 3 attribute

In many Type 2 and Type 6 SCD implementations, the surrogate key from the dimension is put

into the fact table in place of the natural key when the fact data is loaded into the data

repository.[1]

The surrogate key is selected for a given fact record based on its effective date andthe Start_Date and End_Date from the dimension table. This allows the fact data to be easily

joined to the correct dimension data for the corresponding effective date.

Here is the Supplier table as we created it above using Type 6 Hybrid methodology:

Supplier_Ke

y

Supplier_Cod

e

Supplier_Na

me

Current_Stat

e

Historical_Sta

te

Start_Dat

e

End_Dat

e

Current_Fla

g

123 ABC Acme Supply

Co NY CA

01-Jan-

2000

21-Dec-

2004 N

124 ABC Acme Supply

Co

NY IL 22-Dec-

2004

03-Feb-

2008

N

125 ABC Acme Supply

Co NY NY

04-Feb-

2008

31-Dec-

9999 Y

Once the Delivery table contains the correct Supplier_Key, it can easily be joined to the Suppliertable using that key. The following SQL retrieves, for each fact record, the current supplier state

and the state the supplier was located in at the time of the delivery:

SELECT

delivery.delivery_cost,supplier.supplier_name,

supplier.historical_state,supplier.current_state

FROM deliveryINNER JOIN supplier

ON delivery.supplier_key = supplier.supplier_key

Pure Type 6 implementation



















Having a Type 2 surrogate key for each time slice can cause problems if the dimension is subject

to change.[1]

A pure Type 6 implementation does not use this, but uses a Surrogate Key for each master data

item (e.g. each unique supplier has a single surrogate key).

This avoids any changes in the master data having an impact on the existing transaction data.

It also allows more options when querying the transactions.

Here is the Supplier table using the pure Type 6 methodology:

Supplier_Key Supplier_Code Supplier_Name Supplier_State Start_Date End_Date

456 ABC Acme Supply Co CA 01-Jan-2000 21-Dec-2004

456 ABC Acme Supply Co IL 22-Dec-2004 03-Feb-2008

456 ABC Acme Supply Co NY 04-Feb-2008 31-Dec-9999

The following example shows how the query must be extended to ensure a single supplier recordis retrieved for each transaction.

SELECT

supplier.supplier_code,supplier.supplier_state

FROM supplier

INNER JOIN deliveryON supplier.supplier_key = delivery.supplier_key

AND delivery.delivery_date >= supplier.start_dateAND delivery.delivery_date <= supplier.end_date

A fact record with an effective date (Delivery_Date) of August 9, 2001 will be linked toSupplier_Code of ABC, with a Supplier_State of 'CA'. A fact record with an effective date of

October 11, 2007 will also be linked to the same Supplier_Code ABC, but with a Supplier_State

of 'IL'.

Whilst more complex, there are a number of advantages of this approach, including:

1. If there is more than one date on the fact (e.g. Order Date, Delivery Date, Invoice Payment Date)

you can choose which date to use for a query.

2. You can do "as at now", "as at transaction time" or "as at a point in time" queries by changing

the date filter logic.

3. You don't need to reprocess the Fact table if there is a change in the dimension table (e.g.

adding additional fields retrospectively which change the time slices, or if you make a mistake in

the dates on the dimension table you can correct them easily).

4. You can introduce bi-temporal dates in the dimension table.




http://en.wikipedia.org/wiki/Temporal_database#Bitemporal_Relations







5. You can join the fact to the multiple versions of the dimension table to allow reporting of the

same information with different effective dates, in the same query.

The following example shows how a specific date such as '2012-01-01 00:00:00' (which could be

the current datetime) can be used.

SELECT

supplier.supplier_code,supplier.supplier_state

FROM supplier

INNER JOIN deliveryON supplier.supplier_key = delivery.supplier_key

AND '2012-01-01 00:00:00' >= supplier.start_date

AND '2012-01-01 00:00:00' <= supplier.end_date

Both surrogate and natural key

An alternative implementation is to place both the surrogate key and the natural key into the fact

table.[2] This allows the user to select the appropriate dimension records based on:

the primary effective date on the fact record (above),

the most recent or current information,

any other date associated with the fact record.

This method allows more flexible links to the dimension, even if you have used the Type 2

approach instead of Type 6.

Here is the Supplier table as we might have created it using Type 2 methodology:

Supplier_Key Supplier_Code Supplier_Name Supplier_State Start_Date End_Date Current_Flag

123 ABC Acme Supply Co CA 01-Jan-2000 21-Dec-2004 N

124 ABC Acme Supply Co IL 22-Dec-2004 03-Feb-2008 N

125 ABC Acme Supply Co NY 04-Feb-2008 31-Dec-9999 Y

The following SQL retrieves the most current Supplier_Name and Supplier_State for each fact

record:

SELECT

delivery.delivery_cost,supplier.supplier_name,

supplier.supplier_state


ON delivery.supplier_code = supplier.supplier_codeWHERE supplier.current_flag = 'Y'







http://en.wikipedia.org/wiki/Slowly_changing_dimension#cite_note-SCDnot123-2








If there are multiple dates on the fact record, the fact can be joined to the dimension using

another date instead of the primary effective date. For instance, the Delivery table might have a

primary effective date of Delivery_Date, but might also have an Order_Date associated witheach record.

The following SQL retrieves the correct Supplier_Name and Supplier_State for each fact record based on the Order_Date:

SELECT

delivery.delivery_cost,supplier.supplier_name,supplier.supplier_state


ON delivery.supplier_code = supplier.supplier_code

AND delivery.order_date >= supplier.start_dateAND delivery.order_date <= supplier.end_date

Some cautions:

If the join query is not written correctly, it may return duplicate rows and/or give incorrect

answers.

The date comparison might not perform well.

Some Business Intelligence tools do not handle generating complex joins well.

The ETL processes needed to create the dimension table needs to be carefully designed to

ensure that there are no overlaps in the time periods for each distinct item of reference data.

Combining types

Different SCD Types can be applied to different columns of a table. For example, we can apply

Type 1 to the Supplier_Name column and Type 2 to the Supplier_State column of the same

table, the Supplier table.

Data warehousing is the repository of integrated information data will be extracted from the

heterogeneous sources. Data warehousing architecture contains the different; sources like oracle,flat files and ERP then after it have the staging area and Data warehousing, after that it has the

different Data marts then it have the reports and it also have the ODS - Operation Data Store.

This complete architecture is called the Data warehousing Architecture.

Benefits of data warehousing: => Data warehouses are designed to perform well with aggregate queries running on large

amounts of data.

=> The structure of data warehouses is easier for end users to navigate, understand and query

against unlike the relational databases primarily designed to handle lots of transactions.

=> Data warehouses enable queries that cut across different segments of a company's operation.

http://en.wikipedia.org/wiki/Business_Intelligence



http://en.wikipedia.org/wiki/Extract,_transform,_load







E.g. production data could be compared against inventory data even if they were originally

stored in different databases with different structures.

=> Queries that would be complex in very normalized databases could be easier to build andmaintain in data warehouses, decreasing the workload on transaction systems.

=> Data warehousing is an efficient way to manage and report on data that is from a variety of

sources, non uniform and scattered throughout a company.=> Data warehousing is an efficient way to manage demand for lots of information from lots ofusers.

=> Data warehousing provides the capability to analyze large amounts of historical data for

nuggets of wisdom that can provide an organization with competitive advantage.

Data modeling is the process of designing a data base model. In this data model data will be

stored in two types of table fact table and dimension table.Fact table contains the transaction data and dimension table contains the master data. Data

mining is process of finding the hidden trends is called the data mining.

A multi-dimensional structure called the data cube. A data abstraction allows one to viewaggregated data from a number of perspectives. Conceptually, the cube consists of a core or base

cuboids, surrounded by a collection of sub-cubes/cuboids that represent the aggregation of the base cuboids along one or more dimensions. We refer to the dimension to be aggregated as themeasure attribute, while the remaining dimensions are known as the feature attributes.

OLAP stands for Online Analytical Processing. It uses database tables (fact and dimension tables) to enable multidimensional viewing, analysisand querying of large amounts of data. E.g. OLAP technology could provide management with

fast answers to complex queries on their operational data or enable them to analyze their

company's historical data for trends and patterns.

OLTP stands for Online Transaction Processing. OLTP uses normalized tables to quickly record large amounts of transactions while making sure

that these updates of data occur in as few places as possible. Consequently OLTP database aredesigned for recording the daily operations and transactions of a business. E.g. a timecard system

that supports a large production environment must record successfully a large number of updates

during critical periods like lunch hour, breaks, startup and close of work.

Dimensions are categories by which summarized data can be viewed. E.g. a profit summary in a

fact table can be viewed by a Time dimension (profit by month, quarter, year), Region dimension(profit by country, state, city), Product dimension (profit for product1, product2).

MOLAP Cubes: stands for Multidimensional OLAP. In MOLAP cubes the data aggregationsand a copy of the fact data are stored in a multidimensional structure on the Analysis Server

computer. It is best when extra storage space is available on the Analysis Server computer and

the best query performance is desired. MOLAP local cubes contain all the necessary data for

calculating aggregates and can be used offline. MOLAP cubes provide the fastest query responsetime and performance but require additional storage space for the extra copy of data from the fact

table.



ROLAP Cubes: stands for Relational OLAP. In ROLAP cubes a copy of data from the fact

table is not made and the data aggregates are stored in tables in the source relational database. A

ROLAP cube is best when there is limited space on the Analysis Server and query performanceis not very important. ROLAP local cubes contain the dimensions and cube definitions but

aggregates are calculated when they are needed. ROLAP cubes requires less storage space than

MOLAP and HOLAP cubes.HOLAP Cubes: stands for Hybrid OLAP. A ROLAP cube has a combination of the ROLAPand MOLAP cube characteristics. It does not create a copy of the source data however, data

aggregations are stored in a multidimensional structure on the Analysis Server computer.

HOLAP cubes are best when storage space is limited but faster query responses are needed.

You can disconnect the report from the catalog to which it is attached by saving the report with a

snapshot of the data.

An active data warehouse provides information that enables decision-makers within an

organization to manage customer relationships nimbly, efficiently and proactively.

Star schema – A single fact table with N number of Dimension, all dimensions will be linked

directly with a fact table. This schema is de-normalized and results in simple join and lesscomplex query as well as faster results.

Snow schema – Any dimensions with extended dimensions are know as snowflake schema,

dimensions maybe interlinked or may have one to many relationship with other tables. This

schema is normalized and results in complex join and very complex query as well as slowerresults.

A concept hierarchy that is a total (or) partial order among attributes in a database schema iscalled a schema hierarchy.

The roll-up operation is also called drill-up operation which performs aggregation on a data cubeeither by climbing up a concept hierarchy for a dimension (or) by dimension reduction.

Indexing is a technique, which is used for efficient data retrieval (or) accessing data in a faster

manner. When a table grows in volume, the indexes also increase in size requiring more storage.

Dimensional Modeling is a design concept used by many data warehouse designers to build theirdata warehouse. In this design model all the data is stored in two types of tables - Facts table and

Dimension table. Fact table contains the facts/measurements of the business and the dimension

table contains the context of measurements i.e., the dimensions on which the facts are

calculated.Dimension modeling is a method for designing data warehouse.

Three types of modeling are there 1. Conceptual modeling

2. Logical modeling3. Physical modeling



Data Transformation Services is a set of tools available in SQL server that helps to extract,

transform and consolidate data. This data can be from different sources into a single or multiple

destinations depending on DTS connectivity. To perform such operations DTS offers a set oftools. Depending on the business needs, a DTS package is created. This package contains a list

of tasks that define the work to be performed on, transformations to be done on the data objects.

Import or Export data: DTS can import data from a text file or an OLE DB data source into aSQL server or vice versa.

Transform data: DTS designer interface also allows to select data from a data source

connection, map the columns of data to a set of transformations, and send the transformed data to

a destination connection. For parameterized queries and mapping purposes, Data driven querytask can be used from the DTS designer.

Consolidate data : the DTS designer can also be used to transfer indexes, views, logins, triggers

and user defined data. Scripts can also be generated for the sane For performing these tasks, a

valid connection(s) to its source and destination data and to any additional data sources, such aslookup tables must be established.

Data mining extension is based on the syntax of SQL. It is based on relational concepts andmainly used to create and manage the data mining models. DMX comprises of two types of

statements: Data definition and Data manipulation. Data definition is used to define or createnew models, structures.

Example: CREATE MINING SRUCTURE

CREATE MINING MODELData manipulation is used to manage the existing models and structures.

Example: INSERT INTO

SELECT FROM .CONTENT (DMX)

SQL Server data mining offers Data Mining Add-ins for office 2007 that allows discovering the

patterns and relationships of the data. This also helps in an enhanced analysis. The Add-in calledas Data Mining client for Excel is used to first prepare data, build, evaluate, manage and predict

results.

Data mining is used to examine or explore the data using queries. These queries can be fired on

the data warehouse. Explore the data in data mining helps in reporting, planning strategies,

finding meaningful patterns etc. it is more commonly used to transform large amount of data intoa meaningful form. Data here can be facts, numbers or any real time information like sales

figures, cost, meta data etc. Information would be the patterns and the relationships amongst the

data that can provide information.

Sequence clustering algorithm collects similar or related paths, sequences of data containing

events. The data represents a series of events or transitions between states in a dataset like a

series of web clicks. The algorithm will examine all probabilities of transitions and measure thedifferences, or distances, between all the possible sequences in the data set. This helps it to

determine which sequence can be the best for input for clustering.



E.g. Sequence clustering algorithm may help finding the path to store a product of ―similar‖

nature in a retail ware house.

Association algorithm is used for recommendation engine that is based on a market based

analysis. This engine suggests products to customers based on what they bought earlier. The

model is built on a dataset containing identifiers. These identifiers are both for individual casesand for the items that cases contain. These groups of items in a data set are called as an item set.

The algorithm traverses a data set to find items that appear in a case. MINIMUM_SUPPORT

parameter is used any associated items that appear into an item set.

Time series algorithm can be used to predict continuous values of data. Once the algorithm is

skilled to predict a series of data, it can predict the outcome of other series. The algorithmgenerates a model that can predict trends based only on the original dataset. New data can also be

added that automatically becomes a part of the trend analysis.

E.g. Performance one employee can influence or forecast the profit.

Naïve Bayes Algorithm is used to generate mining models. These models help to identifyrelationships between input columns and the predictable columns. This algorithm can be used in

the initial stage of exploration. The algorithm calculates the probability of every state of eachinput column given predictable columns possible states. After the model is made, the results can

be used for exploration and making predictions.

A decision tree is a tree in which every node is either a leaf node or a decision node. This tree

takes an input an object and outputs some decision. All Paths from root node to the leaf node are

reached by either using AND or OR or BOTH. The tree is constructed using the regularities ofthe data. The decision tree is not affected by Automatic Data Preparation.

Models in Data mining help the different algorithms in decision making or pattern matching. Thesecond stage of data mining involves considering various models and choosing the best one

based on their predictive performance.

Data mining helps analysts in making faster business decisions which increases revenue with lower

costs.

• Data mining helps to understand, explore and identify patterns of data.

• Data mining automates process of finding predictive information in large databases.

• Helps to identify previously hidden patterns.

The process of cleaning junk data is termed as data purging. Purging data would mean getting rid of

unnecessary NULL values of columns. This usually happens when the size of the database gets too large.

Data warehousing is merely extracting data from different sources, cleaning the data and storing

it in the warehouse. Where as data mining aims to examine or explore the data using queries.These queries can be fired on the data warehouse. Explore the data in data mining helps in

reporting, planning strategies, finding meaningful patterns etc.

E.g. a data warehouse of a company stores all the relevant information of projects andemployees. Using Data mining, one can use this data to generate different reports like profits

generated etc.



History

In a 1958 article, IBM researcher Hans Peter Luhn used the term business intelligence. He

defined intelligence as: "the ability to apprehend the interrelationships of presented facts in sucha way as to guide action towards a desired goal."

[3]

Business intelligence as it is understood today is said to have evolved from the decision support

systems that began in the 1960s and developed throughout the mid-1980s. DSS originated in the

computer-aided models created to assist with decision making and planning. From DSS, data

warehouses, Executive Information Systems, OLAP and business intelligence came into focus beginning in the late 80s.

In 1989, Howard Dresner (later a Gartner Group analyst) proposed "business intelligence" as anumbrella term to describe "concepts and methods to improve business decision making by using

fact-based support systems."[4]

It was not until the late 1990s that this usage was widespread.[5]

Business intelligence and data warehousing

Often BI applications use data gathered from a data warehouse or a data mart. A data warehouse

is a copy of transactional data that facilitates decision support. However, not all data warehousesare used for business intelligence, nor do all business intelligence applications require a data

warehouse.

To distinguish between the concepts of business intelligence and data warehouses, Forrester

Research often defines business intelligence in one of two ways:

Using a broad definition: "Business Intelligence is a set of methodologies, processes,architectures, and technologies that transform raw data into meaningful and useful information

used to enable more effective strategic, tactical, and operational insights and decision-making."[6]

When using this definition, business intelligence also includes technologies such as data

integration, data quality, data warehousing, master data management, text and content analytics,

and many others that the market sometimes lumps into the Information Management segment.Therefore, Forrester refers to data preparation and data usage as two separate, but closely linked

segments of the business intelligence architectural stack.

Forrester defines the latter, narrower business intelligence market as, "...referring to just the top

layers of the BI architectural stack such as reporting, analytics and dashboards."[7]

Business intelligence and business analytics

Thomas Davenport argues that business intelligence should be divided into querying, reporting,

OLAP, an "alerts" tool, and business analytics. In this definition, business analytics is the subsetof BI based on statistics, prediction, and optimization.

[8]

http://en.wikipedia.org/wiki/IBM



http://en.wikipedia.org/wiki/Hans_Peter_Luhn













http://en.wikipedia.org/wiki/Executive_Information_System






http://en.wikipedia.org/wiki/Gartner_Group



http://en.wikipedia.org/wiki/Business_intelligence#cite_note-power-4












http://en.wikipedia.org/wiki/Forrester_Research






http://en.wikipedia.org/wiki/Information_Management



http://en.wikipedia.org/wiki/Dashboards_%28management_information_systems%29






http://en.wikipedia.org/wiki/Thomas_H._Davenport


http://en.wikipedia.org/wiki/Information_retrieval






http://en.wikipedia.org/wiki/OLAP


http://en.wikipedia.org/wiki/Business_analytics

































Applications in an enterprise

Business intelligence can be applied to the following business purposes, in order to drive

business value.[citation needed ]

1. Measurement – program that creates a hierarchy of performance metrics (see also MetricsReference Model) and benchmarking that informs business leaders about progress

towards business goals ( business process management).

2. Analytics – program that builds quantitative processes for a business to arrive at optimal

decisions and to perform business knowledge discovery. Frequently involves: datamining, process mining, statistical analysis, predictive analytics, predictive modeling,

business process modeling, complex event processing and prescriptive analytics.

3. Reporting/enterprise reporting – program that builds infrastructure for strategic reporting

to serve the strategic management of a business, not operational reporting. Frequentlyinvolves data visualization, executive information system and OLAP.

4. Collaboration/collaboration platform – program that gets different areas (both inside and

outside the business) to work together through data sharing and electronic datainterchange.

5. Knowledge management – program to make the company data driven through strategies

and practices to identify, create, represent, distribute, and enable adoption of insights andexperiences that are true business knowledge. Knowledge management leads to learning

management and regulatory compliance.

In addition to above, business intelligence also can provide a pro-active approach, such as

ALARM function to alert immediately to end-user. There are many types of alerts, for example

if some business value exceeds the threshold value the color of that amount in the report will turn

RED and the business analyst is alerted. Sometimes an alert mail will be sent to the user as well.

This end to end process requires data governance, which should be handled by the expert.[citationneeded ]

Prioritization of business intelligence projects

It is often difficult to provide a positive business case for business intelligence initiatives andoften the projects must be prioritized through strategic initiatives. Here are some hints to increase

the benefits for a BI project.

As described by Kimball[9]

you must determine the tangible benefits such as eliminatedcost of producing legacy reports.

Enforce access to data for the entire organization.[10]

In this way even a small benefit,

such as a few minutes saved, makes a difference when multiplied by the number ofemployees in the entire organization.

As described by Ross, Weil & Roberson for Enterprise Architecture,[11]

consider letting

the BI project be driven by other business initiatives with excellent business cases. Tosupport this approach, the organization must have enterprise architects who can identify

suitable business projects.




http://en.wikipedia.org/wiki/Measurement


http://en.wikipedia.org/wiki/Performance_metrics



http://en.wikipedia.org/wiki/Metrics_Reference_Model







http://en.wikipedia.org/wiki/Business_process_management












http://en.wikipedia.org/wiki/Statistical_analysis



http://en.wikipedia.org/wiki/Predictive_analytics



http://en.wikipedia.org/wiki/Predictive_modeling



http://en.wikipedia.org/wiki/Business_process_modeling









http://en.wikipedia.org/wiki/Enterprise_reporting



http://en.wikipedia.org/wiki/Data_visualization



http://en.wikipedia.org/wiki/Executive_information_system






http://en.wikipedia.org/wiki/Collaboration

http://en.wikipedia.org/wiki/Collaboration_platform



http://en.wikipedia.org/wiki/Data_sharing



http://en.wikipedia.org/wiki/Electronic_data_interchange




http://en.wikipedia.org/wiki/Knowledge_management


http://en.wikipedia.org/wiki/Learning_management




http://en.wikipedia.org/wiki/Regulatory_compliance





























http://en.wikipedia.org/wiki/Collaboration

























Use a structured and quantitative methodology to create defensible prioritization in line

with the actual needs of the organization, such as a weighted decision matrix.[12]

Success factors of implementation

Before implementing a BI solution, it is worth taking different factors into consideration before proceeding. According to Kimball et al., these are the three critical areas that you need to assess

within your organization before getting ready to do a BI project:[13]

1. The level of commitment and sponsorship of the project from senior management

2. The level of business need for creating a BI implementation

3. The amount and quality of business data available.

Business sponsorship

The commitment and sponsorship of senior management is according to Kimball et al., the most

important criteria for assessment.[14] This is because having strong management backing helpsovercome shortcomings elsewhere in the project. However, as Kimball et al. state: ―even the

most elegantly designed DW/BI system cannot overcome a lack of business [management]

sponsorship‖.[15]

It is important that personnel who participate in the project have a vision and an idea of the benefits and drawbacks of implementing a BI system. The best business sponsor should have

organizational clout and should be well connected within the organization. It is ideal that the

business sponsor is demanding but also able to be realistic and supportive if the implementation

runs into delays or drawbacks. The management sponsor also needs to be able to assumeaccountability and to take responsibility for failures and setbacks on the project. Support from

multiple members of the management ensures the project does not fail if one person leaves thesteering group. However, having many managers work together on the project can also mean thatthere are several different interests that attempt to pull the project in different directions, such as

if different departments want to put more emphasis on their usage. This issue can be countered

by an early and specific analysis of the business areas that benefit the most from the

implementation. All stakeholders in project should participate in this analysis in order for themto feel ownership of the project and to find common ground.

Another management problem that should be encountered before start of implementation is if the business sponsor is overly aggressive. If the management individual gets carried away by the

possibilities of using BI and starts wanting the DW or BI implementation to include several

different sets of data that were not included in the original planning phase. However, since extraimplementations of extra data may add many months to the original plan, it's wise to make surethe person from management is aware of his actions.

Business needs

Because of the close relationship with senior management, another critical thing that must be

assessed before the project begins is whether or not there is a business need and whether there is



















a clear business benefit by doing the implementation.[16]

The needs and benefits of the

implementation are sometimes driven by competition and the need to gain an advantage in the

market. Another reason for a business-driven approach to implementation of BI is the acquisitionof other organizations that enlarge the original organization it can sometimes be beneficial to

implement DW or BI in order to create more oversight.

Companies that implement BI are often large, multinational organizations with diverse

subsidiaries.[17]

A well-designed BI solution provides a consolidated view of key business data

not available anywhere else in the organization, giving management visibility and control overmeasures that otherwise would not exist.

Amount and quality of available data

Without good data, it does not matter how good the management sponsorship or business-driven

motivation is. Without proper data, or with too little quality data, any BI implementation fails.

Before implementation it is a good idea to do data profiling. This analysis identifies the ―content,

consistency and structure [..]‖[16]

of the data. This should be done as early as possible in the process and if the analysis shows that data is lacking, put the project on the shelf temporarily

while the IT department figures out how to properly collect data.

When planning for business data and business intelligence requirements, it is always advisable to

consider specific scenarios that apply to a particular organization, and then select the businessintelligence features best suited for the scenario.

Often, scenarios revolve around distinct business processes, each built on one or more datasources. These sources are used by features that present that data as information to knowledge

workers, who subsequently act on that information. The business needs of the organization for

each business process adopted correspond to the essential steps of business intelligence. Theseessential steps of business intelligence include but are not limited to:

1. Go through business data sources in order to collect needed data

2. Convert business data to information and present appropriately3. Query and analyze data

4. Act on those data collected

The quality aspect in business intelligence should cover all the process from the source data to

the final reporting. At each step, the quality gates are different:

1.

Source Data:o Data Standardization: make data comparable (same unit, same pattern..)

o Master Data Management: unique referential2. Operational Data Store (ODS):

o Data Cleansing: detect & correct inaccurate data

o Data Profiling: check inappropriate value, null/empty3. Datawarehouse:

o Completeness: check that all expected data are loaded

http://en.wikipedia.org/wiki/Business_intelligence#cite_note-Kimball_et_al..2C_2008:_17-16









http://en.wikipedia.org/wiki/Master_data_management




http://en.wikipedia.org/wiki/Data_cleansing


http://en.wikipedia.org/wiki/Datawarehouse











o Referential integrity: unique and existing referential over all sources

o Consistency between sources: check consolidated data vs sources

4. Reporting:o Uniqueness of indicators: only one share dictionary of indicators

o Formula accurateness: local reporting formula should be avoid or checked

User aspect

Some considerations must be made in order to successfully integrate the usage of business

intelligence systems in a company. Ultimately the BI system must be accepted and utilized bythe users in order for it to add value to the organization.

[18][19] If the usability of the system is

poor, the users may become frustrated and spend a considerable amount of time figuring out how

to use the system or may not be able to really use the system. If the system does not add value to

the users´ mission, they simply don't use it.[19]

To increase user acceptance of a BI system, it can be advisable to consult business users at an

early stage of the DW/BI lifecycle, for example at the requirements gathering phase.[18] This can provide an insight into the business process and what the users need from the BI system. There

are several methods for gathering this information, such as questionnaires and interview sessions.

When gathering the requirements from the business users, the local IT department should also be

consulted in order to determine to which degree it is possible to fulfill the business's needs based

on the available data.[18]

Taking on a user-centered approach throughout the design and development stage may further

increase the chance of rapid user adoption of the BI system.[19]

Besides focusing on the user experience offered by the BI applications, it may also possiblymotivate the users to utilize the system by adding an element of competition. Kimball

[18]

suggests implementing a function on the Business Intelligence portal website where reports on

system usage can be found. By doing so, managers can see how well their departments are doing

and compare themselves to others and this may spur them to encourage their staff to utilize theBI system even more.

In a 2007 article, H. J. Watson gives an example of how the competitive element can act as an

incentive.[20]

Watson describes how a large call centre implemented performance dashboards for

all call agents, with monthly incentive bonuses tied to performance metrics. Also, agents could

compare their performance to other team members. The implementation of this type of

performance measurement and competition significantly improved agent performance.

BI chances of success can be improved by involving senior management to help make BI a partof the organizational culture, and by providing the users with necessary tools, training, andsupport.

[20] Training encourages more people to use the BI application.

[18]

Providing user support is necessary to maintain the BI system and resolve user problems.[19]

User

support can be incorporated in many ways, for example by creating a website. The website



http://en.wikipedia.org/wiki/Business_intelligence#cite_note-kimball-18



http://en.wikipedia.org/wiki/Usability



http://en.wikipedia.org/wiki/Business_intelligence#cite_note-swain-19






http://en.wikipedia.org/wiki/Business_process












http://en.wikipedia.org/wiki/Business_intelligence#cite_note-watson-20



http://en.wikipedia.org/wiki/Organizational_culture





























should contain great content and tools for finding the necessary information. Furthermore,

helpdesk support can be used. The help desk can be manned by power users or the DW/BI

project team.[18]

BI Portals

A Business Intelligence portal (BI portal) is the primary access interface for Data Warehouse

(DW) and Business Intelligence (BI) applications. The BI portal is the users first impression of

the DW/BI system. It is typically a browser application, from which the user has access to all the

individual services of the DW/BI system, reports and other analytical functionality. The BI portalmust be implemented in such a way that it is easy for the users of the DW/BI application to call

on the functionality of the application.[21]

The BI portal's main functionality is to provide a navigation system of the DW/BI application.

This means that the portal has to be implemented in a way that the user has access to all the

functions of the DW/BI application.

The most common way to design the portal is to custom fit it to the business processes of the

organization for which the DW/BI application is designed, in that way the portal can best fit theneeds and requirements of its users.

[22]

The BI portal needs to be easy to use and understand, and if possible have a look and feel similarto other applications or web content of the organization the DW/BI application is designed for

(consistency).

The following is a list of desirable features for web portals in general and BI portals in particular:

UsableUser should easily find what they need in the BI tool.

Content Rich

The portal is not just a report printing tool, it should contain more functionality such asadvice, help, support information and documentation.

Clean

The portal should be designed so it is easily understandable and not over complex as toconfuse the users

Current

The portal should be updated regularly.

Interactive

The portal should be implemented in a way that makes it easy for the user to use itsfunctionality and encourage them to use the portal. Scalability and customization give the

user the means to fit the portal to each user.

Value OrientedIt is important that the user has the feeling that the DW/BI application is a valuable

resource that is worth working on.







http://en.wikipedia.org/wiki/Business_intelligence#cite_note-Ralph-21



http://en.wikipedia.org/wiki/Business_intelligence#cite_note-Wiley-22



http://en.wikipedia.org/wiki/Consistency



http://en.wikipedia.org/wiki/Web_portal











Marketplace

There are a number of business intelligence vendors, often categorized into the remaining

independent "pure-play" vendors and consolidated "megavendors" that have entered the marketthrough a recent trend

[23] of acquisitions in the BI industry.

[24]

Some companies adopting BI software decide to pick and choose from different product

offerings (best-of-breed) rather than purchase one comprehensive integrated solution (full-

service).[25]

Industry-specific

Specific considerations for business intelligence systems have to be taken in some sectors such

as governmental banking regulations. The information collected by banking institutions andanalyzed with BI software must be protected from some groups or individuals, while being fully

available to other groups or individuals. Therefore BI solutions must be sensitive to those needs

and be flexible enough to adapt to new regulations and changes to existing law.

Semi-structured or unstructured data

Businesses create a huge amount of valuable information in the form of e-mails, memos, notes

from call-centers, news, user groups, chats, reports, web-pages, presentations, image-files, video-

files, and marketing material and news. According to Merrill Lynch, more than 85% of all

business information exists in these forms. These information types are called either semi- structured or unstructured data. However, organizations often only use these documents once.

[26]

The management of semi-structured data is recognized as a major unsolved problem in theinformation technology industry.

[27] According to projections from Gartner (2003), white collar

workers spend anywhere from 30 to 40 percent of their time searching, finding and assessing

unstructured data. BI uses both structured and unstructured data, but the former is easy to search,and the latter contains a large quantity of the information needed for analysis and decision

making.[27][28]

Because of the difficulty of properly searching, finding and assessing unstructured

or semi-structured data, organizations may not draw upon these vast reservoirs of information,which could influence a particular decision, task or project. This can ultimately lead to poorly

informed decision making.[26]

Therefore, when designing a business intelligence/DW-solution, the specific problems associated

with semi-structured and unstructured data must be accommodated for as well as those for thestructured data.[28]

Unstructured data vs. semi-structured data

Unstructured and semi-structured data have different meanings depending on their context. In thecontext of relational database systems, unstructured data cannot be stored in predictably ordered

columns and rows. One type of unstructured data is typically stored in a BLOB (binary large









http://en.wikipedia.org/wiki/Bank_regulation



http://en.wikipedia.org/wiki/Business_intelligence#cite_note-rao-26



http://en.wikipedia.org/wiki/Business_intelligence#cite_note-blumberg-27









http://en.wikipedia.org/wiki/Business_intelligence#cite_note-negash-28



http://en.wikipedia.org/wiki/Columns


http://en.wikipedia.org/wiki/Row_%28database%29



http://en.wikipedia.org/wiki/BLOB


















object), a catch-all data type available in most relational database management systems.

Unstructured data may also refer to irregularly or randomly repeated column patterns that vary

from row to row within each file or document.

Many of these data types, however, like e-mails, word processing text files, PPTs, image-files,

and video-files conform to a standard that offers the possibility of metadata. Metadata caninclude information such as author and time of creation, and this can be stored in a relational

database. Therefore it may be more accurate to talk about this as semi-structured documents or

data,[27]

but no specific consensus seems to have been reached.

Unstructured data can also simply be the knowledge that business users have about future

business trends. Business forecasting naturally aligns with the BI system because business usersthink of their business in aggregate terms. Capturing the business knowledge that may only exist

in the minds of business users provides some of the most important data points for a complete BI

solution.

Problems with semi-structured or unstructured data

There are several challenges to developing BI with semi-structured data. According to Inmon & Nesavich,

[29] some of those are:

1. Physically accessing unstructured textual data – unstructured data is stored in a hugevariety of formats.

2. Terminology – Among researchers and analysts, there is a need to develop a standardized

terminology.3. Volume of data – As stated earlier, up to 85% of all data exists as semi-structured data.

Couple that with the need for word-to-word and semantic analysis.

4.

Searchability of unstructured textual data – A simple search on some data, e.g. apple,results in links where there is a reference to that precise search term. (Inmon & Nesavich,2008)

[29] gives an example: ―a search is made on the term felony. In a simple search, the

term felony is used, and everywhere there is a reference to felony, a hit to an unstructured

document is made. But a simple search is crude. It does not find references to crime,arson, murder, embezzlement, vehicular homicide, and such, even though these crimes

are types of felonies.‖

The use of metadata

To solve problems with searchability and assessment of data, it is necessary to know something

about the content. This can be done by adding context through the use of metadata.[26]

Manysystems already capture some metadata (e.g. filename, author, size, etc.), but more useful would

be metadata about the actual content – e.g. summaries, topics, people or companies mentioned.

Two technologies designed for generating metadata about content are automatic categorization

and information extraction.







http://en.wikipedia.org/wiki/Business_intelligence#cite_note-inmon-29



http://en.wikipedia.org/wiki/Terminology





http://en.wikipedia.org/wiki/Metadata





http://en.wikipedia.org/w/index.php?title=Automatic_categorization&action=edit&redlink=1



http://en.wikipedia.org/wiki/Information_extraction












ETL Testing (Extract, Transform, And Load)

Documents

Transcript of ETL Testing (Extract, Transform, And Load)