Data Warehouse

18
ASSIGNMENT OF INFORMATION TECHNOLOGY

Transcript of Data Warehouse

Page 1: Data Warehouse

ASSIGNMENT OF INFORMATION TECHNOLOGY

SUBMITTED TO: SUBMITTED BY:

Ms. KAMAL VIPAN

Page 2: Data Warehouse

DATA WAREHOUSE

A data warehouse is a repository of an organization's electronically stored data. Means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Data warehousing includes business intelligence tools, tools to extract, transform, and load data into the repository, and tools to manage and retrieve metadata.

According to Inmon, famous author for several data warehouse books, "A data warehouse is a subject oriented, integrated, time variant, non volatile collection of data in support of management's decision making process.”

Data warehousing arises in an organization’s need for reliable, consolidated, unique and integrated reporting and analysis of its data, at different levels of aggregation.

The data warehousing consultant is charged with making the data appear consistent, integrated and consolidated despite the problems in the underlying source systems. The data warehousing consultant achieves this by employing different data warehousing techniques, creating one or more new data repositories (i.e. the data warehouse) whose data model(s) support the needed reporting and analysis.

Key developments in early years of data warehousing were:

1960s — General Mills and Dartmouth College, in a joint research project, develop the terms dimensions and facts.[3]

1970s — ACNielsen and IRI provide dimensional data marts for retail sales.[3]

1983 — Teradata introduces a database management system specifically designed for decision support.

1988 — Barry Devlin and Paul Murphy publish the article An architecture for a business and information systems in IBM Systems Journal where they introduce the term "business data warehouse".

1990 — Red Brick Systems introduces Red Brick Warehouse, a database management system specifically for data warehousing.

1991 — Prism Solutions introduces Prism Warehouse Manager, software for developing a data warehouse.

1991 — Bill Inmon publishes the book Building the Data Warehouse.

1995 — The Data Warehousing Institute, a for-profit organization that promotes data warehousing, is founded.

Page 3: Data Warehouse

1996 — Ralph Kimball publishes the book The Data Warehouse Toolkit.

1997 — Oracle 8, with support for star queries, is released.

1998 — Microsoft releases Microsoft Analysis Services (then OLAP Services) heavily utilizing data warehousing schemas.

Example:

In order to store data, over the years, many application designers in each branch have made their individual decisions as to how an application and database should be built. So source systems will be different in naming conventions, variable measurements, encoding structures, and physical attributes of data. Consider a bank that has got several branches in several countries, has millions of customers and the lines of business of the enterprise are savings, and loans. The following example explains how the data is integrated from source systems to target systems.

Example of Source Data

System Name

Attribute Name Column Name Data type Values

Source System 1

Customer Application Date

CUSTOMER_APPLICATION_DATE NUMERIC(8,0) 11012005

Source System 2

Customer Application Date

CUST_APPLICATION_DATE DATE 11012005

Source System 3

Application Date APPLICATION_DATE DATE 01NOV2005

In the aforementioned example, attribute name, column name, data type and values are entirely different from one source system to another. This inconsistency in data can be avoided by integrating the data into a data warehouse with good standards.

Example of Target Data (Data Warehouse)

Target System Attribute Name Column Name Data

type Values

Record #1 Customer Application Date CUSTOMER_APPLICATION_DATE DATE 01112005

Record #2 Customer Application CUSTOMER_APPLICATION_DATE DATE 01112005

Page 4: Data Warehouse

Date

Record #3 Customer Application Date CUSTOMER_APPLICATION_DATE DATE 01112005

In the above example of target data, attribute names, column names, and data types are consistent throughout the target system. This is how data from various source systems is integrated and accurately stored into the data warehouse.

ARCHITECTURE

Data warehouse architecture is primarily based on the business processes of a business enterprise taking into consideration the data consolidation across the business enterprise with adequate security, data modeling and organization, extent of query requirements, meta data management and application, warehouse staging area planning for optimum bandwidth utilization and full technology implementation.

The Data Warehouse Architecture includes many facets. Some of these are listed as follows:

Process Architecture

Data Model Architecture

Technology Architecture

Information Architecture

Page 5: Data Warehouse

Resource Architecture

Process Architecture

Describes the number of stages and how data is processed to convert raw / transactional data into information for end user usage. The data staging process includes three main areas of concerns or sub- processes for planning data warehouse architecture namely “Extract”, “Transform” and “Load”. These interrelated sub-processes are sometimes referred to as an “ETL” process.

1) Extract- Since data for the data warehouse can come from different sources and may be of different types, the plan to extract the data along with appropriate compression and encryption techniques is an important requirement for consideration.

2) Transform- Transformation of data with appropriate conversion, aggregation and cleaning besides de-normalization and surrogate key management is also an important process to be planned for building a data warehouse.

3) Load- Steps to be considered to load data with optimization by considering the multiple areas where the data is targeted to be loaded and retrieved is also an important part of the data warehouse architecture plan.

Data Model Architecture

In Data Model Architecture (also known as Dimensional Data Model), there are 3 main data modeling styles for enterprise warehouses:-

3rd Normal Form -Top Down Architecture, Top Down Implementation

Federated Star Schemas - Bottom Up Architecture, Bottom Up Implementation

Data Vault - Top Down Architecture, Bottom Up Implementation

Technology Architecture

Technology or Technical architecture primary evolved from derivations from the process architecture, Meta data management requirements based on business rules and security levels implementations and technology tool specific evaluation. Besides these, the Technology architecture also looks into the various technology implementation standards in database management, database connectivity protocols (ODBC, JDBC, OLE DB etc), Middleware (based on ORB, RMI, COM/DOM etc.), Network protocols (DNS, LDAP etc) and other related technologies.

Page 6: Data Warehouse

Information Architecture

Information Architecture is the process of translating the information from one form to another in a step by step sequence so as to manage the storage, retrieval, modification and deletion of the data in the data warehouse.

Resource Architecture

Resource architecture is related to software architecture in that many resources come from software resources. Resources are important because they help determine performance. Workload is the other part of the equation. If you have enough resources to complete the workload in the right amount of time, then performance will be high. If there are not enough resources for the workload, then performance will be low.

Page 7: Data Warehouse

DATABASE

A database is an application that manages data and allows fast storage and retrieval of that data. The term database was originally written as data base, and it may have been first used in 1963 at a symposium sponsored by the System Development Corporation of Santa Monica, California. The use of the term database (single word) became popular in some European countries in the early 1970s, and it subsequently spread to the U.S.

“A database is a collection of information that is organized so that it can easily be accessed, managed, and updated. In one view, databases can be classified according to types of content: bibliographic, full-text, numeric, and images.”

A database can generally be looked at as being a collection of records, each of which contains one or more fields (i.e., pieces of data) about some entity (i.e., object), such as a person, organization, city, product, work of art, recipe, chemical, or sequence of DNA. For example, the fields for a database that is about people who work for a specific company might include the name, employee identification number, address, telephone number, date employment started, position and salary for each worker.

TYPES

There are different types of database but the most popular is a relational database that stores data in tables where each row in the table holds the same sort of information.

Relational Database

The standard of business computing as of 2009, relational databases are the most commonly used database today. It uses the table to structure information so that it can be readily and easily searched through. A relational database is a way of organizing data such that it appears to the user to be stored in a series of interrelated tables. Interest in this model was initially confined to academia, perhaps because the theoretical basis is not easy to understand, and thus the first commercial products, Oracle and DB2, did not appear until around 1980. Subsequently, relational databases became the dominant type for high performance applications because of their efficiency, ease of use, and ability to perform a variety of useful tasks that had not been originally envisioned.

Operational database

Page 8: Data Warehouse

These databases store detailed data needed to support the operations of an entire organization. They are also called subject-area databases (SADB), transaction databases, and production databases. For example:

customer database personal database

inventory database

accounting database

Analytical database

Analytic databases (a.k.a. OLAP- On Line Analytical Processing) are primarily static, read-only databases which store archived, historical data used for analysis. For example, a company might store sales records over the last ten years in an analytic database and use that database to analyze marketing strategies in relationship to demographics.

On the web, you will often see analytic databases in the form of inventory catalogs such as the one shown previously from Amazon.com. An inventory catalog analytical database usually holds descriptive information about all available products in the inventory.

Web pages are generated dynamically by querying the list of available products in the inventory against some search parameters. The dynamically-generated page will display the information about each item (such as title, author, ISBN) which is stored in the database.

Data warehouse

A data warehouse stores data from current and previous years — data extracted from the various operational databases of an organization. It becomes the central source of data that has been screened, edited, standardized and integrated so that it can be used by managers and other end-user professionals throughout an organization. Data warehouses are characterized by being slow to insert into but fast to retrieve from. Recent developments in data warehousing have led to the use of a Shared nothing architecture to facilitate extreme scaling.

Distributed database

These are databases of local work-groups and departments at regional offices, branch offices, manufacturing plants and other work sites. These databases can include segments of both common operational and common user databases, as well as data generated and used only at a user’s own site.

End-user database

Page 9: Data Warehouse

These databases consist of a variety of data files developed by end-users at their workstations. Examples of these are collections of documents in spreadsheets, word processing and even downloaded files.

External database

These databases provide access to external, privately-owned data online — available for a fee to end-users and organizations from commercial services. Access to a wealth of information from external database is available for a fee from commercial online services and with or without charge from many sources in the Internet.

Hypermedia databases on the web

These are a set of interconnected multimedia pages at a web-site. They consist of a home page and other hyperlinked pages of multimedia or mixed media such as text, graphic, photographic images, video clips, audio etc.

Navigational database

In navigational databases, queries find objects primarily by following references from other objects. Traditionally navigational interfaces are procedural, though one could characterize some modern systems like X Path as being simultaneously navigational and declarative.

In-memory databases

In-memory databases primarily rely on main memory for computer data storage. This contrasts with database management systems which employ a disk-based storage mechanism. Main memory databases are faster than disk-optimized databases since the internal optimization algorithms are simpler and execute fewer CPU instructions. Accessing data in memory provides faster and more predictable performance than disk. In applications where response time is critical, such as telecommunications network equipment that operates emergency systems, main memory databases are often used.

Document-oriented database

Page 10: Data Warehouse

Document-oriented databases are computer programs designed for document-oriented applications. These systems may be implemented as a layer above a relational database or an object database. As opposed to relational databases, document-based databases do not store data in tables with uniform sized fields for each record. Instead, they store each record as a document that has certain characteristics. Any number of fields of any length can be added to a document. Fields can also contain multiple pieces of data.

Real-time databases

A real-time database is a processing system designed to handle workloads whose state may change constantly. This differs from traditional databases containing persistent data, mostly unaffected by time. For example, a stock market changes rapidly and dynamically. Real-time processing means that a transaction is processed fast enough for the result to come back and be acted on right away. Real-time databases are useful for accounting, banking, law, medical records, multi-media, process control, reservation systems, and scientific data analysis. As computers increase in power and can store more data, real-time databases become integrated into society and are employed in many applications.

ARCHITECTURE

A number of database architectures exist. Many databases use a combination of strategies.

Databases consist of software-based "containers" that are structured to collect and store information so users can retrieve, add, update or remove such information in an automatic fashion. Database programs are designed for users so that they can add or delete any information needed. The structure of a database is tabular, consisting of rows and columns of information.

Online Transaction Processing systems (OLTP) often use a "row oriented" or an "object oriented" data store architecture, whereas data-warehouse and other retrieval focused applications like Google's BigTable, or bibliographic database (library catalog) systems may use a Column oriented DBMS architecture.

Document-Oriented, XML, knowledgebase, as well as frame databases and RDF-stores (also known as triple stores), may also use a combination of these architectures in their implementation

Not all databases have or need a database schema ("schema-less databases").

Page 11: Data Warehouse

Over many years general-purpose database systems have dominated the database industry. These offer a wide range of functions, applicable to many, if not most circumstances in modern data processing. These have been enhanced with extensible data types (pioneered in the PostgreSQL project) to allow development of a very wide range of applications.

.

There are also other types of databases which cannot be classified as relational databases. Most notable is the object database management system, which stores language objects natively without using a separate data definition language and without translating into a separate storage schema. Unlike relational systems, these object databases store the relationship between complex data types as part of their storage model in a way that does not require runtime calculation of related data using relational algebra execution algorithms.

Database management systems

Database management system (DBMS) consists of software that organizes the storage of data. A DBMS controls the creation, maintenance, and use of the database storage structures of

Page 12: Data Warehouse

social organizations and of their users. It allows organizations to place control of organization wide database development in the hands of Database Administrators (DBAs) and other specialists. In large systems, a DBMS allows users and other software to store and retrieve data in a structured way.

Database management systems are usually categorized according to the database model that they support, such as the network, relational or object model. The model tends to determine the query languages that are available to access the database. One commonly used query language for the relational database is SQL, although SQL syntax and function can vary from one DBMS to another. A common query language for the object database is OQL, although not all vendors of object databases implement this, majority of them do implement this method. A great deal of the internal engineering of a DBMS is independent of the data model, and is concerned with managing factors such as performance, concurrency, integrity, and recovery from hardware failures. In these areas there are large differences between the products.

A relational database management system (RDBMS) implements features of the relational model. In this context, Date's "Information Principle" states: "the entire information content of the database is represented in one and only one way. Namely as explicit values in column positions (attributes) and rows in relations (tuples). Therefore, there are no explicit pointers between related tables." This contrasts with the object database management system (ODBMS), which does store explicit pointers between related types.

Components of DBMS

According to the wikibooks open-content textbooks, "Design of Main Memory Database System/Overview of DBMS", most DBMS as of 2009 implement a relational model. Other less-used DBMS systems, such as the object DBMS, generally operate in areas of application-specific data management where performance and scalability take higher priority than the flexibility of ad hoc query capabilities provided via the relational-algebra execution algorithms of a relational DBMS.

RDBMS components Interface drivers - A user or application program initiates either schema modification or

content modification. These drivers are built on top of SQL. They provide methods to prepare statements execute statements, fetch results, etc. Examples include DDL, DCL, DML, ODBC, and JDBC. Some vendors provide language-specific proprietary interfaces. For example MySQL and FireBird provide drivers for PHP, Python, etc.

SQL engine - This component interprets and executes the SQL query. It comprises three major components (compiler, optimizer, and execution engine).

Page 13: Data Warehouse

Transaction engine - Transactions are sequences of operations that read or write database elements, which are grouped together.

Relational engine - Relational objects such as Table, Index, and Referential integrity constraints are implemented in this component.

Storage engine - This component stores and retrieves data records. It also provides a mechanism to store metadata and control information such as undo logs, redo logs, lock tables, etc.

ODBMS components Language drivers - A user or application program initiates either schema modification

or content modification via the chosen programming language. The drivers then provide the mechanism to manage object lifecycle coupling of the application memory space with the underlying persistent storage. Examples include C++, Java, .NET, and Ruby.

Query engine - This component interprets and executes language-specific query commands in the form of OQL, LINQ, JDOQL, JPAQL, others. The query engine returns language specific collections of objects which satisfy a query predicate expressed as logical operators e.g. >, <, >=, <=, AND, OR, NOT, GroupBY, etc.

Transaction engine - Transactions are sequences of operations that read or write database elements, which are grouped together. The transaction engine is concerned with such things as data isolation and consistency in the driver cache and data volumes by coordinating with the storage engine.

Storage engine - This component stores and retrieves objects in an arbitrarily complex model. It also provides a mechanism to manage and store metadata and control information such as undo logs, redo logs, lock graphs,