Data Warehouse and Data Mining - mahoto, naeem ahmedData Warehouse • A data warehouse is a...
Transcript of Data Warehouse and Data Mining - mahoto, naeem ahmedData Warehouse • A data warehouse is a...
Naeem Ahmed
Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro
Email: [email protected]
Data Warehouse and Data Mining Lecture No. 04-06
Data Warehouse Architecture
Data Warehouse
Operational Data
Data Warehouse
Access Tools
End Users
Data Warehouse • A data warehouse is a central, enterprise-wide
database which contains information extracted from the operational data stores.
• Operational Systems: A system which is used to process the day-to-day transactions of an organization.
Data Warehouse Architecture
Data Warehouse Architecture
Operational Source systems • These are the operational systems of record that
capture the transactions of the business. • These systems are outside the data warehouse
which do not have control over contents and format of the data
• The source systems maintain little historical data • These systems generate operation data that is
detailed, current and subject to change
Data Staging Area • Data staging area can be divided into three phases
– Extraction (E) – Transformation (T) – Loading (L)
• Extraction: It means reading and understanding the source data and copying the data needed for the data warehouse into staging area for further manipulation (i.e. transformation)
Data Staging Area • Loading: Loading refers to populating of data
warehouse with data that has been extracted from operational systems.
• There are two types of loads, which generally take place in data warehouse environment: – Initial load – Incremental load
Data Staging Area • Transformation: The transformation phase applies
a series of rules or functions to the extracted/loaded data.
• This may include some or all of the following: – Select only certain columns to load (or if you prefer, null columns
not to load) – Translate coded values – Derive a new calculated value (e.g. sale_amount = qty * unit_price) – Denormalization in order to fit the Dawarehouse Schema – Summarize multiple rows of data (e.g. total sales for each region)
ETL versus ELT • ETL (The traditional approach): ETL (Extract, transform,
and load) is a process in data warehousing that involves: – Extracting data from outside sources – transforming it to fit business needs, and ultimately – loading it into the data warehouse
• ELT (The Teradata Approach): ELT (Extract, Load and Transform) strategy extracts and loads the data into a Teradata Database first, then uses the power and performance of the Teradata Warehouse to perform the transformation
Data Presentation Area • Extended Relational DBMS
(ROLAP servers) – data stored in RDB – star-join schemas – support SQL extensions (Cube) – Index structures (bitmap, join)
• Multidimensional DBMS (MOLAP servers) – data stored in arrays (n-dimensional
array) – direct access to array data structure – poor storage utilization, especially
when the data is sparse
Data Access Tools • Analysis / OLAP / DSS Tools
• Querying / Reporting Tools
• Data Mining
Data warehouse bus architecture
Warehouse components
Component: Operational Data • The sources of data for the data warehouse is
supplied from: – The data from the mainframe systems in the traditional
network and hierarchical format – Data can also come from the relational DBMS like
Oracle, Informix – In addition to these internal data, operational data also
includes external data obtained from commercial databases and databases associated with supplier and customers
Component: Load Manager • The load manager (also called the front end
component) performs all the operations associated with extraction and loading data into the data warehouse
• These operations include simple transformations of the data to prepare the data for entry into the warehouse
• The size and complexity of this component will vary between data warehouses and may be constructed using a combination of vendor data loading tools and custom built programs
Component: Warehouse Manager • The warehouse manager performs all the operations
associated with the management of data in the warehouse This component is built using vendor data management tools and custom built programs
• The operations performed by warehouse manager include: – Analysis of data to ensure consistency – Transformation and merging the source data from temporary
storage into data warehouse tables – Create indexes and views on the base table. – Generation of de-normalization – Generation of aggregation – Backing up and archiving of data
Warehouse Manager: Detailed Data
• This area of the warehouse stores all the detailed data in the database schema
• In most cases detailed data is not stored online but aggregated to the next level of details
• However the detailed data is added regularly to the warehouse to supplement the aggregated data
Warehouse Manager: Lightly and Highly summarized data
• The area of the data warehouse stores all the predefined lightly and highly summarized (aggregated) data generated by the warehouse manager
• This area of the warehouse is transient as it will be subject to change on an ongoing basis in order to respond to the changing query profiles
• The purpose of the summarized information is to speed up the query performance
• The summarized data is updated continuously as new data is loaded into the warehouse
Warehouse Manager: Archive and Back-up Data
• This area of the warehouse stores detailed and summarized data for the purpose of archiving and back-up
• The data is transferred to storage archives such as magnetic tapes or optical disks
Warehouse Manager: Meta Data
• The data warehouse also stores all the Meta data (data about data) definitions used by all processes in the warehouse
• It is used for variety of purposed including: – The extraction and loading process – Meta data is used to map data
sources to a common view of information within the warehouse. – The warehouse management process – Meta data is used to
automate the production of summary tables. – As part of Query Management process Meta data is used to direct a
query to the most appropriate data source. • The structure of Meta data will differ in each process,
because the purpose is different
Component: Query Manager • The query manager (also called the back end
component) performs all operations associated with management of user queries
• This component is usually constructed using vendor end-user access tools, data warehousing monitoring tools, database facilities and custom built programs
• The complexity of a query manager is determined by facilities provided by the end-user access tools and database
Component: End-user Access Tools
• The principal purpose of data warehouse is to provide information to the business managers for strategic decision-making
• These users interact with the warehouse using end user access tools
• The examples of some of the end user access tools can be: – Reporting and Query Tools – Application Development Tools – Executive Information Systems Tools – Online Analytical Processing Tools – Data Mining Tools
Warehouse Models and Operators
• Data Models – Relations – stars & snowflakes – Cubes
• Operators – Slice and dice – roll-up, drill down – pivoting – other