Benefits of Data Archiving in Data Warehouses

8/10/2019 Benefits of Data Archiving in Data Warehouses

1/12

IBM SoftwareWhite Paper

Benets of data archivingin data warehouses


2/12

2 Benets of data archiving in data warehouses

Contents 2 Executive summary

3 Typical reasons for rapid data growth

4 Challenges associated with data warehouse growth

5 Traditional data growth solutions that do not work

6 Understanding data archiving9 Benets of data archiving

10 Guiding principles and technology requirements

11 Managing data growth responsibly with datawarehouse archiving

Executive summaryData warehouses are the pillars of business intelligence andanalytics systems, often integrating data from multiple datasources in an organization to provide historical, current oreven predictive analysis of the business. Information frommultiple internal or external transactional systems is extracted,transformed and loaded into data warehouses as atomicdata. This cumulative data and the analytics systems thatleverage it provide the technology and methodology that helporganizations discover and develop meaningful insights.

Due to the consolidated nature of data warehouses, these datastores often suffer from rapid growth. Typical reasons for thisphenomenon include expansion of data warehouses with new

subject areas or data marts, compounded data growth fromorganic or inorganic business growth, or a lets keep it all,someone might need it attitude toward historical data.

This unchecked data growth often results in ever-increasinginfrastructure and operational costs, poor data warehouseperformance, and an inability to support complex dataretention and legal hold requirements.

A data archiving solution helps organizations address thesechallenges by allowing IT staff to intelligently move (andpurge) historical and inactive data from production databases

into a more cost-effective location while still providing thecapabilities to query, search or even restore data if needed.

A tiered archiving strategy provides additional benets interms of managing performance and cost-effectiveness. Dataarchiving can also alleviate data growth issues by:

Removing or relocating inactive and dormant data out of thedatabase to improve data warehouse performance

Reducing the infrastructure and operational costs typicallyassociated with data growth

Leveraging proven policies and processes to cost-effectivelymanage multi-temperature data

Improving disaster recovery and backup/restore plans toconsistently meet service-level agreements (SLAs)

Supporting compliance with data retention, purge orhold policies

This paper describes a data lifecycle management strategy fordata warehouses that is designed to manage high-volume datagrowth cost-effectively, and avoid performance degradation.


3/12

IBM Software 3

Typical reasons for rapid data growth The data warehouse is commonly an organizations largestdatabase. This is due to several factors:

Big data and the explosion in data volume: With the adventof big data technologies that help organizations generateinsight from large information assets, companies are keepingunstructured and structured data that might have been thrownaway in the past. Apache Hadoop and similar technologiescontinue to gain momentum and adoption, and will providenew ways of processing large amounts of such data, extractingintelligence from multi-structured data sources, and integratingthe results into existing data warehouses for further analysis

and reporting.

The data tomb effect: Data warehouses may become thedumping ground for historical data from various transactionalsystems, with little regard to the true value of the businessintelligence within this dead data. This data tomb effectmay be caused by the lack of an optimal archiving and dataretention strategy in the originating transactional system itself.

Expansion into new subject areas: Companies frequentlyexpand data warehouses with new subject areas and new datasources, making them part of a central repository for theenterprise or interconnected data marts. While this expansioncan provide insights for crucial business activities, it can alsolead to signicant data expansion.


4/12


Business growth: Larger organizations are often subject tocompounded data growth from mergers and acquisitions, as well as organic business growth. Consolidation of multipleimplementations into one results in a larger system.

Lack of retention and disposal policies: Unfortunately, thebusiness side of an organization may not provide IT teams with enough clarity on data retention and disposal policies.

Most organizations have a lets keep it all, someone mightneed it later mentality for historical data, which preventsthem from exploring cost-effective data retention, hold orpurge processes.

Each of these factors provides an impetus for IT organizationsto adopt data lifecycle management strategies and efcientlymanage categories of data according to their value in a data warehousing architecture.

Challenges associated with datawarehouse growthHigh-volume data growth and large warehouse implementationspresent multiple IT challenges and business risks. While manydata warehouse solutions and architecture choices exist inthe market, every approach poses several common challenges(see Figure 1).

Cost of ownership The impact of exponential data growth on infrastructure andoperational costs can be huge, often taking up most of anorganizations data warehousing budget. Larger amounts of datarequire larger capacity, resulting in more hardware and storagerequirementsas well as higher costs to maintain, monitor andadminister this infrastructure. Large data warehouses generallyrequire bigger servers and appliances, which may also increasesoftware licensing costs for the database, database tooling,integration or business intelligence (BI) tools.

Figure . Performance and capacity challenges associated with data warehouse growth.

Performance

Hardware capacity

D a

t a b a s e s

i z e


5/12

IBM Software 5

In addition, IT departments must factor in the costs ofa mirrored disaster recovery system, the data backupinfrastructure, processes to copy large data sets within the SLA window and replicas of the database across test environments.

Performance and availability Large volumes of data and varying workloads can put a lotof stress on data warehouse systems. With a majority of

production data typically in an inactive state, the performanceand system availability of data warehouses suffer greatly as aresult of unchecked data growth.

When the response time of critical queries and reportingprocesses starts to degrade, extract/transform/load (ETL) loadstake longer and may extend past the SLA windows. Databasebackups run endlessly and the IT staff must operate in reactivemode to contain these issues. These situations pose a signicantrisk to business continuity and system availability, becausedowntime can result in a lengthy system recovery period.

Cost-effective compliance Many data warehouses also feed data back into thetransactional systems, acting as systems of record in thesecases. These systems may be subject to audits, retention,legal hold or e-discovery requests. Simply purging historicaldata is not acceptable as a method for keeping up with datagrowth because compliance regulations may require data tobe retained for a certain number of years, put on legal hold tosatisfy discovery requests, or audited. Keeping all of the datain production databases is not a cost-effective way to retaindata for compliance reasons. Also, if a data warehouse wasused to make business decisions, it may be targeted for legal

disclosure under e-discovery rules.

Traditional data growth solutionsthat do not workIT organizations may try to use conventional methods formanaging data growth, but these methods are habituallyineffective or fail to generate a cost-effective solution.Common techniques include:

Hardware upgrades: Trying to keep up with data growth hasa huge impact on capital expenditure and frequent hardwareupgrades. The traditional solution is to add more server nodes,or perform forklift upgrades to replace the data warehouseinfrastructure. While hardware upgrades are inevitable, there areother ways to defer these costs and reap better performance fromexisting infrastructurewhich may amount to huge savings.

Traditional backups: Large, monolithic backups are highlyredundant with historical and inactive data taking up most ofthe space. Backups are not substitutes for archives; archivesare online or near-line and queryable. Backups cannot solvedata growth problems because they require creating a replicaof the production data, and need to be taken frequently (on a

weekly or monthly basis), which adds more overhead to thegrowth problem. If IT teams use backups to archive data, itcan be difcult to retrieve the data within a short period oftime. Information retrieval also poses a challenge when thedata schema in the original system has evolved.


6/12


Database partitioning: IT departments sometimes try tomanage data growth by implementing a partitioning schemain the traditional database management system (DBMS) toseparate active data from historical data. However, partitioningin this way still may not reduce the overhead on the databasebecause the indexes remain the same size. Partitioning does nothelp reduce overall storage costs and maintenance windows;it also makes it difcult to restore or re-create selective data

records located in a dropped partition from the time when thedatabase was on an older version. Certain analytical DBMSsdont even support database partitioning.

Homegrown solutions: Building a mature data archivingand purging solution in-house can be a very expensive andtime-consuming effort. The scripts and code require properhandling of database referential integrity, error recoverability,high-performance execution and consistent application ofbusiness rules and policies across a potentially large numberof systems. Despite the huge investment, these solutions arehard to maintain and do not provide much longevity in typicalorganizations where people and technology change regularly.

Purging data: In many industries, companies must keeplarge amounts of historical information (especially nancialinformation) for compliance reasons. Data is subject tothe same SLAsincluding those for data retentionas thetransactional system itself, and for that reason must becovered by information lifecycle policies for standardcorporate data.

Understanding data archivingData lifecycle management is a policy- and process-orientedapproach to efciently control the ow of an informationsystems data throughout its lifecycle, from requirementto retirement. Data lifecycle management policies includeensuring optimal application performance and archivinghistorical data to manage data growth while ensuring access toboth production and archived data. Before archiving data, it isimportant to classify everything based on usage activity.

Data assessment and classicationIt is not uncommon for organizations to have millions oreven billions of records across different fact tables that holdmany years of accumulated information. However, it is quitecommon for users and DBAs to nd that the most active datais typically located within the last six months to two years oftransactions. Anything earlier is queried infrequently.

Data in the warehouse can be classied according to itstemperaturethe access frequency, volatility and queryperformance of the data. Hot data is frequently accessedand updated, and users expect optimal performance whenaccessing this data. As data ages, it tends to cool off,meaning that the probability of users accessing this datasignicantly decreases.


7/12

IBM Software 7

Archiving typically targets cold data and relocates it to amore cost-effective storage medium (see Figure 2). However,the data must still be available for regulatory requests,audits and long-term analysisso the archived data shouldbe queryable and restorable (in the original location or astaged location). Data assessment and classication basedon business usage is an important factor in an effectivearchiving strategy.

Data archiving Archiving in its simplest form involves the migration ofinformation or data (typically historical) from an online applicationto a secondary (online, near-line or ofine) system, making itaccessible as a long-term storage repository. As a recognizedinformation lifecycle management best practice, archivingsegregates inactive application data from current activity andsafely moves it to a different tier based on its value to the business.

Consequently, smaller databases tend to deliver higher servicelevels with lower maintenance and operational overhead.

Figure . Multi-temperature data classication based on access requirement.

Coldest

Colder

Cold

Warm

Hot

Current Year 1 Year 3 Year 5 Year 7

Updateaccess

Reportingaccess Ad hoc

access

?

?


8/12


Archiving in dimensional data warehousingor data martsData warehousing uses different methods of data modeling.One popular approachdimensional data warehousinginvolves fact and dimension tables, whereas others use amore normalized data model. There are two types of historytracking in dimensional data warehousing:

1. Fact data changes: Granular fact records about a businessevent (such as a sale or transaction) are linked to a certainpoint in time, which are history-tracked and grow in largenumbers over time. These high-volume, historical anddetailed records are good candidates for archiving.

2. Dimension data changes: Data in dimension tables mayalso change over time and is known as slowly changing dimension(SCD) data. In this case, attribute changes in a dimension such ascustomer phone number or address may be tracked, can changeover time and often result in a sizeable amount of historicaldata. The larger the dimension tables in volume and number of

attributes, the larger the data grows in SCD records. However,fact record growth is higher than SCD records.

Tiered storage archiving strategiesDatabase archiving involves extracting a predened set ofhistorical data (often time-based) from a set of tables whilemaintaining its data referential integrity; moving this data setinto either a secondary archive data warehouse or a le-baseddata archive; and purging the historical transactional datafrom the source database. For higher query performanceand access to larger data volumes of data, warm data may

be stored in another data warehouse instance, ideally on alower-cost infrastructure. For rarely accessed data, storingthis cold data in compressed and queryable data archiveles may provide a more cost-effective solution comparedto higher-tier storage.

Organizations may leverage a combination of these archivestores to balance access performance requirements and cost-effectiveness (see Figure 3). The archived systems wouldleverage lower-cost storage devices such as Serial ATA (SATA),network-attached storage (NAS), content-addressable storage(CAS), optical disks, tapes or cloud storage.

Figure . A three-tier archiving strategy designed to optimize cost-effectiveness and performance of specic data sets on different tiers of storage.

Data archive

lescolddata, tier 3

Contextual data

Archive

Restore

Completedata sets

Archive

Restore

Completedata sets

Production datawarehousehotdata, tier 1

Historicaldata

Current data

Archive datawarehousewarmdata, tier 2

ArchiveHistorical

data


9/12

IBM Software 9

Access to archived data While archiving strategy and architecture may look differentfor each implementation, there may be infrequent requirementsto access the archived data. Archiving removes data fromthe production system, but this data is not lostit is simplyrelocated based on its business value. In cases where a separateinstance of an archive data warehouse is used, the queriescould be directed to the archive instance directly. For scenarios

where combined reporting with production data is needed, datafederation technologies could be leveraged as well.

The data archive les created by the archiving solution shouldallow access using industry-standard interfaces such as ODBC/ JDBC, XML or SQL, via any standard reporting tool. Userscan then browse or search the archives using browser-basedor other standard reporting mechanisms for auditing orcompliance reasons. For heavier analytical requirements onlarger sets of historical data, the archiving solution shouldallow users to restore archived data sets back to the originallocation or a staged location. In general, because archived datais infrequently accessed, restoring data is rarely required.

Benets of data archiving

Lower total cost of ownershipData archiving can have a great impact on reducing total costof ownership for the data warehouse and help with ITcost-savings initiatives. By deferring hardware upgrades inproduction and disaster recovery environments, archivingenables companies to make the most efcient use of

the existing infrastructure in a controlled data growthenvironment. Archiving strategies help the amount of dataDBAs must actively manage and the amount of time theyspend tuning or adjusting storage requirementsfreeingthem up to focus on more strategic projects. Archivingalso holds the potential to reduce software costs (such as

warehouse and database licensing costs) associated withlarger data warehouses.

By leveraging lower-cost storage tiers or lower-cost data warehouse appliances with a tiered archiving strategy,organizations can purge inactive historical data once it

has been archivedreclaiming space in production data warehouse servers. In addition, archiving helps control thecost of capital and operational expenses related to databasebackup processes because the redundant and static historicaldata will be reduced in periodic backups.

Improved performance and availability Archiving and purging inactive data helps signicantly improvequery performance by reducing the amount of data and thenumber of indexes and table scans that must be processed.Smaller data warehouses also perform better with batchprocessing, long-running reports and ETL jobsavoiding

overruns into other production usage requests. Archivingmakes performing periodic maintenance tasks easier and faster,and it streamlines restoration from backups in the event of afailure for better system uptime and user productivity.

Organizations which fail to deploy strategiesto address data complexity and volume issues for their analytics by 2012 will experiencemore than doubling costs of ownership

for their data warehouse and martenvironments in disorganized attemptsto meet this new demand.

Does the 21st-Century Big Data Warehouse Mean the End of theEnterprise Data Warehouse? 25 August 2011, Gartner


10/12


Streamlined risk and compliance managementData archiving helps organizations comply with dataretention and purge policies while providing queryablearchives for audit or e-discovery requests into historical data. The technology and processes also support data legal-holdrequirements. Plus, archiving enables organizations to applybusiness policies to govern data retention and disposal andprovides long-term solutions for storing historical data.

Guiding principles and technologyrequirements An enterprise-grade data archiving solution should meet fourkey technology requirements:

1. Enterprise architecture Most enterprises rely on heterogeneous information assets,solutions and platforms from multiple vendors. A single,scalable data lifecycle management solution must supportall of these major technologies, providing a common andreusable interface and processes. The solution should also beoptimized for high-performance connectivity to multipledata warehousing solutions (such as IBM PureDataSystem for Analytics, which leverages IBM Netezzatechnology; IBM DB2 and IBM InfoSphere Warehouse;IBM Informix; Teradata; Oracle; Microsoft SQL Server;and Sybase) with support for major operating systems includingIBM z/OS, IBM i, Linux, UNIX and Microsoft Windows.

Such an enterprise solution should also support a tieredstorage architecture for optimal balance between storagecost, performance and access requirements. Pre-builtintegration with hierarchical storage management (HSM)systems like IBM Tivoli or EMC Centera also helps easeimplementation of a tiered archive strategy.

IBM InfoSphere Optim: A single, enterprise-scaledata lifecycle management solution

IBM InfoSphere Optim software provides a central datamanagement solution designed to scale to meet enterpriseneeds. Whether addressing a single application, a datawarehouse environment or a global data center,organizations can use InfoSphere Optim solutions to

streamline data management with a consistent strategy.

The unique relationship engine in InfoSphere Optimprovides a single point of control to guide data processingactivities such as archiving, subsetting, migrating andretrieving data. Reusable data management templatesenable consistency and scalability, while advancedsecurity features provide support for role-based accessand activity permissions.

InfoSphere Optim supports major data warehouseenvironments, including IBM PureData System for

Analytics, IBM InfoSphere Warehouse, Teradata andOracle. It also supports enterprise databases andoperating systems, including IBM DB2, IBM Informix,IBM IMS, IBM Virtual Storage Access Method (VSAM),IBM z/OS, Oracle Database, Sybase, Microsoft SQL Server,Microsoft Windows, UNIX and Linux. In addition,InfoSphere Optim supports key ERP and CRM packagedapplications such as Oracle E-Business Suite, PeopleSoftEnterprise, JD Edwards EnterpriseOne, Siebel CRM,

Amdocs CRM and SAP applications, as well as manycustom applications.


11/12

IBM Software 11

2. Complete business objectsFrom a database perspective, a business object represents agroup of related rows from related tables across one or moreapplications, together with its related metadata (informationabout the structure of the database and about the data itself).Capturing the complete business object offers a complete viewof the business activity surrounding a particular transaction.

Data warehouses are required to represent these relationshipsaccurately, whether in a star schema, snowake or hybriddata model. When the high-level entity, such as an order, isarchived, the corresponding line items should be archived as well. If this does not happen, then data integrity is lost. Suchconnections form a complete business objectso enterprisesshould look for archiving software that represents andpreserves such complex entities in a simple, easy-to-manageand high-performing way.

3. Discovery and understanding data structures To archive complete business objects, enterprises need

archiving solutions with robust data discovery and metadatamining capabilities. The solution should be able to discover,analyze and document data models with accurate schemaand data relationships from data warehousing systems inmultiple ways. It should allow IT staff to reverse-engineera model from an existing source database by mining thedatabase catalog. Without this, the data model representation would have to be built manually. If there is no physical ordocumented data model representation, the solution shouldhave automated capabilities for analyzing data values and datapatterns to identify relationships that offer greater accuracyand reliability than manual analysis.

Archiving solutions must also provide the ability to importan existing logical model and make changes to it for databasearchiving. They should also provide an easy way for IT staffto incorporate logical data relationships manually for anycustom relationships not represented at the physical layer.

4. Universal access to archives The archiving solution should offer universal access to

archived data using industry-standard interfaces such asODBC/JDBC, XML or SQL, and reporting tools using theseinterfaces, such as IBM Cognos, SAP Crystal Reports,

Microsoft Excel and others.

Managing data growth responsibly withdata warehouse archivingData warehouses should not be allowed to grow into large,expensive historical data repositories. Managing data growth

with data warehouse archiving helps reduce costs, improveperformance and increase availability for business-criticalanalytics and BI solutions while maintaining compliance

with data retention requirements. Together with IBM,organizations can make a case for archiving in their data

warehouse implementations and evaluate the business valueof managing data growth.

For more information To learn more about IBM data archiving solutions andbest practices, contact your IBM representative or visit:ibm.com /software/data/optim
http://ibm.com/software/data/optimhttp://ibm.com/software/data/optimhttp://ibm.com/software/data/optim


12/12

Please Recycle

Copyright IBM Corporation 2013

IBM CorporationSoftware GroupRoute 100Somers, NY 10589

Produced in the United States of AmericaFebruary 2013

IBM, the IBM logo, ibm.com, Cognos, DB2, IMS, Informix, InfoSphere,Optim, PureData,Tivoli and z/OS are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other productand service names might be trademarks of IBM or other companies. A currentlist of IBM trademarks is available on the web at Copyright and trademarkinformation at ibm.com /legal/copytrade.shtml

Netezza is a trademark or registered trademark of IBM InternationalGroup B.V., an IBM Company.

Linux is a registered trademark of Linus Torvalds in the United States,other countries or both.

Microsoft, Windows, Windows NT, and the Windows logo are trademarksof Microsoft Corporation in the United States, other countries or both.

UNIX is a registered trademark of The Open Group in the United Statesand other countries.

Java and all Java-based trademarks and logos are trademarks or registeredtrademarks of Oracle and/or its afliates.

This document is current as of the initial date of publication and may bechanged by IBM at any time. Not all offerings are available in everycountry in which IBM operates.

The client examples cited are presented for illustrative purposes only. Actual performance results may vary depending on specic congurationsand operating conditions. THE INFORMATION IN THISDOCUMENT IS PROVIDED AS IS WITHOUT ANY WARRANTY,EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR APARTICULAR PURPOSE AND ANY WARRANTY OR CONDITIONOF NON-INFRINGEMENT. IBM products are warranted according tothe terms and conditions of the agreements under which they are provided.

The client is responsible for ensuring compliance with laws andregulations applicable to it. IBM does not provide legal advice or represent

or warrant that its services or products will ensure that the client is incompliance with any law or regulation.

IMW14686-USEN-00
http://ibm.com/legal/copytrade.shtmlhttp://ibm.com/legal/copytrade.shtmlhttp://ibm.com/legal/copytrade.shtml

Benefits of Data Archiving in Data Warehouses

Documents

Transcript of Benefits of Data Archiving in Data Warehouses