Evolving Data Warehouse Architectures - 2.quadrant · 2 TDWI RESEARCh Evolving Data Warehouse...

38
SECOND QUARTER 2014 tdwi.org Evolving Data Warehouse Architectures In the Age of Big Data By Philip Russom BEST PRACTICES REPORT TDWI RESEARCH Co-sponsored by:

Transcript of Evolving Data Warehouse Architectures - 2.quadrant · 2 TDWI RESEARCh Evolving Data Warehouse...

Second Quarter 2014

tdwi.org

Evolving Data Warehouse Architectures In the Age of Big Data

By Philip Russom

BEST PRACTICES REPORT

TDWI RESEARCh

Co-sponsored by:

tdwi.org 1

© 2014 by TDWI (The Data Warehousing InstituteTM), a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. E-mail requests or feedback to [email protected]. Product and company names mentioned herein may be trademarks and/or registered trademarks of their respective companies.

By Philip Russom

Table of ContentsResearch Methodology and Demographics 3

Executive Summary 4

Introduction to Data Warehouse Architectures 5

Various Forms and components of data Warehouse architectures 5

Business and technology drivers behind architectural evolution 8

Why Is dW architecture Important? 10

Problems and Opportunities for Data Warehouse Architectures 12

Benefits of effective data Warehouse architectures 13

Barriers to Success with data Warehouse architectures 14

Multi-Platform Data Warehouse Environments 16

Workload-centric dW architecture 16

distributed dW architecture 17

a distributed dW architecture Is Both Good and Bad 17

From the Single-Platform edW to the Multi-Platform dWe 18

architectural Strategies for edWs versus dWes 18

Hadoop’s Roles in Data Warehouse Architectures 20

Promising uses of Hadoop that Impact dW architectures 20

Integrating HdFS with an rdBMS alleviates the Limitations of Both 22

Analytics and Reporting have Different Requirements for DW Architecture 24

dW architectures for reporting 24

dW architectures for analytics 24

Trends in Data Warehouse Architectural Components 25

components used Most commonly today 27

components to be adopted Within three Years 28

components excluded From users’ Plans 29

Vendor Platforms and Tools as DW Architectural Components 31

Top Ten Priorities for Data Warehouse Architectures 34

Evolving Data Warehouse Architectures In the Age of Big Data

Second Quarter 2014BEST PRACTICES REPORT

TDWI RESEARCh

2 TDWI RESEARCh

Evolving Data Warehouse Architectures

About the AuthorPHILIP RUSSOM is a well-known figure in data warehousing and business intelligence, having published over 500 research reports, magazine articles, opinion columns, speeches, Webinars, and more. Today, he’s the TDWI Research Director for Data Management at The Data Warehousing Institute (TDWI), where he oversees many of the company’s research-oriented publications, services, and events. Before joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and BI consultant and was contributing editor with leading IT magazines. Before that, Russom worked in technical and marketing positions for various database vendors. You can reach him at [email protected], @prussom on Twitter, and on LinkedIn at linkedin.com/in/philiprussom.

About TDWITDWI, a division of 1105 Media, Inc., is the premier provider of in-depth, high-quality education and research in the business intelligence and data warehousing industry. TDWI is dedicated to educating business and information technology professionals about the best practices, strategies, techniques, and tools required to successfully design, build, maintain, and enhance business intelligence and data warehousing solutions. TDWI also fosters the advancement of business intelligence and data warehousing research and contributes to knowledge transfer and the professional development of its members. TDWI offers a worldwide membership program, five major educational conferences, topical educational seminars, role-based training, on-site courses, certification, solution provider partnerships, an awards program for best practices, live Webinars, resourceful publications, an in-depth research program, and a comprehensive website, tdwi.org.

About the TDWI Best Practices Reports SeriesThis series is designed to educate technical and business professionals about new business intelligence technologies, concepts, or approaches that address a significant problem or issue. Research for the reports is conducted via interviews with industry experts and leading-edge user companies and is supplemented by surveys of business intelligence professionals.

To support the program, TDWI seeks vendors that collectively wish to evangelize a new approach to solving business intelligence problems or an emerging technology discipline. By banding together, sponsors can validate a new market niche and educate organizations about alternative solutions to critical business intelligence issues. Please contact TDWI Research Director Philip Russom ([email protected]) to suggest a topic that meets these requirements.

AcknowledgmentsTDWI would like to thank many people who contributed to this report. First, we appreciate the many users who responded to our survey, especially those who agreed to our requests for phone interviews. Second, our report sponsors, who diligently reviewed outlines, survey questions, and report drafts. Finally, we would like to recognize TDWI’s production team: Jennifer Agee, Michael Boyda, and Denelle Hanlon.

SponsorsActian, Cloudera, Datawatch Corporation, Dell Software, HP Vertica, and MapR Technologies sponsored the research and writing of this report.

tdwi.org 3

research Methodology and demographics

Research Methodology and Demographics Report Purpose This report educates users about the many directions data warehouse architectures are evolving. Big data is a major driver of change with its burgeoning size, sources, frequency of delivery, and diversity of structures. In addition, the adoption of advanced analytics and real-time operation is equally influential on DW architectures. To assist users, many new products and technologies have arrived recently from software vendors and the open source community. This report describes all of the above and more.

Terminology For the purposes of this report, big data is characterized as very large data sets (multi-terabyte or larger) that usually consist of a wide range of data types (relational, multi-structured data, text, and so on) from numerous sources, including relatively new ones (Web applications, machines, sensors, and social media).

Survey Methodology In November 2013, TDWI sent an invitation via e-mail to the data management professionals in its database, asking them to complete an Internet-based survey. The invitation was also posted in Web pages, newsletters, and publications from TDWI and other firms. The survey drew responses from 720 survey respondents. From these, we excluded incomplete responses and respondents who identified themselves as academics or vendor employees. The resulting completed responses of 538 respondents form the core data sample for this report.

Research Methods In addition to the survey, TDWI Research conducted many telephone interviews with technical users, business sponsors, and recognized data management experts. TDWI also received briefings from vendors that offer products and services related to data warehouse architectures and big data.

Survey Demographics The majority of survey respondents are IT professionals (76%), whereas the others are consultants (18%) and business sponsors or users (6%). We asked consultants to fill out the survey with a recent client in mind.

The financial services industry (15%) dominates the respondent population, followed by consulting (14%), software/Internet (10%), healthcare (8%), insurance (8%), and other industries. Most survey respondents reside in the U.S. (47%) or Europe (21%). Respondents are evenly distributed across all sizes of organizations.

Position

corporate It professionals 76%

consultants 18%

Business sponsors/users 6%

Industry

Financial services 15%

consulting/professional services 14%

Software/Internet 10%

Healthcare 8%

Insurance 8%

telecommunications 6%

Manufacturing (non-computers) 6%

retail/wholesale/distribution 6%

education 5%

Government: state/local 3%

utilities 3%

other 16%

(“Other” consists of multiple industries, each represented by 2% or less of respondents.)

Geography

united States 47%

europe 21%

asia 9%

canada 9%

australia / new Zealand 6%

Mexico, central or South america

5%

Middle east 2%

africa 1%

Company Size by Revenue

Less than $100 million 19%

$100–500 million 15%

$500 million–$1 billion 12%

$1–5 billion 15%

$5–10 billion 8%

More than $10 billion 15%

don’t know 16%

Based on 538 survey respondents.

4 TDWI RESEARCh

Evolving Data Warehouse Architectures

Executive Summary In the early days of data warehousing, most data warehouses (DWs) were centered around a single-instance database, plus a few “edge systems” for data marts, operational data stores (ODSs), and data staging. Over the years, TDWI has seen a strong trend toward more and more edge systems, many of them designed by vendors or users for workloads for which the average DW is not optimized, especially workloads for new forms of big data, processing for advanced analytics, managing tens of terabytes of source data, and data for real-time operations.

The modern data warehouse environment (DWE) still includes an enterprise data warehouse (EDW), but the EDW is complemented by several other types of data platforms. A DWE includes the usual marts, ODSs, and staging areas, as well as newer standalone platform types for DW appliances, columnar databases, NoSQL databases, Hadoop, real-time technologies, and various analytic tools. Of course, some organizations have multiple warehouses, as well. Given the rising complexity, data warehouse architecture is more critical than ever in order to make sense of, govern, and optimize the complicated multi-platform DWEs that many user organizations are building.

When asked what components a DW architecture should include, users answering this report’s survey identified data standards (71%), a logical design (66%), and a physical plan (56%). The user consensus is that DW architecture includes multiple layers and components, and success usually entails balancing a logical design (mostly data models and standards) with a physical plan (mostly a portfolio of servers and how they integrate).

The leading drivers for change in DW architectures are advanced analytics (57%), big data management and leverage (56%), and real-time operations (41%). Eighty-four percent feel that DW architecture is an opportunity when it achieves the leading drivers. The top barrier to success is the current lack of skills with big data and advanced analytics that most DW teams suffer.

Completely pure DW architectures are rare. For example, only 15% of respondents have a single EDW with no additional data platforms. Even when a DW has a well-defined architecture, users will ignore the plan for some data domains, platforms, or workloads. That’s why 31% have a core EDW, but with additional platforms in the DW environment, typically for analytic, big data, and real-time workloads. An increasing number of DW architectures are hybrids; for example, it’s common that Inmon and Kimball approaches are applied within the same DW environment.

A small but prominent community of DW professionals is integrating the Hadoop Distributed File System (HDFS) and other Hadoop tools into their DW environments. There are many areas within a DW architecture where HDFS shows promise, namely data staging areas, data archives, and repositories of big data for analytics. Hadoop is promising because it handles data types that the average DW cannot, namely unstructured data (text, audio, video) and semi-structured documents (XML, JSON). Yet, relational technologies are still required for most reporting, OLAP, and performance management functions supported by the average DW, so relational databases, SQL, and other relational technologies are not imperiled by Hadoop. Instead, TDWI expects them to be commonly integrated with Hadoop technologies in DW environments as early as 2016.

To help users prepare for new DW architectures, this report quantifies trends in data warehouse architectures and catalogs newly available, relevant technologies. The report also documents how successful organizations are evolving their architectures to leverage new business opportunities for big data. The goal is to provide data warehouse professionals and their business counterparts with the information they need before planning the next generation of their logical data warehouse architecture and its physical deployment.

For years, data warehouses have been trending toward multi-

platform environments

New platforms in the DW environment

include Hadoop, NoSQL databases,

columnar databases, and DW appliances

Most architectures are built as two primary

layers: logical and physical

Multi-platform DW environments still

have an EDW, but it’s complemented by other,

newer platform types

TDWI expects Hadoop and similar platforms

to be in common use in DW environments within

three years

tdwi.org 5

Introduction to data Warehouse architectures

Introduction to Data Warehouse Architectures

Various Forms and Components of Data Warehouse Architectures“Architecture” means many things to many people, especially when the term is applied to IT systems. The multiple definitions and forms of architecture are a natural consequence of the fact that IT systems almost always have an explicit overall architecture, and many system layers and components have their own architectures, as well. For example, many data warehouses are designed as multiple architectures that layer atop each other and work together:

• Logical architecture—mostly data models and their relationships, with a focus on how these represent organizational entities and processes

• Data standards—including standards for data modeling, data quality metrics, interfaces for data integration, programming style, format standards (e.g., 5-digit versus 9-digit ZIP codes), etc.

• Physical architecture—mostly a plan for deploying data and data structures on the servers of the system architecture

• System architecture—a topology of hardware servers and software servers, plus the interfaces and networks that tie them together

In addition, DW architectures usually integrate with or overlap with related architectures for data integration (ETL, data sync, replication), business intelligence (reporting and analytics), and enterprise data (typically source data that exists within the application architectures of operational systems).

Note that all these forms of architecture are complementary and never mutually exclusive. This means that it’s normal for multiple architectures to apply simultaneously, even in a single IT system, such as a data warehouse. It’s linguistically convenient for data warehouse professionals to call the combination of logical and physical layers “the data warehouse architecture.” Whether you consider it one architecture—or two or four—is a matter of semantic hairsplitting. The important point is that all four layers and components serve different purposes, have unique requirements, and must work together for a successful warehouse.

To get a sense of how data warehouse professionals and related personnel define architecture nowadays—and with which priorities—this report’s survey asked: “What do you think DW architecture is?” (See Figure 1, page 7.)

Standards and rules This is the most common component of DW architecture identified by survey respondents (71%). As one respondent put it, DW architecture must include: “standards and guidelines for the interaction of different platforms and data structures.” The standards typically guide how data is modeled, the formation of data quality metrics, metadata requirements, development methods, programming style, and which interfaces are to be used in which data integration situations. Some standards are more like business rules that ensure data integrity, such as “inventory quantity cannot be negative” or “a patient cannot be both male and pregnant.”

Not all architects agree that data standards and rules are truly architectural. To some, they are an adjunct to architecture, whereas others consider them integral to the logical layer of architecture.

DWs are often designed as multiple architectural layers and components that work together

Data standards and business rules are the leading priorities for data architecture

6 TDWI RESEARCh

Evolving Data Warehouse Architectures

Logical design Every well-designed data warehouse begins with a carefully planned logical design that focuses on the DW’s data structures and the patterns or relationships among them (66%). Likewise, all updates and versions of a DW should start with a revision of the logical design, for two reasons: The logical design is where the architecture is aligned to business requirements, and the physical and system layers are loosely based on the logical design.

According to a respondent, DW architecture “is a structure aimed at giving the users multi-sectional views of the information through both relational and dimensional models.” That comment reminds us that most DW professionals consider data modeling to be a primary skill for architecting a data warehouse—perhaps the only skill. Although data modeling is very important—and it’s especially prominent in the logical layer—it’s not all there is to designing a data warehouse architecture.

Physical plan Focused on deploying data, data structures, and data platforms (56%), this is where the architect decides where the components of the logical design will go in the physical world. It’s not necessarily a one-to-one mapping; many tables or marts may be specified in the logical plan, but they need not all be in one database or on one server in the physical deployment. Likewise, a single data structure on the logical layer may be normalized into multiple components on the physical level for the sake of performance or to avoid atomic redundancy.1

As a different example, some data is updated so frequently that any aggregates or calculations based on it should be done at run time, as needed, in a federated or virtual manner, instead of ahead of time. This is sometimes called “virtual data warehousing” or “the logical data warehouse,” because it seems like the logical layer exists without the physical one. In a sense, physicality still exists, though not instantiated until run time.

Some DW architects prefer to divide the physical plan into two layers: a plan for how data structures will be deployed and a separate but closely related plan or topology for hardware and software servers (which also includes how the servers integrate and interoperate). Isolating the data plan from the server plan makes sense when both are very complicated, when the two layers are owned by separate teams, or when you need to avoid “server vendor lock-in.”

Much attention has been focused on server topology in recent years—and that’s where this report focuses for the most part—because this is where most of the action is happening relative to DW architecture. For example, consider the many new types of platforms that have arrived recently from vendors and the open source community, plus new practices on the part of user organizations. These include DW appliances, columnar DBMSs, NoSQL DBMSs, Hadoop, and various analytic tools. They supply users’ growing demands for multi-platform system architectures under the assumption that having a software portfolio of diverse data platform types enables users to move data and workloads to the platforms that are best suited to them. The bulk of this report drills into multi-platform DW environments as manifestations of the physical layer of DW architecture.

Collection of data and platforms without a plan Some data warehouses work despite being a disorganized bucket of inconsistent data sets on disconnected data platforms. Certainly, this is not a recommended practice. These are rare cases (12%) because the likelihood of success is low—a disorganized bucket provides limited value for an organization. For that reason alone, most DWs have a functioning architecture (as we’ll see later, in the survey data for Figure 2).

In a related issue, note that some areas within a data warehouse architecture demand more rigorous application of data standards than others. For example, compare the squeaky clean and carefully modeled data that goes into financial reports to the barely scrubbed data that’s appropriate to exploratory analytics. By comparison, data for analytics sometimes seems like a collection without a plan. But the plan is still there; it’s expressed in the analytic layer at run time, instead of in a priori processing in the data layer, as is typical of data for reports.

The logical layer is the foundation that other

architectural layers are built atop

The physical layer is often portrayed as just a collection of servers,

but it’s really about how a logical data design is

deployed on servers

The fact that architecture-free DWs

are rare indicates architecture is a critical

success factor

1 In a different survey question (not charted in this report), respondents reported that DW architecture is the job of a data warehouse architect (38%), everyone on the BI and DW team (23%), or a BI or DW director (19%).

tdwi.org 7

Introduction to data Warehouse architectures

Inmon versus Kimball In the early days of data warehousing, influential architectural concepts were developed by two important pioneers in the field: Bill Inmon2 and Ralph Kimball.3 Throughout the 1990s, data warehouse professionals hotly debated which was better, Inmon’s top-down approach focused on operational data in third normal form (3NF) or Kimball’s bottom-up approach focused on dimensional models expressed in integrated data marts.

The Inmon versus Kimball dichotomy is still with us (12% in Figure 1), but the argument has cooled considerably in recent years because most DW architects nowadays employ multiple approaches, selecting the one that best fits a certain data structure, data domain, data platform server, or development team. Some DWs incorporate approaches by both Inmon (e.g., with 3NF data for exploratory analytics) and Kimball (e.g., with multidimensional data marts for business performance management). Furthermore, most architectures, standards, and approaches are a matter of degree. Just because you have a plan doesn’t mean you should follow it all the time, as explained in this report’s next User Story (below).

Other One respondent who selected the answer “Other” typed in his formula for a DW architecture, which stresses the combination of logical and physical plans: “Business process architecture (metadata) + logical plan + data acquisition and distribution infrastructure topology [physical plan].” Another respondent expressed a similar sentiment: “A coherent set of principles, models, and frameworks that guide the design and implementation of the (extended) enterprise data warehouse, its components, and their interrelationships at the conceptual, logical, and physical level.” Yet another respondent expressed the opinion that DW architecture “includes data integration and BI/reporting architectures, as well.”

What do you think DW architecture is? Select all that apply

Standards for data models, quality, metadata, interfaces, development methods, etc 71%

a logical design for data structures and the patterns or relationships among them 66%

a plan for physically deploying data, data structures, and data platforms 56%

a collection of data structures and data platforms, with little or no overall plan 12%

the old Inmon versus Kimball argument 12%

other 5%

Figure 1. Based on 1,197 responses from 538 respondents; 2.2 responses per respondent, on average.

USER STORY MOST ARCHITECTURES END UP bEING HYbRIDS, DESPITE STANDARDS, PLANS, AND PREFERENCES According to a user interviewed for this report: “I’ve worked on a lot of data warehouses, and I’ve never seen one implemented 100% Inmon or 100% Kimball, although there’s usually a tendency toward one or the other. Centralized and distributed architectures are rarely 100% either. And most ETL hub architectures are polluted with a few point-to-point interfaces that skirt the hub.

“I think all IT systems benefit from a guiding architecture for the sake of consistent standards, data integrity, ease of maintenance, depth of integration, business alignment, and the usual dogma. Even so, pedantic adherence to architecture isn’t desirable, because a few unique situations inevitably arise that are best served by a minority of non-standard solutions.”

Most architectures and approaches end up being hybrids after evolving

Key to most users’ conception is the combination of logical and physical architectures

2 Bill Inmon’s approach is well documented in his books Building the Operational Data Store, Managing the Data Warehouse, and Data Warehouse Performance.3 Ralph Kimball’s approach is well documented in his books The Data Warehouse Toolkit, The Data Warehouse Lifecycle Toolkit, and The Data Warehouse ETL Toolkit.

8 TDWI RESEARCh

Evolving Data Warehouse Architectures

Business and Technology Drivers behind Architectural EvolutionFirst, let’s establish that data warehouse architectures exist and are in use by the majority of organizations that have a data warehouse. See Figure 2, where a whopping 79% of survey respondents report having a data warehouse with an architecture.

Second, let’s corroborate that data warehouse architectures are, indeed, evolving. (See Figure 3.) Roughly half of users surveyed (54%) report experiencing a moderate level of evolution, whereas roughly a quarter (22%) are seeing dramatic evolution. Another quarter are experiencing minimal change.

Does your primary enterprise data warehouse have an architectural design?

Yes79%

Don’t know 4%

No 18%

Figure 2. Based on 538 respondents.

Is the architecture of your data warehouse environment evolving?

Yes—moderately54%

Don’t know 2%

No—except when DW updates adjust

architecture slightly

22%

Yes—dramatically 22%

Figure 3. Based on 538 respondents.

What’s driving change in data warehouse architectures? To get at the heart of the matter, this report’s survey asked: “What technical issues or practices are driving change in your DW architecture?” (See Figure 4.)

Advanced analytics Most changes being made in DW architectures are to make a warehouse (or its extended environment) more conducive to advanced forms of analytics (57%), especially those enabled by technologies for data mining, statistics, or complex SQL. Note that online analytic processing (OLAP) is still the leading method and technology for analytics, but it’s being complemented by other forms that are more focused on information discovery.

big data As a close second, almost as many changes are being made to accommodate the increasing data volumes (56%) and non-relational data (25%) of big data. In a related data management issue, data virtualization is on the rise (23%).

Real time To keep pace with accelerating business practices, BI and DW infrastructure is being retrofitted with technology for real-time operation (41%), and sometimes for streaming data (15%). This is especially apparent in a popular practice called operational BI, which gathers very fresh data from operational applications and presents it in management dashboards and similar reports—all within seconds or milliseconds.

Most technical changes to data warehouse

architectures are to provide better platforms

for analytics, big data, and real-time operation

tdwi.org 9

Introduction to data Warehouse architectures

What technical issues or practices are driving change in your DW architecture? Select all that apply

advanced analytics (mining, statistics, complex SQL; not oLaP) 57%

Increasing data volumes 56%

real-time operation based on fresh data 41%

Business performance management 38%

online analytic processing (oLaP) 30%

non-relational data 25%

Virtualization of data 23%

cloud adoption 21%

Streaming data 15%

other 9%

Figure 4. Based on 1,688 responses from 538 respondents; 3.1 responses per respondent, on average.

Technical drivers aside, a number of business issues and evolving practices are also driving change in DW architectures. (See Figure 5.)

Market pressures Enhancements to DW architectures can contribute to competitiveness (45%) and fast-paced business processes (43%), as more organizations compete on analytics or simply run the business based on facts developed in BI/DW infrastructure.

Compliance (29%) Many organizations have beefed up data security and role-based user access functions to comply with privacy and security regulations laid out in federal legislation such as HIPAA, Sarbanes-Oxley, and the USA PATRIOT Act. Others are scrambling to support the data standard known as the electronic medical record (EMR) mandated in the federal Affordable Care Act. Yet other organizations are altering their DW architectures to meet other definitions of compliance, such as service-level agreements (for performance and high availability) and internal data governance guidelines.

Departmental bI In some organizations, there is a shift toward funding (29%) and sponsorship (26%) at the departmental level. This is because departmental budgets are sometimes more fluid than capital ones, and most analytic applications are departmentally focused. One way to respond is to restructure a central DW into a componentized architecture, so that departmental assets are distinct from shared enterprise assets (perhaps on standalone servers, although more and more on clouds). An equally common response, however, is for a department to build its own redundant infrastructure for BI, DW, and analytics. In a similar vein, TDWI regularly sees departmental BI deployed atop a third-party cloud funded by the department, not IT.

Organizational issues A variety of changes to an organization can drive changes to IT systems and data warehouses, including reorganizations (25%), centralizing business control (30%), departmental power struggles (19%), and mergers and acquisitions (18%).

Most business-driven changes to DW architectures are in response to markets, compliance, and departmental bI

10 TDWI RESEARCh

Evolving Data Warehouse Architectures

What business issues or practices are driving change in your DW architecture? Select all that apply

competitiveness 45%

Fast-paced business processes 43%

centralizing business control 30%

compliance 29%

Funding 29%

Sponsorship 26%

reorganizations 25%

departmental power struggles 19%

Mergers and acquisitions 18%

other 6%

Figure 5. Based on 1,453 responses from 538 respondents; 2.7 responses per respondent, on average.

USER STORY SUCCESS COMES FROM A GOOD LOGICAL DESIGN bASED MOSTLY ON bUSINESS STRUCTURES Gary DeBoer has been a data warehouse professional and data architect at Saks Fifth Avenue, California Power Exchange, Boeing, Perot Systems, EDS, and Clearwire. “To me, data warehouse architecture borrows structures from the business, applications, IT infrastructure, analytics, enterprise data, and so on. Pieces of these combine into an information architecture for the warehouse, which is sometimes called a conceptual data model or a logical design. In my experience, the best success comes when the logical design is based primarily on the business—how it’s organized, its key business processes, and how the business defines prominent entities like customers, products, and financials. Start with these and similar entities as the core business data; the core doesn’t change, because the essence of business doesn’t change.

“Once a logical data architecture is designed, you can start considering how it will be deployed physically and virtually. How ‘enterprise’ is the data warehouse? With what enterprise applications will the warehouse integrate? How does it fit with IT infrastructure and enterprise data standards? In short, recognize that logical design and physical deployment are two different but related architectures; always give priority to the logical over the physical.”

Why Is DW Architecture Important?Anecdotally, TDWI has seen users’ concern for data warehouse architecture increase noticeably since about 2003. Many users are actively focused on architecture, as seen in the increased use of job titles such as data warehouse architect and enterprise data architect. To gauge the urgency of this activity, this report’s survey asked: “How important is architecture for the success of your data warehouse and related platforms today?” (See Figure 6.)

A mere 2% of respondents feel that data warehouse architecture is not a pressing issue. To the contrary, the vast majority consider DW architecture to be extremely important for success (79%), while many others consider it moderately important (19%).

The vote is in: the vast majority of

users consider DW architecture important

tdwi.org 11

Introduction to data Warehouse architectures

How important is architecture for the success of your data warehouse and related platforms today?

Extremely important79%

Not a pressing issue currently

2%

Moderately important 19%

Figure 6. Based on 538 respondents.

Clearly, the data warehouse professionals and others who responded to the survey are nearly unanimous in their zeal for DW architectures. But why? To get their unvarnished opinions, the survey asked the open-ended question: “In your own words, why is the architecture of a data warehouse important?” The respondents’ comments are loaded with both wisdom and wit, as seen in the representative insights reproduced below:

In your own words, why is the architecture of a data warehouse important?

• “We need the structure; otherwise it’s chaos.”—VP of IT, education, Asia

• “Like anything to be done correctly, having a blueprint is necessary, and this is what architecture implies. Without it would be similar to building a home without a blueprint, which informs each of the subcontractors what they are required to do, from putting in the foundation to laying a roof. In other words, the potential for failure is high without a sound architecture that includes not just the blueprint but also the standards, processes, applications, and infrastructure.”—DW architect, telecommunications, U.S.

• “Architecture helps to save money in development and maintenance and to leverage knowledge yield for users.”—CEO, consulting firm, Europe

• “Every DW solution is tailored to suit the business needs of the organization in question. Architecture is imperative to ensure that the objectives are being met and that all enhancements and changes to the DW are inclusive to the general business objective.”—Environmental scientist, federal government, U.S.

• “Users are increasingly expecting the ability to access large volumes of data across a variety of data sources, and they expect to be able to do it fast. A robust data warehouse architecture is needed to support structured, unstructured, and multi-source data, plus to support real-time, operational, and predictive analytics.”—BI manager, manufacturing, U.S.

• “It is a model or representation that enables ‘testing’ the DW design against business needs.”—IT professional, software/Internet, Canada

• “As the Cheshire Cat noted in Alice in Wonderland, if you don’t know where you’re going, any road will get you there. We do not want to end up with an architecture like that of the Winchester Mystery House.”—Sales consultant, software/Internet, U.S.

DW architecture is a critical foundation for structure, performance, future expansion, business alignment, and user value

12 TDWI research

Evolving Data Warehouse Architectures

• “Without an architectural plan there is little or no understanding of the content, structure, and relationships of the data within the warehouse. Without that understanding, the data is of limited value to the enterprise.”—IT specialist, utilities, U.S.

• “It provides a plan and framework for how you are going to organize data in your environment so that it is understandable by the business, achieves the goals of data integration, stores the information efficiently and securely, and makes sense as a cohesive whole, rather than just a store of operational data.”—VP and technology manager, financial services, U.S.

• “Data without architecture is just a mess.”—Director, hospitality and travel, Europe

Problems and Opportunities for Data Warehouse ArchitecturesEstablishing and sustaining data warehouse architecture takes time, specialized skills, team collaboration, and organizational support. With these requirements in mind, this report’s survey asked: “Is designing and maintaining data warehouse architecture mostly a problem or mostly an opportunity?” (See Figure 7.)

ThevastmajorityconsiderDWarchitectureanopportunity(84%). Conventional wisdom today says that architecture gives a data warehouse and dependent programs for reporting and analytics greater data quality, usability, maintainability, and alignment with business goals.

AsmallminorityconsiderDWarchitectureaproblem(16%). Data warehouse architecture faces a number of technical and organizational challenges, as noted in the next section of this report, but few organizations find that the challenges outweigh the benefits of architecture.

Isdesigningandmaintainingdatawarehousearchitecturemostlyaproblemormostlyanopportunity?Selectonlyone.

Opportunity84%

Problem 16%

Figure 7. Based on 538 respondents.

UserperceptionsaysthatDWarchitectureismostlyanopportunity,

notaproblem.

tdwi.org 13

Problems and opportunities for data Warehouse architectures

Benefits of Effective Data Warehouse ArchitecturesThe strongest trend in DW architecture right now is the movement toward even more data platforms and related tools within the physical layer of the extended data warehouse environment. To test users’ perceptions of data warehouse platform diversification, this report’s survey asked respondents: “If your organization were to further diversify the platform types in your data warehouse architecture, which business and technology tasks would improve?” (See Figure 8.)

Most forms of analytics benefit from having more types of data platforms This includes all data analytics, in general (61%), which bubbled up to the top of the list in Figure 8. In fact, many of the new platforms available are for analytics (although sometimes described as DW platforms), including data warehouse appliances, columnar databases, and Hadoop.

Information exploration and discovery (43%) is also at the top of the list This makes sense because the kinds of analytics on the rise right now (based on technologies for SQL, NoSQL, mining, statistics, and natural language processing) are all about discovering facts about the business that were previously unknown. Likewise, common types of analytic applications benefit from a multi-platform DW environment, including customer-base segmentation (19%), definitions of churn (18%), sentiment analytics (18%), and fraud detection (16%).

A more diverse platform portfolio can aid a business Additional platforms are key to addressing new business requirements (36%), especially data-oriented ones such as analytics (61%), more numerous and accurate business insights (34%), business optimization (30%), and understanding business change (20%).

A diverse platform portfolio can handle a diverse range of data types This is key to embracing the unstructured and schema-free data types found in most big data. Once these new data types are addressed, an organization gets greater business value from big data (34%), broader data sourcing for analytics (33%), and more data for data warehousing (22%).

Handling data in real time usually requires an additional purpose-built system Traditional relational databases and batch-oriented Hadoop systems were not built for real-time operations (33%), although many organizations need faster business processes (26%). Along with retrofitting real-time functions onto platforms that lack them (e.g., Apache Storm on Hadoop or in-memory analytics on RDBMSs), many users add additional standalone tools for event processing to their warehouse environments.

Adding low-cost platforms to a DW environment makes big data more affordable Hadoop and NoSQL platforms give more favorable economics to big data management practices, such as data staging for data warehousing (20%) and data archiving (16%). In a similar vein, many TDWI members have extended their DW environments with new brands of analytic databases (typically appliances and columnar RDBMSs). These are optimized for analytics to a degree that the average report-oriented DW isn’t, and the analytic databases are more affordable for most configurations.

The top beneficiaries of multi-platform DWs are analytics, business value, data breadth, and real-time operations

14 TDWI RESEARCh

Evolving Data Warehouse Architectures

If your organization were to further diversify the platform types in your data warehouse architecture, which business and technology tasks would improve? Select seven or fewer

data analytics 61%

Information exploration and discovery 43%

addressing new business requirements 36%

Greater business value from big data 34%

More numerous and accurate business insights 34%

Broader data sourcing for analytics 33%

real-time operations 33%

Business optimization 30%

Faster business processes 26%

recognition of sales and market opportunities 24%

More data for data warehousing 22%

data staging for data warehousing 20%

understanding business change 20%

Segmentation of customer base 19%

definitions of churn and other customer behaviors 18%

Sentiment analytics and trending 18%

data archiving 16%

Fraud detection 16%

cost containment 15%

Identification of root causes 15%

understanding website visitor behavior 13%

effective use of machine data 12%

Quantification of risks 10%

other 4%

Figure 8. Based on 3,083 responses from 538 respondents; 5.7 responses per respondent, on average.

Barriers to Success with Data Warehouse ArchitecturesAs we just saw, diversifying the types of data platforms in an extended data warehouse architecture has its benefits, at least in users’ perceptions. Diversifying also has a few barriers. To get a sense of the challenges of such multi-platform DW environments, this report’s survey asked respondents: “In your organization, what are the top potential barriers to further diversifying the platform types in your data warehouse architecture?” (See Figure 9.)

A “skills gap” is the largest barrier to success with new DW platforms and data types Back in the day, building a multi-platform DW environment simply involved deploying a few data marts and operational data stores on their own standalone database instances. An equivalent environment nowadays is far more challenging, because it entails working with “the new stuff”—data management tools and data types that are new to an organization. Due to the newness, many organizations that attempt a modern, diversified DW environment find that they have inadequate staffing or skills (47%). Their immaturity with new data types and sources (23%)—plus new technologies for Hadoop, event processing, and so on—make them unprepared for the complexity of multi-platform designs (25%).4

Success requires solid skills, sponsorship,

data management, and funding

4 In a different survey question (not charted in this report), 67% of respondents report that their current data warehouse team does not have the skills for big data, new data types, new methods, and emerging technologies.

tdwi.org 15

Problems and opportunities for data Warehouse architectures

As usual, organizational and business issues should be settled first These include data ownership and other politics (43%), a lack of business sponsorship (38%), and a lack of a compelling business case (25%).

A number of data management issues can get in the way Technical users must prepare to address data integration complexity (36%), poor quality of data (34%), lack of enterprise data architecture (29%), and data security, privacy, and governance issues (25%).

As with any new IT initiative, you won’t get far without proper funding In particular, account for the cost of acquiring multiple platforms (25%) and the cost of administering multiple platforms (27%).

In your organization, what are the top potential barriers to further diversifying the platform types in your data warehouse architecture? Select seven or fewer

Inadequate staffing or skills 47%

data ownership and other politics 43%

Lack of business sponsorship 38%

data integration complexity 36%

Poor quality of data 34%

Lack of enterprise data architecture 29%

cost of administering multiple platforms 27%

Poor integration among multiple platforms 26%

complexity of multi-platform designs 25%

cost of acquiring multiple platforms 25%

data security, privacy, and governance issues 25%

Lack of compelling business case 25%

existing data warehouse architecture 24%

Immaturity with new data types and sources 23%

Lack of metadata and schema in some big data 17%

real-time data handling 16%

Getting started with the right project 15%

Loading large data sets quickly and frequently 14%

Scalability problems with big data 10%

Processing queries fast enough 9%

other 3%

Figure 9. Based on 2,758 responses from 538 respondents; 5.1 responses per respondent, on average.

16 TDWI RESEARCh

Evolving Data Warehouse Architectures

USER STORY THE PROLIFERATION OF DEPARTMENTAL bI MAY WORK AGAINST A CENTRAL DW AND ITS ARCHITECTURAL STANDARDS “Our central data warehouse is still fairly new and a bit under-populated with data, which explains why many departments have set up their own tools and data platforms for BI and analytics,” said the BI director at a metals recycling firm. “In particular, a number of departments have bought their own tools for reporting, analysis, and data exploration. The tools work okay without a warehouse, because most departmental users are happy looking at data from one application at a time. Of course, departmental BI is problematic, because it sees incomplete views of data, outside the context of the rest of the enterprise, and without standard definitions of business entities like customers and products.

“We have approval and funding for a major upgrade to our data warehouse and reporting tools, and that will be in place within a year. After that, we’ll begin consolidating departmental systems into the new, central BI and data warehouse infrastructure. That way, everyone will have a much broader view of the business, based on consensus-driven definitions of business entities.”

Multi-Platform Data Warehouse EnvironmentsMany enterprise data warehouses (EDWs) are evolving into multi-platform data warehouse environments (DWEs) as users continue to add standalone data platforms to their warehouse tool and platform portfolios. The new platforms don’t replace the core warehouse, because it is still the best platform for the data that goes into standards reports, dashboards, performance management, and OLAP. Instead, the new platforms complement the warehouse because they are optimized for workloads that manage, process, and analyze new forms of big data, non-structured data, and real-time data. Users need additional platforms to get new value from new data. The catch is to build a multi-platform data environment without being overwhelmed by its complexity, which a good architectural design can avoid. Hence, for many user organizations, a multi-platform physical plan coupled with a cross-platform logical design is a DW architecture that’s conducive to the new age of big data.5

Workload-centric DW ArchitectureOne way to characterize a data warehouse’s architecture is to count the number and types of workloads it supports. According to a 2012 TDWI Research survey on high-performance data warehousing, a little over half of user organizations surveyed (55%) support only the most common workloads, namely those for standard reports, performance management, and online analytic processing (OLAP). The other half (45%) also support workloads for advanced analytics, detailed source data, various forms of big data, and real-time data feeds.

The trend is toward the latter. In other words, the number and diversity of DW workloads is increasing due to organizations embracing big data, multi-structured data, real-time or streaming data, and data processing for advanced analytics. The catch is that some data warehouses (whether defined as a vendor’s product or a user’s design) can handle multiple, concurrent workloads of various types, whereas others cannot. Hence, the mix of workloads and data types that you manage on a central warehouse (or offload to other platforms) often depends on how capable the central warehouse is with multiple, concurrent workloads. Other determining factors include the ownership of data sets and platforms, plus their relative economic costs.

Logical DW architecture must integrate multiple

physical platforms

5 Some of the material in this section is drawn from a TDWI blog post by Philip Russom, “Evolving Data Warehouse Architectures: From EDW to DWE,” online at http://tdwi.org/Blogs/Philip-Russom/2013/07/From-EDW-to-DWE.aspx.

tdwi.org 17

Multi-Platform data Warehouse environments

Distributed DW ArchitectureThe issue in a multi-workload environment is whether a single-platform data warehouse can be designed and optimized such that all workloads run optimally, even when concurrent. More and more DW teams are concluding that a single-platform DW is no longer desirable. Instead, they maintain a core DW platform for traditional workloads (reports, performance management, and OLAP), but offload other workloads to other platforms. For these organizations, the DW is not going away; it’s just being complemented by additional data platforms tuned to workloads that can and should be offloaded from the core warehouse.

For example, data and processing for SQL-based analytics are regularly offloaded to DW appliances and columnar DBMSs. A few teams offload workloads for big data and advanced analytics to HDFS, MapReduce, and other NoSQL platforms. It’s not just DWs; some users are offloading ETL jobs from an expensive data integration server to less expensive Hadoop. The result is a strong trend toward distributed DW architectures, where many areas of the logical DW architecture are physically deployed on standalone platforms instead of the core DW platform.

A Distributed DW Architecture Is Both Good and BadA distributed architecture is good if your fidelity to business requirements and DW performance leads you to deploy another data platform in your DW environment and the new platform integrates well with others in the distributed architecture, on both logical and physical levels. But it’s bad when disconnected systems proliferate uncontrolled, like the errant data marts we all fear. So far, the newest generation of analytic databases and data management platforms is controlled by users far better than the marts of yore. Still, it’s essential to be diligent with marts to avoid abuses.

Note that the architectural distinctions made here have always been a matter of degree, and will continue to be so. In other words, no architecture is 100% monolithic or 100% distributed. Many are hybrids, and the percentage mix that’s right for you depends on the details of your business and technology. DW architectures have always been distributed to some degree. It’s just that the degree is more pronounced today.

Furthermore, the trend toward distributed DW architectures is not new—not by a long shot. For decades, warehouses have wended their way through a variety of “edge systems” that are deployed on standalone servers off to the side of the warehouse, but integrated with it. This has been true from the dawn of warehousing (as with data marts and operational data stores [ODSs]), although it has recently expanded (with DW appliances and columnar DBMSs), and is now continuing with new types of data platforms (namely NoSQL and Hadoop). Hence, even the new platforms fit comfortably into the well-established tradition of DW edge systems.

USER STORY PURE MONOLITHS ARE RARE; MOST HAVE AT LEAST A FEW SATELLITE SYSTEMS According to a user interviewed for this report: “We trend toward a monolithic warehouse architecture in that almost all our data for BI and analytics resides in a single RDBMS instance. The RDBMS we just migrated to is known to excel with monoliths, so it works for us from a performance and scalability viewpoint. As with any architecture, there’s always a hybrid touch, and so we have a number of secondary platforms, too, as satellites that integrate with the central, monolithic warehouse. For example, all data staging is done on satellite platforms. It’s not that our monolith can’t handle staging; it’s that capacity on the monolith’s RDBMS is way too expensive for storing terabytes of detailed source data and pushing down ELT processing.”

The diversification of DW workloads leads to distributed architectures for DWs

18 TDWI RESEARCh

Evolving Data Warehouse Architectures

From the Single-Platform EDW to the Multi-Platform DWEA consequence of the workload-centric approach is a trend away from the single-platform monolith of the enterprise data warehouse (EDW) toward a physically distributed data warehouse environment (DWE). A modern DWE consists of multiple platform types, ranging from the traditional warehouse (and the usual satellite systems for marts and ODSs) to new platforms such as DW appliances, columnar DBMSs, NoSQL databases, MapReduce tools, and HDFS. In other words, users’ portfolios of tools for BI/DW and related disciplines are diversifying aggressively.

The multi-platform approach adds more complexity to the DW environment, but BI/DW professionals have always managed complex technology stacks successfully. The upside is that users love the high performance and solid information outcomes they get from workload-tuned platforms.

Note that a DWE can also be a disconnected bucket of standalone silos, and that’s where many organizations are today. Ideally, the physically distinct systems of the DWE should be integrated with others, so they connect via an overall logical design. Integration within the DWE can take many forms, including shared dimensions, data sync, federation, ETL jobs, data flows across DWE platforms, real-time interfaces, and so on. Unless the platforms of a DWE are integrated at appropriate levels, the DWE is just a bucket of silos. A unifying architectural design is more efficient technically and more effective for business users.

Architectural Strategies for EDWs versus DWEsDespite the unmistakable trend toward multi-platform DWEs, they are still in the minority compared to the traditional EDW. Even so, it’s difficult to quantify architecture, because there are so many hybrids and variations. (See Figure 10.) Users determine which architectural mix is best for them and then act accordingly; the mix evolves as requirements shift.

100% pure data warehouse architectures are rare A mere 15% of survey respondents claim to have a central monolithic EDW with no other data platforms. At the other extreme, relatively few report having no true EDW, but many workload-specific data platforms instead (15% in Figure 10).

Hybrid EDWs are the most common data warehouse architectures today More than any other architecture tested by the survey, 37% percent of respondents report using a central EDW with a few additional data platforms. A similar hybrid comes in second place, namely a central EDW with many additional data platforms (16%). Clearly, hybrids work for users, whether planned for the sake of diverse workloads or resulting from evolutionary forces (reorganizations, mergers, departmental systems, and so on).

Other hybrids are possible, too This includes DWEs that have many workload-specific standalone platforms, where an EDW is present but is not the center of the environment (15% in Figure 10). TDWI has seen this form of DW architecture in firms that are heavily into analytics but don’t have much need for traditional reports, as is common among Internet firms.

Architectural strategies for future evolution most often build out an EDW Most survey respondents plan to extend an existing DW to accommodate big data and other new requirements (41% in Figure 11), whereas fewer will deploy new data platforms for big data, analytics, real time, and so on (25% in Figure 11). Note that a third of respondents have no strategy.

Rearrange the acronym from EDW to DWE to get

“DW environment,” which means a multi-platform DW

Hybrid architectures outnumber pure EDWs or

pure DWEs

tdwi.org 19

Multi-Platform data Warehouse environments

Which of the following best describes your extended DW environment today?

central edW with a few additional data platforms 37%

central edW with many additional data platforms 16%

central monolithic edW with no other data platforms 15%

Many workload-specific data platforms; edW is present but not the center 15%

no true edW, but many workload-specific data platforms instead 15%

other 2%

Figure 10. Based on 538 respondents.

Which of the following best describes your organization’s strategy for evolving your data warehouse environment and its architecture relative to big data?

extend existing core dW to accommodate big data and other new requirements 41%

deploy new data management systems specifically for big data, analytics, real time, etc 25%

no strategy for dW architecture, although we do need one 23%

no strategy for dW architecture, because we don’t need one 6%

other 5%

Figure 11. Based on 538 respondents.

USER STORY SOME DATA WAREHOUSES bEGIN LIFE AS A COLLECTION OF DATA MARTS AND SPREADSHEETS “We’ve only recently begun building our first data warehouse,” said a BI professional at a state college in the U.S. “For decades, the university ran on a number of disconnected software applications. We ran reports on a per-application basis and consolidated data across these by entering numbers into spreadsheets and paper forms. Obviously, this approach suffered many problems and limitations. But the problem that finally convinced our administration to fund a data warehouse was the lack of historic data. Without that history, no one could answer basic questions, such as: how many math majors do we have today, and how does that compare to three years ago? Are grades going up or down? Such questions impact the university’s funding, accreditation, and quality of education.

“We’re building the data warehouse one data mart at a time, basing our designs on older reports, spreadsheets, and paper forms. As we go, we’re careful to define standard ways of describing students, sophomores, majors, professors, test scores, and so on. Eventually, we’ll double back and make a pass through our collection of marts to standardize and consolidate them into a history-oriented, multi-dimensional data warehouse architecture.”

20 TDWI RESEARCh

Evolving Data Warehouse Architectures

Hadoop’s Roles in Data Warehouse ArchitecturesIn a TDWI Research survey on integrating Hadoop into BI and DW that ran in 2012, 88% of the users surveyed reported that the Hadoop ecosystem of products is a business opportunity (not a technology problem) because it enables new types of applications. When asked which types of applications benefit most from Hadoop, survey respondents chose (in priority order) big data analytics, advanced analytics (i.e., data mining, statistical analysis, complex SQL, and natural language processing), and discovery analytics. After these three analytic application types, respondents chose two data management use cases for Hadoop, namely information exploration and complementing a data warehouse. Other data management uses seen in the survey include data archiving, transforming big data for analytics, and data staging.6

All the things just mentioned form a long list of use cases in data warehousing for the Hadoop Distributed File System (HDFS) and other Hadoop tools (MapReduce, Hive, HBase, Impala, Drill, and so on). And many of these—if implemented in a multi-platform data warehouse environment (DWE)—would have a strong influence on the architecture of the data environment.7

Promising Uses of Hadoop that Impact DW ArchitecturesHDFS and other Hadoop tools promise to extend and improve some areas within data warehouse architectures:

Data staging A lot of data processing occurs in a DW’s staging area to prepare source data for specific uses (reporting, analytics, OLAP) and for loading into specific databases (DWs, marts, appliances, cubes, and dimensions). Much of this processing is done by homegrown or tool-based solutions for extract, transform, and load (ETL). Data is landed, staged, and persisted on a hodgepodge of RDBMSs and file folders on networked drives. It’s common for a staging area to be many times the size of its DW in terms of raw storage. Some users have multiple data stages: one per data type, source, target, velocity, or processing requirement.

TDWI has encountered several DW teams that have consolidated and migrated their staging area(s) onto HDFS to take advantage of its low cost, linear scalability, facility with file-based data, and ability to manage unstructured data. Users who prefer to hand-code most of their ETL solutions will most likely feel at home in code-intense environments such as Apache MapReduce, Pig, and Hive. They may even be able to refactor existing code to run there. For users who prefer to build their ETL solutions atop a vendor tool, the community of vendors for ETL and other data management tools is rolling out new interfaces and functions for the entire Hadoop product family.

This discussion assumes that (whether you use Hadoop or not) you should physically locate your data staging area(s) on standalone systems outside the core data warehouse, if you haven’t already. That way, you preserve the core DW’s capacity for what it does best: storing squeaky clean, well-modeled data (with an audit trail via metadata and master data) for standard reports, dashboards, performance management, and OLAP. In this scenario, the standalone data staging area(s) offload most of the management of big data, archiving of source data, and much of the data processing for ETL, data quality, and so on. All these functions are known to scale well on HDFS.

Data archiving When organizations embrace forms of advanced analytics that require detailed source data, they amass large volumes and retain most of the data over time, which taxes areas of the DW architecture where source data is stored. Storing terabytes of source data in the core EDW’s RDBMS can be prohibitively expensive, which is why many organizations have moved such data to less expensive satellite systems within their extended DW environments.

DW architectures influenced by Hadoop include data staging,

archiving, analytics, and multi-structured data

6 This survey and related issues are discussed in detail in the TDWI Best Practices Report Integrating Hadoop into Business Intelligence and Data Warehousing, available online at tdwi.org/bpreports.

7 Some of the material in this section is drawn from the TDWI blog post by Philip Russom, “Evolving Data Warehouse Architectures: The Roles of Hadoop,” online at http://tdwi.org/Blogs/Philip-Russom/2013/08/Evolving-Data-Warehouse-Architectures-The-Roles-Of-Hadoop.aspx.

tdwi.org 21

Hadoop’s roles in data Warehouse architectures

Similar to migrating staging areas to HDFS, some organizations are migrating their stores of source data and other archives to HDFS. This lowers the cost of archives and analytics while providing greater capacity.

Multi-structured data Relatively few organizations are currently getting BI value from semi- and unstructured data, despite years of wishing for it. HDFS can be a special place within your DW environment for managing and processing semi-structured and unstructured data. Hadoop users are finding this approach more successful than stretching an RDBMS-based DW platform to handle data types it was not designed for.

One of Hadoop’s strongest complements to a DW is its handling of semi- and unstructured data, but don’t go thinking that Hadoop is only for unstructured data: HDFS handles the full range of data, including structured forms. In fact, Hadoop can manage and process just about any data you can store in a file and copy into HDFS.

Processing flexibility Given its ability to manage diverse multi-structured data, as just described, Hadoop’s NoSQL approach is a natural framework for manipulating nontraditional data types. Note that these data types are often free of schema or metadata, which makes them challenging for most vendor brands of SQL-based RDBMSs, although a few have functions for deducing, creating, and applying schema as needed. Hadoop supports a variety of programming languages (Java, R, C), thus providing more capabilities than SQL alone can offer. Again, a few RDBMSs support these same languages as a complement to SQL.

In addition, Hadoop enables the growing practice of “late binding.” With ETL for data warehousing, data is processed, standardized, aggregated, and remodeled before entering the data warehouse environment; this imposes an a priori structure on the data, which is appropriate for known reports, but limits the scope of analytic repurposing later. Data entering HDFS is typically processed lightly or not at all to avoid limiting its future applications. Instead, Hadoop data is processed and restructured at run time, so it can flexibly enable the open-ended data exploration and discovery analytics that many users are looking for today.8

USER STORY ECONOMIC REASONS CAN LEAD TO A DATA WAREHOUSE ENVIRONMENT THAT INCLUDES HADOOP A seasoned data warehouse professional at a major telco had the following to say about reasons for integrating Hadoop into a mature data warehouse environment: “In recent years, we began to retain more and more source data in our enterprise data warehouse, as compared to the past when we deleted as much of it as possible. Building up a massive store of source data is good for analytics because the data is ready for use without rerunning extracts. But our problem was that we were managing and processing source data in the core data warehouse platform. Although quite scalable and fast, our EDW platform is very expensive in terms of the additional capacity we would need to purchase for managing our growing volumes of source data.

“For purely financial reasons, we began to evaluate additional platforms that could provide inexpensive, large-scale storage and processing, yet integrate with and complement our mature EDW. This led us to Hadoop. We felt Hadoop was the right solution, but we weren’t comfortable with a purely open source version. So, we acquired and deployed a distribution of Hadoop from a software vendor that also provides additional tools, maintenance, and services to make Hadoop far more enterprise ready than an open source solution would be.

“Now that we’ve succeeded with offloading certain DW data and workloads, we are likewise offloading some forms of ETL and ELT processing onto Hadoop.”

8 For more information about Hadoop and DW architecture, read the TDWI Checklist Report Where Hadoop Fits in Your Data Warehouse Architecture, available online at tdwi.org/checklists.

22 TDWI RESEARCh

Evolving Data Warehouse Architectures

Integrating HDFS with an RDBMS Alleviates the Limitations of Both

Hadoop’s limitations relative to RDBMSs used for data warehousing

Despite all the goodness of Hadoop described so far in this report, there are areas within data warehouse architectures where HDFS and other Hadoop tools are not a good fit:9

RDbMS functionality HDFS is a distributed file system and therefore lacks capabilities we expect from relational database management systems (RDBMSs), such as indexing, random access to data, support for standard SQL, and query optimization. But that’s okay, because HDFS does things RDBMSs do not do as well, such as managing and processing massive volumes of file-based, unstructured data. For lightweight DBMS functionality (although not fully relational), users can layer HBase over HDFS, as well as the query framework called Hive.

Low-latency data access and queries HDFS’s batch-oriented, serial-execution engine means that it’s not the best platform for real-time or speedy queries. Furthermore, Hadoop lacks mature query optimization. Hence, the selective random access to data and iterative ad hoc queries that we take for granted with RDBMSs are alien to Hadoop.

An RDBMS integrated with Hadoop can provide needed query support. HBase is a possible solution if all you need is a record store or nominal operational database, not a full-blown analytic RDBMS. Upcoming improvements to Apache Hive, the new Impala query engine, and Apache Drill will address some of the latency issues.

Streaming data HDFS and other Hadoop products can capture data from streaming sources (Web servers, sensors, machinery) and append it to files. However, being inherently batch, they are ill-equipped to process that data in real time unless extended by additional tools. Such extremes of real-time analytics are best done today with specialized tools for complex event processing (CEP) and/or operational intelligence (OI) from third-party vendors. Even so, the incubator project Apache Storm could be a contender in the future for stream processing atop Hadoop.

Granular security Hadoop today includes a few security features such as file-permission checks, access control for job queues, and service-level authorization. Add-on products that provide encryption and LDAP (lightweight directory access protocol) integration are available for Hadoop from a few third-party vendors. Since HDFS is not a DBMS (and Hadoop data doesn’t necessarily come in relational structures), don’t expect granular security at the row or field level, as in an RDBMS.

Administrative tools According to a TDWI survey, better security is Hadoop users’ most pressing need for improvement, followed by a need for better administrative tools, especially for cluster deployment and maintenance. The good news is that a few vendors offer tools for Hadoop administration, and an upgrade of the open source administration tool Ambari is coming.

SQL-based analytics With the above latency limitations in mind, HDFS is a problematic choice for workloads that are iterative and query based, as with SQL-based analytics. Furthermore, Hadoop products today have limited support for standard SQL. A number of vendor products (from RDBMSs to data integration and reporting tools) can provide SQL support today, and open source Hadoop has new incubator projects that will eventually provide adequate support for SQL. These are critical if Hadoop is to become a productive part of a SQL-driven data warehouse architecture.10

Hadoop is strong in areas where RDbMSs are weak, and vice versa, resulting

in a useful synergy

9 See the TDWI blog post by Philip Russom, “Evolving Data Warehouse Architectures: Integrating HDFS with an RDBMS Alleviates the Limitations of Both,” online at http://tdwi.org/Blogs/Philip-Russom/2013/09/Integrating-HDFS-with-an-RDBMS.aspx.

10 For more information about Hadoop’s limitations, see the 2013 TDWI Best Practices Report Integrating Hadoop into BI and Data Warehousing, available online at tdwi.org/bpreports.

tdwi.org 23

Hadoop’s roles in data Warehouse architectures

Hadoop and RDBMSs are complementary and should be used together

Hadoop’s help for data warehouse environments is limited to a few areas. Luckily, most of Hadoop’s strengths are in areas where most warehouses and BI technology stacks are weak, such as unstructured data, very large data sets, non-SQL algorithmic analytics, and the flood of files that is drowning many DW environments. Conversely, Hadoop’s limitations (as discussed above) are mostly met by mature functionality available today from a wide range of RDBMS types (OLTP databases, columnar databases, DW appliances, etc.), plus administrative tools. In that context, Hadoop and the average RDBMS-based data warehouse are complementary (despite some overlap), which results in a fortuitous synergy when the two are integrated.

The trick, of course, is making HDFS and an RDBMS work together optimally. To that end, one of the critical success factors for assimilating Hadoop into evolving data warehouse architectures is the improvement of interfaces and interoperability between HDFS and RDBMSs. Luckily, this is well under way due to efforts from software vendors and the open source community. Technical users are starting to leverage HDFS/RDBMS integration.

For example, an emerging best practice among DW professionals with Hadoop experience is to manage diverse big data in HDFS, but process it and move the results (via ETL or other data integration media) to RDBMSs (elsewhere in the DW architecture), which are more conducive to SQL-based analytics. Hence, HDFS serves as a massive data staging area and archive.

A similar best practice is to use an RDBMS as a front end to HDFS data; this way, data is moved via distributed queries (whether ad hoc or standardized), not via ETL jobs. HDFS serves as a large, diverse operational data store, whereas the RDBMS serves as a user-friendly semantic layer that makes HDFS data look relational.11

USER STORY HDFS INTEGRATED WITH AN RDbMS IS AN EMERGING SYSTEM ARCHITECTURE FOR ANALYTICS AND WAREHOUSING Alex Peake is a data architect at software vendor Intuit: “We have a lot of different platform types in our extended data warehouse environment. But the preferred system architecture we’re moving toward is simply a relational database management system (RDBMS) integrated with a cluster of the Hadoop Distributed File System (HDFS), augmented with other Hadoop technologies. This combination makes sense, because we have very large data volumes, which Hadoop was designed for. But the RDBMS is still relevant, because it provides good query response times and efficiencies for smaller sets of data that have high enough value to warrant moving such data into an RDBMS.

“There’s a natural flow of data across these two platforms, and we use colors to describe how data evolves through the flow. Black data is the original source data, as it first arrives in HDFS. Gray data has been lightly standardized, remodeled, or joined—just enough to improve performance for the query-based work of data analysts. Gray data is persisted in a data warehouse on HDFS. White data is maximally cleansed and standardized. White means mostly RDBMS, but some will be in Hadoop.

“Note that the black-to-white data flow aligns with economic considerations. For example, we consider white data to be of the highest business value, which justifies our sizable investment in processing it, then persisting it on the most expensive data platform—namely, warehouses and marts on RDBMSs. Likewise, the low cost of Hadoop means we can afford to retain massive volumes of black data, so we can tap it quickly and repeatedly as new analytic opportunities arise.”12

11 For more information about RDBMS/HDFS architectures for data warehousing, see the TDWI Checklist Report The Modern Data Warehouse: What Enterprises Must Have Today and What They’ ll Need in the Future, available online at tdwi.org/checklists.

12 Note that other users interviewed for this report relayed similar analogies to color. For example, “dark data” is a common euphemism for Hadoop data, big data, unstructured data, or any data asset that is retained but not leveraged much.

24 TDWI RESEARCh

Evolving Data Warehouse Architectures

Analytics and Reporting have Different Requirements for DW ArchitectureThere are a few off-base questions that TDWI regularly hears from users who are in the thick of implementing or growing their analytic programs, and therefore get a bit carried away. Such questions include: “Our analytic applications generate so many insights that I should decommission my enterprise reporting platform, right?” and “Should we implement Hadoop to replace our data warehouse and/or reporting platform?”13

The misconception behind these questions (which makes them “off-base”) is that people too often forget that analytics and reporting are two different practices. Analytics and reporting serve different user constituencies, produce different deliverables, prepare data differently, and support organizational goals differently. Despite a fair amount of overlap, analytics and reporting are complementary, which means most organizations need both and neither will replace the other. Furthermore, due to their differences, each has unique tool, data, data platform, and architecture requirements that you need to satisfy if you’re to get the most out of each.

DW Architectures for ReportingReporting is mostly about entities and facts you know well, represented by highly polished data that you know well. And that data usually takes the form of carefully modeled and cleansed data with rich metadata and master data that’s managed in a data warehouse. In fact, it’s difficult to separate reporting and data warehouses, because most users designed their DWs first and foremost as a repository for reporting and similar practices such as OLAP, performance management, dashboards, and operational BI.

The IT press regularly claims that Hadoop can replace a true DW. TDWI finds this unlikely because the current state of Hadoop cannot satisfy the data requirements of enterprise reporting nearly as well as the average RDBMS-based DW can. Ultimately, it’s not about a warehouse and it architectures per se; it’s about mission-critical practices a DW supports well, such as reporting. This may change in the future, because Hadoop gets more sophisticated almost daily. Even if change comes, the point remains: enterprise reporting, performance management, and OLAP depend on an RDBMS-based DW for success. That’s 80–90% of the deliverables of a BI program, so it behooves savvy users to keep and protect their DWs.

DW Architectures for AnalyticsAdvanced analytics enables the discovery of new facts you didn’t know based on the exploration and analysis of data that’s probably new to you. New data sources generally tell you new things, which is one reason organizations are exploring, analyzing, and visualizing big data more than ever before. Unlike the pristine data that reports operate on, advanced analytics works best with detailed source data in its original (even messy) form, using discovery-oriented technologies such as ad hoc queries, search, mining, statistics, predictive algorithms, and natural language processing. Sure, DWs can be expanded to support some forms of big data and advanced analytics, but the extreme volumes, diversity, and hefty processing of big data are driving more and more users to locate big data on standalone platforms within an extended DW environment—platforms such as Hadoop, NoSQL databases, DW appliances, and columnar databases.

Providing separate data platforms for reporting and analytics is a win-win data strategy, and it’s a common manifestation of the trend toward multi-platform DW architectures. It frees up capacity on

Treat analytics and reporting differently if

you want to get the most out of each

Traditional DWs are killer platforms for reporting, and reporting is mission

critical So protect your DW for the sake of

reporting and related practices

An emerging strategy is to preserve a DW for

reporting and related practices, but deploy

complementary platforms for analytics

13 Some of the material in this section is drawn from the TDWI blog post by Philip Russom, “Analytics and Reporting are Two Different Practices,” online at http://tdwi.org/Blogs/Philip-Russom/2013/09/Analytics-and-Reporting-are-Two-Different-Practices.aspx.

tdwi.org 25

trends in data Warehouse architectural components

the DW so it can continue growing and supporting enterprise reporting and related practices. And it provides data platforms for advanced analytics that are more affordable and adept for exploration, discovery, and non-OLAP analysis with large volumes of multi-structured data.

USER STORY MANY TIMES, THE ROAD TO AN EFFECTIVE DATA WAREHOUSE IS PAVED WITH CONSOLIDATIONS Bridget Quinlan is the director of BI for Brunswick Corporation: “Our company consists of four semi-autonomous divisions. Until several years ago, each had its own strategy for BI; as a result, we had many disparate BI systems across the enterprise that essentially performed the same functions in slightly different ways. As a consequence, Brunswick as a whole was inefficient and inconsistent in this area.

“To correct the situation, we spent a few years collocating and consolidating data warehousing and reporting systems from the divisions and departments, plus doing a more thorough job of integrating application data into the BI/DW infrastructure. Today, we’re the opposite of what we used to be, in that our BI is more centralized and standardized than in most organizations, with very few departmental systems, all managed by a single team organized in a shared services model.

“Despite where we came from, we are today remarkably close to having a single, monolithic data warehouse architecture. We mostly do traditional reporting, for which we can easily source data from our central data warehouse without need for additional platforms for, say, advanced analytics. The warehouses we started with were built with the Ralph Kimball approach, and we’ve stuck with Kimball for the most part, because it applied well to the consolidations we were doing.”

Trends in Data Warehouse Architectural ComponentsBy this point in the report, we’ve seen a number of data structures, tools, features, and platform types that could serve as components in some form of data warehouse architecture. These range widely, and they include marts, cubes, ODSs, analytic functions, numerous DBMS types, in-memory functions, and various Hadoop products. Most of these are can be deployed physically as a standalone system or co-located as a function within a data warehouse DBMS or other system. Regardless of what project stage you’re in relative to designing and maintaining a DW architecture, knowing the available components and their uses is fundamental to making good decisions about approaches to take and software products to evaluate.

To quantify the amount of use that components are getting, as well as to draw the big picture of available components, TDWI presented survey respondents with a list of 30 components that could be used in a DW architecture. The survey asked: “In the architecture of your organization’s primary data warehouse, which are in use today? Which will be in use within three years? Which do you know you won’t use?” (See Figure 12.)

In the survey, each DW architectural component on the list had three possible answers:

• No plans to use

• Using today; will keep using

• Will use within three years

There are many tool and platform types you could integrate into your warehouse architecture Choose carefully

26 TDWI RESEARCh

Evolving Data Warehouse Architectures

In the architecture of your organization’s primary data warehouse, which are in use today? Which will be in use within three years? Which do you know you won’t use?

In-memory analytics 37% 27% 36%

Shared metadata repository 30% 35% 35%

Hadoop distributed File System (HdFS) 53% 13% 34%

In-memory dBMS or data cache 43% 23% 34%

Mapreduce—Hadoop implementation 56% 12% 32%

In-database analytics 36% 35% 29%

analytic sandboxes 36% 35% 29%

cloud as platform for some dW components 58% 14% 28%

columnar dBMS 47% 26% 27%

Solid-state drives for "hot" data 56% 19% 25%

Mapreduce—vendor implementation 66% 9% 25%

Metrics for performance management 20% 56% 24%

odS for landing real-time data 41% 35% 24%

noSQL dBMS 65% 11% 24%

data warehouse appliance 44% 33% 23%

Federated or virtual data views 48% 29% 23%

redundancy for high availability 38% 41% 21%

event processing 44% 35% 21%

Multi-dimensional cubes for oLaP 22% 61% 17%

enterprise data warehouse (edW) 11% 75% 14%

tabular data for reports 22% 65% 13%

odSs for subject areas 38% 49% 13%

relational dBMS on MPP 49% 38% 13%

odS for persisted source data 40% 48% 12%

data marts 10% 79% 11%

odS for temporary source data 47% 42% 11%

Multi-dimensional snowflake schema 37% 53% 10%

Multi-dimensional star schema 16% 75% 9%

data staging area 9% 83% 8%

relational dBMS on SMP 53% 41% 6%

Figure 12. Based on 538 respondents. The chart is sorted by the “will use within three years” column. This data set is also the basis for subsets charted in Figures 13, 14, and 15.

The following sections of this report sort the survey data from this question into three different orders (one for each column above) to reveal which architectural components are most used today, will be adopted most in the near future, and which are most often omitted from users’ plans.

As mentioned, the survey question presented a list of 30 architectural components (as charted in Figure 12). Figures 13, 14, and 15 include only the 15 components that bubbled to the top of each sort order, so that we can focus on the components most affected by each sort.

No plans to use

Using today and will keep using

Will use within 3 years

tdwi.org 27

trends in data Warehouse architectural components

Responses to this question indicate which warehouse components successful users are employing today, as well as which they anticipate employing in a few years. From this information, we can quantify trends and project future directions for data warehouse architectures. We can also deduce priorities that can guide users in planning their future efforts in data warehousing. It’s useful to know in what order successful users adopt the options, because that enables the effective planning of project lifecycle stages.

Components Used Most Commonly TodayAs no surprise, the “meat and potatoes” components are the most common The list of top 15 components used today (see Figure 13, next page) includes well-established hallmarks of the traditional data warehouse such as data stages, data marts, multi-dimensional schema, OLAP cubes, ODSs, tabular data, relational DBMSs, and metadata repositories. Conspicuously absent are newer platforms and features such as in-memory functions, Hadoop, clouds, columnar DBMSs, and appliances—but we’ll see these soon in the list of top components to be aggressively adopted in the next few years.

The data stage is the most common component of a data warehouse architecture (83%) Whether you have a true data warehouse or simply an ODS, row store, or a body of marts, you have to land, stage, and process (to some degree) the data coming into those environments for the purposes of reporting, analytics, data sync, archiving, and so on. At one extreme, some “warehouse” environments are almost exclusively data staging, because most data sets for reports and analyses are instantiated at run time, not beforehand. Comprehensive data warehouses typically have multiple staging platforms, one per data type or source. This explains why data stages can be slightly more common than core warehouses.

The data mart is also a leading component of DW architectures (79%) A single-subject mart is a manageable chunk that avoids big bang projects, provides a low-risk starting point, and satisfies well-targeted user needs. It’s no wonder—as seen in most of the User Stories in this report—that many users structure their warehouse architecture as a series of data marts. Furthermore, many marts exist successfully outside the context of a warehouse, as seen in the growing number of solutions for departmental BI and single-subject analytic applications.

Multidimensionality continues to be an important hallmark of data warehousing This takes multiple forms, such as the star schema (75%), snowflake schema (53%), and OLAP cubes (61%). All of these assume hefty data processing, which persists multidimensional data to disk for the sake of high performance for queries. With new practices in advanced analytics, however, multidimensionality is expressed at run time in the analytic layer, as seen in the complex SQL-based analytics run on columnar DBMSs and DW appliances. Likewise, older methods of modeling data and persisting it for performance are now being complemented by new analytic practices that deduce data models from raw source data at run time, as is typical of analytics run atop schema-neutral Hadoop or NoSQL platforms, as well as appliances and columnar RDBMSs that enable users to develop and apply schema on the fly.

MPP is closing the gap on SMP Around the turn of the century, user organizations began replacing hardware servers and DBMSs built on the application-oriented symmetrical multi-processing (SMP) computing architecture in favor of the data-optimized massively parallel processing (MPP) architecture. The percentage split between SMP/MPP in a TDWI survey in 2009 was 61% / 33%, respectively.14 The survey for this report puts them neck-and-neck at 41% / 38%. TDWI expects MPP to soon pull ahead of SMP and become the most common computing architecture for DWs and analytics thanks to its distinct advantages with queries against very large data sets, as is typical of big data analytics.

Established components are still the most common, although they’re being joined by new ones

14 See Figure 11 in the 2009 TDWI Best Practices Report Next Generation Data Warehouse Platforms, available online at tdwi.org/bpreports.

28 TDWI RESEARCh

Evolving Data Warehouse Architectures

Top 15 Data Warehouse Architectural Components Used Today

data staging area 83%

data marts 79%

Multi-dimensional star schema 75%

enterprise data warehouse (edW) 75%

tabular data for reports 65%

Multidimensional cubes for oLaP 61%

Metrics for performance management 56%

Multidimensional snowflake schema 53%

odSs for subject areas 49%

odS for persisted source data 48%

odS for temporary source data 42%

redundancy for high availability 41%

relational dBMS on SMP 41%

relational dBMS on MPP 38%

Shared metadata repository 35%

Figure 13. Based on 538 respondents. The data subset charted here was created by sorting the “using today” column in Figure 12 in descending order and taking the top 15 rows.

Components to be Adopted Within Three YearsAnalytics is clearly driving most users’ adoption of new platforms and features Most of the DW architecture components listed in Figure 14 are built for analytics (with the possible exceptions of metadata, clouds, and solid-state drives). As discussed early in this report, the business community increasingly uses analytics to compete, determine direction, acquire and retain customers, operate faster, discover new problems and opportunities, and run the business by the numbers. Hence, any technical platform or tool feature that enables analytics (or the management of data for analytics) is a potential gain for a data-driven business.

Managing non-relational big data is also a pressing need for many organizations All enterprise data should be considered an asset, and big data is the same. Yet, most organizations are just now figuring out how to manage the large data sets and diverse formats of big data. Despite today’s hype focused on big data, the primary path to business value and ROI from big data is through analytics. Hence, organizations are planning to complement existing relational DBMSs with data platforms designed for advanced analytics with big data. This naturally leads users to consider HDFS (34%), open source MapReduce (32%), vendor-built MapReduce (25%), and NoSQL DBMSs (24%). These platforms fill critical holes in DW architectures based around RDBMSs, so TDWI expects Hadoop and other non-relational data platforms to be far more common in DW architectures within three years. But don’t expect the new to replace the old: they complement each other, and TDWI anticipates new architectures will integrate relational and non-relational platforms.

Real-time operation is just as important as analytics and big data Many organizations are under pressure to operate closer to real time, as described early in this report. On the technology level, real-time business operation demands data managed and analyzed faster and more frequently. Hence, many of the DW components poised for aggressive user adoption contribute to speed and frequency, namely in-memory analytics (36%), in-memory DBMSs (34%), solid-state drives (25%), and ODSs for landing real-time data (24%).

DW components poised for growth are new,

and they’re focused on analytics, big data, and

real-time operation

tdwi.org 29

Relational technology is more relevant than ever, but in updated forms In recent years, TDWI has seen its members aggressively adopt various types of “analytic databases.” These are new relational database management systems (RDBMSs) that were built specifically for the most popular form of advanced analytics today, namely SQL-based analytics. Analytic databases usually take the form of columnar DBMSs (27%) or data warehouse appliances (23%), which the survey shows to be poised for aggressive adoption.

Analytic databases are sometimes the core RDBMS under an EDW, but the vast majority of the time, they manage terabyte-scale data marts or ODSs specifically for SQL-based analytics, not reporting or general DW purposes. Analytic databases are optimized for high performance with extremely complex SQL routines executed against raw source data, mostly in its original operational schema. Analytic databases remind us that relational technology is still highly relevant—especially when updated appropriately, as in analytic databases—not just for the core EDW and reporting, but also for advanced forms of analytics with big data.

Top 15 Data Warehouse Architectural Components to be Adopted Within Three Years

In-memory analytics 36%

Shared metadata repository 35%

Hadoop distributed File System (HdFS) 34%

In-memory dBMS or data cache 34%

Mapreduce—Hadoop implementation 32%

In-database analytics 29%

analytic sandboxes 29%

cloud as platform for some dW components 28%

columnar dBMS 27%

Solid-state drives for "hot" data 25%

Mapreduce—vendor implementation 25%

Metrics for performance management 24%

odS for landing real-time data 24%

noSQL dBMS 24%

data warehouse appliance 23%

Figure 14. Based on 538 respondents. The data subset charted here was created by sorting the “will use within three years” column in Figure 12 in descending order and taking the top 15 rows.

Components Excluded From Users’ PlansSubstantial numbers of user organizations have no plans to adopt new platforms or features Ironically, 10 of the 15 architectural components listed in Figure 15 (as commonly excluded from user plans) also appear in Figure 14 (as components to be aggressively adopted). This is often the case with new platforms and features: despite the passion and commitment of early adopters, many organizations are not open to new options or simply don’t need them because older options still fulfill their requirements. Extrapolating from survey data, roughly one-third of respondents are planning to adopt new platforms, whereas two-thirds are not. Given the newness and focus on analytics (versus reporting or meat-and-potatoes warehousing) of Hadoop, MapReduce, NoSQL, and other new platforms or features, one-third is quite good.

One-third of users surveyed plan to adopt new DW component types, whereas two-thirds don’t

trends in data Warehouse architectural components

30 TDWI RESEARCh

Evolving Data Warehouse Architectures

Top 15 Data Warehouse Architectural Components Excluded from Users’ Plans

Mapreduce—vendor implementation 66%

noSQL dBMS 65%

cloud as platform for some dW components 58%

Mapreduce—Hadoop implementation 56%

Solid-state drives for "hot" data 56%

Hadoop distributed File System (HdFS) 53%

relational dBMS on SMP 53%

relational dBMS on MPP 49%

Federated or virtual data views 48%

columnar dBMS 47%

odS for temporary source data 47%

data warehouse appliance 44%

event processing 44%

In-memory dBMS or data cache 43%

odS for landing real-time data 41%

Figure 15. Based on 538 respondents. The data subset charted here was created by sorting the “no plans to use” column in Figure 12 in descending order and taking the top 15 rows.

USER STORY A LATE-bINDING, AGILE WAREHOUSE IS ONE WAY TO KEEP PACE WITH RAMPANT bUSINESS EVOLUTION “In reaction to the Affordable Care Act, the state I work for decided to build its own cooperative health care exchange, instead of using the federal one,” said the director of analytics at a healthcare provider and insurer. “As you can imagine, it’s been a mad dash to create a new exchange and populate it with partners and members. Since all exchanges of this type are heavily digital, it’s been equally hectic to build and deploy IT systems, including a data warehouse with reporting and analytics.

“Even stranger is the fact that my team was asked to build a data warehouse that satisfies the requirements of a business that didn’t exist when we started a year ago. As the organization was defined and assembled, we made our best guesses about what data and BI deliverables the exchange would eventually need. Plus, the work had to be done with the small staff that’s typical of a startup, as well as with the limited budget and immutable launch date that were set in stone by state and federal legislation.

“To cope with the rapid evolution, we set a number of stop points where we halt work and review technical requirements to see if we’re aligned with recent changes in direction and other progress. We’re employing an agile sprint method, which works well with the high rate of change we’re dealing with. One of the best decisions we made was to go with a late-binding architecture, where most models, metrics, and data are assembled at run time, at the analytic level. Otherwise, we’d never have kept pace with the constantly changing data, sources, targets, reports, business requirements, staff, and portfolio of vendor tools.”

tdwi.org 31

Vendor Platforms and Tools as DW Architectural ComponentsSince the firms that sponsored this report are all good examples of vendors that offer new types of data platforms, tools, hardware, and professional services for new DW architectures, let’s take a brief look at the product portfolio of each. The sponsors form a representative sample of the vendor community, but their offerings illustrate different approaches.15

Actian Corporation has accumulated a fairly comprehensive portfolio of platforms and tools for managing analytics, big data, and all other enterprise data, encompassing the full range of structured, semi-structured, and unstructured data and content types. The new Actian Analytics Platform includes connectivity to more than 200 sources, a visual framework that simplifies ETL and data science, high-performance analytic engines, and libraries of analytic functions.

The Actian Analytics Platform centers on Matrix (a massively parallel columnar RDBMS formerly called ParAccel) and Vector (a single-node RDBMS optimized for BI). Actian DataFlow accelerates ETL natively on Hadoop. Actian Analytics includes more than 500 analytic functions ready to run in-database or on Hadoop. Actian DataConnect connects and enriches data from over 200 sources on-premises or in the cloud. The Actian platform is integrated by a modular framework that enables users to quickly connect to all data assets for open-ended analytics with linear scalability.

Strategic partnerships include Hortonworks (for HDFS and YARN), Attivio (for big content), and a number of contributors to the Actian Analytics library.

Cloudera is a leading provider of Apache Hadoop–based software, services, and training, enabling data-driven organizations to derive business value from all their data while simultaneously reducing the costs of data management. CDH (Cloudera’s distribution including Apache Hadoop) is a comprehensive, tested, and stable distribution of Hadoop that is widely deployed in commercial and non-commercial environments. Organizations can subscribe to Cloudera Enterprise—comprising CDH, Cloudera Support, and the Cloudera Manager—to simplify and reduce the cost of Hadoop configuration, rollout, upgrades, and administration. Cloudera also provides Cloudera Enterprise Real-Time Query (RTQ), powered by Cloudera Impala, the first low-latency SQL query engine that runs directly over data in HDFS and HBase. Cloudera Search increases data ROI by offering non-technical resources a common and everyday method for accessing and querying large, disparate big data stores of mixed format and structure managed in Hadoop. As a major contributor to the Apache open source community, with customers in every industry, and a massive partner program, Cloudera’s big data expertise is profound.

Datawatch Corporation provides a visual data discovery and analytics solution that optimizes any data—regardless of its variety, volume, or velocity—to reveal valuable insights for improving business decisions. Datawatch has a unique ability to integrate structured, unstructured, and semi-structured sources—such as reports, PDF files, print spools, and EDI streams—with real-time data streams from CEP engines, tick feeds, or machinery and sensors into visually rich analytic applications, which enable users to dynamically discover key factors about any operational aspect of their business. Datawatch steps users through data access, exploration, discovery, analysis, and delivery, all in a unified and easy-to-use tool called Visual Data Discovery, which integrates with existing BI and big data platforms. IT’s involvement is minimal in that IT sets up data connectivity; most users can create their own reports and analyses, then publish them for colleagues to share in a self-service fashion. The solution is suitable for a single analyst, a department, or an enterprise. Regardless of user type, whether business or technical or both, all benefit from the high ease of use, productivity, and speed to insight that Datawatch’s real-time data visualization delivers.

Actian

Cloudera

Datawatch Corporation

Vendor Platforms and tools as dW architectural components

15 The vendors and products mentioned here are representative, and the list is not intended to be comprehensive.

32 TDWI RESEARCh

Evolving Data Warehouse Architectures

For years, Dell Software has been acquiring and building software tools (plus partnering with leading vendors for more tools) with the goal of assembling a comprehensive portfolio of IT administration tools for securing and managing networks, applications, systems, endpoints, devices, and data. Within that portfolio, Dell Software now offers a range of tools specifically for data management, with a focus on big data and analytics. For example, Toad Data Point provides interfaces and administrative functions for most traditional databases and packaged applications, plus new big data platforms such as Hadoop, MongoDB, Cassandra, SimpleDB, and Azure. Spotlight is a DBA tool for monitoring DBMS health and benchmarking. Shareplex supports Oracle-to-Oracle replication today, and will soon support Hadoop. Kitenga Big Data Analytics enables rapid transformation of diverse unstructured data into actionable insights. Boomi MDM launched in 2013. The new Toad BI Suite pulls these tools together to span the entire information life cycle of big data and analytics. After all, the goal of Dell Software is: one vendor, one tool chain, all data.

HP Vertica provides solutions to big data challenges. The HP Vertica Analytics Platform was purpose-built for advanced analytics against big data. It consists of a massively parallel database with columnar support, plus an extensible analytics framework optimized for the real-time analysis of data. It is known for high performance with very complex analytic queries against multi-terabyte data sets.

Vertica offers advantages over SQL-on-Hadoop analytics, shortening some queries from days to minutes. Although SQL is the primary query language, Vertica also supports Java, R, and C. Furthermore, the HP Vertica Flex Zone feature enables users to define and apply schema during query and analysis, thereby avoiding the need to prepocess data or deploy Hadoop or NoSQL platforms for schema-free data.

HP Vertica is part of HP’s new HAVEn platform, which integrates multiple products and services into a comprehensive big data platform that provides end-to-end information management for a wide range of structured and unstructured data domains. To simplify and accelerate the deployment of an analytic solution, HP offers the HP ConvergedSystem 300 for Vertica—a pre-built and pre-tested turn-key appliance.

MapR provides a complete distribution for Apache Hadoop, which is deployed at thousands of organizations globally for production, data-driven applications. MapR focuses on extending and advancing Hadoop, MapReduce, and NoSQL products and technologies to make them more feature rich, user friendly, dependable, and conducive to production IT environments. For example, MapR is spearheading the development of Apache Drill, which will bring ANSI SQL capabilities to Hadoop in the form of low-latency, interactive query capabilities for both structured and schema-free, nested data. As other examples, MapR is the first Hadoop distribution to integrate enterprise-grade search; MapR enables flexible security via support for Kerberos and native authentication; and MapR provides a plug-and-play architecture for integrating real-time stream computational engines such as Storm with Hadoop. For greater high availability, MapR provides snapshots for point-in-time data rollback and a No NameNode architecture that avoids single points of failure within the system and ensures there are no bottlenecks to cluster scalability. In addition, it’s fast; MapR set the Terasort, MinuteSort, and YCSB world records.

Dell Software

HP

MapR

tdwi.org 33

USER STORY MIGRATIONS TYPICALLY FORCE OR EMPOWER CHANGES IN DATA WAREHOUSE ARCHITECTURE “We recently migrated a data warehouse from an older relational database to a more modern one—built specifically for data warehousing and analytics—and we learned a number of lessons from the experience,” said a solutions architect and data warehouse professional at an insurance firm. “We started with a ‘forklift’ that simply copied data—pretty much unaltered—from the old platform to the new one. Luckily, most queries and other processing ran as well or better after the forklift. But there were a number of areas where new development was needed to get the most out of the new platform.

“For example, spooling and partitioning are radically different on the two platforms, so most of the new development focused there. Obviously, stored procedures, user-defined functions, and other proprietary functionality had to be recreated on the new platform. But the reward for these efforts was dramatic. In some cases, query times were reduced by 90% (for example, fact table load and report generation), with the most extreme example being a purge job that went from three days down to 10 minutes—all without having to make any changes to the reporting layer.

“Our next step is to implement in-database analytics. In fact, one of the reasons for the migration is that the analytic tool we use (for statistical modeling with actuarials and financial instruments) will also run as a user-defined function inside the new database platform we just migrated to. Once in-database analytics is set up, we’ll be able to refresh analytic models while the data’s left in place in the RDBMS, instead of moving large data sets to the analytic tool. This both simplifies and speeds up analytics so we can refresh the models faster and more frequently, and that’s something the business needs.”

Vendor Platforms and tools as dW architectural components

34 TDWI RESEARCh

Evolving Data Warehouse Architectures

Top Ten Priorities for Data Warehouse ArchitecturesIn closing, let’s summarize the report by listing the top 10 priorities for data warehouse architectures, with a few comments about why each is important. Think of the priorities as requirements, rules, or recommendations that can guide organizations into successful strategies for evolving data warehouse architectures.

1 Recognize that successful data warehouse architectures have integrated logical and physical layers, plus other components This report has focused on the server topology within the physical layer. However, equally important are the logical layer and its data standards component. Successful DW architectures fully work out both logical and physical layers to get the most out of each. If you ignore the logical layer, you won’t have meaningful data models that align with business requirements. Nor will you have consistent, auditable, quality data based on standards. Likewise, if you ignore the physical layer, you won’t have the server diversity you need for diverse workloads, and you may lack performance, scalability, and availability. For these reasons, 98% of survey respondents feel DW architecture is extremely or moderately important. Don’t be the 2%.

2 Identify the business and technical drivers in your organization, and let those determine the evolution of your data warehouse architecture If your organization is like those surveyed, your priorities will be advanced analytics (not reporting), big data (as fuel for analytics), and real-time operation (which includes real-time analyses). Analytics is clearly the leading driver (and beneficiary) for evolving DW architectures, whereas big data and real-time operation have significant overlaps with analytics.

3 be aware that the leading barrier to successful DW architecture is inadequate staffing and skills Hadoop and NoSQL DBMSs are so new that it’s understandable that few DW professionals have experience with them, but TDWI regularly finds DW teams that likewise have little experience with analytic methods or architecting warehouses. Expect to hire and train team members before tackling your next upgrade of DW architecture. Consultants are probably in order, too. Tools that simplify data access and analysis can help close the skills gap, as well.

4 before expanding data warehouse architecture into new platform types, secure sponsorship, funding, and improvements to data management infrastructure These are prominent barriers to success, as well.

5 Don’t forget that certain platform features are worthy of adoption It’s not always about the whole platform. For example, TDWI is seeing more and more users turning on features within their RDBMSs that they haven’t used before, particularly features that enable analytics, such as in-memory databases, in-database analytics, and columnar storage engines.

6 be open to hybrids and alternate standards Survey data says that many users mix multiple architectural features or approaches. For example, mixing Inmon approaches with Kimball ones is fairly common. As another example, a common server plan has an EDW at its center, but with several standalone data platforms complementing it. As long as all the pieces integrate appropriately, and non-standard components don’t outnumber the standard ones, the architecture will serve its purposes well.

tdwi.org 35

7 Don’t be pedantic about DW architectures or standards you establish Although it’s good to have guidelines, you needn’t adhere to them slavishly. A few non-standard data models or interfaces won’t bring down an otherwise pristine architecture. The most common reason to allow these is to get higher performance with unusual data structures or systems. Just keep these aberrations to a minimum.

8 You may need to integrate Hadoop into your data warehouse environment The likelihood of this happening in your organization increases along with increases in data structure diversity, schema-free data, unstructured data, multi-structured data, advanced analytics, data volumes, file-based data, log data, Web data, social data, and the need for really cheap storage and archiving. If three or more of these are hitting your organization, then Hadoop and NoSQL technologies are probably in your future.

9 Never forget that analytics and reporting are different practices with different DW architectural requirements Most of the new platforms and features highlighted in this report are exclusively for analytics, not reporting. That’s fine, since most of the changes TDWI sees users making to their architectures are for analytics.

10 Don’t expect the new stuff to replace the old stuff DW architectures and software portfolios are like the rest of IT: the portfolio is almost always accretive, in that we add more new tools and platforms while decommissioning relatively few old ones. We need more platform types because business and technology requirements just keep expanding. Hence, HDFS won’t replace DWs, NoSQL DBMSs won’t replace RDBMSs, and analytics won’t replace reporting.

top ten Priorities for data Warehouse architectures

research co-sponsor

It’s not just about big data, it’s about fast dataHow to harness evolving data warehouse architectures and visual data discovery to deliver faster business insights and better results

The world of business intelligence (BI) and analytics is undergoing significant changes, and it’s about time. Real-time data warehouses and streaming data holds the potential to dramatically reduce the time to deliver business-critical information if it can be processed, understood, and analyzed in real time. Datawatch visual data discovery solutions enable line-of-business users to independently and easily create and share analysis without delay.

Visual Data Discovery Displacing Traditional BI Traditional BI has been a relatively slow process as it required a top-down, IT-led approach in which the data needed to be modeled to create semantic layers, and reports were created and pushed to users. Visual data discovery delivers much faster results than traditional BI as it takes a bottom-up approach in which business users are in the driver’s seat and can easily access and combine data from multiple sources. Instead of viewing static, tabular data from traditional BI reports, the interface is based on interactive visualization for intuitive exploration.

Traditional BI is good at answering common questions such as, “What were our sales for the quarter in Asia?” However, visual data discovery is about finding things that are not yet known and answering questions never considered, such as, “I didn’t know that we closed some very large deals in Singapore, but the rest of Asia doesn’t seem to be doing well. Let me see what’s causing this anomaly.” Even advanced data discovery solutions do not support the requirements brought about by big data—specifically the use of streaming data sources.

Visualizing Streaming Data—The New Normal The amount of streaming data being generated is exploding rapidly from the growing variety of interconnected machines, devices, sensors, and consumer content. And the amount of data generated from any one source can be staggering.

Datawatch is the next generation of visual data discovery solution that can consume data streams directly from streaming sources, which instantly pushes information in real time, on a tick-by-tick basis. Line-of-business users can see exactly what is happening as it occurs and know precisely how the business is performing. When an outlier or anomaly is spotted, Datawatch can combine this streaming data with the historical data to get a comprehensive understanding that enables making better and faster decisions than ever before and taking action. Alerts can also be set to notify business users when certain conditions or thresholds are reached.

See for Yourself Visual data discovery and streaming analytics represent the future of big data analytics initiatives. Organizations can begin to prepare for the changes today by downloading the free Datawatch Desktop trial edition.

An example Datawatch dashboard based on data streamed directly from sources in real time

datawatch.com

TDWI RESEARCh

TDWI Research provides research and advice for business intelligence and data warehousing professionals worldwide. TDWI Research focuses exclusively on BI/DW issues and teams up with industry thought leaders and practitioners to deliver both broad and deep understanding of the business and technical challenges surrounding the deployment and use of business intelligence and data warehousing solutions. TDWI Research offers in-depth research reports, commentary, inquiry services, and topical conferences as well as strategic planning services to user and vendor organizations.

555 S Renton Village Place, Ste. 700

Renton, WA 98057-3295

T 425.277.9126

F 425.687.2842

E [email protected] tdwi.org