Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big...

46
1 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX Dan Kangas (Lenovo) Venkat Chandra (Cazena) John Piekos (Cazena) Keith Adams (Lenovo) Ajay Dholokia (Lenovo) Gary Cudak (Lenovo) Last update: 27 June 2019 Version 1.0 Configuration Reference Number: BGDCL03XX91 Deployment considerations for scalable racks including detailed validated bills of material Solution based on the ThinkSystem HX platform running VMware vSphere virtualization Reference architecture for Cloudera Enterprise with Cazena SaaS Data Lakes Solution based on Cazena’s Fully Managed SaaS Data Lake running on the ThinkAgile HX platform

Transcript of Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big...

Page 1: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

1 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

Dan Kangas (Lenovo)

Venkat Chandra (Cazena)

John Piekos (Cazena)

Keith Adams (Lenovo)

Ajay Dholokia (Lenovo)

Gary Cudak (Lenovo)

Last update: 27 June 2019 Version 1.0

Configuration Reference Number: BGDCL03XX91

Deployment considerations for scalable racks including detailed validated bills of material

Solution based on the ThinkSystem HX platform running VMware vSphere virtualization

Reference architecture for Cloudera Enterprise with Cazena SaaS Data Lakes

Solution based on Cazena’s Fully Managed SaaS Data Lake running on the ThinkAgile HX platform

Page 2: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

2 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

Table of Contents

1 Introduction .............................................................................................. 4

2 Business problem and business value .................................................. 5

Business Problem .................................................................................................... 5 Business Value ........................................................................................................ 5

2.2.1 Time to Production....................................................................................................................... 5 2.2.2 Deploy ML & Analytics Quickly with the AppCloud ..................................................................... 6 2.2.3 Plug & Play Enterprise Deployment: ........................................................................................... 6 2.2.4 Enabling Multi-tenancy ................................................................................................................ 6 2.2.5 DevOps 24x7 Production Operations .......................................................................................... 6 2.2.6 Secure data lake with a highly-differentiated security model. ..................................................... 7 2.2.7 Best of Breed Data Lake - Optimized for ThinkAgile HX ............................................................. 7

3 Requirements ........................................................................................... 8

Functional Requirements ......................................................................................... 8 Non-functional Requirements................................................................................... 8

4 Architectural Overview ............................................................................ 9

Cazena .................................................................................................................. 10 Cloudera Enterprise ............................................................................................... 10 Lenovo ThinkAgile HX Appliance ........................................................................... 11

5 Component Model ................................................................................. 12

Cazena Components ............................................................................................. 12 5.1.1 Cazena AppCloud...................................................................................................................... 12 5.1.2 Data Ingestion ............................................................................................................................ 13 5.1.3 Cazena Security ........................................................................................................................ 13 5.1.4 The Cloudera Data Lake ........................................................................................................... 13 5.1.5 Cazena Operations (DevOps) ................................................................................................... 13

ThinkAgile HX - Nutanix ......................................................................................... 14 5.2.1 Nutanix Prism ............................................................................................................................ 15 5.2.2 Controller VM (CVM) ................................................................................................................. 16

6 Operational Model ................................................................................. 18

Hardware Description ............................................................................................ 18 6.1.1 ThinkAgile HX Key features ....................................................................................................... 18 6.1.2 Lenovo ThinkAgile HX5520 Appliance ...................................................................................... 19 6.1.3 Lenovo ThinkAgile HX3320 Appliance ...................................................................................... 19

Page 3: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

3 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

6.1.4 Lenovo ThinkSystem NE0152T ................................................................................................. 20 6.1.5 Lenovo RackSwitch G8272 ....................................................................................................... 20

Cluster HX Node Configurations ............................................................................ 21 6.2.1 Node T-shirt Sizes ..................................................................................................................... 21 6.2.2 Node types and VM organization .............................................................................................. 23 6.2.3 System Management node ........................................................................................................ 24

Cluster Software Stack .......................................................................................... 24 Cazena Orchestration ............................................................................................ 24 Cazena AppCloud .................................................................................................. 25 Cloudera Service Role Layouts.............................................................................. 25 System Management ............................................................................................. 26 Networking ............................................................................................................. 27

6.8.1 Data Network ............................................................................................................................. 28 6.8.2 Hardware Management Network ............................................................................................... 28 6.8.1 10Gb and 25Gb Data Network Configurations .......................................................................... 29

Predefined Lenovo HX Cluster Configurations....................................................... 29 6.9.1 Cluster Storage Capacity ........................................................................................................... 31 6.9.2 Storage Tiering with NVMe and SSD Drives ............................................................................. 31

7 Deployment Considerations ................................................................. 32

Increasing Cluster Performance ............................................................................. 32 Processor Selection ............................................................................................... 32 Memory Size and Performance .............................................................................. 32 Designing for Storage Capacity and Performance ................................................. 34

7.4.1 Node Capacity ........................................................................................................................... 34 7.4.2 Estimating Disk Space ............................................................................................................... 35

Scaling Considerations .......................................................................................... 35

8 Cluster Hardware Bill of Materials ........................................................ 36

HX5520 Bill of Materials ......................................................................................... 36 HX3320 Bill of Materials ......................................................................................... 37 Systems Management Node Bill of Materials ........................................................ 39 Network .................................................................................................................. 40 Rack ....................................................................................................................... 40

9 Acknowledgements ............................................................................... 42

10 Resources .............................................................................................. 43

11 Document History .................................................................................. 45

Page 4: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

4 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

1 Introduction This document describes the reference architecture for Cazena’s big data Software as a Service (SaaS) offering with Cloudera running on the Lenovo ThinkAgile HX platform. This solution delivers Cloudera Enterprise as an SaaS, fully-managed private cloud data lake.

The reference architecture is a predefined and optimized hyperconverged hardware infrastructure and includes the planning, design considerations, and best practices for delivering a Cazena SaaS data lake. Lenovo and Cazena have worked together on this document, and the reference architecture that is described herein was validated by Lenovo and Cazena.

With the ever-increasing 5Vs of big data: volume, variety velocity, and also veracity and value of data, you need an infrastructure that allows flexibility to quickly switch workloads as needed. The ability to quickly provision multiple clusters on a single platform is key to this flexibility and the Lenovo HX hyperconverged appliance is the cornerstone.

Lenovo ThinkAgile HX series brings a best-in-class hyperconverged system with Nutanix’s industry-leading software preloaded on Lenovo platforms. It dramatically simplifies data center management, frees up your IT staff, and accelerates your deployment, ultimately reducing your total cost of ownership. Nutanix offers one-click management through the Prism UI. The experience that comes from building your own distributed file system uniquely positions Nutanix to maintain, remediate, and give insights into Cloudera’s Hadoop infrastructure. In addition, Nutanix uses built-in big data modules to integrate directly with big data frameworks such as Cloudera for analytics and database applications.

Cloudera brings the power of Hadoop to the customer's enterprise. Hadoop is an open source software framework that is used to reliably manage large volumes of structured and unstructured data. Cloudera expands and enhances this technology to withstand the demands of your enterprise. It enables diverse analytic processes to operate against a shared data catalog that preserves business context like security and governance policies, and makes it easier for IT to set and enforce policies while providing business access to self-service analytics. The result is that you get a more enterprise ready solution for complex, large-scale analytics.

Capping all of this off is Cazena's Software as a Service data lake which eliminates both the need for ongoing DevOps resources and the need for administration for the big data platform and infrastructure. Cazena provides a production-ready data lake environment quickly; delivered ready for data ingesting, storage and analytics. Cazena's integrated turnkey service combines data ingestion, storage and compute, analytic tool integration and fully-managed DevOps and customer support into one solution. No setup or development is required for production or ongoing operations. Cazena brings SaaS-level experience to on-premise data lakes deployed on Lenovo hyperconverged appliances.

The intended audience for this reference architecture is IT professionals, Cloud and big data architects and engineers, technical architects, sales engineers, and consultants to assist in planning, designing, and implementing the big data solution with Lenovo hardware. It is assumed that you are familiar with Hadoop components and capabilities. For more information about all of the above components, see the “Resources” section on page 43.

Page 5: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

5 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

2 Business problem and business value Enterprises are incorporating large data lakes into their IT architecture to store this big data. The expectation is that ready access to all the available data can lead to higher quality of insights obtained through the use of analytics, which in turn drive better business decisions.

Business Problem There are two challenges facing enterprises wanting to deliver enterprise data lakes. The first challenge is establishing an easy-to-deploy data storage and processing infrastructure that can begin to deliver value very quickly. Spending months hiring dozens of skilled engineers to piece together a data management environment is costly and often leads to frustration from unrealized goals. The ability to quickly provision a big data processing infrastructure is just the first step, however.

The second challenge is the ongoing administration, management, monitoring and DevOps of the data lake. Data lake clusters must support running different types of analytics workloads. The ongoing DevOps is costly both in terms of time and manpower, requiring the attention of key technical employees whose time is better spent on delivering business outcomes for the enterprise.

The ThinkAgile HX virtualized platform, coupled with Cloudera Enterprise running on Cazena’s SaaS data lake, provides the simplicity and agility to make this happen in your datacenter. The result is a private cloud SaaS-like experience for not only provisioning the data processing infrastructure, but also for the ongoing dev-ops of the infrastructure, monitoring and business analytic tooling. This results in a private cloud data lake experience. This private cloud ability also readies the infrastructure for access to the public cloud when appropriate to create hybrid and multi-cloud environments.

Business Value The combined solution of running Cloudera with Cazena on the Lenovo ThinkAgile HX platform delivers a private cloud SaaS experience for on premise data lakes in a fraction of the time and cost of Do it Yourself (DIY) solutions.

2.2.1 Time to Production Customers can be in production with a Cazena private cloud data lake within just a few weeks. Comparatively, a typical DIY can take 9 to 12 months with at least twice the cost.

Page 6: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

6 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

Figure 1. Cazena Time to Production

2.2.2 Deploy ML & Analytics Quickly with the AppCloud The Cazena AppCloud helps teams deploy a wide range of machine learning (ML) and analytics applications in the private cloud with instant access to enterprise data on the Cazena platform. Using the AppCloud delivers many performance, operational and security benefits; and accelerates time to results. Now teams don’t have to wait for infrastructure to deploy new capabilities.

2.2.3 Plug & Play Enterprise Deployment: Cazena's Gateway software securely connects Cazena to existing enterprise data sources, analytics tools and management processes. Cazena offers plug & play deployment, increasing time to value, agility and productivity.

2.2.4 Enabling Multi-tenancy There are many reasons to virtualize infrastructure in the data center. This is a trend that has been ongoing over the past ten years or more. Most enterprise applications including Big Data have specific and intermittent peaks in terms of utilization, and that often results in extended periods of idle time with the hardware. Multi-tenancy brings the high hardware utilization of the Cloud to your datacenter along with the agility to better serve internal lines of businesses.

2.2.5 DevOps 24x7 Production Operations Cazena’s SaaS solutions are automated and fully-managed, so you can use the private cloud data lake with confidence. Cazena is monitored and supported 24 x 7 by DevOps and security gurus. You receive the best price vs. performance for data and analytics workloads. You won’t need to assemble or integrate or configure or patch anything.

Page 7: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

7 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

2.2.6 Secure data lake with a highly-differentiated security model. Security is fundamental to Cazena. Each Cazena customer receives their own private data cloud. Cazena's data lake environments are secure (Kerberized, etc.) and encrypt all data in motion and at rest by default. Everything within the data lake is managed, monitored and logged to conform to strict compliance standards.

Cazena includes Gateway Software, which manages access and facilitates data moves, adding security, encryption and compression automatically. Load data into Cazena from any connected enterprise or data source – and keep it in one secure environment for team access. The Gateway is powerful enough to run in enterprise datacenters – and regularly passes muster with tough network and security guidelines.

2.2.7 Best of Breed Data Lake - Optimized for ThinkAgile HX Cazena 'as a service' solutions include databases and components such as Spark, Hadoop and MPP SQL from Cloudera. Cloudera deployed on Lenovo ThinkAgile HX with Lenovo networking components provides superior performance, reliability, and scalability. The reference architecture supports entry through high-end configurations and the ability to easily scale as the use of big data grows.

Leverage Cazena's platform automation and performance - don't waste time trying to "do it yourself". Cazena costs 50% - 80% less than DIY solutions. It's fully-managed so you won't need any new cloud DevOps skills to setup or operate the platform. And your big data cluster is up and running in just a few weeks.

Page 8: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

8 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

3 Requirements The functional and non-functional requirements for this reference architecture are desribed in this section.

Functional Requirements A big data solution supports the following key functional requirements:

● Ability to handle various workloads, including batch and real-time analytics ● Industry-standard interfaces so that applications can work with Cloudera ● Ability to handle large volumes of data of various data types ● Various client interfaces ● Ability to quickly repurpose private clusters in a cloud-like way

Non-functional Requirements Customers require their big data solution to be easy, dependable, and fast. The following non-functional requirements are key:

● Easy: o Ease of development o Easy management at scale with automated DevOps o Advanced job management o Multi-tenancy o Easy to access data by various user types

● Dependable: o Data protection with snapshot and mirroring o Automated self-healing o Insight into software/hardware health and issues o High availability (HA) and business continuity

● Fast: o Superior performance o Scalability

● Secure and governed: o Strong authentication and authorization o Kerberos support o Data confidentiality and integrity

Page 9: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

9 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

4 Architectural Overview Cazena's solution brings the SaaS Data Lake experience to on-premise Lenovo ThinkAgile HX hyper-converged appliance. Cloudera Enterprise is delivered on top of the Cazena SaaS solution. Cloudera delivers an integrated suite of analytic engines ranging from stream and batch data processing to data warehousing, operational database, and machine learning. This combined solution empowers enterprises to collect, store and analyze any data in the private cloud, without the need for customer DevOps or administration resources. The architecture is composed of the following three main components:

1. Cazena

2. Cloudera Enterprise

3. Lenovo ThinkAgile HX - Nutanix Appliance

Figure 2. Cazena Private Data Lake Architecture

Page 10: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

10 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

Cazena Cazena Software as a Services (SaaS) creates Cloudera as a Service on the Lenovo ThinkAgile HX Appliance. This SaaS helps enterprises radically simplify data science, data lakes, ML/AI, analytics, BI and data engineering. Cazena's private cloud solutions include the complete Cloudera stack, security, analytics and other best-of-breed software. Everything is integrated, fully-managed and monitored, with automation and optimization to ensure the best performance at the lowest price.

Cazena SaaS contains these functions:

• Cazena AppCloud: A curated single-tenant environment called App Cloud where new apps or

tools can be landed, pre-configured and certified to securely access data or compute • Cazena DevOps: Includes all production operations such as white-glove support and resolution,

24x7 health-monitoring, alerting, upgrades, patching with validation for Cloudera Enterprise. And the

ability to add and certify new tools, libraries and analytic functions. • Cazena Security: Includes capabilities such as end-to-end authentication, authorization,

encryption and auditing, along with on-going security operations • Cazena Ingestion: Uses a gateway concept as a single point of hybrid integration where multiple

data source connections, tool integrations, and networking & VPN complexities are handled under the

covers by Cazena.

Cloudera Enterprise Cloudera Enterprise solution contains the following components:

• Analytic SQL: Apache Impala. Impala is the industry’s leading massively parallel processing (MPP)

SQL query engine that runs natively in Hadoop.

• Search Engine: Cloudera Search. Cloudera Search is Apache Solr that is integrated with Cloudera

Enterprise.

• NoSQL: HBase. A scalable, distributed column-oriented datastore. HBase provides real-time

read/write random access to very large datasets hosted on HDFS.

• Stream Processing: Apache Spark. Apache Spark is an open source, parallel data processing

framework that complements Hadoop to make it easy to develop fast, unified big data applications

that combine batch, streaming, and interactive analytics on all your data.

• Machine Learning: Spark MLlib. MLlib is the API that implements common machine learning

algorithms. MLlib is usable in Java, Scala, Python and R.

• Cloudera Manager: Cloudera Manager is the industry’s first and most sophisticated management

application for Hadoop and the enterprise data hub. Cloudera Manager gives you a cluster-wide, real-

time view of nodes and services running; provides a single, central console to enact configuration

changes across your cluster; and incorporates a full range of reporting and diagnostic tools to help

you optimize performance and utilization.

Page 11: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

11 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

• Cloudera Kafka: Kafka delivers a publish/subscribe messaging system, but with better throughput,

built-in partitioning, replication, and fault tolerance.

For more information, see the Cloudera website: cloudera.com/content/cloudera/en/products-and-services/product-comparison.html

Lenovo ThinkAgile HX Appliance The big data cluster solution described in this document is deployed on the ThinkAgile HX series of Nutanix pre-loaded appliance nodes. ThinkAgile HX Series consolidates compute, storage, and virtualization software into a resource pool, easily managed in scale-out clusters through a single interface. The series ships from Lenovo as fully integrated and validated building blocks, and delivers extreme reliability, security, scalability, and simplified management. As a result, you can deploy applications faster, without the hassle, while dramatically reducing your total cost of ownership.

Figure 3 Nutanix Architecture

Use Nutanix Prism for one-click planning, provisioning, insights, and firmware updates to achieve faster, simpler IT operations. A Lenovo-Nutanix innovation, the ThinkAgile XClarity Integrator for Nutanix, available from the Nutanix Calm marketplace, can be used in concert with Prism for comprehensive management and reduced manual entry. Leverage the time and cost-savings of ThinkAgile HX to enable your IT to focus on innovative solutions for your business.

Lenovo ThinkAgile Advantage provides end-to-end life cycle management for a seamless, simple customer experience for planning, deployment, and maintenance, including a 24/7 Single Point of Support. ThinkAgile HX Series also includes the option of shipping from Lenovo’s factory in a fully assembled rack, along with any combination of ThinkSystem servers, storage, and networking, which facilitates quick deployment of solutions based on validated HX reference architectures.

Page 12: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

12 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

5 Component Model Cazena provides features and capabilities that meet the functional and nonfunctional requirements of customers. It supports mission-critical and real-time big data analytics across different industries, such as financial services, retail, media, healthcare, manufacturing, telecommunications, travel and government organizations.

Cazena Components Below describes the components that make up Cazena Software as a Service.

Figure 4. Cazena Key Components

5.1.1 Cazena AppCloud To plug in new third-party tools, the Cazena platform provides a curated single-tenant environment called App Cloud where new apps or tools can be landed, pre-configured and certified to securely access data or compute. Ultimately the private cloud platform is a living ecosystem that grows continually – new tools and platforms are driven by new users, whether variety of analytic/ML/BI tools, notebooks, or environments. Supporting this growth with SLA and compliance is critical. The AppCloud combined with the Cazena stack provides an optimized end-to-end experience.

Page 13: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

13 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

5.1.2 Data Ingestion For easy deployment, Cazena uses the concept of “gateways” as a single point of hybrid integration – think of it like a socket where you plug in tools and data. All the complexity of networking/VPN, data source connections and tool integrations are hidden from the end user, regardless of whether data and analytic tools are on-premises or in the cloud. This makes it easy for analysts to start their work immediately without changing their existing process.

5.1.3 Cazena Security Security is an ongoing process and a lot of hard work. Is the private cloud truly private where all cluster access is authenticated? Is all access identity and role driven? Is all data encrypted end-to-end including ingest and tool access? Are encryption keys managed appropriately? Are you monitoring for any intrusions at the data or configuration level? Is all logging centralized and actionable? Cazena’s Big Data as a Service includes capabilities like end-to-end authentication, authorization, encryption and auditing, along with on-going security operations.

5.1.4 The Cloudera Data Lake Cazena offers Cloudera as a Service on the Lenovo ThinkAgile HX Appliance. Cloudera Enterprise helps enterprises radically simplify data science, data lakes, ML/AI, analytics, BI and data engineering. Cazena's private cloud solutions include the complete Cloudera stack, security, analytics and other best-of-breed software. Everything is integrated, fully-managed and monitored, with automation and optimization to ensure the best performance at the lowest price.

5.1.5 Cazena Operations (DevOps) Cazena’s big data as a Service solutions include all production operations such as white-glove support and resolution, 24x7 health-monitoring, alerting, upgrades, patching with validation, ability to add and bring in new tools and libraries, certify with newer analytic tools etc. In other words, Cazena automation and expertise removes most of the production operating burden, allowing your dev-ops to scale easily.

Page 14: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

14 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

ThinkAgile HX - Nutanix Cazena leverages the Nutanix system to operate and scale Cloudera Enterprise in conjunction with other hosted services. Within the Cazena offering, Nutanix is used as the single scalable platform for all deployments. Existing sources and platforms can send data to the Cloudera cluster on Nutanix over the network. The figure below shows a high-level view of the Cazena solution running Cloudera on Nutanix.

Figure 5. Cazena’s Fully Managed Cloudera CDH on Nutanix

Cazena uses Nutanix’s modular scale-out approach to select an appropriate initial deployment size and grow in granular data and desktop increments. This strategy removes the hurdle of a large up-front infrastructure purchase that a customer may need many months or years to grow into, ensuring a faster time-to-value for a Cloudera implementation.

As a virtual private data cloud, the Cazena solution leverages the Nutanix Enterprise Cloud Platform to enable many new abilities for data processing and management:

• Scale-out with workload demands: The Nutanix cluster can scale-out seamlessly to suit project-

specific Cloudera workload requirements. Adding and removing nodes from a Nutanix cluster are

Page 15: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

15 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

one-click operations, with the cluster automatically rebalancing compute and storage across available

nodes.

• Built-in DevOps: Big data scientists demand performance, reliability, and a flexible scale model. But

they don’t have the time or skills to provide ongoing DevOps to the data cloud. Cazena deliver this by

automating common DevOps operations as part of the appliance so that data scientists can focus on

their big data jobs.

• Batch scheduling and stacked workloads: Allow all workloads and applications, such as Hadoop,

virtual desktops, and servers, to coexist. Schedule jobs to run during off-peak hours to take

advantage of idle night and weekend hours that would otherwise go to waste. Nutanix also allows you

to bypass the flash tier for sequential workloads, eliminating the time it takes to rewarm the cache for

mixed workloads. • New Hadoop economics: Virtualizing Hadoop reduces complexity and ensures success for

sophisticated projects with a scale-out, grow-as-you-go model—a perfect fit for Big Data projects. • Unified data platform: Run multiple data processing platforms along with Hadoop YARN on a single

unified data platform, the Acropolis DSF. • Analytic high-Density engine: With the Nutanix solution, you can start small and scale, letting you

accurately match supply with demand and minimize the up-front capital expenditure. • Automatic leveling: Nutanix spreads data evenly across the cluster, ensuring that local drives don't fill

up and cause an outage when space is available elsewhere on the network.

5.2.1 Nutanix Prism Nutanix Prism allows administrators to manage the virtual environment. Prism is a part of the Nutanix software preloaded on the appliances and offers the following features:

Single point of control: o Accelerates enterprise-wide deployment o Manages capacity centrally o Adds nodes in minutes o Supports non-disruptive software upgrades with zero downtime o Integrates with REST APIs and PowerShell

Monitoring and alerting: o Tracks infrastructure utilization (storage, processor, memory) o Centrally monitors multiple clusters across multiple sites o Monitors per virtual machine (VM) performance and resource usage o Checks system health o Generates alerts and notifications

Page 16: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

16 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

5.2.2 Controller VM (CVM) The Nutanix Controller VM (CVM) is the key to hyper-converged capability and each node in a cluster has its own instance. Table 1 shows the main components of the CVM.

Table 1. CVM Components

The CVM works as the interface between storage and hypervisor to manage all I/O operations for the hypervisor and user VMs running on the nodes as shown below.

Table 2. CVM Interaction with hypervisor and user VMs

Page 17: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

17 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

CVM virtualizes all the local storage attached to each node in a cluster and presents it as centralized storage array using Acropolis Distributed File System (ADFS), (formerly called Nutanix Distributed File System. All I/O operations are handled locally to provide the highest performance. See the Lenovo ThinkAgile HX links in the Reference section on page 43.

Page 18: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

18 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

6 Operational Model This section describes the operational model for the Cazena reference architecture. To show the operational model for different sized customer environments, three different models or cluster designs are provided for supporting different amounts of data. Throughout this document, these models are referred to as starter rack, half rack, and full rack

A Cazena deployment consists of cluster nodes, networking equipment, power distribution units, and racks. The predefined configurations can be implemented as-is or modified based on specific customer requirements, such as lower cost, improved performance, and increased reliability. Key workload requirements, such as the data growth rate, sizes of datasets, and data ingest patterns help in determining the proper configuration for a specific deployment.

Hardware Description This reference architecture uses Lenovo ThinkAgile HX3320 and HX5520, Nutanix pre-loaded appliances, along with Lenovo ThinkSystem NE1052T and RackSwitch G8272 top of rack switches.

6.1.1 ThinkAgile HX Key features The ThinkAgile HX Series appliances offer the following key features:

• Factory-integrated, pre-configured ready-to-go appliances built on proven and reliable Lenovo

ThinkSystem servers that provide compute power for a variety of workloads and applications and

powered by industry’s most feature-rich hyperconverged infrastructure software from Nutanix.

• Provide quick and convenient path to implement a hyperconverged solution powered by Nutanix with

"one stop shop" and a single point of contact provided by Lenovo for purchasing, deploying, and

supporting the solution.

• Meet various workload demands with cost-efficient hybrid or performance-optimized all-flash storage

configurations.

• Deliver fully validated and integrated hardware and firmware that is certified with Nutanix software.

• Include Lenovo ThinkAgile Advantage Single Point of Support for quick 24/7 problem reporting and

resolution.

• Offer Lenovo deployment services to get customers up and running quickly.

• The Nutanix software running on the HX Series appliances deliver the following key features:

• A natively integrated solution for data protection and continuous availability at VM granularity that

gives administrators an affordable range of options to meet the recovery point objectives (RPO) and

recovery time objectives (RTO) for different applications.

• A fault resistant platform, with no single point of failure and no bottlenecks with shared-nothing

architecture, where all data, metadata and services are distributed to all nodes within the cluster, that

is built to detect, isolate and recover from failures anywhere in the system.

Page 19: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

19 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

• An intuitive user-centric management experience to simplify every aspect of the IT infrastructure

lifecycle and provide a single pane of glass to monitor and control Nutanix clusters, with simplified

workflows and rich automation for common administrative tasks.

• Powerful security features, such as two-factor authentication and data-at-rest encryption, with a

security development lifecycle that is integrated into product development to help customers meet the

most stringent security requirements.

6.1.2 Lenovo ThinkAgile HX5520 Appliance Lenovo ThinkAgile HX Series appliances are designed to help you simplify IT infrastructure, reduce costs, and accelerate time to value. These hyperconverged appliances from Lenovo combine industry-leading hyperconvergence software from Nutanix with Lenovo enterprise platforms that feature Intel Xeon Processor Scalable family.

The ThinkAgile HX5520 is a 2U rack-mount appliance that supports two processors, up to 3 TB of 2933 MHz TruDDR4 memory, 12x or 14x SAS/SATA SFF hot-swap drive bays with an extensive choice of SATA SSDs and HDDs, and flexible network connectivity options with 1/10 GbE RJ-45, 10 GbE SFP+, and 10/25 GbE SFP28 ports.

Several common uses for the ThinkAgile HX Series appliances that are optimized for storage-heavy workloads include file servers, on-cluster backups, and big data.

Figure 6. ThinkAgile HX5520 Appliance

For more information, see the Product Guide at this link: https://lenovopress.com/lp1124-thinkagile-hx5520-appliance-gen2

6.1.3 Lenovo ThinkAgile HX3320 Appliance The ThinkAgile HX3320 is a 1U rack-mount appliance that supports two processors, up to 3 TB of 2933 MHz TruDDR4 memory, 10x or 12x SAS/SATA SFF hot-swap drive bays with an extensive choice of SAS/SATA SSDs and HDDs, and flexible network connectivity options with 1/10 GbE RJ-45, 10 GbE SFP+, and 10/25 GbE SFP28 ports.

Several common uses for the ThinkAgile HX Series appliances for compute-heavy workloads include virtual desktop infrastructure (VDI), server virtualization, private/hybrid clouds, enterprise applications, light databases, and remote office and branch office workloads.

Page 20: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

20 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

Figure 7. Lenovo ThinkAgile HX3320

For more information, see the Product Guide at this link: https://lenovopress.com/lp1121-thinkagile-hx3320-appliance-gen2

6.1.4 Lenovo ThinkSystem NE0152T The Lenovo ThinkSystem NE0152T RackSwitch is a 1U rack-mount Gigabit Ethernet switch with 10 GbE uplinks that delivers line-rate performance with feature-rich design that supports virtualization, high availability, and enterprise class Layer 2 and Layer 3 functionality in a cloud management environment.

The NE0152T RackSwitch has 48x RJ-45 Gigabit Ethernet fixed ports and 4x SFP+ ports that support 1 GbE and 10 GbE optical transceivers, active optical cables (AOCs), and direct attach copper (DAC) cables. Featuring a small, 1U footprint, this switch is designed for management network and access-layer deployments in data center infrastructures.

Figure 8. ThinkSystem NE0152T

For more information, see the Product Guide at this link:

https://lenovopress.com/lp0965-lenovo-thinksystem-ne0152t-gigabit-ethernet-switch

6.1.5 Lenovo RackSwitch G8272 Designed with top performance in mind, Lenovo RackSwitch G8272 is ideal for today’s big data, cloud and optimized workloads. The G8272 switch offers up to 72 10Gb SFP+ ports in a 1U form factor and is expandable with six 40Gb QSFP+ ports. It is an enterprise-class and full-featured data center switch that delivers line-rate, high-bandwidth switching, filtering and traffic queuing without delaying data. Large data center grade buffers keep traffic moving. Redundant power and fans and numerous HA features equip the switches for business-sensitive traffic.

The G8272 switch (as shown in Figure 9) is ideal for latency-sensitive applications. It supports Lenovo Virtual Fabric to help clients reduce the number of I/O adapters to a single dual-port 10Gb adapter, which helps

Page 21: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

21 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

reduce cost and complexity. The G8272 switch supports the newest protocols, including Data Center Bridging/Converged Enhanced Ethernet (DCB/CEE) for support of FCoE and iSCSI and NAS.

Figure 9. Lenovo RackSwitch G8272

The enterprise-level Lenovo RackSwitch G8272 has the following characteristics:

• 48x SFP+ 10GbE ports plus 6x QSFP+ 40GbE ports • Support up to 72x 10Gb connections using break-out cables • 1.44 Tbps non-blocking throughput with low latency (~ 600 ns) • OpenFlow enabled allows for easily created user-controlled virtual networks • Virtual LAG and LACP for dual switch redundancy

For more information, see the Lenovo RackSwitch G8272 Product Guide: https://lenovopress.com/tips1267-lenovo-rackswitch-g8272

Cluster HX Node Configurations The Cazena reference architecture is implemented on a set of Virtual Machines (VM) that make up a cluster which includes: Master nodes, Data nodes, Cazena orchestration and identity services and the Cazena AppCloud nodes. This provides flexibility to customize and reconfigure a Cazena cluster as workload requirements change. Multiple VMs are created on the Lenovo HX3320 and HX5520 appliance nodes according to the tables in this section.

The number of HX3320 and HX5520 appliance nodes can be scaled out as needed to grow the cluster as customer data capacity and analytics capacity increases. Pre-defined rack sizes are shown in section 6.9 on page 29.

Various HX3320 and HX5520 node configurations for processor core count, memory, and storage are provided as T-shirt sizes Small, Medium, and Large.

6.2.1 Node T-shirt Sizes The wide range of HX node configurations are possible for processor type and core count, memory size, and storage components. This reference architecture presents 3 sizes as recommended for certain big data Hadoop workloads. These three sizes may be customized as required for a particular workload type.

The HX5520 T-shirt sizes are shown below in Table 3.

Page 22: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

22 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

Table 3. HX5520 Node Configuration T-shirt Sizes

T-Shirt Size Workload type Node configuration Small Minimum

supported configuration

2x Intel® Xeon® processors: 6252 Gold, 24-core, 2.1Ghz 192 GB: 12x 32GB 2933MHz RDIMM 10Gb Data Network 128GB M.2 SSD RAID1 OS storage

Medium Balanced workloads including Map Reduce

2x Intel® Xeon® processors: 6252 Gold, 24-core, 2.1Ghz 384 GB: 12x 32GB 2933MHz RDIMM 10Gb or 20Gb Data Network 128GB M.2 SSD RAID1 OS storage

Large In-memory analytics such as Apache Spark

2x Intel® Xeon® processors: 8268 Platinum , 24-core, 2.9Ghz 768 GB: 12x 32GB 2933MHz RDIMM 25Gb Data Network 128GB M.2 SSD RAID1 OS storage

Table 4. HX3320 Node T-shirt Sizes

T-Shirt Size Workload type Node configuration Small Minimum

supported configuration

2x Intel® Xeon® processors: 6252 Gold, 24-core, 2.1Ghz 384 GB: 12x 32GB 2933MHz RDIMM 10Gb Data Network 128GB M.2 SSD RAID1 OS storage

Medium Balanced workloads including Map Reduce

2x Intel® Xeon® processors: 6252 Gold, 24-core, 2.1Ghz 384 GB: 12x 32GB 2933MHz RDIMM 10Gb or 25Gb Data Network 128GB M.2 SSD RAID1 OS storage

Large In-memory analytics such as Apache Spark

2x Intel® Xeon® processors: 6252 Gold, 24-core, 2.1Ghz 384 GB: 12x 32GB 2933MHz RDIMM 25Gb Data Network 128GB M.2 SSD RAID1 OS storage

Page 23: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

23 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

6.2.2 Node types and VM organization Below are the HX node types and VM organization used in this reference architecture which reflects the Starter rack pre-defined production configuration shown in Section 6.9 on page 29.

Table 5. Cloudera Services, HX3320 nodes

Component Node configuration ThinkAgile HX3320 2x Intel® Xeon® processors: 6252 Gold, 24-core, 2.1Ghz

384GB: 12x 32GB, 2933MHz RDIMM 128GB M.2 SSD RAID1 OS storage

Cloudera Master or Utilities * VM: 8 CPU/16 vCores, 144GB memory, 2TB storage Cloudera Services VM: 8 CPU/16 vCores, 144GB memory, 2TB storage Cazena Orchestration VM: 2 CPU/4 vCores, 64GB memory, 2TB storage Nutanix Control VM VM: 6 CPU/12 vCores, 32GB memory, 2TB storage

* Cloudera NameNode or Cloudera Manager services

Table 6. Cazena AppCloud, HX3320 nodes

Component Node configuration ThinkAgile HX5520 2x Intel® Xeon® processors: 6252 Gold, 24-core, 2.1Ghz

384GB: 12x 32GB, 2933MHz RDIMM 128GB M.2 SSD RAID1 OS storage

AppCloud Node VM: 6 CPU/12 vCores, 117GB memory, 2TB storage AppCloud Node VM: 6 CPU/12 vCores, 117GB memory, 2TB storage AppCloud Node VM: 6 CPU/12 vCores, 117GB memory, 2TB storage Nutanix Control VM VM: 6 CPU/12 vCores, 32GB memory, 2TB storage

* This HX3320 count may vary from 2 to 6 nodes as the cluster scales out

Table 7. Cloudera Data Node

Component Node configuration ThinkAgile HX5520 2x Intel® Xeon® processors: 6252 Gold, 24-core, 2.1Ghz

384GB: 12x 32GB 2933MHz RDIMM 128GB M.2 SSD RAID1 OS storage

Cloudera Data Node VM: 8 CPU/16 vCores, 176GB memory, 27.84TB raw storage Cloudera Data Node VM: 8 CPU/16 vCores, 176GB memory, 27.84TB storage Nutanix Control VM VM: 6 CPU/12 vCores, 32GB memory, 32TB raw storage

* Scales from 3 to 16 or more data nodes per rack

Table 8. System Management Node

Component Node configuration ThinkSystem SR630 1x Intel® Xeon® processors: 3204 Bronze, 1.9Ghz

16 GB: 1x 16GB 2933MHz RDIMM 128GB M.2 SSD RAID1 OS storage

* May combine additional functions for Gateway and Edge nodes

Page 24: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

24 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

CVM memory size allocated may change based on total capacity of the Acropolis Distributed File system on an individual node. For example, 36GB is allocated for >=80TB node storage capacity. Latest ThinkAgile HX requirements should be checked when sizing a new Cazena/ThinkAgile HX cluster design.

6.2.3 System Management node Known as Edge, System Management or Gateway nodes, these are installed on the cluster data network but do not run Cazena software directly. Their purpose is to connect the Cazena cluster to an outside network for remote administration access, for ingesting data from an outside source, or for a dedicated node running end user application software which can access the cluster.

A single system management/gateway node is configured in this reference architecture as a minimal node configured for remote administration of the Linux OS and for hardware maintenance. Based on the particular requirements of the cluster for high speed ingesting of data and edge node applications, the CPU, memory, storage, and network capability of this server can be increased.

Cluster Software Stack The following software components were installed for this Cazena reference architecture:

Table 9. Cluster Software Stack

Component Version Cazena SaaS 19.4 HX3320 and HX5520 appliances ThinkAgile Best recipe 4.0

AcropolisOS (AOS) v5.10.3.2 Foundations v4.3.4

VMware ESXi 6.7.0 update 02-8935087-LNV-20180707 Appliance firmware levels XCC v2.50

UEFI v2.11 Cloudera Enterprise 5.15.1

Cloudera Manager 5.15.1 Cloudera Navigator 2.14.1

Cazena Orchestration Users of the Cazena ThinkAgile HX solution do not need to orchestrate the provisioning of the private data lake (Cloudera, AppCloud, etc.). Cazena properly sizes and provisions the private data cloud as part of the Cazena service.

However, often it is important to understand the underlying configuration. Within the joint offering, a virtual node is identified as the CzServices node. This node runs the Cazena Orchestrator (responsible for Cloudera CDH cluster create, delete, stop, start and health monitoring services), Cazena Portal (the user interface), Consul (service registration and monitoring) the Data Mover and the Dataset Manager.

Page 25: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

25 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

Additionally, there is another identity node called the “IPA Node” that runs a “Salt Master”, a configuration management software to manage the virtual machines and the software installed on them, as well as free-ipa, an open source software for identity management and DNS resolution.

Cazena AppCloud A portion of the private data cloud is reserved for Cazena’s AppCloud. The AppCloud is a set of one or more nodes dedicated to running any user-specific application software. Typically, these nodes host business intelligence (BI) applications, customer-specific data lake/Hadoop tools, and data ingestion and workflow tooling applications.

Typical products might include Tableau, RStudio Connect, Streamsets, DataRobot, Anaconda, and many others.

In general, customers do not directly access nodes in the Cazena cluster (such as via SSH, remote desktop, etc.) since Cazena provides this service. However, AppCloud nodes are customer-specific and access is possible based on customer needs, such as for customer management of certain software installed on those nodes.

Cloudera Service Role Layouts Location of Cloudera software services is an important part of the software installation. Cloudera install instructions will provide a recommended distribution of services across the Master and Data nodes. Below is the recommended service layout with the nodes used in this reference architecture. The services may be relocated as required to meet a specific cluster node configuration

Table 10. Service Layout Matrix for High Availability Node Master

VM 1 Master VM 2

Master VM 3

Master VM 4

DataNode VMs

Service/Roles NameNode JournalNode

NameNode JournalNode

JournalNode RStudio Server DataNode

ZooKeeper ZooKeeper ZooKeeper

ResourceManager

ResourceManager

Hive MetaStore, WebHCat,

HiveServer2

Hive Gateway NodeManager

JobHistory Server

Cloudera Manager and

CM management

services

Sentry Gateway

impalad

Page 26: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

26 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

SparkHistory Server

Hue Spark Gateway

Oozie Sqoop 1 Gateway

Impala Statestore

Yarn Gateway

Impala Catalog Server

System Management Systems management of a cluster including Operating System, security, Hadoop & Spark applications and resource management is provided by Cazena. This deployment offers a fully-managed private SaaS solution along with IT HelpDesk support for the private cloud.

Hardware management uses the Lenovo XClarity™ Administrator, which is a centralized resource management solution that reduces complexity, speeds up response and enhances the availability of Lenovo server systems and solutions. XClarity™ is used to install the OS onto new worker nodes; update firmware across the cluster nodes, record hardware alerts and report when repair actions are needed.

Figure 10 shows the Lenovo XClarity™ Administrator interface in which servers, storage, switches and other rack components are managed and status is shown on the dashboard. Lenovo XClarity™ Administrator is a virtual appliance that is quickly imported into a server virtualized environment.

Page 27: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

27 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

Figure 10. XClarity™ Administrator interface

Networking The reference architecture specifies two networks: a high-speed data network and a management network which support the Lenovo HX Nutanix cluster. Virtual networks are created which run on top of the physical network and allow quick provisioning of new clusters and Cazena deployments. The two types of networks are implemented with 1Gb, out-of-band management, and a pair of high-speed data switches for in-band high speed data network including High Availability. See Figure 11 below.

Page 28: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

28 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

Figure 11. Data and Management Networks

6.8.1 Data Network

The data network creates a private cluster among multiple nodes and is used for high-speed data transfer across worker and master nodes, and also for importing data into the Cazena cluster. The Cazena cluster typically connects to the customer’s corporate data network. This reference architecture demonstrates the Lenovo 10Gb Ethernet System Networking RackSwitch™ G8272 which provides 48 10Gb Ethernet ports with 40Gb uplink ports. Other available data network speeds are 25Gb with RackSwitch NE2532 or 100Gb with RackSwitch NE10032 using either copper or fiber links.

The two Ethernet NIC ports of each node are link aggregated into a single bonded network connection giving up to double the bandwidth of each individual link. Link redundancy is also provided if one link fails. The two data switches are connected together as a Virtual Link Aggregation Group (vLAG) pair using LACP to provide the switch redundancy. Either high speed data switch can drop out of the network and the other switch continues transferring traffic. The switch pairs are connected with dual 10Gb links called an ISL, which allows maintaining consistency between the two peer switches.

6.8.2 Hardware Management Network

The hardware management network is a 1GbE network for out-of-band hardware management. The recommended 1GbE switch is the Lenovo RackSwitch NE0152T with 10Gb SFP+ uplink ports. Through the XClarity™ Controller management module (XCC) within the ThinkAgile HX3320 and HX520 servers, the out-of-band network enables hardware-level management of cluster nodes, such as UEFI firmware configuration, hardware failure status and remote power control of the nodes.

Page 29: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

29 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

Cazena has no dependency on the XCC management function. The Cloudera cluster and hardware management networks are then typically connected directly to the customer’s existing administrative network to facilitate remote maintenance of the cluster.

6.8.1 10Gb and 25Gb Data Network Configurations Both 10Gb and 25Gb high speed data network configurations are possible with a Cazena cluster. The 10Gb network speed is cost effective for most big data workloads. The higher 25Gb network speed can be added to increase the overall cluster performance . The table below shows the 10Gb and 25Gb network adapters available.

Table 11. HX5520 and HX3320 Network Connectivity Description Quantity Min/Max

1/10 GbE RJ-45 base ports ThinkSystem 10Gb 2-port Base-T LOM (RJ-45) 0 / 1 1/10 GbE RJ-45 expansion ports 0 / 1 Intel X550-T2 Dual Port 10GBase-T Adapter (RJ-45) Intel X550-T2 Dual Port 10GBase-T Adapter (RJ-45) 0 / 2 10 GbE SFP+ base ports ThinkSystem 10Gb 2-port SFP+ LOM 0 / 1 ThinkSystem 10Gb 4-port SFP+ LOM 0 / 1 10 GbE SFP+ expansion ports Intel X710-DA2 PCIe 10Gb 2-Port SFP+ Ethernet Adapter 0 / 2 10/25 GbE SFP28 expansion ports Mellanox ConnectX-4 Lx 10/25GbE SFP28 2-Port PCIe Ethernet Adapter 0 / 2

* One of the 1/10 GbE RJ-45 or 10 GbE SFP+ LOM cards is required for selection, and it provides base network connectivity. Optional expansion ports can be selected, if needed.

Predefined Lenovo HX Cluster Configurations The intent of the predefined configurations is to aid initial sizing for customers and to show example starting points for three production cluster sizes: starter rack, half rack, and full rack. A non-product Proof of Concept (POC) config. is provided for evaluation purposes only. These configs. consist of data nodes, master/utility nodes, Cazena service nodes, Cazena AppCloude nodes, system management & edge nodes as well as network switches, storage enclosures and rack hardware. Figure 12 shows rack diagrams with a description of each component. Table 13 shows storage capacity of the predefined configurations.

Table 12. Pre-Defined Configuration Node Counts Node Type POC non-Prod Starter Rack Half Rack Full Rack

SR630, Sys Mngt/Gateway 0 1 1 1 HX3320, Master/Orchestration 1 4 4 4 HX3320, Cazena AppCloud 0 2 2 2 HX5520, Cloudera Data Node 1 (Data+AppCloud) 3 8 16

Page 30: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

30 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

Figure 12. Cazena Starter and Half Rack pre-defined configurations

Page 31: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

31 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

6.9.1 Cluster Storage Capacity

* To maintain a 100TB Cloudera data node limit, 8TB drives can be partitioned to 7TB each. Table 13. Cazena Cluster Storage Capacities

6.9.2 Storage Tiering with NVMe and SSD Drives ThinkAgile HX appliances are configured with SSD drives as a standard configuration and an all SSD drive configuration is also available. Using a hybrid configuration of SSD drives and HDDs provides a hot/cold storage tier for increased performance and reduced cost. SSDs drives have a write limit specification expressed in Drive Writes Per Day (DWPD). Lenovo Entry drives are the lowest cost and provide a 0.6 to 1.1 write cycle per day average limit (equivalent to continuous writes over a 5 year time period). Mainstream drives have a 3 to 5 DWPD limit. The Performance drives give over 10 DWPD. Each drive includes wear leveling algorithms as standard features to spread the write operations evenly across every storage byte and allows the drive to achieve the DWPD specifications. The drive firmware takes care of wear leveling under the covers for ease of use by application software.

Page 32: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

32 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

7 Deployment Considerations This section describes other considerations for deploying the Cazena solution.

Increasing Cluster Performance This reference architecture and pre-defined configurations provide balanced cluster performance and a starting point to customizing specific workload types. Various hardware components can be enhanced as needed such as processor, system memory, storage components, and network configuration. Reference the below sections for methods for creating a unique cluster configuration relative to:

• CPU selection/core counts • Memory selection and the 384GB/768GB performance advantage for higher bandwidth • Storage performance with maximum HDDs • SSD drives for high speed storage and hot/warm/cold storage tiering • Network performance

Processor Selection Minimum Hadoop recommendations are one processor core per data disk plus additional cores dedicated to specific Cazena software services and data analytics functions. The HX5520 data node configuration recommended in this reference architecture uses Intel Gold processors with a core count to provide at least this ratio plus a set of additional cores for additional data analytics.

Workload types used on the Cazena cluster may be skewed toward IO-bound workloads that create heavy network traffic or CPU bound workloads that stress the CPU cores themselves. Intel Processors in the Platinum class provide higher core counts to meet the highest of CPU bound workloads.

Below are several examples of IO-bound workloads: • Sorting • Indexing • Grouping • Data importing and exporting • Data movement and transformation

Below are several examples of CPU-bound workloads: • Clustering/Classification • Complex text mining • Natural-language processing • Feature extraction

Memory Size and Performance Low node or VM memory capacity can negatively impact cluster performance by causing workload thrashing and spilling to slower storage devices. Also, in-memory workloads such as Apache Spark benefit from larger memory capacity. Spark workloads are recommended to use higher memory capacities than with Hadoop Map Reduce for this reason.

In addition to capacity, the number of populated memory DIMMs can negatively impact performance due to

Page 33: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

33 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

Intel memory interleaving techniques for maximizing memory controller performance.

This reference architecture specifies node memory sizes appropriate to Lenovo server types and with moderate Cazena workloads, but many memory choices are available. Table 14 shows memory capacity recommendations for this reference architecture based on cluster node type. Use by VMs of this memory resource and other resources can be seen in section 6.2.2 Node types and VM organization.

Table 14. Node/VM memory capacity recommendations

Memory Capacity Node Type 384 GB Master/ Utility + Cazena Orchestration node

288 GB Cazena AppCloud

288 - 384 GB Data node minimum

578 - 3,000 GB Data node in-memory Spark and high performance workloads

Table 15 provides specific memory configurations to maximize node memory and overall workload performance. While all memory capacities in the chart are valid and considered balanced sizes (or near-balanced), the green color coding shows changes in memory bandwidth relative to the highest performance possible using the maximum of 24 DIMMs available with the current generation of Intel 2-socket architecture. If during the cluster planning phase, a determined node memory capacity is close to the 'best' dark green row, one should use that higher capacity and gain a bandwidth performance advantage as well. Certain populated DIMM quantities have memory interleaving advantages giving faster access times than other DIMM quantities.

If one is repurposing an existing cluster and would like to reuse memory, the lighter green, 'better' memory config will show capacities that may fit with memory DIMMs on-hand. The resulting performance relative to the best possible is lower and is shown in the chart.

The relative performance column is aligned with the whitepaper Intel Xeon Scalable Family Balanced Memory Configurations. A link is provided in the Reference section. The column is relative to the maximum memory bandwidth possible. The Lenovo memory configurator should be used to verify the latest recommended memory capacities for the Balanced or Near-Balanced configuration. Other DIMM configurations will show up in the configurator as un-balanced and should be avoided. See this link:

http://lesc.lenovo.com/ss/#/memory_configuration

Table 15. HX3320 and HX5520 - Recommended memory configurations for 2-socket nodes

Capacity DIMM Description

Relative

Performance Quantity

128 GB 16GB TruDDR4 Memory (1Rx4, 1.2V) 2666MHz RDIMM 67% 8

192 GB 16GB TruDDR4 Memory (1Rx4, 1.2V) 2666MHz RDIMM 97% 12

256 GB 32GB TruDDR4 Memory (2Rx4, 1.2V) 2666Mhz RDIMM 67% 8

288 GB 6x 8GB plus 6x 16GB TruDDR4 Memory (1Rx4, 1.2V) 2666Mhz RDIMM

94% 12

Page 34: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

34 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

384 GB 32GB TruDDR4 Memory (2Rx4, 1.2V) 2666Mhz RDIMM 97% 12

512 GB 32GB TruDDR4 Memory (2Rx4, 1.2V) 2666Mhz RDIMM 68% 16

578 GB 6x 16GB (2Rx8) plus 6x 32GB TruDDR4 Memory (2Rx4, 1.2V) 2666Mhz RDIMM

94% 12

768 GB 64GB TruDDR4 Memory (4Rx4, 1.2V) 2666MHz LRDIMM 97% 12

768 GB 32GB TruDDR4 Memory (2Rx4, 1.2V) 2666Mhz RDIMM 100% 24

1,536 GB 64GB TruDDR4 Memory (4Rx4, 1.2V) 2666MHz LRDIMM 100% 24

2,048 GB 128GB TruDDR4 Memory (8Rx4 1.2V) 2666Mhz 3DS RDIMM 68% 16

3,072 GB * 128GB TruDDR4 Memory (8Rx4 1.2V) 2666Mhz 3DS RDIMM 100% 24

DIMM counts to avoid: 2,6,10,14,18,20,22

* Requires CPU part numbers that support 1.5TB of memory each CPU .

Best Better Avoid Notes: 1. DIMM quantity is of the same part number (speed, size, rank, etc.), unless noted.

2. Physical location of memory DIMMs in the numbered DIMM slots is important - follow the install guides for each particular Lenovo server for the correct location for each DIMM.

Designing for Storage Capacity and Performance The Lenovo HX5520 2U data node is configured with high capacity 3.5" form factor Solid State Drives (SSDs) and rotating Hard Disk Drives (HDDs) for maximum capacity. Installing the maximum possible drives is the first choice in obtaining maximum parallel performance and resulting IO bandwidth. The HX5520 in this reference architecture is configured with 2 of the drives as SSDs, and 12x HDDs. Further performance increase can be accomplished with 4 SSDs and up to 12x HDDs. The HX5520 also supports all SSD drives for maximum storage performance.

7.4.1 Node Capacity

The 3.5" HDD form factor gives the maximum local storage capacity for a node. 10TB and larger HDDs are available and can be used to replace the 4TB HDDs used in this reference architecture to give a total of up to the 100 TBs per node (as the recommended maximum by Cloudera). The 4TB HDD size provides the best balance of HDD capacity and performance per node. When increasing data disk capacity, some workloads may experience a decrease in disk parallelism, creating a bottleneck at that node which negatively affects performance. To increase total rack capacity while still using the 4TB HDD size recommended in this reference architecture, the number of HX5520 nodes in the cluster should be increased to maintain good I/O disk performance

Page 35: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

35 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

7.4.2 Estimating Disk Space

When you are estimating disk space within a Cazena cluster on the Nutanix Acropolis file system, consider the following:

For improved fault tolerance and performance, Cazena replicates data blocks across multiple cluster worker nodes. In this reference architecture, the Acropolis file system maintains 2 replicas, and the Hadoop Distributed File System (HDFS) is set to a replication factor of 2 also.

To ensure efficient file system operation and to allow time to add more storage capacity to the cluster if necessary, reserve 25% of the total capacity of the cluster.

Assuming the default 2 replicas maintained by Hadoop Distributed File System (HDFS) and 2 replicas maintained by Acropolis Distributed File System (ADFS), the raw data disk space and the required number of nodes can be estimated by using the following equations:

Total Raw Storage capacity = 3x HX5520 data nodes * (2x 3.84TB SSDs + 12x 4TB HDDs )

= 3 * (7.68 TB + 48TB)

= 167 TB

ADFS Storage Capacity = Raw Storage Capacity / ADFS replication factor

= 167 TB / 2

= 83.52 TB

HDFS Storage Capacity = (20% * ADFS Capacity) + (80% * ADFS Capacity) / HDFS replc. factor

= (0.20 * 83.52 TB) + (0.80 * 83.52 TB / 2)

= 16.7 TB + 33.4 TB

= 50.1 TB

Scaling Considerations The Cazena architecture is linearly scalable by adding individual nodes as needed, but it's important to plan ahead for sufficient lab space, open space in racks, and network switch needs to support the new nodes.

A Cazena cluster is scalable by adding additional HX3320 1U and HX5520 2U storage optimized nodes to accommodate additional data storage capacity and associated Cazena management software services . Typically, identically configured HX5520 and VMs for the data nodes are best to maintain the same ratio of storage and compute capabilities. As the capacity of a rack is reached, new racks can be added to the Cazena cluster.

Page 36: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

36 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

8 Cluster Hardware Bill of Materials For reference the complete hardware bill of material is shown in the sections below. The Lenovo Data Center Solutions Configurator (DCSC) should be used to latest information and when placing purchase orders.

HX5520 Bill of Materials Table 16. HX5520 Bill of Materials

Code Description Qty 7X84CTO4WW Server : ThinkAgile HX5520 Appliance 3

B0T8 ThinkAgile HX552x Base 3 B0W5 XClarity Pro and Prism Pro 3 B0VV Nutanix Pro Edition 3 B63T Nutanix SW Stack on VMware ESXi 6.7 3 B0W1 3 Years 3 B4HC Intel Xeon Gold 6252 24C 150W 2.1GHz Processor 3 B4H3 ThinkSystem 32GB TruDDR4 2933MHz (2Rx4 1.2V) RDIMM 6 B0SV Nutanix Hybrid Node Config 36 B6JZ ThinkSystem 3.5" PM883 3.84TB Entry SATA 6Gb Hot Swap SSD 3 AUU8 ThinkSystem 3.5" 4TB 7.2K SATA 6Gb Hot Swap 512n HDD 6 AUKJ ThinkSystem 10Gb 2-port SFP+ LOM 36 AVWF ThinkSystem 1100W (230V/115V) Platinum Hot-Swap Power Supply 3 6400 2.8m, 13A/100-250V, C13 to C14 Jumper Cord 6

AXCH ThinkSystem Toolless Slide Rail Kit with 2U CMA 6 B0MK Enable TPM 2.0 3 B4NL ThinkSystem SR650 Refresh MB 3 AUR9 ThinkSystem SR650/SR550/SR590 3.5" SATA/SAS 12-Bay Backplane 3 AURZ ThinkSystem SR590/SR650 Rear HDD/SSD Kit 3 A484 Populate Rear Drives 3 5977 Select Storage devices - no configured RAID required 3

AUNM ThinkSystem 430-16i SAS/SATA 12Gb HBA 3 8072 General Racking Solution 3 ATSB Nutanix Solution Code MFG Instruction 3 AUMV ThinkSystem M.2 with Mirroring Enablement Kit 3 AUUV ThinkSystem M.2 CV3 128GB SATA 6Gbps Non-Hot Swap SSD 3

AURC ThinkSystem SR550/SR590/SR650 (x16/x8)/(x16/x16) PCIe FH Riser 2 Kit

6

AUPW ThinkSystem XClarity Controller Standard to Enterprise Upgrade 3 AUS8 ThinkSystem SR550/SR590/SR650 EIA Latch w/ VGA Upgrade Kit 3 AUSS MS 12x3.5" HDD BP Cable Kit 3 AUTJ ThinkSystem common Intel Label 3 8971 Integrate in manufacturing 3

Page 37: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

37 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

B13M ThinkAgile EIA Plate 3 B0SP HX Series 2U Agency label 2U 3 AUTA XCC Network Access Label 3 AUSG ThinkSystem SR650 6038 Fan module 3 B13Q ThinkAgile 2U Service Label LI 3 AUT8 ThinkSystem 1100W RDN PSU Caution Label 3 AUSF Lenovo ThinkSystem 2U MS CPU Performance Heatsink 3 AVJ2 ThinkSystem 4R CPU HS Clip 6 AURR ThinkSystem M3.5 Screw for Riser 2x2pcs and SR530/550/558/570/590

Planar 5pcs 6 AUTQ ThinkSystem small Lenovo Label for 24x2.5"/12x3.5"/10x2.5" 3 AUSA Lenovo ThinkSystem M3.5" Screw for EIA 24 AUTS ThinkSystem 2U 12 3.5"HDD Conf HDD sequence Label 3 AURP Lenovo ThinkSystem 2U 2FH Riser Bracket 3 B173 Companion Part for XClarity Controller Standard to Enterprise Upgrade in

Factory 3 AURM ThinkSystem SR550/SR650/SR590 Right EIA Latch with FIO 3 9220 Preload by Hardware Feature Specify 3

AWFF ThinkSystem SR650 WW Lenovo LPK 1 A102 Advanced Grouping 3 2306 Integration >1U Component 3 B0ML Feature Enable TPM on MB 3

HX3320 Bill of Materials Table 17. HX3320 Bill of Materials

Code Description Qty 7X83CTO3WW HX3320-2019 : ThinkAgile HX3320 Appliance 6

B0T2 ThinkAgile HX332x Base 6 B0VV Nutanix Pro Edition 6 B63T Nutanix SW Stack on VMware ESXi 6.7 6 B0W1 3 Years 6 B4Z6 None (If ThinkAgile Deployment services are not selected, Lenovo

strongly recommends that approved business partners perform the deployment)

6

B4HC Intel Xeon Gold 6252 24C 150W 2.1GHz Processor 12 B4H3 ThinkSystem 32GB TruDDR4 2933MHz (2Rx4 1.2V) RDIMM 72 B0SV Nutanix Hybrid Node Config 6 B34M ThinkSystem 2.5" PM883 3.84TB Entry SATA 6Gb Hot Swap SSD 12 AUUJ ThinkSystem 2.5" 2TB 7.2K SATA 6Gb Hot Swap 512e HDD 48 AUKJ ThinkSystem 10Gb 2-port SFP+ LOM 6 AVWB ThinkSystem 1100W (230V/115V) Platinum Hot-Swap Power Supply 12

Page 38: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

38 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

6400 2.8m, 13A/100-250V, C13 to C14 Jumper Cord 12 A51P 2m Passive DAC SFP+ Cable 12 AXCB ThinkSystem Toolless Slide Rail Kit with 1U CMA 6 B0MK Enable TPM 2.0 6 AUW9 ThinkSystem SR630/SR570 2.5" AnyBay 10-Bay Backplane 6 5977 Select Storage devices - no configured RAID required 6

AUNM ThinkSystem 430-16i SAS/SATA 12Gb HBA 6 ATSB Nutanix Solution Code MFG Instruction 6 AUMV ThinkSystem M.2 with Mirroring Enablement Kit 6 AUUV ThinkSystem M.2 CV3 128GB SATA 6Gbps Non-Hot Swap SSD 12 AUWC ThinkSystem SR530/SR570/SR630 x8/x16 PCIe LP+LP Riser 1 Kit 6 AUWA ThinkSystem SR530/SR570/SR630 x16 PCIe LP Riser 2 Kit 6 AUWQ Lenovo ThinkSystem 1U LP+LP BF Riser Bracket 6 AUPW ThinkSystem XClarity Controller Standard to Enterprise Upgrade 6 AUWW Front VGA Connector Upgrade Kit for 1U 2.5" 6 AUWV 10x2.5"Cable Kit (1U) 6 AUTJ ThinkSystem common Intel Label 6 B0SQ HX Badge 1 6 B13M ThinkAgile EIA Plate 6 B4NK ThinkSystem SR630 Refresh MB 6 AWF9 ThinkSystem Response time Service Label LI 6 AUTA XCC Network Access Label 6 B13S HX 1U Service label LI 6 AUW7 ThinkSystem SR630 4056 Fan Module 12 ASFE Notice for Advanced Format 512e Hard Disk Drives 6 AUWF Lenovo ThinkSystem Super Cap Holder Dummy 6 B0SN HX Series 1U Agency label 6 AURS Lenovo ThinkSystem Memory Dummy 72 AVJ2 ThinkSystem 4R CPU HS Clip 12 AUT8 ThinkSystem 1100W RDN PSU Caution Label 6 AULQ ThinkSystem 1U CPU Performance Heatsink 12 AUTQ ThinkSystem small Lenovo Label for 24x2.5"/12x3.5"/10x2.5" 6 AURR ThinkSystem M3.5 Screw for Riser 2x2pcs and SR530/550/558/570/590

Planar 5pcs 24

B173 Companion Part for XClarity Controller Standard to Enterprise Upgrade in Factory

6

9220 Preload by Hardware Feature Specify 6 AWGE ThinkSystem SR630 WW Lenovo LPK 1 A102 Advanced Grouping 6 8072 General Racking Solution 6 8971 Integrate in manufacturing 6 2305 Integration 1U Component 6

Page 39: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

39 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

B6C2 Node Tebibytes 42 B0ML Feature Enable TPM on MB 6 B6C1 Node Cores 288 8086 No Publications Selected 5

AUWN Lenovo ThinkSystem 1U LP Riser Bracket 6

Systems Management Node Bill of Materials Table 18 lists the BOM for the Systems Management Node.

Table 18. Systems Management Node

Code Description Qty 7X02CTO1WW SR630 System Management : ThinkSystem SR630 - 3yr Warranty 1

AUW2 ThinkSystem SR630 3.5" Chassis with 4 Bays 1 B4HU Intel Xeon Bronze 3204 6C 85W 1.9GHz Processor 1 AUNB ThinkSystem 16GB TruDDR4 2666 MHz (1Rx4 1.2V) RDIMM 1 AUW8 ThinkSystem SR530/SR630/SR570 3.5" SATA/SAS 4-Bay Backplane 1 5977 Select Storage devices - no configured RAID required 1 AUNL ThinkSystem 430-8i SAS/SATA 12Gb HBA 1 AUMV ThinkSystem M.2 with Mirroring Enablement Kit 1 AUUV ThinkSystem M.2 CV3 128GB SATA 6Gbps Non-Hot Swap SSD 2 AVW8 ThinkSystem 550W (230V/115V) Platinum Hot-Swap Power Supply 2 6400 2.8m, 13A/100-250V, C13 to C14 Jumper Cord 2

AUPW ThinkSystem XClarity Controller Standard to Enterprise Upgrade 1 B0MJ Feature Enable TPM 1.2 1 AXCA ThinkSystem Toolless Slide Rail 1 AUWM Lenovo ThinkSystem 1U LP+LP BF Riser Dummy 1 AWF9 ThinkSystem Response time Service Label LI 1 AUWG Lenovo ThinkSystem 1U VGA Filler 1 AUWL Lenovo ThinkSystem 1U LP Riser Dummy 1 AUX3 ThinkSystem SR630 Model Number Label 1 AUS6 Lenovo ThinkSystem 1U height CPU HS Dummy 1 AUTV ThinkSystem large Label for non-24x2.5"/12x3.5"/10x2.5" 1 AURY Lenovo ThinkSystem PHY Module Dummy 1 B6ZL ThinkSystem SR630 Agency Label Lenovo, No ES Mark 1 B173 Companion Part for XClarity Controller Standard to Enterprise Upgrade in

Factory 1

AVWK ThinkSystem EIA Plate with Lenovo Logo 1 AVWH ThinkSystem 550W RDN PSU Caution Label 1 B4NK ThinkSystem SR630 Refresh MB 1 AUWK Lenovo ThinkSystem 4056 Fan Dummy 1 AUTA XCC Network Access Label 1

Page 40: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

40 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

AUTJ ThinkSystem common Intel Label 1 AVJ3 ThinkSystem 1x1 3.5" HDD Filler 4 AVJ2 ThinkSystem 4R CPU HS Clip 1 AULP ThinkSystem 1U CPU Heatsink 1 AUX4 MS 1U Service Label LI 1 8072 General Racking Solution 1 8971 Integrate in manufacturing 1 2305 Integration 1U Component 1 B0ML Feature Enable TPM on MB 1 AWGE ThinkSystem SR630 WW Lenovo LPK 1 AUWY ThinkSystem SR650 12x3.5" SATA/SAS/NVME BP Cable Kit 1 AUWL Lenovo ThinkSystem 1U LP Riser Dummy

Network Table 19. High Speed Data Switch

Code Description Qty 7159HCW Switch : Lenovo RackSwitch G8272 (Rear to Front) 2

ASRD Lenovo RackSwitch G8272 (Rear to Front) 2 ASTN Air Inlet Duct for 487 mm RackSwitch 2 A3KP Adjustable 19" 4 Post Rail Kit 2 6201 1.5m, 10A/100-250V, C13 to IEC 320-C14 Rack Power Cable 4

7159HCW Switch : Lenovo RackSwitch G8272 (Rear to Front) 2 ASRD Lenovo RackSwitch G8272 (Rear to Front) 2 ASTN Air Inlet Duct for 487 mm RackSwitch 2

Table 20. Management Switch

Code Description Qty 7Y81CTO1WW Switch : Lenovo ThinkSystem NE0152T RackSwitch (Rear to Front) 1

B45U Lenovo ThinkSystem NE0152T RackSwitch (Rear to Front) 1 6201 1.5m, 10A/100-250V, C13 to IEC 320-C14 Rack Power Cable 2

00D6061 Air Inlet Duct for 442 mm RackSwitch 1 00D6185 Adjustable 19" 4 Post Rail Kit 1

Rack Table 21. 42U Rack

Code Description Qty 9363RC4 Rack : 42U 1100mm Deep Dynamic Rack 1

A1RC 42U 1100mm Enterprise V2 Dynamic Rack 1

Page 41: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

41 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

4275 5U black plastic filler panel 5 4271 1U black plastic filler panel 1 8072 General Racking Solution 1 8971 Integrate in manufacturing 1 9134 use 200V (high voltage) 1 9123 2-bay arrangement 1 3101 Rack 01 1

9363RC4 Rack : 42U 1100mm Deep Dynamic Rack 1

Page 42: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

42 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

9 Acknowledgements This reference architecture document has benefited very much from the detailed and careful review comments provided by colleagues at Lenovo and Cazena.

Lenovo business review

• Prasad Venkatachar – Sr. Solutions Product Manager

Cazena technical contributions and review

• Monas Bhar, Cazena Software Engineer

• Sujatha Mizar, Cazena Software Engineer

Page 43: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

43 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

10 Resources For more information, see the following resources:

Lenovo ThinkAgile HX Series Product Guide: https://lenovopress.com/ds0019.pdf

Lenovo ThinkAgile HX5520 appliance: https://lenovopress.com/lp1124-thinkagile-hx5520-appliance-gen2

Lenovo ThinkAgile HX3320 appliance: https://lenovopress.com/lp1121-thinkagile-hx3320-appliance-gen2

Lenovo RackSwitch NE0152T (1GbE Switch):

• Product guide: https://lenovopress.com/lp0965-lenovo-thinksystem-ne0152t-gigabit-ethernet-switch

Lenovo RackSwitch G8272 (10GbE Switch): • Lenovo Press Product page: https://lenovopress.com/tips1267-lenovo-rackswitch-g8272

Intel Xeon Scalable Family Balanced Memory Configurations • Whitepaper: https://lenovopress.com/lp0742-intel-xeon-scalable-family-balanced-memory-

configurations Lenovo XClarity Administrator:

• Product page: https://lenovopress.com/tips1200-lenovo-xclarity-administrator

Cazena: • Cazena Overview: https://www.cazena.com/what-is-cazena

• Cazena Documentation: https://docs.cazena.com/docs/

• Additional Cazena resources: https://www.cazena.com/resources

Nutanix: • Cloudera with Nutanix white paper: https://www.nutanix.com/go/cloudera-with-nutanix • vSphere admin guide: http://download.nutanix.com/documentation/v510/vSphere-Admin6-FLEX-

AOS-v510.pdf

Cloudera: • Cloudera Distribution for Hadoop (CDH): http://www.cloudera.com/content/cloudera/en/products-and-

services/cdh.html • Cloudera Installation Guide: https://www.cloudera.com/documentation/enterprise/5-15-

x/topics/installation.html • Cloudera products and services: https://www.cloudera.com/products.html • Cloudera solutions: http://www.cloudera.com/content/cloudera/en/solutions.html

VMware: • VMware Publications: https://www.vmware.com/support/pubs/

Red Hat:

• Red Hat Linux operating system: https://www.redhat.com/en

Page 44: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

44 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

Page 45: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

45 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

11 Document History Version 1.0 27 June 2019 First version

Page 46: Lenovo Big Data Validated Design for Cazena SaaS using ...lenovopress.com/lp1179.pdf · Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX 2.2.6 Secure

46 Lenovo Big Data Validated Design for Cazena SaaS using Cloudera on ThinkAgile HX

Trademarks and special notices © Copyright Lenovo 2019.

References in this document to Lenovo products or services do not imply that Lenovo intends to make them available in every country.

Lenovo, the Lenovo logo, ThinkCenter, ThinkVision, ThinkVantage, ThinkPlus and Rescue and Recovery are trademarks of Lenovo.

IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both.

Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.

Intel, Intel Inside (logos), MMX, and Pentium are trademarks of Intel Corporation in the United States, other countries, or both.

Other company, product, or service names may be trademarks or service marks of others.

Information is provided "AS IS" without warranty of any kind.

All customer examples described are presented as illustrations of how those customers have used Lenovo products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer.

Information concerning non-Lenovo products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an endorsement of such products by Lenovo. Sources for non-Lenovo list prices and performance numbers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. Lenovo has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-Lenovo products. Questions on the capability of non-Lenovo products should be addressed to the supplier of those products.

All statements regarding Lenovo future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Contact your local Lenovo office or Lenovo authorized reseller for the full text of the specific Statement of Direction.

Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future products. Such commitments are only made in Lenovo product announcements. The information is presented here to communicate Lenovo’s current investment and development activities as a good faith effort to help with our customers' future planning.

Performance is based on measurements and projections using standard Lenovo benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here.

Photographs shown are of engineering prototypes. Changes may be incorporated in production models.

Any references in this information to non-Lenovo websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this Lenovo product and use of those websites is at your own risk.