Real World Archirecture and Deployment Best Practices
-
Upload
dataworks-summithadoop-summit -
Category
Technology
-
view
114 -
download
0
Transcript of Real World Archirecture and Deployment Best Practices
Architecture Best Practices For Big Data Deployment
© Copyright 2017 Dell Inc.2
J. Cory MintonPrincipal SE and Data Analytics Leader
• 6+ Years at Dell EMC
• Lead GTM for Data Analytics Blueprint
• I Hardware!
• Startup Advisor
• Oracle and SAP Background
• BS Engineering and MBA
• www.BigDataBeard.com
•
www.GoWithDaddy.com
© Copyright 2017 Dell Inc.3
Problem…
© Copyright 2017 Dell Inc.4
Provide basic fundamentals for sizing a Hadoop deployment and
share learned best practices.
© Copyright 2017 Dell Inc.5
Assumption #1General Understanding of Hadoop Ecosystem
© Copyright 2017 Dell Inc.6
Assumption #2General Understanding of Hadoop Infrastructure
ComponentsMachine Usage Description Machine Hardware Class
Management Node Runs Ambari Server, Supporting Databases, Ambari Metrics Service, other optional services
Master Server
Edge Node Runs edge services, such as Knox, Hue, other front-end client services
Master Server
Master Node Runs master services, such as NameNode, Resource Manager, Oozie, HBase Master
Master Server
Data Node Runs HDFS Datanode, YARN NodeManager, optionally HBase Region Server. This node will be the majority of the cluster and provide workload execution and data storage.
Slave Server
Kafka Node In the case workload volume for Kafka exceeds reasonable throughput of using existing capacity on edge nodes, dedicated Kafka nodes can be used.
Slave Server
© Copyright 2017 Dell Inc.7
Generalizations90% Empirical + 10% Experience 100% Perfect Every
Time• Virtualization Realities
– It works.
• Cloud vs On-Prem– Questions/Problems/Considerations are same, just
not in your DC.– It’s definitely virtual, unless…
• Sizing Approaches– Assuming new, tuning later– Start with cluster sizes– Then get machine specs
• 3X Replicas vs Erasure Coding– Failure happens, how you prep depends on your
goals.– Overhead is better…– Focus on today…and assume 4.5X for sizing
• Compression Impacts– More space savings, but at a cost.
• Throughput per Core– 100-150 MB/s for all activities in cluster.– 25-50 MB/s for actual processing tasks (not inclusive
of HDFS replication or non-local I/O)
Sizing FundamentalsCapacity Based
© Copyright 2017 Dell Inc.9
DAS Sizing – Capacity Based
𝑈𝑠𝑒𝑎𝑏𝑙𝑒𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑦 𝑁𝑒𝑒𝑑𝑒𝑑×4.5÷𝑅𝑎𝑤𝑇𝐵𝑝𝑒𝑟 𝑁𝑜𝑑𝑒=𝑊𝑜𝑟𝑘𝑒𝑟 𝑁𝑜𝑑𝑒𝐶𝑜𝑢𝑛𝑡Example:
How many worker nodes for 100TB useable?
Assuming 24TB/server…
Worker Node Count = 19 (round up)
© Copyright 2017 Dell Inc.10
Determine Cluster Class
Cluster Size Type
# of Nodes Approximate # of Racks*
# of Master Servers
Test/Dev < 8 1 1-2
Mini 8 - 16 1-2 3
Small 17 - 40 2 - 4 4 - 6
Medium 41 - 120 3 - 8 7 - 9
Large 121 – 512 8 - 32 10 - 12
Jumbo > 512 > 32 Lets Talk
*Note – Racks can vary by rack size, chassis used, etc.
© Copyright 2017 Dell Inc.11
Example Output
Worker Node: The worker node is responsible for data storage along with batch and real time data processing
Worker Node: The worker node is responsible for data storage along with batch and real time data processing
Worker Node: The worker node is responsible for data storage along with batch and real time data processing
Management Node: The management node is responsible for all processes related to management and operation of a Hadoop cluster
EdgeNode: The gateway node is responsible for receiving allJob requests from outside of cluster and submitting for processing
Master Node: The Master Node is responsible for the Management Oversight of all YARN, HDFS and HBase
Top of Rack Switch: The Top of Rack Switch is responsible for the management of physical network traffic on the Hadoop cluster
4-6
19
Number of Machines
Sizing FundamentalsPerformance Based
© Copyright 2017 Dell Inc.13
SLA Driven Hadoop Sizing
1. If the following Data Processing SLAs are known:– Number of GB of data to process– Amount of time to process it
2. … then size based on MB/sec throughput for minimum.– Assume 50MB/s for core calculation (remember much I/O is data movement)
3. Capacity may dictate more nodes, go with higher number.
4. Follow other best practices for master/slave ratios.
Example:
How many worker nodes to process 1 PB/day?
Convert to MB/sTB/day = 11.57MB/s
Assuming 24 cores//node
Worker Node Count = 10
© Copyright 2017 Dell Inc.14
Things that don’t go in this cluster…
• HAWQ – give it nodes
• Impala – same
• SpringXD/HDF – same, CPU and memory hogs
• Spark – can run here, but likes memory…most run separate on dedicated HW
• Kafka – can run here, but most run separate
Sizing FundamentalsSizing Nodes for Workload
© Copyright 2017 Dell Inc.16
What is your workload ?
• Expecting Complexity? Increase CPU and Memory Ratio.
• Machine Learning?• Image Processing?• Natural Language Processing?
• Low Latency Applications? Increase Memory.
• Storm, low latency Hive, Tez, HBase?• Spark?
• Traditional ETL and Archiving? Increase Disk.
• Pig, Hive, MapReduce?
© Copyright 2017 Dell Inc.17
Cluster Type Summary
Cluster Type Machine Recommendations
Storage Oriented• 2U Server, Single Socket• 64GB RAM• 12-36 3.5” NL-SAS / SATA 7200 RPM
Balanced• 2U Server, Dual Socket• 128GB RAM• 3.5” 2-4TB NL-SAS / SATA 7200 RPM
Performance• 2U Server, Dual Socket• 256GB RAM• 24 – 10K SAS Drives
High Performance• 1U Server, Dual Socket or 2U, 4 Socket• MAX Ram• SSD Drives
© Copyright 2017 Dell Inc.18
Dell EMC BlueprintsAny workload. Any environment. Any experience.
Converged Continuum
BuildMaximum Flexibility
BuyTurnkey Outcomes, Maximum Agility
19 © Copyright 2017 Dell Inc.
Ready Systems
Ready Bundles
Ready NodesNative
Hybrid Cloud & Analytic
Insights Module
EnterpriseHybrid Cloud
Blocks
Racks
Appliances
Dell EMC BlueprintsAny workload. Any environment. Any experience.
Converged Continuum
BuildMaximum Flexibility
BuyTurnkey Outcomes, Maximum Agility
Proven outcomes Global Services Custom Financing
20 © Copyright 2017 Dell Inc.
Ready Systems
Ready Bundles
Ready NodesNative
Hybrid Cloud & Analytic
Insights Module
EnterpriseHybrid Cloud
Blocks
Racks
Appliances
Dell EMC Blueprints ProgramAccelerating IT. Simplifying Build to Buy for Customers.
SOFTWARE DEFINED
HPC DATAANALYTICS
BUSINESSAPPLICATIONS
© Copyright 2017 Dell Inc.21
Solu
tions
Portfolio: Data Analytics Blueprint SolutionsB
enef
its
BUY
Fastest time to valueOptimized and tuned for use caseGreatest risk reductionSolution lifecycle automation
BUILD
Greater flexibilityValidated for use case Heterogeneity with lower riskComponent lifecycle automation options
Dell EMC Ready Bundle for Cloudera Hadoop with Isilon Shared Storage
Dell EMC Ready Bundle for Cloudera Hadoop
(ETL Offload, R730XD, FX2)
Consumption models
Dell EMC Splunk solution on VxRail All Flash
Dell EMC Analytic Insights Module
Dell EMC Ready Bundle for Hortonworks Hadoop(R730XD)
0 1
0 2
0 3
0 4
0 5
0 6
0 7
0 8
0 9
1 0
1 1
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
2 0
2 1
2 2
2 3
2 4
2 5
2 6
2 7
2 8
2 9
3 0
3 1
3 2
3 3
3 4
3 5
3 6
3 7
3 8
3 9
4 0
4 1
4 2
0 1
0 2
0 3
0 4
0 5
0 6
0 7
0 8
0 9
1 0
1 1
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
2 0
2 1
2 2
2 3
2 4
2 5
2 6
2 7
2 8
2 9
3 0
3 1
3 2
3 3
3 4
3 5
3 6
3 7
3 8
3 9
4 0
4 1
4 2S tac k- ID
LNK1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2 2 23 24 2 5 26 27 28 29 30 31 32
A CT50 52 543 3 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
49 51 53
S tac k- ID
LNK 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2 2 23 24 2 5 26 27 28 29 30 31 32 A CT 50 52 543 3 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
49 51 53
120
124
112
11610
4108
96100
8892
8084
7276
6468
5660
4852
4044
3236
2428
1620
812
04 St ac k ID
St a ck N o.
1
2
25 26SF P +
3 5 7 9 11
4 6 8 1 0 12
13 15 1 7 19 21
14 16 1 8 20 22 2 4
LNK AC T1
2
2 3
LNK AC T
C OM B O P ORT S23 24
KV M
KV M
KV M
KV M
KV M
KV M
KV M
KV M
KV M
KV M
KV M
KV M
Dell EMC Splunk solution on Vblock 540
© Copyright 2017 Dell Inc.22
Dell EMC Hadoop BlueprintPrimary use case: Scale out solution to optimize data management, processing and analytics
Solution benefits• Enables organizations to gain business
insights to build unique competitive advantages
• Simplify the design, architecture, deployment and configuration of a Hadoop environment
Differentiation• Tested and validated architecture• Integrates with current systems • Leverages existing tools and resources• Flexible and scalable to process multi-
structured data volumes
Scales from 5 to 252 nodes, 3.8 PB
Pod Network 2x Dell Networking S4048 10GbE Pod Switches1x S3038 iDRAC Switch
Data Nodes10x PowerEdge R730xd with 3.5 Drives – 48 TB or 10x PowerEdge R730xd with 2.5” Drives – 24TB
Infrastructure Nodes1x Dell PowerEdge™ R630 Admin Node3x PowerEdge R730XD Name Nodes1x PowerEdge R730XD Edge Node
Cluster Network 2x Dell Networking S6000 40GbE Cluster Switches
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42Stack-ID
LNK 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 AC T 50 52 5433 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
49 51 53
Stack-ID
LNK 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 AC T 50 52 5433 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
49 51 53
120
124
112116
104108
96100
8892
8084
7276
6468
5660
4852
4044
3236
2428
1620
812
04 Stack ID
120
124
112
116
104
108
9610
0
8892
8084
7276
6468
5660
4852
4044
3236
2428
1620
812
04 Stack ID
Stack No.
1
2
25 26SFP+
3 5 7 9 11
4 6 8 10 12
13 15 17 19 21
14 16 18 20 22 24
LNK ACT1
2
23
LNK ACT
COMB O PORTS 23 24
KVM
KVM
KVM
KVM
KVM
KVM
KVM
KVM
KVM
KVM
KVM
KVM
Hortonworks HDP or Cloudera CDHDell OpenManage™ / iDRAC with Lifecycle Controller
Realized value with Dell EMC Data Analytics
$15-25 million in customer savings with 360-degree supply chain view — Siemens
reduction of data warehouse costs — Danske Bank
30% 4X faster predictive analytics
— Dell
58% reduction in post-operative infections
— University of Iowa Hospitals and Clinics
23 © Copyright 2017 Dell Inc.
Solution centers Staffed with engineers and Blueprint solution experts
Global Solution CentersValidate. Evaluate. Collaborate. Innovate.
Engagements begin with your challenges• Briefings with a
team of experts• Architectural design
sessions• Proofs of concept
© Copyright 2017 Dell Inc.25
Contact Blueprints specialist:[email protected]
Accelerate your journey
Visit: Dell.com/Blueprints
© Copyright 2017 Dell Inc.26
Questions?
© Copyright 2017 Dell Inc.27
Related Sessions
• IT Leadership Track - Modern Architecture Concepts for Big Data
• Hands-On Labs