Hadoop Robot from eBay at China Hadoop Summit 2015
-
Upload
polo-li -
Category
Technology
-
view
321 -
download
4
Transcript of Hadoop Robot from eBay at China Hadoop Summit 2015
![Page 1: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/1.jpg)
HADOOP ROBOT Action and Remediation Center for eBay's Hadoop Clusters
Polo Li | 李健
eBay Cloud Services
![Page 2: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/2.jpg)
About me
2
![Page 3: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/3.jpg)
Agenda
• Hadoop @eBay
• Problem Statement
• What is Hadoop Robot
• Q&A
3
![Page 4: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/4.jpg)
4
Hadoop @eBay
1-10 nodes
2007
100+ nodes
1000 + core
1 PB
2010
2011
1000+ node
10,000+ core
10+ PB
4000+ node
40,000+ core
50+ PB
2013
2015
10,000+ nodes
150,000+ cores
150+ PB
2009
10+ nodes
![Page 5: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/5.jpg)
Hadoop @eBay
•10+ large Hadoop Clusters
•10,000+ nodes
•50,000+ jobs per day
•50,000,000+ tasks per day
5
![Page 6: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/6.jpg)
Shared vs Dedicated Clusters
• Shared clusters
• Used primarily for analytics of user behavior and inventory
• Mix of batch and ad-hoc jobs
• Mix of MR, YARN, Hive, Pig, Cascading, etc.
• Hadoop and HBase security enabled
• Dedicated clusters
• Very specific use case like index building
• Tight SLAs for jobs (in order of minutes)
• Immediate revenue impact
• Usually smaller than our shared clusters, but still large (600+ nodes)
6
![Page 7: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/7.jpg)
7
Hadoop Platform Stack
![Page 8: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/8.jpg)
Agenda
• Hadoop @eBay
• Problem Statement
• What is Hadoop Robot
• Q&A
8
![Page 9: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/9.jpg)
Problem Statements
• Long trouble shooting time
• Bad cluster performance
• Too many different skus, operating systems and metadatas
• Human resource Cost
• Cluster Availability
9
![Page 10: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/10.jpg)
Traditional Trouble Shooting Pipeline
Step 1 • Check failed application task logs to find out the suspicious hadoop nodes
Step 2 • Check the suspicious hadoop node hardware && system status
Step 3 • Check hadoop metrics and hadoop daemon logs
Step 4 • Check hadoop source code
10
![Page 11: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/11.jpg)
Victim or Perpetrator ?
Sometimes you think you've found the perpetrators, however it may turn out to be
the victim.
11
![Page 12: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/12.jpg)
What may impact cluster performance?
• Hardware
• System
• Hadoop
• JVM
• …
12
![Page 13: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/13.jpg)
Hardware issues
84%
3%
4% 4%
1%
1%
1%
2%
Hardware Issues distribution
Disk
Nic
Memory
Cpu
Chassis
Fan
Mobo
PSU
13
![Page 14: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/14.jpg)
System issues
•High load
•Node reboot
•Disk full
•Network saturated fully
•OOM
•Kernel bug
•Orphan processes
•…
14
![Page 15: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/15.jpg)
Hadoop Issues
•Hot spot
•Low hdfs locality
•High RPC call queue length
•Hadoop configuration inconsistent issues
•Big resource consuming applications
•Big log/hdfs output applications
•Bad Application scheduling
•…
15
![Page 16: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/16.jpg)
About hadoop node
16
NODE
SKU
OS METADATA
![Page 17: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/17.jpg)
Different SKU
•Cpu core
•Memory size
•Disk number
•Disk size
•Nic Speed
•…
17
![Page 18: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/18.jpg)
Different OS
• Image profile
• Image version
•Patches
•…
18
![Page 19: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/19.jpg)
Different Metadata
• Hadoop Metadata
•Daemons
•Packages
•Configurations
• OS Metadata
•Services
•Scripts/Tools
•Configurations
19
![Page 20: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/20.jpg)
Human Resource Cost
• Detect node (varies based on admin experiences)
• Node decommission (1-2 Hours)
• Vendor offline remediation (3-5 Days)
• Node reimage(45 Mins)
• Hadoop Installation and OS configuration (15 Mins)
• Node restart (5 Mins)
• Disk burning test (6-8 Hours)
• Node health verification (15 Mins)
• Node recommission (15 Mins)
20
![Page 21: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/21.jpg)
Cluster Availability
21
• Improve Availability
• Increase Live data node percentage
• Reduce job failures due to bad node
![Page 22: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/22.jpg)
Agenda
• Hadoop @eBay
• Problem Statement
• What is Hadoop Robot
• Q&A
22
![Page 23: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/23.jpg)
What is Hadoop Robot Hadoop Robot is action and remediation center for eBay hadoop clusters:
• End-to-end automated remediation center
• API center for hadoop action and remediation
• Unified Hadoop Admin Console
• Real time maintenance view of Hadoop clusters
• Analytical insights into hardware maintenance data
23
![Page 24: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/24.jpg)
End-to-end Automatic remediation center •Hardware Maintenance
•Alert Detection
•Node Decommission
•Remediation
•Node Recommission
•Remove Failed Disk Volume
•Bad Disk Hot Swap
•Hadoop Daemon Restart
•Hadoop Abnormal Job Termination
•Hadoop Cluster Expansion
•…
24
![Page 25: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/25.jpg)
Robot Architecture
25
![Page 26: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/26.jpg)
Hardware Maintenance Workflow
• Alert detection
• Node decommission
• Vendor remediation (while node is offline)
• Node OS reprovisioning
• Hadoop Installation && OS configuration
• Node restart
• Burn-in test
• Node health verification
• Node recommission
26
![Page 27: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/27.jpg)
Data Flow
27
![Page 28: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/28.jpg)
Alerts Drive Workflow
28
![Page 29: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/29.jpg)
Hardware and System Alerts
•Hardware
29
DISK
SMARTD_CHECK
SECTOR_CHECK
RW_CHECK
MOUNT_CHCEK
SYSLOG_CHECK
NIC
SPEED_CHCEK
DUPLEX_CHECK
FLAPPING_CHECK
DOWN_CHECK
NOT_FOUND_CHECK
MCELOG
CPU_CHECK
MEMORY_CHCEK
CPU_HIGH_TEMPERATURE_CHECK
MEMORY
MEMORY_SIZE_CHECK
![Page 30: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/30.jpg)
Hardware and System Alerts
•System
30
DISK
DISK_SPACE_CHECK
REBOOT
NODE_REBOOT_CHCEK
PROCESS
ORPHAN_PROCESSES_CHECK
![Page 31: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/31.jpg)
Metadata Setup - ansible
31
![Page 32: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/32.jpg)
One button hadoop installation and system configuration
We use ansible playbook to install and
configure various OS and software
packages including Hadoop.
32
Copy the Hadoop Packages, Configuration and Codes from
Source to the Destination Node
Update Symbolic Links to point to the Latest Hadoop Code.
Restart Hadoop Daemon
![Page 33: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/33.jpg)
Bonnie++
•We use bonnie++ to carry out a stress-test of the repaired hardware. This not only
puts a load on the I/O and disk subsystem, but it also can flush out CPU, RAM
and fan/cooling issues.
33
![Page 34: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/34.jpg)
Cluster level status overview
34
![Page 35: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/35.jpg)
Unified Hadoop Admin Console
35
![Page 36: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/36.jpg)
Summit Job
36
![Page 37: Hadoop Robot from eBay at China Hadoop Summit 2015](https://reader034.fdocuments.net/reader034/viewer/2022042510/589d084a1a28ab255c8b615b/html5/thumbnails/37.jpg)
Track Robot Job Activities
37