Installing Hadoop / Spark from scratch

23
1 © 2016 IBM Corporation Big Data Developer meetup Installing Apache Hadoop and Spark from scratch Ljubljana, June 2016

Transcript of Installing Hadoop / Spark from scratch

Page 1: Installing Hadoop / Spark from scratch

1

© 2016 IBM Corporation

Big Data Developer meetup

Installing Apache Hadoop and Spark from scratch

Ljubljana, June 2016

Page 2: Installing Hadoop / Spark from scratch

2

© 2016 IBM Corporation

Agenda

Why do you need Hadoop

What do you need before you install Apache Hadoop

Hadoop distributions

Hadoop components you need to know about

About Spark

Installation process walk-through

Adding cluster nodes

Ways to automate

Zero-install options

Page 3: Installing Hadoop / Spark from scratch

3

© 2016 IBM Corporation

Why do you need Apache Hadoop

License – free

Scalable

General purpose MPP

engine

Distributed storage

Packed with tools

Backend for your Big

Data project

Page 4: Installing Hadoop / Spark from scratch

4

© 2016 IBM Corporation

What do you need before you install Hadoop and Spark

A server (or servers)

Installed OS (in case of IBM RHEL 6.5-7 or SUSE 11 SP3)

A Hadoop distribution (more later)

Or avoid all that trouble by using VM / Docker if you are just

playing (more later)

Page 5: Installing Hadoop / Spark from scratch

5

© 2016 IBM Corporation

Apache Hadoop Distributions

Hortonworks HDP

Cloudera CDH

IBM IOP (today’s focus)

Number of others

Distributions are very similar but different, as in Linux

Some are part of ODP some are not

Page 6: Installing Hadoop / Spark from scratch

6

© 2016 IBM Corporation

Hadoop components you need to know about

Yarn – resource manager

HDFS

MapReduce

Ambari

ZooKeeper

Hive

Pig

sqoop

Page 7: Installing Hadoop / Spark from scratch

7

© 2016 IBM Corporation

Apache Spark is a fast, general purpose, easy-to-use cluster computing system for large-

scale data processing

– Fast

•Leverages aggressively cached in-memory

distributed computing and dedicated

App Executor processes even when no jobs

are running

•Faster than MapReduce

– General purpose

•Covers a wide range of workloads

•Provides SQL, streaming and complex

analytics

– Flexible and easier to use than Map Reduce

•Spark is written in Scala, an object oriented,

functional programming language

•Scala, Python and Java APIs

•Scala and Python interactive shells

•Runs on Hadoop, Mesos, standalone or

cloud

Logistic regression in Hadoop and Spark

from http://spark.apache.org

Page 9: Installing Hadoop / Spark from scratch

9

© 2016 IBM Corporation

Prereqs

Install OS

Setup yum repository

Install prerequisites

• Yum install nc

Full list of preparation steps

Make sure your hostname is in /etc/hosts

Tweak some settings (disable Trasparent Huge Pages)

• echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

Generate ssh key and set up passwordless ssh

• Ssh-keygen

• Chmod 700 ~/.ssh

• Check with ssh localhost

Page 10: Installing Hadoop / Spark from scratch

10

© 2016 IBM Corporation

Prereqs (cont.)

disable IPv6

Configure ulimit

• /etc/security/limits.conf

Disable SELinux

Set up NTP on all servers

Page 11: Installing Hadoop / Spark from scratch

11

© 2016 IBM Corporation

First step – install Ambari

Install repository

• yum install iop-4.1.0.0-1.<version>.<platform>.rpm

Install ambari

• Yum install ambari-server

Setup ambari server

• sudo ambari-server setup

Start ambari server

• Ambari-server start

Go to ambari interface <your-ip>:8080

• Default user/pass = admin/admin

Launch installation wisard

Page 12: Installing Hadoop / Spark from scratch

12

© 2016 IBM Corporation

Ambari installation

Next-next-next

Provide cluster name

Provide private ssh key

Page 13: Installing Hadoop / Spark from scratch

13

© 2016 IBM Corporation

Choose services

Page 14: Installing Hadoop / Spark from scratch

14

© 2016 IBM Corporation

Assign masters

Page 15: Installing Hadoop / Spark from scratch

15

© 2016 IBM Corporation

Assign slaves and clients

Page 16: Installing Hadoop / Spark from scratch

16

© 2016 IBM Corporation

Customize services

Here you would have to setup proper DB server connections in

your prod environment

Page 17: Installing Hadoop / Spark from scratch

17

© 2016 IBM Corporation

Review and deploy

Page 18: Installing Hadoop / Spark from scratch

18

© 2016 IBM Corporation

Validate

Page 19: Installing Hadoop / Spark from scratch

19

© 2016 IBM Corporation

Adding a new cluster node

Create a new server, with same

pre-rereqs

Make sure that passwordless ssh

works from ambari server to the

node

ssh-copy-id -i ~/.ssh/id_rsa.pub

root@hostname01

And done

Page 20: Installing Hadoop / Spark from scratch

20

© 2016 IBM Corporation

Extra steps

Install Anaconda / Jupyter for data analysis

PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook -

-no-browser --port 12000 --ip='0.0.0.0'" ./bin/pyspark

Page 21: Installing Hadoop / Spark from scratch

21

© 2016 IBM Corporation

Ways to automate - Ansible

Simple automation tool

Infrastructure as a code

Agent-less

Easy to learn

Check for examples online “ansible hadoop

playbook”

Page 22: Installing Hadoop / Spark from scratch

22

© 2016 IBM Corporation

Zero – installation options

•Big Insights QSE

•BigInsights on cloud (paid)

Page 23: Installing Hadoop / Spark from scratch

23

© 2014 IBM Corporation

WRAP-UP