Hadoop 2.x HDFS Cluster Installation (VirtualBox)

Post on 09-Jul-2015

1.504 views 3 download

description

This is a straight-forward tutorial for those who are goring to use HDFS in an academic environment on their notebooks or PCs.

Transcript of Hadoop 2.x HDFS Cluster Installation (VirtualBox)

کارگاه پردازش داده توزیع شده

پردیس- شهیدبهشتی

دانشکده علوم و مهندسی کامپیوتر

پایگاه داده توزیع شدهدرس:

دکتر هادی طباطباییاستاد:

ابوالفضل صدیقی ارائه: ۱۳۹۳آذر

2

Apache Hadoop 2.x Cluster Installation

Amir Sedighi@amirsedighi

http://hexican.com

Dec 2014

3

References

● http://hadoop.apache.org/docs/r2.2.0/

● http://www.vasanthivuppuluri.com/hadoop/installing-hadoop-2-5-1-on-64-bit-ubuntu-14-01/

● https://sites.google.com/site/hadoopandhive/home

4

Topics

● Assumptions

● First Node

– Installing Java

– Downloading and Extracting Hadoop

– Hadoop and Java Env Variables

– Disabling IP6

– Configuring Hadoop

● Cloning

● HDFS– Starting HDFS

● HDFS Health● FS Commands● Reclaiming Space● Reducing Replication Factor

5

Assumptions

● You already know about Linux.

– http://www.slideshare.net/AmirSedighi/distrinuted-data-processing-workshop-sbu

6

Installing Java

● $ sudo apt-get install default-jdk

7

Downloading and Extracting

● http://hadoop.apache.org/releases.html

● $ tar -zxvf hadoop-2.2.0.tar.gz

8

Hadoop and Java Env Variables

● Append the following definitions to /etc/profile or ~/.bashrc

export HADOOP_PREFIX="/home/amir/hadoop-2.2.0"

export HADOOP_HOME=$HADOOP_PREFIX

export HADOOP_COMMON_HOME=$HADOOP_PREFIX

export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop

export HADOOP_HDFS_HOME=$HADOOP_PREFIX

export HADOOP_MAPRED_HOME=$HADOOP_PREFIX

export HADOOP_YARN_HOME=$HADOOP_PREFIX

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

export JAVA_HOME=/usr/java/jdk1.7.0_55

export PATH=$PATH:$JAVA_HOME/bin:/home/amir/hadoop-2.2.0/bin:/home/amir/hadoop-2.2.0/sbin

9

Disabling IP6

● $ sudo nano /etc/sysctl.conf

# Disable IPv6

net.ipv6.conf.all.disable_ipv6 = 1

net.ipv6.conf.default.disable_ipv6 = 1

net.ipv6.conf.lo.disable_ipv6 = 1

10

Hadoop Configuration

● You would need to create or modify the following files inside hadoop/etc/hadoop:

– slaves

– core-site.xml

– yarn-site.xml

– hdfs-site.xml

– hadoop-env.sh

11

slaves

● List all DataNodes in slaves file.

slave1

slave2

slave3

12

slaves

Create slaves in hadoop/etc/hadoop folder:

u01

u02

u03

u04

u05

u06

...

13

etc/hosts and hadoop/etc/hadoop/slaves

14

core-site.xml

● Edit core-site.xml and apply the following:

<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://u01/</value>

<description>NameNode URI</description>

</property>

</configuration>

15

core-site.xml

16

yarn-site.xml<configuration>

<property>

<name>yarn.resourcemanager.hostname</name>

<value>u01</value>

<description>The hostname of the RM.</description>

</property>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<property>

<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

</configuration>

17

yarn-site.xml

18

hdfs-site.xml

<configuration>

<property>

<name>dfs.datanode.data.dir</name>

<value>file:///home/amir/hadoop-2.2.0/hdfs/datanode</value>

</property>

<property>

<name>dfs.namenode.name.dir</name>

<value>file:///home/amir/hadoop-2.2.0/hdfs/namenode</value>

</property>

</configuration>

19

hdfs-site.xml

20

hadoop-env.sh

● Add the following:

– export JAVA_HOME=/usr/java/jdk1.7.0_55

21

Reboot

● $ sudo reboot

22

Cloning

● Extend the cluster by cloning.

– NOTE: Find the instruction here:● http://www.slideshare.net/AmirSedighi/distrinuted-data-

processing-workshop-sbu

23

HDFS

● The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.

● It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.

● HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.

● HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

● HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

● HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project.

24

HDFS Architecture

25

DataNodes

26

start-dfs.sh

27

HDFS Health

● $ jps

– NameNode

– DataNode

● Check log files● Web UI

– http://u01:50070

28

HDFS Health

29

30

HDFS Health, Live Nodes

31

Hadoop FS Commands

● cat

● chmod

● chown

● copyFromLocal

● copyToLocal

● cp

● du

● expunge

● get

● ls

● mkdir

● put

● rm

● tail

32

HDFS Commands

33

Space Reclamation

● Delete Files

– $ hadoop fs -rm /filename

– $ hadoop fs -expunge

● Decrease Replication Factor

34

How to change replication factor of existing files in HDFS

● To set replication of an individual file to 4:

– hadoop dfs -setrep -w 4 /path/to/file

● You can also do this recursively. To change replication of entire HDFS to 1:

– hadoop dfs -setrep -R -w 1 /

35

Questions?