Big data with hadoop Setup on Ubuntu 12.04

Mandakini Kumari

Big DataWith

Hadoop Setup

Agenda1. Big Data ?2. Limitation of Existing System3. Advantage Of Hadoop4. Disadvantage of Hadoop5. Hadoop Echo System & Components6. Prerequisite for Hadoop 1.x7. Install Hadoop 1.X

1.1 Characteristics of Big Data

1.2 In Every 60 seconds on the internet

2.1 Limitation of Existing Data Analytics Architecture

3.1 Advantage of Hadoop•Hadoop: storage + Computational capabilities both together. While RDBMS computation done in CPU which required BUS for data transfer from HardDisk to CPU

•Fault-tolerant hardware is expensive V/S Hadoop is design to run on cheap commodity hardware•Complicated Data Replication & Failure System v/s Hadoop autometically handles datareplication and node failure.

•HDFS (storage) is optimized for high throughput.•Large block sizes of HDFS helps in large files(GB, PB...)• HDFS have high Scalability and Availability for achieve data replication and fault tolerance.•Extremely scalable•MR Framework allows parallel work over a huge data.•Job schedule for remote execution on the slave/datanodes allow parallel & fast job executions.•MR deal with business and HDFS with storage independently

3.2 Advantage of Hadoop

3.3 Advantage of Hadoop

4.1 Disadvantage of Hadoop•HDFS is inefficient for handling small files

•Hadoop 1.X single points of failure at NN

•Create problem if cluster is more then 4000 because all meta data will store on only one NN RAM.•Hadoop 2.x don't have single points of failure.

•Security is major concern because Hadoop 1.X does offer a security model But by default it is disabled because of its high complexity.

•Hadoop 1.X does not offer storage or network level encryption which is very big concern for government sector application data.

5.1 HADOOP ECO SYSTEM

5.2 ADVANTAGE OF HDFS

5.3 NAMENODE: HADOOP COMPONENT

•It is Master with high end H/W.•Store all Metadata in Main Memory i.e. RAM.

•Type of MetaData: List of files, Blocks for each file, DN for each block

•File attributes: Access time, replication factor

•JobTracker report to NN after JOB completed.

•Receive heartbeat from each DN

•Transaction Log: Records file create / delete etc.

5.4 DATANODE: HADOOP COMPONENT•A Slave/commodity H/W

•File Write operation in DN preferred as sequential process. If parallel then issue in data replication.

•File write in DN is parallel process

•Provides actual storage.

•Responsible for read/write data for clients

•Heartbeat: NN receive heartbeat from DN in every 5 or 10 sec. If heartbeat not received then data will replicated to another datanode.

5.5 SECONDARY NAMENODE: HADOOP COMPONENT

•Not a hot standby for the NameNode(NN)

•If NN fail only Read operation can performed no block replicated or deleted.

•If NN failed system will go in safe mode

•Secondary NameNode connect to NN in every hour and get backup of NN metadata

•Saved metadata can build a failed NameNode

5.6 MAPREDUCE(BUSINESS LOGIC) ENGINE•TaskTracker(TT) is slave•TT act like resource who work on task •Jobtracker(Master) act like manager who split JOB into TASK

5.7 HDFS: HADOOP COMPONENT

5.8 FAULT TOLERANCE: REPLICATION AND RACK AWARENESS

6. Hadoop Installation: Prerequisites1. Ubuntu Linux 12.04.3 LTS

2. Installing Java v1.5+

3. Adding dedicated Hadoop system user.

4. Configuring SSH access.

5. Disabling IPv6.

For Putty user: sudo apt-get install openssh-serverRun command sudo apt-get update

6.1 Install Java v1.5+

6.1.1) Download latest oracle java linux version wget https://edelivery.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u45-linux-x64.tar.gzORTo avoid passing username and password usewget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com" https://edelivery.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u45-linux-x64.tar.gz

6.1.2) Copy Java binaries into the /usr/local/java directory.sudo cp -r jdk-7u25-linux-x64.tar.gz /usr/local/java

6.1.3) Change the directory to /usr/local/java: cd /usr/local/java

6.1.4) Unpack the Java binaries, in /usr/local/javasudo tar xvzf jdk-7u25-linux-x64.tar.gz

6.1.5) Edit the system PATH file /etc/profile sudo nano /etc/profile or sudo gedit /etc/profile

https://edelivery.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u45-linux-x64.tar.gz

6.1 Install Java v1.5+

6.1.6) At end of /etc/profile file add the following system variables to your system path:JAVA_HOME=/usr/local/java/jdk1.7.0_25PATH=$PATH:$HOME/bin:$JAVA_HOME/binexport JAVA_HOMEexport PATH

6.1.7)Inform your Ubuntu Linux system where your Oracle Java JDK/JRE is located. sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/local/java/jdk1.7.0_40/bin/javac"

6.1.8) Reload system wide PATH /etc/profile: . /etc/profile

6.1.9) Test Java: Java -version

6.2 Add dedicated Hadoop system user

6.2.1) Adding group: sudo addgroup Hadoop

6.2.2) Creating a user and adding the user to a group:sudo adduser –ingroup Hadoop hduser

6.3 Generae an SSH key for the hduser user

6.3.1) Login as hduser with sudo

6.3.2) Run this Key generation command: ssh-keyegen -t rsa -P “”

6.3.3) It will ask to provide the file name in which to save the key, just press has entered so that it will generate the key at ‘/home/hduser/ .ssh’

6.3.4)Enable SSH access to your local machine with this newly created key.cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

6.3.5) Test SSH setup by connecting to your local machine with the hduser user. ssh hduser@localhostThis will add localhost permanently to the list of known hosts

6.4 Disabling IPv6

6.4.1)We need to disable IPv6 because Ubuntu is using 0.0.0.0 IP for different Hadoop configurations. Run command : sudo gedit /etc/sysctl.conf

Add the following lines to the end of the file and reboot the machine, to update the configurations correctly.

#disable ipv6net.ipv6.conf.all.disable_ipv6 = 1net.ipv6.conf.default.disable_ipv6 = 1net.ipv6.conf.lo.disable_ipv6 = 1

Install Hadoop 1.2Ubuntu Linux 12.04.3 LTSHadoop 1.2.1, released August, 2013

Download and extract Hadoop:

Command: wget http://archive.apache.org/dist/hadoop/core/hadoop-1.2.0/hadoop-1.2.0.tar.gz

Command: tar -xvf hadoop-1.2.0.tar.gz

http://archive.apache.org/dist/hadoop/core/hadoop-1.2.0/hadoop-1.2.0.tar.gz

Edit Core-Site.Xml

Command: sudo gedit hadoop/conf/core-site.xml

<property><name>fs.default.name</name><value>hdfs://localhost:8020</value></property>

Edit hdfs-site.xml

Command: sudo gedit hadoop/conf/hdfs-site.xml

<property><name>dfs.replication</name><value>1</value></property><property><name>dfs.permissions</name><value>false</value></property>

Edit mapred-site.xml

Command: sudo gedit hadoop/conf/mapred -site.xml

<property><name>mapred.job.tracker</name><value>localhost:8021</value></property>

Get your ip address

Command: ifconfig

Command: sudo gedit /etc/hosts

CREATE A SSH KEY•Command: ssh-keygen -t rsa –P ""•Moving the key to authorized key:•Command: cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Configuration

•Reboot the system

• Add JAVA_HOME in hadoop-env.sh file:

Command: sudo gedit hadoop/conf/hadoop-env.sh

Type :export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386

JAVA_HOME

Hadoop CommandFormat the name nodeCommand: bin/hadoop namenode -formatStart the namenode, datanodeCommand: bin/start-dfs.shStart the task tracker and job trackerCommand: bin/start-mapred.shTo check if Hadoop started correctlyCommand: jps

Thank you

CONTACT ME @http://in.linkedin.com/pub/mandakini-kumari/18/93/935http://www.slideshare.net/mandakinikumari

References:http://bigdatahandler.com/2013/10/24/what-is-apache-hadoop/edureka.in

mailto:[email protected]

mailto:[email protected]

Big data with hadoop Setup on Ubuntu 12.04

Data & Analytics

Transcript of Big data with hadoop Setup on Ubuntu 12.04