Cai Dat Cum Hadoop

download Cai Dat Cum Hadoop

of 30

description

Cai Dat Cum Hadoop

Transcript of Cai Dat Cum Hadoop

HNG DN CI T HADOOP1. Chun b12. Ci t Hadoop13. Cu hnh cc thng s cho hadoop cluster34. Format HDFS225. Khi ng h thng226. Kim tra ci t thnh cng23

1. Chun b1.1. To user HadoopThc hin cc lnh sau trn tt c cc server (Master v slave)useradd hadooppasswd hadoophadoop ALL = NOPASSWD: ALL (Mn quyn root thc thi mi cu lnh)1.2. Ci t java 1.7 (Phng QTHT ci)Nu cha c ci, tham kho trn web:http://timarcher.com/node/59http://www.roseindia.net/linux/tutorial/installingjdk5onlinux.shtml1.3. Cu hnh SSHKim tra xem my c c ci t SSH hay cha:

Nu SSH cha c ci t , tin hnh ci t gi OpenSSH theo hng dn ti: http://www.topology.org/linux/openssh.html2. Ci t HadoopTon b phn ci t ta s lu tr trong th mc: /u01/hadoop. Cu trc th mc ci t hadoop nh sau: Ton b cu trc ci t c t trong th mc hadoop_installation. Th mc ny t trong th mc home ca /u01/hadoop. Cu trc /u01/hadoop /hadoop_installtion nh sau:Comment by Nguyen Van Chung: Khng ci vo th mc home Trn Master:/u01/hadoop/hadoop_installtion/u01/hadoop/hadoop_installtion/installation/u01/hadoop/hadoop_installtion/data/u01/hadoop/data/hadoop_installtion/name_dir/u01/hadoop/data/hadoop_installtion/mapred Trn Slave:/u01/hadoop/hadoop_installtion/u01/hadoop/hadoop_installtion/installation/u02/hadoop/hadoop_installtion/data/u03/hadoop/hadoop_installtion/data/u04/hadoop/hadoop_installtion/data/u01/hadoop/data/hadoop_installtion/name_dir/u02/hadoop/data/hadoop_installtion/mapred/u03/hadoop/data/hadoop_installtion/mapred/u04/hadoop/data/hadoop_installtion/mapredTh mc installation: cha phn ci t hadoopTh mc data: th mc lu tr d liu khi chy hadoop. Th mc name_dir: l th mc lu tr cho HDFS. Nu my ng vai tr l mt DataNode, th th mc ny s l ni lu tr cc block d liu cho DataNode. Nu my ng vai tr l NameNode hay Secondary NameNode th th mc ny s lu tr cc metadata.Comment by Nguyen Van Chung: Khng t ng dn ging nhau gia data_dir v name_dir, t th mc data sang cc cha d liuTh mc mapred: l th mc lu tr d liu khi chy MapReduce. V d nh y s l ni lu tr cc kt qu gin tip khi thc hin map task. Ti v chp tp tin hadoop-1.2.1.tar.gz vo th mc home ca user hadoop.Comment by Nguyen Van Chung: S dng bn 1.2.1 ging bn kia.Gii nn tp tin hadoop-1.2.1.tar.gz vo th mc hadoop_installation/installation

Comment by Nguyen Van Chung: Sa vo file bashrc th mc home ca user. (bn di c)Export cc bin mi trng HADOOP_HOME v PATH (Hoc ta c th sa trc tip file file ~./bashrc trong phn3):

Bin HADOOP_HOME gip ta qun l ng dn ti th mc ci t hadoop v h tr cho Hadoop xc nh CLASSPATH, cn vic thm ng dn ti $HADOOP_HOME/bin vo PATH gip ta c th thc thi cc lnh, cc control script trong $HADOOP_HOME/bin , v d nh lnh hadoop hay script start-all.sh, m khng cn g ng dn tuyt i ti lnh. T di y ta s dng $HADOOP_HOME ni ti ng dn ti th mc ci t Hadoop.(CLASSPATH: l mt tham s (c th c thit lp bng command line hay bin mi trng) m JVM dng n tm cc lp c nh ngha hoc cc gi chng trnh.)Kim tra vic ci t Hadoop thnh cng:

Ta thy Hadoop c ci t thnh cng v hin th c thng tin phin bn Hadoop ang dng.3. Cu hnh cc thng s cho hadoop clusterCu hnh HADOOPSa ni dung cc fileThm vo file ~/.bashrc vi ni dung nh sau:# Set Hadoop-related environment variablesexport HADOOP_PREFIX= /u01/hadoop/hadoop_installation/installation /hadoop-1.2.1/export HADOOP_PID_DIR = /u01/hadoop/ hadoop_installation/installation /hadoop-1.2.1/pid/export JAVA_HOME= /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)

# Some convenient aliases and functions for running Hadoop-related commandsunalias fs &> /dev/nullalias fs="hadoop fs"unalias hls &> /dev/nullalias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and# compress job outputs with LZOP (not covered in this tutorial):# Conveniently inspect an LZOP compressed file from the command# line; run via:## $ lzohead /hdfs/path/to/lzop/compressed/file.lzo## Requires installed 'lzop' command.#lzohead () { hadoop fs -cat $1 | lzop -dc | head -1000 | less}

# Add Hadoop bin/ directory to PATHexport PATH=$JAVA_HOME/bin:$PATH:$HADOOP_PREFIX/bin

File $HADOOP_HOME/conf/hdfs-site.xml

dfs.replication 1Comment by Nguyen Van Chung: Replication = 1 (khng c backup, trao i thm vi Hi v tham s ny Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.

dfs.name.dir /u01/hadoop/data/hadoop_installtion/name_dir Determines where on the local filesystem the DFS name node should store the name table. If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.

dfs.data.dir /u01/hadoop/data/hadoop_installtion/data

dfs.namenode.handler.count 3

dfs.http.address 0.0.0.0:9070

dfs.secondary.http.address 0.0.0.0:9090

dfs.datanode.address 0.0.0.0:9010

dfs.datanode.http.address 0.0.0.0:9075

dfs.datanode.https.address 0.0.0.0:9475

dfs.datanode.ipc.address 0.0.0.0:9020

dfs.https.address 0.0.0.0:9470

File $HADOOP_HOME/conf/mapred-site.xml

mapred.job.tracker master:9311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.

mapred.local.dir /u01/hadoop-0.20.203.0/tempdir

mapred.map.child.java.opts -Xmx512M -Djava.io.tmpdir>/u01/hadoop-0.20.203.0/tempdir Larger heap-size for child jvms of maps.

mapred.reduce.child.java.opts -Xmx512M -Djava.io.tmpdir>/u01/hadoop-0.20.203.0/tempdir Larger heap-size for child jvms of reduces.

mapred.job.tracker.http.address 0.0.0.0:9030 Larger heap-size for child jvms of reduces.

mapred.task.tracker.http.address 0.0.0.0:9060 Larger heap-size for child jvms of reduces.

File $HADOOP_HOME/conf/core-site.xml

fs.default.name hdfs://master:54310 The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.

fs.inmemory.size.mb 100

io.sort.factor 50

io.sort.mb 100

Sa file /etc/hosts trn tt c cc serverThm vo cc dng sau:10.30.136.4 slave410.30.136.5 master10.30.136.6 slave610.30.136.7 slave710.30.136.8 slave810.30.136.9 slave910.30.136.10 slave1010.30.136.11 slave11 Sa file $HADOOP_PREFIX/conf/slaves trn tt c cc serverComment by Nguyen Van Chung: Sa c file masterThm vo cc dng sau:slave4slave6slave7slave8slave9slave10slave11Cu hnh passwordless cho dch v SSH vi user hadoop Pht sinh cp public/private key: ssh -keygen -t rsa -f ~/.ssh/id_rsa

Ch , khi thc hin lnh ssh-keygen, c th ta s c yu cu nhp mt passphrases. Trong trng hp ny, hy nhp mt passphrases rng.Sau , append public key vo file ~/.ssh/authorized_keysCn m bo user hadoop (owner) c quyn c/ghi th mc ~/.ssh v file ~/ssh/authorized_keysLu : Do trn tt c cc slave ta u c user hadoop nn ta ch cn pht sinh rsa key 1 ln v ng b ha th mc /home/ hadoop/.ssh ln tt c cc slave.ng b cu hnh ssh passwordless ln cc my slave thng qua lnh scp:

(Hoc c th dng lnh: ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slave239)Kim tra cu hnh passwordless login: T master ta thc hin vic login vo cc Slave

Nu cu hnh thnh cng, ta c th login c vo ngay m khng cn in password nh bn trn.(Hoc bng cch thc hin cc lnh sau:ssh-keygen -t rsa -P ""cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keysssh master (g yes, ri n enter, nu khng i nhp pass l ok, ch ny khng quan trng, quan trng l my master sang slave khng i pass)ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slavessh hduser@slave)Danh mc cc file cu hnhHadoop cung cp 1 tp cc file cu hnh cho php chng ta cu hnh cc thng s cho Hadoop cluster, chng ta c th tm thy cc file cu hnh ny trong th mc $HADOOP_HOME/conf:

Tn fileFormatM t

hadoop-env.shComment by Nguyen Van Chung: Sa ng dn java trong file ny.Bash scriptLu gi cc bin mi trng chy daemons trn Hadoopcluster.

core-site.xmlFile cu hnh theo nhdng xmlCu hnh cc thng s chohadoop core

hdfs-site.xmlFile cu hnh theo nhdng xmlCu hnh cc thng s cho ccdaemons chy hdfs: namenode,datanode, secondary namenode.

mapred-site.xmlFile cu hnh theo nhdng xmlCu hnh cc thng s cho ccdaemons chy MapReduce:jobtracker, tasktracker

mastersPlain textCha danh sch a ch ip (hochostname nu c ci dns) ccmy chy secondary namnode

slavesPlain textCha danh sch a ch ip (hochostname nu c ci dns) ccmy chy datanode vtasktraker.

hadoopmetrics.PropertiesJava propertiesCu hnh cc metric, tc cchm hadoop s report li ccthng tin hot ng ca cluster.

log4j.propertiesJava propertiesCu hnh cc properties cho vicghi li log khi chy ccdaemons: namenode, datanode,jobtracker, tasktracker.

Lu : ta c th t cc file cu hnh ny 1 th mc bt k bn ngoi $HADOOP_HOME/conf. Lc khi chy cc control script khi ng cc daemons, ta phi thm option config ch r ng dn ti th mc cha cc filecu hnh ny. V d:% start-dfs.sh config Chi tit cc file cu hnh chnhhadoop-env.shCha cc bin mi trng phc v cho vic chy cc daemon Hadoop. Cc daemon chy Hadoop gm c Namenode/Datanode, Jobtracker/TastTracker v Secondary Namenode. Mt s thng s quan trng:

Tn fileGi tr mc nh ngha

JAVA_HOMEKhng cBin mi trng cha thmc home ca Java. yl mt bin rt quantrng.

HADOOP_LOG_DIR$HADOOP_HOME/logLu gi thng tin v thmc lu cc file log trongqu trnh chy ccdaemon

HADOOP_HEAPSIZE1000 MBLng b nh ti a sc cp pht chymi daemon

core-site.xmlCu hnh cc thng s cho cc daemons chy Hadoop:Tn fileGi tr mc nh ngha

fs.default.namefile://Tn min mc nh. Tham s ny s gipchng ta dng nhng path tng i, pathtng i ny s kt hp vi tn min mcnh xc nh path tuyt i. Khi sdng hdfs, ta nn t gi tr cho tham s113ny l:hdfs://

hadoop.tmp.dir/tmpMc nh, y s l th mc lu tr cho ttc d liu trn hadoop cluster.

hdfs-site.xmlCu hnh cc daemon chy HDFS. Mt s tham s quan trong:Tn fileGi tr mc nh ngha

dfs.replication3Tham s ny quy nh ch sreplication level mc nh cho mt filekhi n c to ra trn HDFS. Nh ta bit, replication level ca mt filechnh l s bn sao ca tng block cafile trn HDFS. Gi s replication levelca file F l n, th mi block ca file Fs c lu ra n bn sao nm trn ndatanode khc nhau trn cluster.

dfs.name.dir${hadoop.tmp.dir}/dfs/nameDanh sch cc th mc lu d liu trnh thng file local ca cc daemonnamenode. Ni y s lu tr ccmetadata ca h thng file phn tnHDFS.

dfs.data.dir${hadoop.tmp.dir}/dfs/dataDanh sch cc thc mc lu tr d liutrn h thng file local ca cc daemondatanode. y l ni tht s s lu trcc block ca cc file trn HDFS.

fs.checkpoint.dir${hadoop.tmp.dir}/dfs/namesecondaryDanh sch cc th mc trn h thng file local m cc daemon secondarynamenode s lu tr

mapred-site.xmlCu hnh cc daemon chy Map/Reduce. Cc tham s quan trng:Tn fileGi tr mc nh ngha

mapred.job.trackerlocalhost:8021Hostname (hoc ip) v portca daemon Jobtracker.Nh ta bit, trn 1Hadoop cluster, c duy nhtmt daemon JobTrackerchy trn 1 node no .Port mc nh chy daemonny l 8021

mapred.local.dir${hadoop.tmp.dir}/mapredNi lu tr trn h thngfile cc b ca cc tin trnhchy MapReduce nhJobTracker v TaskTracker

Phn tn cc ci t v cu hnh ln mi node trn clusterComment by Nguyen Van Chung: Ch ci th mc u01, khng ci /home..Dng lnh scp chp ton b th mc /u01/hadoop/hadoop_installation lncc th mc tng ng trn slave01, slave02

4. Format HDFSFormat HDFSCh : Lnh sau phi c thc hin t NameNodeComment by Nguyen Van Chung: Chy t th mc $HADOOP_HOME/binFormat namenode:

5. Khi ng h thngComment by Nguyen Van Chung: S dng start-all.sh v stop-all.shKhi ng HadoopCh : cc lnh sau phi c thc hin t namenodeTrc khi khi ng, ta phi m bo tng la c tt trn tt c cc node

Khi ng HDFS (khi ng NameNode, SecondaryNameNode v cc DataNode):

Khi ng MapReduce:

(Ta c th dng lnh start-all.sh v stop-all.sh)6. Kim tra ci t thnh cngKim tra Hadoop ang chyTa c th kim tra vic Hadoop ang chy bng cch kim tra cc daemon trn cccluster c chy 1 cch ng n.Kim tra Namenode v JobTracker ang chy trn namenode

Kim tra Datanode ang chy trn cc datanode

Kim tra cc TaskTracker ang chy trn datanode

Kim tra tnh tr006Eg ton b HDFS bng lnh:

Vi lnh ny ta s bit c danh sch cc DataNode v tnh trng ca chng.Hoc thc truy cp vo namenode qua cc cng http:http://:50070 : Cng giao din web ca HDFS

http://:50030 : Cng giao tip vi dch v Map/Reduce

Ph lc I: Bng cc tham s cu hnh HadoopCc thng tin cu hnh l mt trong cc thnh phn chnh ca thao tc thc thi mt cng vic trn h thng Hadoop. V thun tin cho ngi dng th Hadoop h tr sn cho ngi dng cc file cu hnh mc nh vi cc thng tin cu hnh ban u, vi tng file cu hnh s l cc thng tin cu hnh c th. i vi ngi dng, vi tng h thng ci t v tng bi ton mun gii quyt trn h thng, h phi chnh sa cc thng tin cu hnh cho thch hp. C ch p dng cc thng tin cu hnh vo h thng nh sau, u tin n s c cc thng tin cu hnh mc nh (vi mu l tendefault. xml), sau n s c cc thng tin cu hnh ca ngi dng (vi mu l ten-site.xml), nu c thng tin cu hnh khc th n s override thng tin ny.Sau y l danh sch cc file cu hnh ca h thng hadoop. Mc nh, cc file ny nm trong th mc \conf. Mt thng tin cu hnh c mu nhsau : Thng tin cu hnh chung cho h thng hadoopFile core-site.xmlFile cu hnh core-site.xml (mc nh l file core-default.xml), file ny cha cc thng tin cu hnh chung cho h thng hadoop (Ngun tham kho: http://hadoop.apache.org/common/docs/current/core-default.html).core-site.xml

Namehadoop.tmp.dir

Default value/tmp/hadoop-${user.name}

DescriptionCc th mc tm trong cc node trong cluster

Namefs.default.name

Default valuefile:///

DescriptionTn ca h thng file mc nh gm cc trng nh scheme vauthority ca URI. Authority gm c host v port. Mc nh l hthng local. Cn vi HDFS l hdfs://

Namefs.checkpoint.size

Default value67108864

DescriptionKch thc ca editlog (theo byte) m trigger li cc checkpoint

Namelocal.cache.size

Default value10737418240

DescriptionKch thc ti a ca b nh cache m bn mun lu tr (mcnh l 10GB)

Thng s cu hnh h thng file HDFSFile hdfs-site.xmlFile cu hnh hdfs-site.xml, dng cho thao tc cu hnh cc thng tin ca h thng file HDFS. Xem thm ti (http://hadoop.apache.org/common/docs/current/ hdfsdefault.html)hdfs-site.xml

Namedfs.namenode.logging.level

Default valueinfo

DescriptionCc mc logging cho namenode. Nu gi tr l dir th s log lithay i ca namespace, l block th log li cc thng tin v cc125bn sao,thao tc to hoc xa block, cui cng l all

Namedfs.secondary.http.address

Default value0.0.0.0:50090

Descriptiona chi http ca Secondary Namenode server. Nu port l 0 thserver s chy trn mt port bt k.

Namedfs.datanode.address

Default value0.0.0.0:50010

Descriptiona chi datanode server dng lng nghe cc kt ni. Nu port l0 th server s chy trn mt port bt k.

Namedfs.datanode.http.address

Default value0.0.0.0:50075

Descriptiona chi http ca datanode server. Nu port l 0 th server s chytrn mt port bt k.

Namedfs.datanode.handler.count

Default value3

DescriptionS lng cc tiu trnh trn server cho datanode

Namedfs.http.address

Default value0.0.0.0:50070

Descriptiona chi v port ca giao din web ca dfs namenode dng lngnghe cc kt ni. Nu port l 0 th server s chy trn mt port btk.

Namedfs.name.dir

Default value${hadoop.tmp.dir}/dfs/name

DescriptionTh mc trn h thng file local m DFS Namenode dng lutr file fsimage. Nu c nhiu th mc, th file fsimage s c tobn sao trong tt c cc th mc trn.

Namedfs.name.edits.dir

Default value${dfs.name.dir}

DescriptionTh mc trn h thng file local m DFS Namenode dng lutr file v transaction (file edits). Nu c nhiu th mc, th fileny s c to bn sao trong tt c cc th mc trn.

Namedfs.permissions

Default valueTRUE

DescriptionBt thao tc kim tra cc permission trn HDFS.

Namedfs.data.dir

Default value${hadoop.tmp.dir}/dfs/data

DescriptionTh mc trn h thng file local m mt DFS Datanode dng lu tr cc file block ca n. Nu c nhiu th mc, th cc blocks c to bn sao trong tt c cc th mc trn. Nu th mckhng tn ti th b ignore

Namedfs.replication

Default value3

DescriptionS lng bn sao mc nh ca 1 block

Namedfs.replication.max

Default value512

DescriptionS lng bn sao ti a ca mt block

Namedfs.replication.min

Default value1

DescriptionS lng bn sao ti thiu ca mt block

Namedfs.block.size

Default value67108864

DescriptionKch thc mc nh ca mt block (64MB)

Namedfs.heartbeat.interval

Default value3

DescriptionKhong thi gian datanode gi heartbeat n Namenode (giy)

Namedfs.namenode.handler.count

Default value10

DescriptionS lng cc tiu trnh server trn Namenode

Namedfs.replication.interval

Default value3

DescriptionChu k (giy) m namenode s tnh li s lng bn sao cho ccdatanode

File masterFile ny nh ngha host lm Secondary Namenode. Vi tng dng trong file ny l a ch ip hoc tn ca host .File slavesFile ny nh ngha cc host lm DataNode cng nh TaskTracker. Vi tng dng trong file l a chi ip hoc tn ca host .Thng s cu hnh cho m hnh Hadoop MapReduceFile mapred-site.xmlFile cu hnh mapred-site.xml, dng cho thao tc cu hnh cc thng tin ca m hnh MapReduce. Tham kho thm ti(http://hadoop.apache.org/common/docs/current/mapred-default.html)mapred-site.xml

Name mapred.job.tracker

Default value local

Description Host v port m MapReduce job tracker chy trn . Nu l local, cc job s c chy trong mt tin trnh nh mt maptask v reduce task.

Name mapred.job.tracker.http.address

Default value 0.0.0.0:50030

Description a ch v port ca server http ca jobtrack m server s lng nghe cc kt ni. Nu port l 0 th server s khi ng trn mt port bt k.

Name mapred.local.dir

Default value ${hadoop.tmp.dir}/mapred/local

Description Th mc local ni m MapReduce s lu cc file d liu trung gian. C th l danh sch cc th mc c cch nhau bi du phy trn cc thit b khc nhau m rng a. Th mc phi tn ti.

Name mapred.system.dir

Default value ${hadoop.tmp.dir}/mapred/system

Description Th mc chia s ni m MapReduce lu tr cc file iu khin.

Name mapred.temp.dir

Default value ${hadoop.tmp.dir}/mapred/temp

Description Th mc chia s cho cc file tm.

Name mapred.map.tasks

Default value 2

Description S lng cc maptask dng cho mt job. Khng c hiu lc khi mapred.job.tracker l local

Name mapred.reduce.tasks

Default value 1

Description S lng cc reducetask dng cho mt job. Khng c hiu lc khi mapred.job.tracker l local

Name mapred.child.java.opts

Default value -Xmx200m

Description Cc option ca Java cho cc tin trnh con ca TaskTracker. Gi tr kch thc heap cho mt task.

Name mapred.job.reuse.jvm.num.tasks

Default value 1

Description S lng cc task chy trn mi jvm. Nu gi tr l -1 th khng gii hn s lng task.

Name mapred.task.tracker.http.address

Default value 0.0.0.0:50060

Description a ch v port ca http tasktracker server. Nu port l 0 th server s khi ng trn mt cng bt k.

29