Post on 18-Jan-2018
description
Cloud Computing projectCloud Computing project
NSYSU Sec. 1 DemoNSYSU Sec. 1 Demo
NSYSU EE IT_LABNSYSU EE IT_LAB 22
OutlineOutline
Our system’s architectureOur system’s architecture
Flow chart of the hadoop’s job(web crawler) Flow chart of the hadoop’s job(web crawler) working on hadoop clusterworking on hadoop cluster– Basic setupBasic setup– Flow chartFlow chart
Compare crawler’s efficiency on different typCompare crawler’s efficiency on different types’ hadoop clusteres’ hadoop cluster
NSYSU EE IT_LABNSYSU EE IT_LAB 33
ArchitectureArchitecture
HardwareHardware– 2 ASUS Servers, 2 ASUS Servers, Intel Xeon CPU X3330 2.66GHz,Intel Xeon CPU X3330 2.66GHz, 1TB HD & 3G ram (master, slave1)1TB HD & 3G ram (master, slave1)– 1 PC, 1 PC, Intel Core 2 Quad CPU Q6600 2.40GHz,Intel Core 2 Quad CPU Q6600 2.40GHz, 500G HD, 4G ram (slave2)500G HD, 4G ram (slave2)
SoftwareSoftware– CentOS 5.03CentOS 5.03– Hadoop 0.20.1Hadoop 0.20.1
NSYSU EE IT_LABNSYSU EE IT_LAB 44
ArchitectureArchitectureMachine 01
Machine 02 Machine 03
master (x.x.x.1)
slave2 (x.x.x.3)slave1 (x.x.x.2)
Namenode
JobTracker
DatanodeTaskTracker
DatanodeTaskTracker
DatanodeTaskTracker
administer
http://x.x.x.1:50070
http://x.x.x.1:50030
user
Job
NSYSU EE IT_LABNSYSU EE IT_LAB 55
HDFS HDFS http://x.x.x.1:50070
NSYSU EE IT_LABNSYSU EE IT_LAB 66
HDFS HDFS http://x.x.x.1:50070
NSYSU EE IT_LABNSYSU EE IT_LAB 77
Job admin Job admin http://x.x.x.1:50030
NSYSU EE IT_LABNSYSU EE IT_LAB 88
Job admin Job admin http://x.x.x.1:50030
NSYSU EE IT_LABNSYSU EE IT_LAB 99
Job admin Job admin http://x.x.x.1:50030
NSYSU EE IT_LABNSYSU EE IT_LAB 1010
Basic setup (hadoop)Basic setup (hadoop)
1.1. Setup communication without password thrSetup communication without password through ough sshssh protocol protocol
2.2. Install Install javajava3.3. Import Import java pathjava path (or any files’ path needed) (or any files’ path needed)
in {hadoop dir}/conf/hadoop-env.shin {hadoop dir}/conf/hadoop-env.sh4.4. Import Import namenodenamenode and and JobtrackerJobtracker hosts’ na hosts’ na
me in {hadoop dir}/conf/hadoop-site.shme in {hadoop dir}/conf/hadoop-site.sh
NSYSU EE IT_LABNSYSU EE IT_LAB 1111
Basic setup (hadoop)Basic setup (hadoop)
5.5. Setup Setup mastermaster file and file and slavesslaves file file6.6. Format HDFS Format HDFS (hadoop distributed file system)(hadoop distributed file system)7.7. Start HadoopStart Hadoop8.8. Check hadoop Check hadoop HDFS http://namenode’s ip:50070HDFS http://namenode’s ip:50070 Job admin http://Jobtracker’s ip:50030Job admin http://Jobtracker’s ip:50030
NSYSU EE IT_LABNSYSU EE IT_LAB 1212
Basic setup (crawler)Basic setup (crawler)
1.1. Check your web robot agent fileCheck your web robot agent file
2.2. Setup urls filter fileSetup urls filter file
3.3. Set your seed urls file by manual input or weSet your seed urls file by manual input or web’s url package b’s url package
(Some details’ setting steps are ignored here.)(Some details’ setting steps are ignored here.)
NSYSU EE IT_LABNSYSU EE IT_LAB 1313
Flow chartFlow chartSeed urls
Run crawl commandas a hadoop job
Assign job’s fragments to each tasktracker; go fetch web’s data
Store context to output dir. on HDFS
Link logNew fetch list
Doc. dataFetch log
HDFS
user
( )
Map &reduce
NSYSU EE IT_LABNSYSU EE IT_LAB 1414
Hadoop cluster – 1 nodeHadoop cluster – 1 node
Machine 01master (x.x.x.1)
Namenode
JobTracker
DatanodeTaskTracker
NSYSU EE IT_LABNSYSU EE IT_LAB 1515
Hadoop cluster – 2 nodesHadoop cluster – 2 nodes
Machine 01 Machine 02master (x.x.x.1) slave1 (x.x.x.2)
Namenode
JobTracker
DatanodeTaskTracker
DatanodeTaskTracker
NSYSU EE IT_LABNSYSU EE IT_LAB 1616
Hadoop cluster – 3 nodesHadoop cluster – 3 nodesMachine 01
Machine 02 Machine 03
master (x.x.x.1)
slave2 (x.x.x.3)slave1 (x.x.x.2)
Namenode
JobTracker
DatanodeTaskTracker
DatanodeTaskTracker
DatanodeTaskTracker
NSYSU EE IT_LABNSYSU EE IT_LAB 1717
Urls setUrls set
Get urls package from Get urls package from http://dmoz.org/http://dmoz.org/
select one out of every 500, so that we end select one out of every 500, so that we end up with around 10000 URLs up with around 10000 URLs
NSYSU EE IT_LABNSYSU EE IT_LAB 1818
Crawler input (seeds.txt)Crawler input (seeds.txt)
NSYSU EE IT_LABNSYSU EE IT_LAB 1919
Crawler ouputCrawler ouput
Output to HDFSOutput to HDFS
NSYSU EE IT_LABNSYSU EE IT_LAB 2020
Speed compareSpeed compare
Hadoop job costs timeHadoop job costs time(9199 urls)(9199 urls)
1 work node1 work node 1888 seconds1888 seconds
2 work nodes2 work nodes 1679 seconds1679 seconds
3 work nodes3 work nodes 1628 seconds1628 seconds
NSYSU EE IT_LABNSYSU EE IT_LAB 2121
Thanks for your attention!!Thanks for your attention!!