Hadoop 20111215
-
Upload
exsuns -
Category
Technology
-
view
1.166 -
download
0
description
Transcript of Hadoop 20111215
![Page 2: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/2.jpg)
• Hadoop Core, our flagship sub-project, provides a distributed filesystem (HDFS) and support for the MapReduce distributed computing metaphor.
• Pig is a high-level data-flow language and execution framework for parallel computation. It is built on top of Hadoop Core.
![Page 3: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/3.jpg)
ZooKeeper
• ZooKeeper is a highly available and reliable coordination system. Distributed applications use ZooKeeper to store and mediate updates for critical shared state.
![Page 4: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/4.jpg)
JobTracker
• JobTracker: The JobTracker provides command and control for job management. It supplies the primary user interface to a MapReduce cluster. It also handles the distribution and management of tasks. There is one instance of this server running on a cluster. The machine running the JobTracker server is the MapReduce master.
![Page 5: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/5.jpg)
TaskTracker
• TaskTracker: The TaskTracker provides execution services for the submitted jobs. Each TaskTracker manages the execution of tasks on an individual compute node in the MapReduce cluster. The JobTracker manages all of the TaskTracker processes. There is one instance of this server per compute node.
![Page 6: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/6.jpg)
NameNode
• NameNode: The NameNode provides metadata storage for the shared file system. The NameNode supplies the primary user interface to the HDFS. It also manages all of the metadata for the HDFS. There is one instance of this server running on a cluster. The metadata includes such critical information as the file directory structure and which DataNodes have copies of the data blocks that contain each file’s data. The machine running the NameNode server process is the HDFS master.
![Page 7: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/7.jpg)
Secondary NameNode
• Secondary NameNode: The secondary NameNode provides both file system metadata backup and metadata compaction. It supplies near real-time backup of the metadata for the NameNode. There is at least one instance of this server running on a cluster, ideally on a separate physical machine than the one running the NameNode. The secondary NameNode also merges the metadata change history, the edit log, into the NameNode’s file system image.
![Page 8: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/8.jpg)
Design of HDFS
• Design of HDFS– Very large files– Streaming data access– Commodity hardware
• not a good fit– Low-latency data access– Lots of small files– Multiple writers, arbitrary file modifications
![Page 9: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/9.jpg)
blocks
• normally 512 bytes• HDFS : 64 MB by default
![Page 10: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/10.jpg)
HDFS 文件读取• 内存
![Page 11: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/11.jpg)
HDFS 文件写入
![Page 12: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/12.jpg)
使用• HDFS 初次建立之前需要格式化 namenode– hadoop namenode –format
![Page 13: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/13.jpg)
HDFS 文件写入• Outputsream.write()• Outputstream.flush() 刷新,超过一个 block
的时候,才会读到。• Outputstream.sync() 强制同步• Outputstream.close() 包括 sync()
![Page 14: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/14.jpg)
DistCp 分布式复制• hadoop distcp -update hdfs://namenode1/foo
hdfs://namenode2/bar
• hadoop distcp –update ……– 只更新修改过的文件
• hadoop distcp –overwrite ……– 覆盖
• hadoop distcp –m 100 ……– 复制任务被分成 N 个 MAP 执行
![Page 15: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/15.jpg)
Hadoop 文件归档• Har 文件
• Hadoop archive –archiveName file.har /myfiles /outpath
• Hadoop fs –ls /outpath/file.har• Hadoop fs –lsr har:///outpath/file.har
![Page 16: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/16.jpg)
文件操作• Hadoop fs –rm hdfs://192.168.126.133:9000/xxx
•cat •cp •lsr •rmr
•chgrp •du •mkdir •setrep
•chmod •dus •moveFromLocal •stat
•chown •expunge •moveToLocal •tail
•copyFromLocal •get •mv •test
•copyToLocal •getmerge •put •text
•count •ls •rm •touchz
![Page 17: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/17.jpg)
分布式部署• Master&slave 192.168.0.10• Slave 192.168.0.20
• 修改 conf/master– 192.168.0.10
• 修改 Conf/slave– 192.168.0.10– 192.168.0.20
![Page 18: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/18.jpg)
安装 hadoop
• ssh-keygen-tdsa –P '‘ –f ~/.ssh/id_dsa
• Cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
• 关闭防火墙 Sudo ufw disable
![Page 19: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/19.jpg)
分布式部署 Core-site.xml (master&slave相同 )
• <configuration>
• <property>• <name>hadoop.tmp.dir</name>• <value>/home/tony/tmp/tmp</value>• <description>Abaseforothertemporarydirectories.</description>• </property>
• <property>• <name>fs.default.name</name>• <value>hdfs://192.168.0.10:9000</value>• </property>
• </configuration>
![Page 20: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/20.jpg)
分布式部署 Hdfs-site.xml (master&slave)
• <configuration>• <property>• <name>dfs.replication</name>• <value>1</value>• </property>• <property>• <name>dfs.name.dir</name>• <value>/home/tony/tmp/name</value>• </property>• <property>• <name>dfs.data.dir</name>• <value>/home/tony/tmp/data</value>• </property>• </configuration>• 并且保证当前机器有该目录
![Page 21: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/21.jpg)
分布式部署 Mapred-site.xml
• <configuration>• <property>• <name>mapred.job.tracker</name>• <value>192.168.0.10:9001</value>• </property>
• </configuration>• 所有的机器都配成 master 的地址
![Page 22: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/22.jpg)
Run
• Hadoop namenode –format– 每次 fotmat 前,先 stop-all ,并清空 tmp 一下
的所有目录• Start-all.sh 或 ( start-dfs 和 start-mapred )• 显示运行情况 :– http://192.168.0.20:50070/dfshealth.jsp – 或 hadoop dfsadmin -report
![Page 23: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/23.jpg)
![Page 24: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/24.jpg)
![Page 25: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/25.jpg)
could only be replicated
• java.io.IOException: could only be replicated to 0 nodes , instead of 1.
• 解决:– XML 的配置不正确,要保证 slave 的 mapred-
site.xml 和 core-site.xml 的地址都跟 master 一致
![Page 26: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/26.jpg)
Incompatible namespaceIDs
• java.io.IOException: Incompatible namespaceIDs in /home/hadoop/data: namenode namespaceID = 1214734841; datanode namespaceID = 1600742075
• 原因:– 格式化前没清空 tmp ,导致 ID 不一致
• 解决:– 修改 namenode 的
/home/hadoop/name/current/VERSION
![Page 27: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/27.jpg)
UnknownHostException
• # hostname • Vi /etc/hostname 修改 hostname• Vi /etc/hosts 增加 hostname 对应的 IP
![Page 28: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/28.jpg)
Name node is in safe mode• hadoop dfsadmin -safemode leave
• safemode模式NameNode在启动的时候首先进入安全模式,如果datanode丢失的block达到一定的比例(1-dfs.safemode.threshold.pct),则系统会一直处于安全模式状态即只读状态。dfs.safemode.threshold.pct(缺省值0.999f)表示HDFS启动的时候,如果DataNode上报的block个数达到了元数据记录的block个数的0.999倍才可以离开安全模式,否则一直是这种只读模式。如果设为1则HDFS永远是处于SafeMode。下面这行摘录自NameNode启动时的日志(block上报比例1达到了阀值0.9990)The ratio of reported blocks 1.0000 has reached the threshold 0.9990. Safe mode will be turned off automatically in 18 seconds.hadoop dfsadmin -safemode leave有两个方法离开这种安全模式1. 修改dfs.safemode.threshold.pct为一个比较小的值,缺省是0.999。2. hadoop dfsadmin -safemode leave命令强制离开
• 用户可以通过dfsadmin -safemode value 来操作安全模式,参数value的说明如下:enter - 进入安全模式leave - 强制NameNode离开安全模式get - 返回安全模式是否开启的信息wait - 等待,一直到安全模式结束。
![Page 29: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/29.jpg)
error in shuffle in fetcher
• org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher
• 解决方式:– 问题出在 hosts 文件的配置上,在所有节点
的 /etc/hosts 文件中加入其他节点的主机名和IP 映射
![Page 30: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/30.jpg)
![Page 31: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/31.jpg)
Auto sync
![Page 32: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/32.jpg)
动态增加 datanode
• 主机的 conf/slaves 中,增加 namenode 的地址
• • 启动新增的 namenode – bin/hadoop-daemon.sh start datanode
bin/hadoop-daemon.sh start tasktracker • • 启动后, Hadoop 自动识别。
![Page 33: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/33.jpg)
screenshot
![Page 34: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/34.jpg)
容错• 如果一个节点很长时间没反应,就会清出
集群,并且其它节点会把 replication 补上
![Page 35: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/35.jpg)
![Page 36: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/36.jpg)
执行 MapReduce
• hadoop jar a.jar com.Map1 hdfs://192.168.126.133:9000/hadoopconf/ hdfs://192.168.126.133:9000/output2/
• 状态:• http://localhost:50030/
![Page 37: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/37.jpg)
Read From Hadoop URL• //execute: hadoop ReadFromHDFS• public class ReadFromHDFS {• static {• URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());• }• public static void main(String[] args){• try {• URL uri = new URL("hdfs://192.168.126.133:9000/t1/a1.txt");• IOUtils.copyBytes(uri.openStream(), System.out, 4096, false);• }catch (FileNotFoundException e) {• e.printStackTrace();• } catch (IOException e) {• e.printStackTrace();• }• }• }
![Page 38: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/38.jpg)
Read By FileSystem API• //execute : hadoop ReadByFileSystemAPI• public class ReadByFileSystemAPI {• public static void main(String[] args) throws Exception {• String uri = ("hdfs://192.168.126.133:9000/t1/a2.txt");;• Configuration conf = new Configuration();• FileSystem fs = FileSystem.get(URI.create(uri), conf);• FSDataInputStream in = null;• try {• in = fs.open(new Path(uri));• IOUtils.copyBytes(in, System.out, 4096, false);• } finally {• IOUtils.closeStream(in);• }• }• }
![Page 39: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/39.jpg)
FileSystemAPI• Path path = new Path(URI.create("hdfs://192.168.126.133:9000/t1/tt/"));• if(fs.exists(path)){• fs.delete(path,true);• System.out.println("deleted-----------");• }else{• fs.mkdirs(path);• System.out.println("creted=====");• }
• /**• * List files• */• FileStatus[] fileStatuses = fs.listStatus(new Path(URI.create("hdfs://192.168.126.133:9000/")));• for(FileStatus fileStatus : fileStatuses){• System.out.println("" + fileStatus.getPath().toUri().toString() + " dir:" + fileStatus.isDirectory());• }
• PathFilter pathFilter = new PathFilter(){• @Override• public boolean accept(Path path) {• return true;• }• };
![Page 40: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/40.jpg)
文件写入策略• 在创建一个文件之后,在文件系统的命名空间中是可见的,如下所示:• 1. Path p = new Path("p"); • 2. Fs.create(p); • 3. assertThat(fs.exists(p),is(true)); • 但是,写入文件的内容并不保证能被看见,即使数据流已经被刷新。所以文件长度
显• 示为 0 :• 1. Path p = new Path("p"); • 2. OutputStream out = fs.create(p); • 3. out.write("content".getBytes("UTF-8")); • 4. out.flush(); • 5. assertThat(fs.getFileStatus(p).getLen(),is(0L)); • 一旦写入的数据超过一个块的数据,新的读取者就能看见第一个块。对于之后的块
也• 是这样。总之,它始终是当前正在被写入的块,其他读取者是看不见它的。• out.sync(); 强制同步, close() 的时候会自动调用 sync()
![Page 41: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/41.jpg)
集群复制 归档• hadoop distcp -update hdfs://n1/foo
hdfs://n2/bar/foo• 归档– hadoop archive -archiveName files.har
/my/files /my• 使用归档– hadoop fs -lsr har:///my/files.har– hadoop fs -lsr har://hdfs-localhost:8020/my/files.har/my/files/di
• 归档缺点:修改文件、增加删除文件 都需重新归档
![Page 42: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/42.jpg)
SequenceFile Reader&Writer• Configuration conf = new Configuration();• SequenceFile.Writer writer =null ;• try {• System.out.println("start....................");• FileSystem fileSystem = FileSystem.newInstance(conf);• IntWritable key = new IntWritable(1);• Text value = new Text("");• Path path = new Path("hdfs://192.168.126.133:9000/t1/seq");• if(!fileSystem.exists(path)){• fileSystem.create(path);• writer = SequenceFile.createWriter(fileSystem, conf, path, key.getClass(), value.getClass());
• for(int i=1; i<10; i++){• writer.append(new IntWritable(i), new Text("value" + i));• }• writer.close();• }else{• SequenceFile.Reader reader = new SequenceFile.Reader(fileSystem,path,conf);• System.out.println("now while segment");• while(reader.next(key, value)){• System.out.println("key:" + key.get() + " value:" + value + " position" + reader.getPosition());• };• }• } catch (IOException e) {• e.printStackTrace();• } finally{• IOUtils.closeStream(writer);• }
![Page 43: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/43.jpg)
SequenceFile
• 1 value1• 2 value2• 3 value3• 4 value4• 5 value5• 6 value6• 7 value7• 8 value8• 9 value9• 包括一个 Key 和一个 Value• 可以用 hadoop fs –text hdfs://……… 来显示文件
![Page 44: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/44.jpg)
SequenceMap
• 重建索引: MapFile.fix(fileSystem, path, key.getClass(), value.getClass(), true, conf);
• MapFile.Writer writer = new MapFile.Writer(conf, fileSystem, path.toString(), key.getClass(), value.getClass());
• MapFile.Reader reader = new MapFile.Reader(fileSystem,path.toString(),conf);
![Page 45: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/45.jpg)
Mapper Test Case• @Test• public void testMapper1() throws IOException {• MyMapper myMapper = new MyMapper();• Text text = new Text("xxxxxx<<HelloWorld>>xxxxxxxxxxxxxxxxxx");• OutputCollector outputCollector = new OutputCollector<Text,IntWritable>(){• public void collect(Text resultKey, IntWritable resultValue) throws IOException {• System.out.println("resultKey:" + resultKey + " resultValue:" + resultValue);• Assert.assertTrue("HelloWorld" . equals(resultKey.toString()));• }• };• myMapper.map(null,text, outputCollector, null);• }
• public class MyMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {• @Override• public void map(LongWritable longWritable, Text text, OutputCollector<Text, IntWritable>
textIntWritableOutputCollector, Reporter reporter) throws IOException {• Text result = new Text(text.toString().split("<<")[1].split(">>")[0]);• textIntWritableOutputCollector.collect(result, new IntWritable(result.getLength()));• }• }
![Page 46: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/46.jpg)
Mapper Test Case• @Test• public void testReducer1() throws IOException {• MyReducer myReducer = new MyReducer();• ArrayList arrayList = new ArrayList();• arrayList.add(new Text("a1")); arrayList.add(new Text("a222")); arrayList.add(new Text("a33"));• Iterator it = arrayList.iterator();• OutputCollector<Text,Text> outputCollector = new OutputCollector<Text,Text>(){• public void collect(Text resultKey, Text resultValue) throws IOException {• System.out.println("resultKey:" + resultKey + " resultValue:" + resultValue);• Assert.assertTrue(resultKey.toString().equals("a222"));• }• };• myReducer.reduce(null,it,outputCollector,null);• }
• public class MyReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> {• public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {• int sum = 0;• Text t = new Text();• while (values.hasNext()) {• Text tmp = values.next();• if (tmp.getLength() > t.getLength()) {• t = tmp;• }• }• output.collect(key, t);• }• }
![Page 47: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/47.jpg)
Map Reduce 执行原理• JobClient.submitJob()• 1. 向 JobTracker 申请一个任务 ID• 2. 检查作业的输入输出目录是否存在或已存
在• 3. 计算作业的输入划分,如果目录不存在就
把错误返回给MapReduce程序• 4. 把作业运行的资源复制到 JobTracker服务
器的目录• 5. 通知 jobtracker 运行
![Page 48: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/48.jpg)
Mapper 输入 使用多种 InputFormat
//match inputFormat by input path• MultipleInputs.addInputPath(conf, new Path(args[0]), KeyValueTextInputFormat.class,
KVTempMapper.class);• MultipleInputs.addInputPath(conf, new Path("hdfs://192.168.126.133:9000/*.txt"),
TextInputFormat.class, KVTempMapper.class);
![Page 49: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/49.jpg)
MapReduce 输出种类• 节能• 多输出:–实现 Partitioner
– conf.setPartitionerClass(MyPartitioner.class);
–代码在备注
![Page 50: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/50.jpg)
自义定输出目录• public class MyOutputFormat extends
MultipleTextOutputFormat {• protected String generateFileNameForKeyValue(Object
key, Object value, String name) {• return "abc.txt"; • }• }
• 运行时: conf.setOutputFormat(MyOutputFormat.class);• 最后输出会输出到 目录的 abc.txt 文件
![Page 51: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/51.jpg)
设置多个输出 Format• MultipleOutputs.addNamedOutput(conf, "outputA", TextOutputFormat.class,
LongWritable.class, Text.class);• MultipleOutputs.addNamedOutput(conf, "outputB", MyOutputFormat.class,
LongWritable.class, Text.class);
• MultipleOutputs 用完最后一定要关闭• 需要覆写 configure 来获取 JobConf
–代码在备注里
![Page 52: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/52.jpg)
记数器
• mapper 或 reducer 中– reporter.incrCounter(CounterType.Success, 1);– reporter.incrCounter("myGroup","name", 2);
• 作业完成时会打印出计数
• 程序获取 Counter:• RunningJob runningJob = JobClient.runJob(conf);• JobClient jobClient = new JobClient(conf);• Counters.Counter counter = runningJob.getCounters().findCounter("myGroup",
"counterA");• if(counter!=null){• System.out.println(counter.getCounter());• }
![Page 53: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/53.jpg)
排序 & Join
• conf.setOutputFormat(MapFileOutputFormat.class);
![Page 54: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/54.jpg)
Pig
• 避免编写 MapReduce程序、编译打包、执行
• 运行:–本地模式 : pig –x local
• export PIG_CLASSPATH=hadoop/conf
• 注释 – /* xxx */– -- xxxxxxxxxxxxxx 两个减号
![Page 55: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/55.jpg)
Pig syntax
• raw = LOAD 'excite.log' 读取一个文件– USING PigStorage('\t') 分隔符– AS (user:int, time:int, query:int); 变量及类型
• register XXX.jar 使用 JAR 包• dump raw • describe raw 打印结构• explain raw • store raw into 'aaa.txt' 保存
![Page 56: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/56.jpg)
![Page 57: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/57.jpg)
Pig syntax
• filter– ccc = filter aaa by name is null and age>10
• Group– bbb = group aaa by myColumn
• Foreach&Generate– ddd = foreach bbb generate group, MAX(aaa.temp)
• Illustrate– ILLUSTRATE aaa 打印步骤
![Page 58: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/58.jpg)
Pig 内置函数• split XXX into a1 if temp is not null, a2 if temp
is null• PIG 内置函数:– AVG, CONCAT, COUNT, DIFF, MAX, SIZE, SUM,
TOKENIZE– IsEmpty– PigStorage
![Page 59: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/59.jpg)
Foreach
• data:– a, 1, hello– b, 2, hey
• execute:– foreach XXX generate $2, $1+10, $0
• result:– hello, 11, a– hey, 12, b
![Page 60: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/60.jpg)
自定义函数 UDF Filter
• filter XXX by isGood(year)• public class GoodPig extends FilterFunc{• public Boolean exec(Tuple tuple) ;• }• 使用:• define isGood pig.GoodPig
![Page 61: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/61.jpg)
自定义 Pig函数 改变类型• public class MyEvalFunc extends EvalFunc{public List<FuncSpec> getArgToFuncMapping() }使用:define myEvalFunc com.MyEvalFuncforeach XXX generate myEvalFunc(aaa)
![Page 62: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/62.jpg)
MyLoadFunc与存储的处理• store XXX into 'out.txt' using PigStorage('==')– 输入: Hello==1==a
• 自定义 LoadFunc– a1 = load 'xxx.txt' using com.MyLoadFunc() as
(year:int, temp:int)–代码在备注
![Page 63: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/63.jpg)
Pig 的 Join
• aaa:– 1,hi– 2,hello– 3,nihao
• bbb:– a,2– b,3– c,1
– xxx = join aaa by $0, bbb by $1
1 hi c 1
2 hello a 2
3 nihao b 3
![Page 64: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/64.jpg)
Hive 简介• 数据仓库,• 把类似 SQL 的语法转化成 MapReduce程序• 不支持 Index,Transaction, 分钟级别的延时• 不支持 SQL 的 Having• 数据类型支持–基本类型 string,int,double,boolean 等– 复杂类型的 Array,Map,Struct
![Page 65: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/65.jpg)
Hive 数据仓库• % export HIVE_HOME=/home/my/hive• 运行: bin/hive• hive> SHOW TABLES;• hive -f script.q• hive -e 'SELECT * FROM dummy'
![Page 66: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/66.jpg)
Hive 建表与载入数据• 建表• CREATE TABLE records– (year STRING, temperature INT, quality INT)– ROW FORMAT DELIMITED– FIELDS TERMINATED BY '\t';
• 从文件载入:– LOAD DATA LOCAL INPATH
'input/ncdc/micro-tab/sample.txt'– OVERWRITE INTO TABLE records
![Page 67: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/67.jpg)
Error while making MR scratch directory
• 把 hadoop 的配置文件 core-site.xml 中的:• fs.default.name 里的值改成 hosts里的名称• 然后重启 hadoop 和 hive
• 如果提示 name node is in safe mod– hadoop dfsadmin -safemode leave
– 或在 hdfs 上建立相关目录并加权限– % hadoop fs -mkdir /tmp– % hadoop fs -chmod a+w /tmp– % hadoop fs -mkdir /user/hive/warehouse– % hadoop fs -chmod a+w /user/hive/warehouse
![Page 68: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/68.jpg)
Hive 启动模式• hive - - service hiveserver
![Page 69: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/69.jpg)
Metastore
• metastore由两部分组成– service– 数据存储
![Page 70: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/70.jpg)
复杂类型• CREATE TABLE complex (• col1 ARRAY<INT>,• col2 MAP<STRING, INT>,• col3 STRUCT<a:STRING, b:INT, c:DOUBLE>• );
• 查询:• SELECT col1[0], col2['b'], col3.c FROM complex
![Page 71: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/71.jpg)
托管表与外部表managed tables and External tables
• 托管表会移动数据到 Hive 的数据仓库目录– CREATE TABLE managed_table (dummy STRING);– LOAD DATA INPATH '/user/tom/data.txt' INTO table
managed_table;• 外部表:– CREATE EXTERNAL TABLE external_table (dummy
STRING) LOCATION '/user/tom/external_table';– LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE
external_table;– 删除外部表的时候不会删除数据,只删除 metaata
![Page 72: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/72.jpg)
Hive Partition 分区• 会按目录保存数据– /user/hive/warehouse/tab4/level=2/city=beijing/h2.txt
(红色部分是 partition )• 建表– CREATE TABLE logs (ts BIGINT, line STRING)– PARTITIONED BY (dt STRING, country STRING);
• 使用:– LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'– INTO TABLE logs– PARTITION (dt='2001-01-01', country='GB');
![Page 73: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/73.jpg)
Hive Buckets
• CREATE TABLE bucketed_users (id INT, name STRING)
• CLUSTERED BY (id) INTO 4 BUCKETS;
• 分隔成 4片,用于拆分成多个 MapReduce任务
![Page 74: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/74.jpg)
分隔符• CREATE TABLE ...• ROW FORMAT DELIMITED• FIELDS TERMINATED BY '\001'• COLLECTION ITEMS TERMINATED BY '\002'• MAP KEYS TERMINATED BY '\003'• LINES TERMINATED BY '\n'• STORED AS TEXTFILE;
![Page 75: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/75.jpg)
指定序列化反序列化• CREATE TABLE stations (usaf STRING, wban STRING, name STRING)• ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'• WITH SERDEPROPERTIES (• "input.regex" = "(\\d{6}) (\\d{5}) (.{29}) .*"• );
• hive> SELECT * FROM stations LIMIT 4;• 010000 99999 BOGUS NORWAY • 010003 99999 BOGUS NORWAY • 010010 99999 JAN MAYEN • 010013 99999 ROST
![Page 76: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/76.jpg)
表命令• create table xxx as select name,age from tab2• ALTER TABLE source RENAME TO target;• ALTER TABLE target ADD COLUMNS (col3
STRING);• create table XXX as select c1,c2 from Tab2
![Page 77: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/77.jpg)
自定义函数 UDF
• select myFun(age) from tab3;
• public class MyFun extends UDF {• }
• 编写完以后注册 :– create temporary function myFun as 'com.MyFun'
![Page 78: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/78.jpg)
自定义组聚集函数 UDAF
• extends UDAF
![Page 79: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/79.jpg)
Hive 简介• 数据仓库,• 把类似 SQL 的语法转化成 MapReduce程序• 不支持 Index,Transaction, 分钟级别的延时• 不支持 SQL 的 Having• 数据类型支持–基本类型 string,int,double,boolean 等– 复杂类型的 Array,Map,Struct
![Page 80: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/80.jpg)
Hive 数据仓库• % export HIVE_HOME=/home/my/hive• 运行: bin/hive• hive> SHOW TABLES;• hive -f script.q• hive -e 'SELECT * FROM dummy'
![Page 81: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/81.jpg)
Hive 建表与载入数据• 建表• CREATE TABLE records– (year STRING, temperature INT, quality INT)– ROW FORMAT DELIMITED– FIELDS TERMINATED BY '\t';
• 从文件载入:– LOAD DATA LOCAL INPATH
'input/ncdc/micro-tab/sample.txt'– OVERWRITE INTO TABLE records
![Page 82: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/82.jpg)
Error while making MR scratch directory
• 把 hadoop 的配置文件 core-site.xml 中的:• fs.default.name 里的值改成 hosts里的名称• 然后重启 hadoop 和 hive
• 如果提示 name node is in safe mod– hadoop dfsadmin -safemode leave
– 或在 hdfs 上建立相关目录并加权限– % hadoop fs -mkdir /tmp– % hadoop fs -chmod a+w /tmp– % hadoop fs -mkdir /user/hive/warehouse– % hadoop fs -chmod a+w /user/hive/warehouse
![Page 83: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/83.jpg)
Hive 启动模式• hive - - service hiveserver
![Page 84: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/84.jpg)
Metastore
• metastore由两部分组成– service– 数据存储
![Page 85: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/85.jpg)
复杂类型• CREATE TABLE complex (• col1 ARRAY<INT>,• col2 MAP<STRING, INT>,• col3 STRUCT<a:STRING, b:INT, c:DOUBLE>• );
• 查询:• SELECT col1[0], col2['b'], col3.c FROM complex
![Page 86: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/86.jpg)
托管表与外部表managed tables and External tables
• 托管表会移动数据到 Hive 的数据仓库目录– CREATE TABLE managed_table (dummy STRING);– LOAD DATA INPATH '/user/tom/data.txt' INTO table
managed_table;• 外部表:– CREATE EXTERNAL TABLE external_table (dummy
STRING) LOCATION '/user/tom/external_table';– LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE
external_table;– 删除外部表的时候不会删除数据,只删除 metaata
![Page 87: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/87.jpg)
Hive Partition 分区• 会按目录保存数据– /user/hive/warehouse/tab4/level=2/city=beijing/h2.txt
(红色部分是 partition )• 建表– CREATE TABLE logs (ts BIGINT, line STRING)– PARTITIONED BY (dt STRING, country STRING);
• 使用:– LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'– INTO TABLE logs– PARTITION (dt='2001-01-01', country='GB');
![Page 88: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/88.jpg)
Hive Buckets
• CREATE TABLE bucketed_users (id INT, name STRING)
• CLUSTERED BY (id) INTO 4 BUCKETS;
• 分隔成 4片,用于拆分成多个 MapReduce任务
![Page 89: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/89.jpg)
分隔符• CREATE TABLE ...• ROW FORMAT DELIMITED• FIELDS TERMINATED BY '\001'• COLLECTION ITEMS TERMINATED BY '\002'• MAP KEYS TERMINATED BY '\003'• LINES TERMINATED BY '\n'• STORED AS TEXTFILE;
![Page 90: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/90.jpg)
指定序列化反序列化• CREATE TABLE stations (usaf STRING, wban STRING, name STRING)• ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'• WITH SERDEPROPERTIES (• "input.regex" = "(\\d{6}) (\\d{5}) (.{29}) .*"• );
• hive> SELECT * FROM stations LIMIT 4;• 010000 99999 BOGUS NORWAY • 010003 99999 BOGUS NORWAY • 010010 99999 JAN MAYEN • 010013 99999 ROST
![Page 91: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/91.jpg)
表命令• create table xxx as select name,age from tab2• ALTER TABLE source RENAME TO target;• ALTER TABLE target ADD COLUMNS (col3
STRING);• create table XXX as select c1,c2 from Tab2
![Page 92: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/92.jpg)
自定义函数 UDF
• select myFun(age) from tab3;
• public class MyFun extends UDF {• }
• 编写完以后注册 :– create temporary function myFun as 'com.MyFun'
![Page 93: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/93.jpg)
自定义组聚集函数 UDAF
• extends UDAF
![Page 94: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/94.jpg)
HBase
• start-hbase.sh• hbase shell• create 'tab1','col'• list 显示表• put 'tab1','row1', 'col:name', 'XiaoMing'• put 'tab1', 'row1', 'col:age', '10'• put 'tab2', 'row2', 'col:name', 'DaMing'• 删除表– disable 'tab1'– drop 'tab1'
![Page 95: Hadoop 20111215](https://reader035.fdocuments.net/reader035/viewer/2022062300/5556b03ad8b42a9c798b532f/html5/thumbnails/95.jpg)
HBase API Get• @Test• public void testGet() throws IOException {• Configuration conf = HBaseConfiguration.create();• // conf.set("hbase.master.port", "localhost:PORT");• // conf.set("hbase.zookeeper.quorum", "IP");• HTable table = new HTable(conf, "tab1");• Get get = new Get(Bytes.toBytes("r1"));• get.addColumn(Bytes.toBytes("col"), Bytes.toBytes("name"));• Result result = table.get(get);• byte[] value = result.value();• System.out.println("v:" + Bytes.toString(value));• byte[] val = result.getValue(Bytes.toBytes("col"), Bytes.toBytes("name"));• System.out.println("Value: " + Bytes.toString(val));• }