Scala on Hadoop

Hadoop Conference

Scala on Hadoop

はてな田中慎司stanaka @ hatena.ne.jp

http://d.hatena.ne.jp/stanaka/http://twitter.com/stanaka/

アジェンダ自己紹介はてなでの Hadoop

Scala までの道 Scala on Hadoop Scala on Hadoop の応用

自己紹介 ( 株 ) はてな執行役員

担当領域システムアーキテクチャ

スケーラビリティサーバ・ネットワークサポート

はてなでの Hadoop #1

自作サーバ 10 台 CPU: Core2 Quad x 1 Mem: 8GB HDD: 3TB

DC ではなくオフィスに置いて、電源代節約

はてなでの Hadoop #2

蓄積されるデータ ( 主にログ ) ダイアリー 7G/day ブックマーク 5G/day うごメモ 3G/day

ジョブ 300 jobs/day

はてなでの Hadoop システム ( 現状 )

Hadoop MapReduce

HDFS

ReverseProxy

ジョブの投入

HatenaFotolife

HatenaGraph

ログを時間毎に蓄積

/logs/$service/$year/$month/$date/$host_access-$hour.log

Hadoop

2008/5 頃調査 Hadoop Streaming

2008/8 頃稼働 Perl による Mapper, Reducer YAML でジョブを定義

2009/4 WebUI を作成 2009/11 Scala 化 ← イマココ

Hadoop Streaming

Java 以外の言語で MapReduce を可能に ! Map, Reduce の入出力を標準入力 /標準出力

として扱う→ map.pl, reduce.pl を用意するだけ

通常は入力出力共に HDFS 上に置かれる

map.pl

#!/usr/bin/env perluse strict;use warnings;

while (<>) { chomp; my @segments = split /\s+/; printf "%s\t%s\n", $segments[8], 1;}

reduce.pl

#!/usr/bin/env perluse strict;use warnings;

my %count;while (<>) { chomp; my ($key, $value) = split /\t/; $count{$key}++;}

while (my ($key, $value) = each %count) { printf "%s\t%s\n", $key, $value;}

実行

% hadoop jar $HADOOP_DIR/contrib/hadoop-*-streaming.jar \ -input httpd_logs \ -output analog_out \ -mapper /home/user/work/analog/map.pl \ -reducer /home/user/work/analog/reduce.pl

ジョブの定義 YAML で定義

- name: latency mapper: class: LogAnalyzer::Mapper options: filters: isbot: 0 conditions: - key: Top filters: uri: '^\/$' value: $response reducer: class: Reducer::Distribution

input: class: LogAnalyzer::Input options: service: ugomemo period: 1 output: class: Output::Gnuplot options: title: "Ugomemo Latency $date" xlabel: "Response time (msec)" ylabel: "Rates of requests (%)" fotolife_folder: ugomemo

Hadoop Streaming の限界遅い ← Perl の問題も ..

ジョブを KILL しても、プロセスが残ることがある

HDFS 操作が遅い Combiner が定義できない

Scala

2003 年登場関数型の特徴を備えた言語普通のオブジェクト指向っぽくも書ける JavaVM 上で動作する

object HelloWorld { def main(args: Array[String]) { println("Hello, world!") }}

Scala による Quick sort

def qsort[T <% Ordered[T]](list: List[T]): List[T] = list match { case Nil => Nil case pivot::tail => qsort(tail.filter(_ < pivot)) ::: pivot :: qsort(tail.filter(_ >= pivot)) }

scala> qsort(List(2,1,3))res1: List[Int] = List(1, 2, 3)

WordCount by Javapublic class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable> {…

WordCount by Scala

object WordCount {

class MyMap extends Mapper[LongWritable, Text, Text, IntWritable] {

val one = 1 override def map(ky: LongWritable, value: Text, output: Mapper[LongWritable, Text, Text, IntWritable]#Context) = { (value split " ") foreach (output write (_, one)) } }

class MyReduce extends Reducer[Text, IntWritable, Text, IntWritable] { override def reduce(key: Text, values: java.lang.Iterable[IntWritable], output: Reducer[Text, IntWritable, Text, IntWritable]#Context) = { val iter: Iterator[IntWritable] = values.iterator() val sum = iter reduceLeft ((a: Int, b: Int) => a + b) output write (key, sum) } }

def main(args: Array[String]) = {…

Java vs Scala

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); }}

override def map(ky: LongWritable, value: Text, output: Mapper[LongWritable, Text, Text, IntWritable]#Context) = { (value split " ") foreach (output write (_, one))}

Java

Scala

Scala on Hadoop

Java と Scala を接続するライブラリが必要

SHadoop http://code.google.com/p/jweslley/

source/browse/#svn/trunk/scala/shadoop 型変換を行うシンプルなライブラリ

mapper

class MyMap extends Mapper[LongWritable, Text, Text, IntWritable] {

val one = 1

override def map(ky: LongWritable, value: Text, output: Mapper[LongWritable, Text, Text, IntWritable]#Context) = { (value split " ") foreach (output write (_, one)) }}

reducer

class MyReduce extends Reducer[Text, IntWritable, Text, IntWritable] { override def reduce(key: Text, values: java.lang.Iterable[IntWritable], output: Reducer[Text, IntWritable, Text, IntWritable]#Context) = { val iter: Iterator[IntWritable] = values.iterator() val sum = iter reduceLeft ((a: Int, b: Int) => a + b) output write (key, sum) }}

main

def main(args: Array[String]) = { val conf = new Configuration() val otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs()

val job = new Job(conf, "word count") job setJarByClass(WordCount getClass()) job setMapperClass(classOf[WordCount.MyMap]) job setCombinerClass(classOf[WordCount.MyReduce]) job setReducerClass(classOf[WordCount.MyReduce])

job setMapOutputKeyClass(classOf[Text]) job setMapOutputValueClass(classOf[IntWritable]) job setOutputKeyClass(classOf[Text]) job setOutputValueClass(classOf[IntWritable])

FileInputFormat addInputPath(job, new Path(otherArgs(0))) FileOutputFormat setOutputPath(job, new Path(otherArgs(1))) System exit(job waitForCompletion(true) match { case true => 0 case false => 1})}

HDFS 操作

import java.net.URIimport org.apache.hadoop.fs._import org.apache.hadoop.hdfs._import org.apache.hadoop.conf.Configuration

object Hdfs { def main(args: Array[String]) = { val conf = new Configuration()

val uri = new URI("hdfs://hadoop01:9000/") val fs = new DistributedFileSystem fs.initialize(uri, conf)

var status = fs.getFileStatus(new Path(args(0))) println(status.getModificationTime) }}

ビルド手法 Maven

http://maven.apache.org/ Java 系のプロジェクト管理ツール

プロジェクト作成mvn org.apache.maven.plugins:maven-archetype-plugin:2.0-alpha-4:create -DarchetypeGroupId=org.scala-tools.archetypes -DarchetypeArtifactId=scala-archetype-simple -DarchetypeVersion=1.2 -DremoteRepositories=http://scala-tools.org/repo-releases -DgroupId=com.hatena.hadoop -DartifactId=hadoop

依存関係の記述 Hadoop 関連 jar の登録

依存関係の記述<dependency> <groupId>commons-logging</groupId> <artifactId>commons-logging</artifactId> <version>1.0.4</version> <scope>provided</scope></dependency><dependency> <groupId>commons-cli</groupId> <artifactId>commons-cli</artifactId> <version>1.0</version> <scope>provided</scope></dependency>

mvn install:install-file -DgroupId=org.apache.hadoop -DartifactId=hadoop-core -Dversion=0.20.1 -Dpackaging=jar -Dfile=/opt/hadoop/hadoop-0.20.1-core.jar

ビルド・パッケージ作成と実行ビルド・パッケージ作成

実行 $HADOOP_HOME/bin/hadoop jar ../maven/hadoop/target/hadoop-1.0-SNAPSHOT.jar com.hatena.hadoop.Hadoop -D mapred.job.tracker=local -D fs.default.name=file:/// input output

mvn scala:compilemvn packagemvn clean

レスポンス時間の計測 #1

計測方法特定の URL を叩いて、その時間を計測生アクセスログから収集

生アクセスログを分析 Hadoop クラスタ

Core2Quad サーバ 10 台はてなダイアリーのログ 7GB → 10 分程度で処理

分布をグラフ化

レスポンス時間の計測 #2

Mapper URL などの条件でフィルタレスポンス時間を記録

Reducer レスポンス時間の分布を計算

後処理グラフ化 (gnuplot) Fotolife にアップロード(AtomAPI)

レスポンス時間の分布グラフ

良好なレスポンスの例

キャッシュによる影響

まとめはてなでの Hadoop Scala on Hadoop

色々、触って楽しみましょう !

Q&[email protected]

Scala on Hadoop

Technology

Transcript of Scala on Hadoop