Design of a_dsl_by_ruby_for_heavy_computations

21
Design of a DSL by Ruby for heavy computations over map-reduce clusters the 37th Grace seminar 16th June, 2010 Koichi Fujikawa Cirius Technologies, Inc.

description

Presentation of the 37th GRACE seminar at NII, 16th June, 2010. http://www.grace-center.jp/event/node/59

Transcript of Design of a_dsl_by_ruby_for_heavy_computations

Page 1: Design of a_dsl_by_ruby_for_heavy_computations

Design of a DSL by Ruby for heavy computations

over map-reduce clusters

the 37th Grace seminar16th June, 2010

Koichi FujikawaCirius Technologies, Inc.

Page 2: Design of a_dsl_by_ruby_for_heavy_computations

Today's AgendaBackgroundProblemApproachMy ProjectConclusion

Page 3: Design of a_dsl_by_ruby_for_heavy_computations

BackgroundWhere are we in the world?

Page 4: Design of a_dsl_by_ruby_for_heavy_computations

We Live in the "Big Data" era

World-wide web page data (Text-only) is expected 400TB (at one point).

Some web service company (like Google, Yahoo, etc) have to process these data for their business, but..

General HDD can read data in 50MB/sec. This means we can take 2000 hours (approx. 100 days) to read the total web data(400TB) by one machine.We need the parallel processing / file system.

Page 5: Design of a_dsl_by_ruby_for_heavy_computations

MapReduce

MapReduce is one of the parallel skeletonsBecame popular by Google's paper(2004)MapReduce has two phases

Map phase: transform key and value to another (key and) valueReduce phase: aggregate and calculate values by one key

Each record process by map phase first and then by reduce phase

Page 6: Design of a_dsl_by_ruby_for_heavy_computations
Page 7: Design of a_dsl_by_ruby_for_heavy_computations

Hadoop

Hadoop is open source clone of Google MapReduce hosted by Apache FoundationBig web service provider(Yahoo, Facebook, etc) contribute this project actively.Large development and user community all over the world (including Japan)

Hadoop conference Japan 2009Hadoop source code reading events

Page 8: Design of a_dsl_by_ruby_for_heavy_computations

ProblemWhat issues do we face?

Page 9: Design of a_dsl_by_ruby_for_heavy_computations

Programming Model

General programmers, engineers are not familiar with this "MapReduce" model, so it is too difficult to try and use

Especially to separate Map and ReduceNo Effective way of the "pattern of the MapRecuce programming" because this technology is not mature for the engineers. We have to find this individually. It is very difficult and time-consuming.

Page 10: Design of a_dsl_by_ruby_for_heavy_computations

Programming Language

Hadoop is written in Java language, so the programmers need to write Map and Reduce procedure in Java.

Java is strong typed and compile language.Some web service engineer don't like these language.

No problem if the code is fixed and completed, but I wonder it is suitable for ad-hoc prototyping and easy querying.

MapReduce jobs depend on what users want to get, so flexibility is important, I think.

Page 11: Design of a_dsl_by_ruby_for_heavy_computations

ApproachHow do we resolve it?

Page 12: Design of a_dsl_by_ruby_for_heavy_computations

Hide complexity of MapReduce

I found the description for MapReduce could be simpler in some specific case (e.g. log analysis).In this case (but almost all of Hadoop usage is now log analysis), it would be nice if programmers can write the description without taking care of MapReduce!

Page 13: Design of a_dsl_by_ruby_for_heavy_computations

DSL approach by Ruby

For this description, I created DSL for each specific usage.

Log analysis DSL is a reference implementation which I prepared.

As DSL runtime environment for Hadoop, I chose Ruby and JRuby, which is Ruby runtime working on JVM.

Ruby is very flexible and reusable object-oriented language, so very easy to create DSL processor.

Page 14: Design of a_dsl_by_ruby_for_heavy_computations

My projectWhat do I do?

Page 15: Design of a_dsl_by_ruby_for_heavy_computations

Hadoop Papyrus

DSL framework for Hadoop by JRubyWe can write log analysis code by only several line.

Open source (Apache Licence) same as Hadoop

Hosted by githubDistributed by common Ruby archive site

RubyGems.orgSupported by IPA mitoh 2009

Page 16: Design of a_dsl_by_ruby_for_heavy_computations
Page 17: Design of a_dsl_by_ruby_for_heavy_computations
Page 18: Design of a_dsl_by_ruby_for_heavy_computations

DEMO

Page 19: Design of a_dsl_by_ruby_for_heavy_computations

ConclusionWhat is archiving now?

Page 20: Design of a_dsl_by_ruby_for_heavy_computations

On the way to big challenge

We need parallel processing method to handle massive web-scale data.MapReduce and Hadoop is one of good tools, but..

Difficult to describe Map and ReduceIrritated to write Java for someone :-)

Hadoop Papyrus is providing the key!Ruby-based DSL framework for HadoopYou can write Map and Reduce at once

Page 21: Design of a_dsl_by_ruby_for_heavy_computations

Questions?Thank you very much!Twitter ID: @fujibee