Design of a_dsl_by_ruby_for_heavy_computations

Design of a DSL by Ruby for heavy computations

over map-reduce clusters

the 37th Grace seminar16th June, 2010

Koichi FujikawaCirius Technologies, Inc.

Today's AgendaBackgroundProblemApproachMy ProjectConclusion

BackgroundWhere are we in the world?

We Live in the "Big Data" era

World-wide web page data (Text-only) is expected 400TB (at one point).

Some web service company (like Google, Yahoo, etc) have to process these data for their business, but..

General HDD can read data in 50MB/sec. This means we can take 2000 hours (approx. 100 days) to read the total web data(400TB) by one machine.We need the parallel processing / file system.

MapReduce

MapReduce is one of the parallel skeletonsBecame popular by Google's paper(2004)MapReduce has two phases

Map phase: transform key and value to another (key and) valueReduce phase: aggregate and calculate values by one key

Each record process by map phase first and then by reduce phase

Hadoop

Hadoop is open source clone of Google MapReduce hosted by Apache FoundationBig web service provider(Yahoo, Facebook, etc) contribute this project actively.Large development and user community all over the world (including Japan)

Hadoop conference Japan 2009Hadoop source code reading events

ProblemWhat issues do we face?

Programming Model

General programmers, engineers are not familiar with this "MapReduce" model, so it is too difficult to try and use

Especially to separate Map and ReduceNo Effective way of the "pattern of the MapRecuce programming" because this technology is not mature for the engineers. We have to find this individually. It is very difficult and time-consuming.

Programming Language

Hadoop is written in Java language, so the programmers need to write Map and Reduce procedure in Java.

Java is strong typed and compile language.Some web service engineer don't like these language.

No problem if the code is fixed and completed, but I wonder it is suitable for ad-hoc prototyping and easy querying.

MapReduce jobs depend on what users want to get, so flexibility is important, I think.

ApproachHow do we resolve it?

Hide complexity of MapReduce

I found the description for MapReduce could be simpler in some specific case (e.g. log analysis).In this case (but almost all of Hadoop usage is now log analysis), it would be nice if programmers can write the description without taking care of MapReduce!

DSL approach by Ruby

For this description, I created DSL for each specific usage.

Log analysis DSL is a reference implementation which I prepared.

As DSL runtime environment for Hadoop, I chose Ruby and JRuby, which is Ruby runtime working on JVM.

Ruby is very flexible and reusable object-oriented language, so very easy to create DSL processor.

My projectWhat do I do?

Hadoop Papyrus

DSL framework for Hadoop by JRubyWe can write log analysis code by only several line.

Open source (Apache Licence) same as Hadoop

Hosted by githubDistributed by common Ruby archive site

RubyGems.orgSupported by IPA mitoh 2009

ConclusionWhat is archiving now?

On the way to big challenge

We need parallel processing method to handle massive web-scale data.MapReduce and Hadoop is one of good tools, but..

Difficult to describe Map and ReduceIrritated to write Java for someone :-)

Hadoop Papyrus is providing the key!Ruby-based DSL framework for HadoopYou can write Map and Reduce at once

Questions?Thank you very much!Twitter ID: @fujibee

Design of a_dsl_by_ruby_for_heavy_computations

Technology

Transcript of Design of a_dsl_by_ruby_for_heavy_computations