Design of a_dsl_by_ruby_for_heavy_computations
-
Upload
koichi-fujikawa -
Category
Technology
-
view
739 -
download
0
description
Transcript of Design of a_dsl_by_ruby_for_heavy_computations
Design of a DSL by Ruby for heavy computations
over map-reduce clusters
the 37th Grace seminar16th June, 2010
Koichi FujikawaCirius Technologies, Inc.
Today's AgendaBackgroundProblemApproachMy ProjectConclusion
BackgroundWhere are we in the world?
We Live in the "Big Data" era
World-wide web page data (Text-only) is expected 400TB (at one point).
Some web service company (like Google, Yahoo, etc) have to process these data for their business, but..
General HDD can read data in 50MB/sec. This means we can take 2000 hours (approx. 100 days) to read the total web data(400TB) by one machine.We need the parallel processing / file system.
MapReduce
MapReduce is one of the parallel skeletonsBecame popular by Google's paper(2004)MapReduce has two phases
Map phase: transform key and value to another (key and) valueReduce phase: aggregate and calculate values by one key
Each record process by map phase first and then by reduce phase
Hadoop
Hadoop is open source clone of Google MapReduce hosted by Apache FoundationBig web service provider(Yahoo, Facebook, etc) contribute this project actively.Large development and user community all over the world (including Japan)
Hadoop conference Japan 2009Hadoop source code reading events
ProblemWhat issues do we face?
Programming Model
General programmers, engineers are not familiar with this "MapReduce" model, so it is too difficult to try and use
Especially to separate Map and ReduceNo Effective way of the "pattern of the MapRecuce programming" because this technology is not mature for the engineers. We have to find this individually. It is very difficult and time-consuming.
Programming Language
Hadoop is written in Java language, so the programmers need to write Map and Reduce procedure in Java.
Java is strong typed and compile language.Some web service engineer don't like these language.
No problem if the code is fixed and completed, but I wonder it is suitable for ad-hoc prototyping and easy querying.
MapReduce jobs depend on what users want to get, so flexibility is important, I think.
ApproachHow do we resolve it?
Hide complexity of MapReduce
I found the description for MapReduce could be simpler in some specific case (e.g. log analysis).In this case (but almost all of Hadoop usage is now log analysis), it would be nice if programmers can write the description without taking care of MapReduce!
DSL approach by Ruby
For this description, I created DSL for each specific usage.
Log analysis DSL is a reference implementation which I prepared.
As DSL runtime environment for Hadoop, I chose Ruby and JRuby, which is Ruby runtime working on JVM.
Ruby is very flexible and reusable object-oriented language, so very easy to create DSL processor.
My projectWhat do I do?
Hadoop Papyrus
DSL framework for Hadoop by JRubyWe can write log analysis code by only several line.
Open source (Apache Licence) same as Hadoop
Hosted by githubDistributed by common Ruby archive site
RubyGems.orgSupported by IPA mitoh 2009
DEMO
ConclusionWhat is archiving now?
On the way to big challenge
We need parallel processing method to handle massive web-scale data.MapReduce and Hadoop is one of good tools, but..
Difficult to describe Map and ReduceIrritated to write Java for someone :-)
Hadoop Papyrus is providing the key!Ruby-based DSL framework for HadoopYou can write Map and Reduce at once
Questions?Thank you very much!Twitter ID: @fujibee