Issue Details (XML | Word | Printable)

Key: HADOOP-2781
Type: New Feature New Feature
Status: Open Open
Priority: Major Major
Assignee: Unassigned
Reporter: Ted Dunning
Votes: 3
Watchers: 11
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Hadoop/Groovy integration

Created: 05/Feb/08 12:08 AM   Updated: 23/Apr/08 05:13 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works trunk.tgz 2008-04-23 04:31 PM Ted Dunning 101 kB
Environment: Any


 Description  « Hide
This is a place-holder issue to hold initial release of the groovy integration for hadoop.

The goal is to be able to write very simple map-reduce programs in just a few lines of code in a functional style. Word count should be less than 5 lines of code!



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Ted Dunning added a comment - 28/Feb/08 02:45 AM
Here is the README for grool, a standalone package I will upload shortly.

OVERVIEW

Grool is a simple extension to the Groovy scripting language. Itis intended to make
it easier to use Hadoop from an scripted environment and to make it possible to write
simple map-reduce programs in functional style with much less boiler-plate code than
is typically required. Essentially, the goal is to make simple programs simple to write.

For instance, the venerable word-count example boils down to the following using grool:

count = Hadoop.mr({key, text, out, reporter ->
text.split(" ").each { out.collect(it, 1) }
}, {word, counts, out, reporter ->
int sum = 0
counts.each { sum += it }
out.collect(word, sum)
})

When we call the function count, it will determine whether its input is local or already
in the Hadoop file system and will invoke a map-reduce program. The location of the output
is returned packaged in an object that can be passed to other map-reduce functions like count
or read directly using familiar Groovy code constructs. For instance, to count some literal
text and print the results, we can do this:

count(["this is some test", "data for a simple", "test of a program"]).eachLine { println(it) }

As an example of composition of functions, we can write a simple variant of the word
counting program that counts the prevalence of counts in each decade (1-10, 10-100
and so on). This can be written this way

decade = Hadoop.mr({key, text, out, reporter -> out.collect(Math.floor(Math.log(new Integer(text.split("\t")[1]))), 1) }, {decade, counts, out, reporter ->
int sum = 0; counts.each { sum += it }
out.collect(decade, sum)
})

These two programs can be composed and the result printed using familiar functional style:

decade(count(text)).eachLine{ println it }

When we are done, we can clean up any temporary files in the Hadoop file system using the
following command:

Hadoop.cleanup()

If we were to write code that needs line by line debugging, we can change any invocation of
Hadoop.mr to Local.mr and the code will be executing locally rather than using Hadoop. Local
and Hadoop based map reduce functions can be intermixed arbitrarily and transparently.

KNOWN ISSUES and LIMITATIONS

As it stands, grool is useful for some simple tasks, but it badly needs extending. Some
of the limitations include:

  • all input is read using TextInputFormat (easily changed)
  • combiners, partition functions and key sorting aren't supported yet (easily changed)
  • there is essentially no documentation and very little in the way of test code. The
    current code is more of a proof of concept than a serious tool.
  • the current API doesn't allow multiple input files beyond what a single glob expression
    can specify (the current argument parsing code is fugly and should be replaced)
  • there is some speed penalty due to the type conversions performed at the Java/Groovy
    interface and because Groovy code can be slower than pure Java. Currently the penalty
    appears to be at most about 2-3x for applications like log parsing. In my experience, this
    is less than the cost of using, say, Pig. (improving this may be very difficult to do
    without sacrificing the simplicity of writing grool code, but you can provide a Java
    mapper or reducer if you like to avoid essentially all of this overhead)
  • there are bound to be several horrid infelicities in the API as it stands that make grool
    really hard to use in important cases. This is bound to lead significant and incompatible
    changes in the API. Chief among the suspect decisions is the fact that Hadoop.mr returns
    a closure instead of an object with interesting methods.

HOW IT WORKS

The fundamental difficulty in writing something like grool is that it is hard to write a
script that executes code remotely without expressing the remote code as a string. If you
do that, then you lose all of the syntax checking that the language.

I wanted to use a language like Python or Groovy, but I wasn't willing to give up being
able to call the map function easily for testing before composing it into a map-reduce
function. Unfortunately, languages like Groovy and Python don't provide access to the
source of functions or closures and, at least in Groovy's case, these functions may
refer to variables outside of the scope of the function which would mean that the text
of the function wouldn't make sense on the remote node anyway.

The solution used by grool to get around this difficulty is to execute the ENTIRE script
multiple times on different machines. Depending on where the script is executing, the
semantics of the functions being executed differs. When the script is executed initially
on the local machine, it primarily copies data to and from the Hadoop file system,
decides what the names of temporary files should be, configures Hadoop jobs and invokes
them. When executed by a mapper configure method, the script executes all initialization
code in the script and then saves references to all of mapper functions defined in the
script so that the map method can invoke the correct function. Similar actions occur
in the configure method of each reducer.

This can be a bit confusing because references to external variables don't really refer
to the same variables. In addition, some code only makes sense to execute in the correct
environment. Mostly, this involves code such as writing results which is handled by a
small hack where all map-reduce functions return a reference to a null file when executed
by mappers or reducers. That makes any output loops such as in the simple examples above
execute only a single time. Some other code may require more care. To help with those
cases, there is a function Hadoop.local that accepts a closure to be executed only on
the original machine.


Ted Dunning added a comment - 23/Apr/08 04:31 PM

Here is a first cut. There are still many rough edges.

Doug Cutting added a comment - 23/Apr/08 05:13 PM
To be committed as a contrib module, this should:
  • change package to org.apache.hadoop.grool
  • put java sources in src/java/org/apache/hadoop/grool
  • put Groovy jar in lib/groovy-X.X.X.jar
  • put Groovy license in lib/groovy-X.X.X-LICENSE.txt
  • reformat sources and README.txt to fit in 80 columns
  • remove compiled code & test output
  • add any test sources to src/test/org/apache/hadoop/grool
  • add build.xml like those in subdirectories of http://svn.apache.org/repos/asf/hadoop/core/trunk/src/contrib/. each of these imports build-contrib.xml then overrides a few things as needed.
  • run 'ant test' while in src/contrib/grool (after running 'ant compile-core' in Hadoop's root)