|
Owen O'Malley made changes - 08/Feb/08 06:36 AM
Robert Chansler made changes - 25/Mar/08 03:03 AM
Here is a first cut. There are still many rough edges.
Ted Dunning made changes - 23/Apr/08 04:31 PM
To be committed as a contrib module, this should:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
OVERVIEW
Grool is a simple extension to the Groovy scripting language. Itis intended to make
it easier to use Hadoop from an scripted environment and to make it possible to write
simple map-reduce programs in functional style with much less boiler-plate code than
is typically required. Essentially, the goal is to make simple programs simple to write.
For instance, the venerable word-count example boils down to the following using grool:
count = Hadoop.mr({key, text, out, reporter ->
text.split(" ").each { out.collect(it, 1) }
}, {word, counts, out, reporter ->
int sum = 0
counts.each { sum += it }
out.collect(word, sum)
})
When we call the function count, it will determine whether its input is local or already
in the Hadoop file system and will invoke a map-reduce program. The location of the output
is returned packaged in an object that can be passed to other map-reduce functions like count
or read directly using familiar Groovy code constructs. For instance, to count some literal
text and print the results, we can do this:
count(["this is some test", "data for a simple", "test of a program"]).eachLine { println(it) }
As an example of composition of functions, we can write a simple variant of the word
counting program that counts the prevalence of counts in each decade (1-10, 10-100
and so on). This can be written this way
decade = Hadoop.mr({key, text, out, reporter -> out.collect(Math.floor(Math.log(new Integer(text.split("\t")[1]))), 1) }, {decade, counts, out, reporter ->
int sum = 0; counts.each { sum += it }
out.collect(decade, sum)
})
These two programs can be composed and the result printed using familiar functional style:
decade(count(text)).eachLine{ println it }
When we are done, we can clean up any temporary files in the Hadoop file system using the
following command:
Hadoop.cleanup()
If we were to write code that needs line by line debugging, we can change any invocation of
Hadoop.mr to Local.mr and the code will be executing locally rather than using Hadoop. Local
and Hadoop based map reduce functions can be intermixed arbitrarily and transparently.
KNOWN ISSUES and LIMITATIONS
As it stands, grool is useful for some simple tasks, but it badly needs extending. Some
of the limitations include:
current code is more of a proof of concept than a serious tool.
can specify (the current argument parsing code is fugly and should be replaced)
interface and because Groovy code can be slower than pure Java. Currently the penalty
appears to be at most about 2-3x for applications like log parsing. In my experience, this
is less than the cost of using, say, Pig. (improving this may be very difficult to do
without sacrificing the simplicity of writing grool code, but you can provide a Java
mapper or reducer if you like to avoid essentially all of this overhead)
really hard to use in important cases. This is bound to lead significant and incompatible
changes in the API. Chief among the suspect decisions is the fact that Hadoop.mr returns
a closure instead of an object with interesting methods.
HOW IT WORKS
The fundamental difficulty in writing something like grool is that it is hard to write a
script that executes code remotely without expressing the remote code as a string. If you
do that, then you lose all of the syntax checking that the language.
I wanted to use a language like Python or Groovy, but I wasn't willing to give up being
able to call the map function easily for testing before composing it into a map-reduce
function. Unfortunately, languages like Groovy and Python don't provide access to the
source of functions or closures and, at least in Groovy's case, these functions may
refer to variables outside of the scope of the function which would mean that the text
of the function wouldn't make sense on the remote node anyway.
The solution used by grool to get around this difficulty is to execute the ENTIRE script
multiple times on different machines. Depending on where the script is executing, the
semantics of the functions being executed differs. When the script is executed initially
on the local machine, it primarily copies data to and from the Hadoop file system,
decides what the names of temporary files should be, configures Hadoop jobs and invokes
them. When executed by a mapper configure method, the script executes all initialization
code in the script and then saves references to all of mapper functions defined in the
script so that the map method can invoke the correct function. Similar actions occur
in the configure method of each reducer.
This can be a bit confusing because references to external variables don't really refer
to the same variables. In addition, some code only makes sense to execute in the correct
environment. Mostly, this involves code such as writing results which is handled by a
small hack where all map-reduce functions return a reference to a null file when executed
by mappers or reducers. That makes any output loops such as in the simple examples above
execute only a single time. Some other code may require more care. To help with those
cases, there is a function Hadoop.local that accepts a closure to be executed only on
the original machine.