Description
One of the great potential advantages of using Scala for writing MapReduce pipelines is the ability to send side data as part of function closures, rather than through Hadoop Configurations or the Distributed Cache. As an absurdly simple example, consider the following Scala PipelineApp that divides all elements of a numeric PCollection by an arbitrary argument:
object DivideApp extends PipelineApp {
val divisor = Integer.valueOf(args(0))
val nums = read(From.textFile("numbers.txt"))
val dividedNums = nums.map
dividedNums.write(To.textFile("dividedNums"))
run()
}
Executing this PipelineApp fails. MapReduce tasks get a value of "null" for divisor (or 0 if divisor is forced to be a primitive numeric type). This indicates that an error is occurring in the serialization of Scala function closures that causes unbound variables in the closure to take on their default JVM values.