Uploaded image for project: 'Crunch'
  1. Crunch
  2. CRUNCH-73

Scrunch applications using PipelineApp do not properly serialize closures to MapReduce tasks.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.4.0
    • Fix Version/s: 0.4.0
    • Component/s: Scrunch
    • Labels:
      None

      Description

      One of the great potential advantages of using Scala for writing MapReduce pipelines is the ability to send side data as part of function closures, rather than through Hadoop Configurations or the Distributed Cache. As an absurdly simple example, consider the following Scala PipelineApp that divides all elements of a numeric PCollection by an arbitrary argument:

      object DivideApp extends PipelineApp {
      val divisor = Integer.valueOf(args(0))
      val nums = read(From.textFile("numbers.txt"))
      val dividedNums = nums.map

      { n => n / divisor }

      dividedNums.write(To.textFile("dividedNums"))
      run()
      }

      Executing this PipelineApp fails. MapReduce tasks get a value of "null" for divisor (or 0 if divisor is forced to be a primitive numeric type). This indicates that an error is occurring in the serialization of Scala function closures that causes unbound variables in the closure to take on their default JVM values.

        Attachments

        1. CRUNCH-73-v1.patch
          7 kB
          Kiyan Ahmadizadeh
        2. CRUNCH-73-v2.patch
          10 kB
          Kiyan Ahmadizadeh

          Activity

            People

            • Assignee:
              kiyan Kiyan Ahmadizadeh
              Reporter:
              kiyan Kiyan Ahmadizadeh
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: