Uploaded image for project: 'Crunch (Retired)'
  1. Crunch (Retired)
  2. CRUNCH-73

Scrunch applications using PipelineApp do not properly serialize closures to MapReduce tasks.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.4.0
    • 0.4.0
    • Scrunch
    • None

    Description

      One of the great potential advantages of using Scala for writing MapReduce pipelines is the ability to send side data as part of function closures, rather than through Hadoop Configurations or the Distributed Cache. As an absurdly simple example, consider the following Scala PipelineApp that divides all elements of a numeric PCollection by an arbitrary argument:

      object DivideApp extends PipelineApp {
      val divisor = Integer.valueOf(args(0))
      val nums = read(From.textFile("numbers.txt"))
      val dividedNums = nums.map

      { n => n / divisor }

      dividedNums.write(To.textFile("dividedNums"))
      run()
      }

      Executing this PipelineApp fails. MapReduce tasks get a value of "null" for divisor (or 0 if divisor is forced to be a primitive numeric type). This indicates that an error is occurring in the serialization of Scala function closures that causes unbound variables in the closure to take on their default JVM values.

      Attachments

        1. CRUNCH-73-v1.patch
          7 kB
          Kiyan Ahmadizadeh
        2. CRUNCH-73-v2.patch
          10 kB
          Kiyan Ahmadizadeh

        Activity

          People

            kiyan Kiyan Ahmadizadeh
            kiyan Kiyan Ahmadizadeh
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: