Uploaded image for project: 'Crunch'
  1. Crunch
  2. CRUNCH-485

groupByKey on Spark incorrect if key is Avro record with defined sort order

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.11.0
    • Fix Version/s: 0.12.0
    • Component/s: Core
    • Labels:
      None

      Description

      GroupByKey on Spark is incorrect if the key type is an Avro record with defined sort order (http://avro.apache.org/docs/1.7.7/spec.html#order).

      Instead, it serializes the entire avro record to a binary blob (byte array) and groups identical blobs. This is wrong. By contrast, groupByKey on MapReduce works as expected, so it does take Avro's sort order into account.

      The culprit is probably the following code from org.apache.crunch.impl.spark.collect.PGroupedTableImpl#getJavaRDDLikeInternal

      groupedRDD = parentRDD.map(new PairMapFunction(ptype.getOutputMapFn(), runtime.getRuntimeContext()))
                .mapToPair(new MapOutputFunction(keySerde, valueSerde))
                .groupByKey(numPartitions);
      

      where MapOutputFunction simply converts the entire key object to a binary blob, without taking sort order into account.

        Attachments

        1. CRUNCH-485.patch
          11 kB
          Josh Wills
        2. CRUNCH-485b.patch
          14 kB
          Josh Wills

          Activity

            People

            • Assignee:
              jwills Josh Wills
              Reporter:
              tychol Tycho Lamerigts
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: