Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27296

Efficient User Defined Aggregators

    XMLWordPrintableJSON

    Details

    • Target Version/s:
    • Flags:
      Important

      Description

      Spark's UDAFs appear to be serializing and de-serializing to/from the MutableAggregationBuffer for each row.  This gist shows a small reproducing UDAF and a spark shell session:

      https://gist.github.com/erikerlandson/3c4d8c6345d1521d89e0d894a423046f

      The UDAF and its compantion UDT are designed to count the number of times that ser/de is invoked for the aggregator.  The spark shell session demonstrates that it is executing ser/de on every row of the data frame.

      Note, Spark's pre-defined aggregators do not have this problem, as they are based on an internal aggregating trait that does the correct thing and only calls ser/de at points such as partition boundaries, presenting final results, etc.

      This is a major problem for UDAFs, as it means that every UDAF is doing a massive amount of unnecessary work per row, including but not limited to Row object allocations. For a more realistic UDAF having its own non trivial internal structure it is obviously that much worse.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                eje Erik Erlandson
                Reporter:
                eje Erik Erlandson
              • Votes:
                0 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: