Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27296

Efficient User Defined Aggregators

    XMLWordPrintableJSON

Details

    • Important

    Description

      Spark's UDAFs appear to be serializing and de-serializing to/from the MutableAggregationBuffer for each row.  This gist shows a small reproducing UDAF and a spark shell session:

      https://gist.github.com/erikerlandson/3c4d8c6345d1521d89e0d894a423046f

      The UDAF and its compantion UDT are designed to count the number of times that ser/de is invoked for the aggregator.  The spark shell session demonstrates that it is executing ser/de on every row of the data frame.

      Note, Spark's pre-defined aggregators do not have this problem, as they are based on an internal aggregating trait that does the correct thing and only calls ser/de at points such as partition boundaries, presenting final results, etc.

      This is a major problem for UDAFs, as it means that every UDAF is doing a massive amount of unnecessary work per row, including but not limited to Row object allocations. For a more realistic UDAF having its own non trivial internal structure it is obviously that much worse.

      Attachments

        Issue Links

          Activity

            People

              eje Erik Erlandson
              eje Erik Erlandson
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: