Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27296

Efficient User Defined Aggregators

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • Important

    Description

      Spark's UDAFs appear to be serializing and de-serializing to/from the MutableAggregationBuffer for each row.  This gist shows a small reproducing UDAF and a spark shell session:

      https://gist.github.com/erikerlandson/3c4d8c6345d1521d89e0d894a423046f

      The UDAF and its compantion UDT are designed to count the number of times that ser/de is invoked for the aggregator.  The spark shell session demonstrates that it is executing ser/de on every row of the data frame.

      Note, Spark's pre-defined aggregators do not have this problem, as they are based on an internal aggregating trait that does the correct thing and only calls ser/de at points such as partition boundaries, presenting final results, etc.

      This is a major problem for UDAFs, as it means that every UDAF is doing a massive amount of unnecessary work per row, including but not limited to Row object allocations. For a more realistic UDAF having its own non trivial internal structure it is obviously that much worse.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            eje Erik Erlandson Assign to me
            eje Erik Erlandson
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment