Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21039

Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.1.1
    • Fix Version/s: 2.3.0
    • Component/s: Spark Core
    • Labels:
      None

      Description

      Currently, DataFrame.stat.bloomFilter uses RDD.aggregate, which means that the bloom filters received for each partition of data are merged in the driver. The cost of this operation can be very high if the bloom filters are large. It would be nice if it used RDD.treeAggregate instead, in order to parallelize the operation of merging the bloom filters.

        Attachments

          Activity

            People

            • Assignee:
              lovasoa Lovasoa
              Reporter:
              lovasoa Lovasoa
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: