Uploaded image for project: 'Apache Sedona'
  1. Apache Sedona
  2. SEDONA-64

Broadcast dedupParams in join judgements to reduce task deserialization time

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.1.0

    Description

      Join judgements requires large variables (dedupParams) that Spark will have to deserialize for every task.

      For very large partition counts this adds a significant overhead. For jobs with sedona.join.numpartition=20000 (34k actual tasks) task deserialization time takes on average 200 ms.

      By broadcasting dedupParams we've been able to reduce task deserialization time to 11 ms. Total job execution time is reduces by 20%.

      https://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              umartin Martin Andersson
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m