Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-36967

Report accurate shuffle block size if its skewed

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.3.0
    • 3.3.0
    • Spark Core
    • None

    Description

      Now map task will report accurate shuffle block size if the block size is greater than "spark.shuffle.accurateBlockThreshold"( 100M by default ). But if there are a large number of map tasks and the shuffle block sizes of these tasks are smaller than "spark.shuffle.accurateBlockThreshold", there may be unrecognized data skew.

      For example, there are 10000 map task and 10000 reduce task, and each map task create 50M shuffle blocks for reduce 0, and 10K shuffle blocks for the left reduce tasks, reduce 0 is data skew, but the stat of this plan do not have this information.

         

      I think we need to judge if a shuffle block is huge and need to be accurate reported while running.

      Attachments

        1. map_status2.png
          33 kB
          Wan Kun
        2. map_status.png
          29 kB
          Wan Kun

        Activity

          People

            wankun Wan Kun
            wankun Wan Kun
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: