Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35332

Not Coalesce shuffle partitions when cache table

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.1, 3.1.0, 3.1.1
    • 3.2.0
    • Shuffle
    • None
    • latest spark version

    Description

      How to reproduce the problem

      linux shell command to prepare data:
      for i in $(seq 200000);do echo "$(($i+100000)),name$i,$(($i*10))";done > data.text

      sql to reproduce the problem:

      • create table data_table(id int, str string, num int) row format delimited fields terminated by ',';
      • load data local inpath '/path/to/data.text' into table data_table;
      • CACHE TABLE test_cache_table AS
        SELECT str
        FROM
        (SELECT id,str FROM data_table
        )group by str;

      Finally you will see a stage with 200 tasks and not coalesce shuffle partitions, the problem will waste resource when data size is small.

      Attachments

        1. cacheTable.png
          360 kB
          Xianghao Lu

        Activity

          People

            ulysses XiDuo You
            luxianghao Xianghao Lu
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: