Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8968

dynamic partitioning in spark sql performance issue due to the high GC overhead

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.0
    • 2.0.0
    • SQL
    • None

    Description

      now the dynamic partitioning show the bad performance for big data due to the GC/memory overhead. this is because each task each partition now we open a writer to write the data, this will cause many small files and high GC. We can shuffle data by the partition columns so that each partition will have ony one partition file and this also reduce the gc overhead

      Attachments

        Issue Links

          Activity

            People

              scwf Fei Wang
              scwf Fei Wang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: