Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32966

Spark| PartitionBy is taking long time to process

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Invalid
    • Affects Version/s: 2.4.5
    • Fix Version/s: None
    • Component/s: PySpark
    • Environment:

      EMR - 5.30.0; Hadoop - 2.8.5; Spark - 2.4.5

      Description

      1. When I do a write without any partition it takes 8 min

      df2_merge.write.mode('overwrite').parquet(dest_path)

       

             2. I have added conf - spark.sql.sources.partitionOverwriteMode=dynamic ; it took a longer time (more than 50 min before I force terminated the EMR cluster). But I have observed the partitions have been created and data files are present. But in EMR cluster the process is still showing as running, where as in spark history server it is showing no running or pending process.

      df2_merge.write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)

       

            3. I have modified with new conf - spark.sql.shuffle.partitions=3; it took 24 min

      df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)

       

           4. Again I disabled the conf and run plain write with partition. It took 30 min.

      df2_merge.coalesce(3).write.mode('overwrite').partitionBy("posted_on").parquet(dest_path_latest)

       

      Only one conf is common in the above scenarios is spark.sql.adaptive.coalescePartitions.initialPartitionNum=100

      My point is to reduce the time of writing with partitionBy. Is there anything I am missing

       

         

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              SujitDas Sujit Das
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: