Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18917

Dataframe - Time Out Issues / Taking long time in append mode on object stores

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.0.2, 2.1.0
    • 2.2.0
    • EC2, Spark Core, SQL, YARN
    • None
    • Patch, Important

    Description

      When using Dataframe write in append mode on object stores (S3 / Google Storage), the writes are taking long time to write/ getting read time out. This is because dataframe.write lists all leaf folders in the target directory. If there are lot of subfolders due to partitions, this is taking for ever.

      The code is In org.apache.spark.sql.execution.datasources.DataSource.write() following code causes huge number of RPC calls when the file system is an Object Store (S3, GS).
      if (mode == SaveMode.Append) {
      val existingPartitionColumns = Try {
      resolveRelation()
      .asInstanceOf[HadoopFsRelation]
      .location
      .partitionSpec()
      .partitionColumns
      .fieldNames
      .toSeq
      }.getOrElse(Seq.empty[String])
      There should be a flag to skip Partition Match Check in append mode. I can work on the patch.

      Attachments

        Issue Links

          Activity

            People

              rxin Reynold Xin
              alunarbeach Anbu Cheeralan
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 72h
                  72h
                  Remaining:
                  Remaining Estimate - 72h
                  72h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified