Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29938

Add batching in alter table add partition flow

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.4, 2.4.4
    • Fix Version/s: 3.0.0
    • Component/s: SQL
    • Labels:
      None

      Description

      When lot of new partitions are added by an Insert query on a partitioned datasource table, sometimes the query fails with -

      An error was encountered: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out; at
      org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) at
      org.apache.spark.sql.hive.HiveExternalCatalog.createPartitions(HiveExternalCatalog.scala:928) at
      org.apache.spark.sql.catalyst.catalog.SessionCatalog.createPartitions(SessionCatalog.scala:798) at
      org.apache.spark.sql.execution.command.AlterTableAddPartitionCommand.run(ddl.scala:448) at
      org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.refreshUpdatedPartitions$1(InsertIntoHadoopFsRelationCommand.scala:137)
      

      This happens because adding thousands of partition in a single call takes lot of time and the client eventually timesout.

      Also adding lot of partitions can lead to OOM in Hive Metastore (similar issue in recover partition flow fixed).

      Steps to reproduce -

      case class Partition(data: Int, partition_key: Int)
      val df = sc.parallelize(1 to 15000, 15000).map(x => Partition(x,x)).toDF
      df.registerTempTable("temp_table")
      
      spark.sql("""CREATE TABLE `test_table` (`data` INT, `partition_key` INT) USING parquet PARTITIONED BY (partition_key) """)
      spark.sql("INSERT OVERWRITE TABLE test_table select * from temp_table").collect()
      

        Attachments

          Activity

            People

            • Assignee:
              prakharjain09 Prakhar Jain
              Reporter:
              prakharjain09 Prakhar Jain
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: