Description
When lot of new partitions are added by an Insert query on a partitioned datasource table, sometimes the query fails with -
An error was encountered: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) at org.apache.spark.sql.hive.HiveExternalCatalog.createPartitions(HiveExternalCatalog.scala:928) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createPartitions(SessionCatalog.scala:798) at org.apache.spark.sql.execution.command.AlterTableAddPartitionCommand.run(ddl.scala:448) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.refreshUpdatedPartitions$1(InsertIntoHadoopFsRelationCommand.scala:137)
This happens because adding thousands of partition in a single call takes lot of time and the client eventually timesout.
Also adding lot of partitions can lead to OOM in Hive Metastore (similar issue in recover partition flow fixed).
Steps to reproduce -
case class Partition(data: Int, partition_key: Int) val df = sc.parallelize(1 to 15000, 15000).map(x => Partition(x,x)).toDF df.registerTempTable("temp_table") spark.sql("""CREATE TABLE `test_table` (`data` INT, `partition_key` INT) USING parquet PARTITIONED BY (partition_key) """) spark.sql("INSERT OVERWRITE TABLE test_table select * from temp_table").collect()