Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20920

ForkJoinPool pools are leaked when writing hive tables with many partitions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.1
    • 2.1.2, 2.2.0
    • SQL
    • None

    Description

      This bug is loosely related to SPARK-17396

      In this case it happens when writing to a hive table with many, many, partitions (my table is partitioned by hour and stores data it gets from kafka in a spark streaming application):

      df.repartition()
      .write
      .format("orc")
      .option("path", s"$tablesStoragePath/$tableName")
      .mode(SaveMode.Append)
      .partitionBy("dt", "hh")
      .saveAsTable(tableName)

      As this table grows beyond a certain size, ForkJoinPool pools start leaking. Upon examination (with a debugger) I found that the caller is AlterTableRecoverPartitionsCommand and the problem happens when `evalTaskSupport` is used (line 555). I have tried setting a very large threshold via `spark.rdd.parallelListingThreshold` and the problem went away.

      My assumption is that the problem happens in this case and not in the one in SPARK-17396 due to the fact that AlterTableRecoverPartitionsCommand is a case class while UnionRDD is an object so multiple instances are not possible, therefore no leak.

      Regards,
      Rares

      Attachments

        Activity

          People

            srowen Sean R. Owen
            mrares Rares Mirica
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: