Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20920

ForkJoinPool pools are leaked when writing hive tables with many partitions

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1.1
    • Fix Version/s: 2.1.2, 2.2.0
    • Component/s: SQL
    • Labels:
      None

      Description

      This bug is loosely related to SPARK-17396

      In this case it happens when writing to a hive table with many, many, partitions (my table is partitioned by hour and stores data it gets from kafka in a spark streaming application):

      df.repartition()
      .write
      .format("orc")
      .option("path", s"$tablesStoragePath/$tableName")
      .mode(SaveMode.Append)
      .partitionBy("dt", "hh")
      .saveAsTable(tableName)

      As this table grows beyond a certain size, ForkJoinPool pools start leaking. Upon examination (with a debugger) I found that the caller is AlterTableRecoverPartitionsCommand and the problem happens when `evalTaskSupport` is used (line 555). I have tried setting a very large threshold via `spark.rdd.parallelListingThreshold` and the problem went away.

      My assumption is that the problem happens in this case and not in the one in SPARK-17396 due to the fact that AlterTableRecoverPartitionsCommand is a case class while UnionRDD is an object so multiple instances are not possible, therefore no leak.

      Regards,
      Rares

        Attachments

          Activity

            People

            • Assignee:
              srowen Sean R. Owen
              Reporter:
              mrares Rares Mirica
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: