[SPARK-20920] ForkJoinPool pools are leaked when writing hive tables with many partitions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.1.1
Fix Version/s: 2.1.2, 2.2.0
Component/s: SQL
Labels:
None

Description

This bug is loosely related to ~~SPARK-17396~~

In this case it happens when writing to a hive table with many, many, partitions (my table is partitioned by hour and stores data it gets from kafka in a spark streaming application):

df.repartition()
.write
.format("orc")
.option("path", s"$tablesStoragePath/$tableName")
.mode(SaveMode.Append)
.partitionBy("dt", "hh")
.saveAsTable(tableName)

As this table grows beyond a certain size, ForkJoinPool pools start leaking. Upon examination (with a debugger) I found that the caller is AlterTableRecoverPartitionsCommand and the problem happens when `evalTaskSupport` is used (line 555). I have tried setting a very large threshold via `spark.rdd.parallelListingThreshold` and the problem went away.

My assumption is that the problem happens in this case and not in the one in ~~SPARK-17396~~ due to the fact that AlterTableRecoverPartitionsCommand is a case class while UnionRDD is an object so multiple instances are not possible, therefore no leak.

Regards,
Rares

Attachments

Issue Links

links to

[Github] Pull Request #18216 (srowen)

Activity

People

Assignee:: Sean R. Owen

Reporter:: Rares Mirica

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 30/May/17 08:17

Updated:: 10/Jul/17 23:40

Resolved:: 13/Jun/17 09:48