Details
Description
We found an issue in a customer environment where the HS2 crashed after a few days and the Java core dump contained several thousands of truststore reloader threads:
"Truststore reloader thread" #126 daemon prio=5 os_prio=0 tid=0x00007f680d2e3000 nid=0x98fd waiting on
condition [0x00007f67e482c000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.security.ssl.ReloadingX509TrustManager.run
(ReloadingX509TrustManager.java:225)
at java.lang.Thread.run(Thread.java:745)
We found the issue to be caused by a bug in Hadoop where the TimelineClientImpl is not destroying the SSLFactory if SSL is enabled in Hadoop and the timeline server is running. I opened YARN-5309 which has more details on the problem, and a patch was submitted a few days back.
In addition to the changes in Hadoop, there are a couple of Hive changes required:
- ExecDriver needs to call jobclient.close() to trigger the clean-up of the resources after the submitted job is done/failed
- Hive needs to pick up a newer release of Hadoop to pick up
MAPREDUCE-6618andMAPREDUCE-6621that fixed issues with calling jobclient.close(). Both fixes are included in Hadoop 2.6.4.
However, since we also need to pick upYARN-5309, we need to wait for a new release of Hadoop.
Attachments
Attachments
Issue Links
- is blocked by
-
YARN-5309 Fix SSLFactory truststore reloader thread leak in TimelineClientImpl
- Closed
- is related to
-
HIVE-12766 TezTask does not close DagClient after execution
- Closed