Fix SSLFactory truststore reloader thread leak in TimelineClientImpl

      We found a similar issue as HADOOP-11368 in TimelineClientImpl. The class creates an instance of SSLFactory in newSslConnConfigurator and subsequently creates the ReloadingX509TrustManager instance which in turn starts a trust store reloader thread.
      However, the SSLFactory is never destroyed and hence the trust store reloader threads are not killed.

      This problem was observed by a customer who had SSL enabled in Hadoop and submitted many queries against the HiveServer2. After a few days, the HS2 instance crashed and from the Java dump we could see many (over 13000) threads like this:
      "Truststore reloader thread" #126 daemon prio=5 os_prio=0 tid=0x00007f680d2e3000 nid=0x98fd waiting on
      condition [0x00007f67e482c000]
      java.lang.Thread.State: TIMED_WAITING (sleeping)
      at java.lang.Thread.sleep(Native Method)
      at org.apache.hadoop.security.ssl.ReloadingX509TrustManager.run
      at java.lang.Thread.run(Thread.java:745)

      HiveServer2 uses the JobClient to submit a job:
      Thread [HiveServer2-Background-Pool: Thread-188] (Suspended (breakpoint at line 89 in

      owns: Object (id=464)
      owns: Object (id=465)
      owns: Object (id=466)
      owns: ServiceLoader<S> (id=210)
      ReloadingX509TrustManager.<init>(String, String, String, long) line: 89
      FileBasedKeyStoresFactory.init(SSLFactory$Mode) line: 209
      SSLFactory.init() line: 131
      TimelineClientImpl.newSslConnConfigurator(int, Configuration) line: 532
      TimelineClientImpl.newConnConfigurator(Configuration) line: 507
      TimelineClientImpl.serviceInit(Configuration) line: 269
      TimelineClientImpl(AbstractService).init(Configuration) line: 163
      YarnClientImpl.serviceInit(Configuration) line: 169
      YarnClientImpl(AbstractService).init(Configuration) line: 163
      ResourceMgrDelegate.serviceInit(Configuration) line: 102
      ResourceMgrDelegate(AbstractService).init(Configuration) line: 163
      ResourceMgrDelegate.<init>(YarnConfiguration) line: 96
      YARNRunner.<init>(Configuration) line: 112
      YarnClientProtocolProvider.create(Configuration) line: 34
      Cluster.initialize(InetSocketAddress, Configuration) line: 95
      Cluster.<init>(InetSocketAddress, Configuration) line: 82
      Cluster.<init>(Configuration) line: 75
      JobClient.init(JobConf) line: 475
      JobClient.<init>(JobConf) line: 454
      MapRedTask(ExecDriver).execute(DriverContext) line: 401
      MapRedTask.execute(DriverContext) line: 137
      MapRedTask(Task<T>).executeTask() line: 160
      TaskRunner.runSequential() line: 88
      Driver.launchTask(Task<Serializable>, String, boolean, String, int, DriverContext) line: 1653
      Driver.execute() line: 1412

      For every job, a new instance of JobClient/YarnClientImpl/TimelineClientImpl is created. But because the HS2 process stays up for days, the previous trust store reloader threads are still hanging around in the HS2 process and eventually use all the resources available.

      It seems like a similar fix as HADOOP-11368 is needed in TimelineClientImpl but it doesn't have a destroy method to begin with.

      One option to avoid this problem is to disable the yarn timeline service (yarn.timeline-service.enabled=false).


