Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Not A Problem
-
1.0.0
-
None
Description
ManagedChannels leaked on ratis pipeline when there are many connection retries
Observed that too many ManagedChannels opened while running Synthetic Hadoop load generator.
Ran benchmark with only one pipeline in the cluster and also ran with only two pipelines in the cluster.
Both the run failed with too many open files and could see many open TCP connections for long time and suspecting channel leaks..
More details below:
1) Execute NNloadGenerator
[rakeshr@ve1320 loadOutput]$ ps -ef | grep load hdfs 362822 1 19 05:24 pts/0 00:03:16 /usr/java/jdk1.8.0_232-cloudera/bin/java -Dproc_jar -Xmx825955249 -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dyarn.log.dir=/var/log/hadoop-yarn -Dyarn.log.file=hadoop.log -Dyarn.home.dir=/opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.2982244/lib/hadoop/libexec/../../hadoop-yarn -Dyarn.root.logger=INFO,console -Djava.library.path=/opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.2982244/lib/hadoop/lib/native -Dhadoop.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.2982244/lib/hadoop -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console -Dhadoop.policy.file=hadoop-policy.xml -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.2982244/jars/hadoop-mapreduce-client-jobclient-3.1.1.7.2.0.0-141-tests.jar NNloadGenerator -root o3fs://bucket2.vol2/ rakeshr 368739 354174 0 05:41 pts/0 00:00:00 grep --color=auto load
2) Active 9858 TCP connections during the run, which is ratis pipeline default port.
[rakeshr@ve1320 loadOutput]$ sudo lsof -a -p 362822 | grep "9858" | wc
3229 32290 494080
[rakeshr@ve1320 loadOutput]$ vi tcp_log
............
java 440633 hdfs 4090u IPv4 271141987 0t0 TCP ve1320.halxg.cloudera.com:35190->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
java 440633 hdfs 4091u IPv4 271127918 0t0 TCP ve1320.halxg.cloudera.com:35192->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
java 440633 hdfs 4092u IPv4 271038583 0t0 TCP ve1320.halxg.cloudera.com:59116->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
java 440633 hdfs 4093u IPv4 271038584 0t0 TCP ve1320.halxg.cloudera.com:59118->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
java 440633 hdfs 4095u IPv4 271127920 0t0 TCP ve1320.halxg.cloudera.com:35196->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
[rakeshr@ve1320 loadOutput]$ ^C
3) heapdump shows there are 9571 ManagedChanel objects. Heapdump is quite large and attached snapshot to this jira.
4) Attached output and threadump of the SyntheticLoadGenerator benchmark client process to show the exceptions printed to the console. FYI, this file was quite large and have trimmed few repeated exception traces..