Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
2.6.0
-
None
-
None
Description
Our YARN clusters are configured with 10GB of temporary local storage.
When investigating an unhealthy YARN nodemanager, we found it became unhealthy because it's "local-dirs usable space" had dropped below 90%. Investigation showed that this was mainly due to over a 100 different entries in the usercache, all containing the exact same libjars:
yarn@yarn-nodemanager-0:/tmp/yarn/nm-local-dir$ du -s ./usercache/lily/filecache/* 41272 ./usercache/lily/filecache/10 41272 ./usercache/lily/filecache/100 41272 ./usercache/lily/filecache/101 41272 ./usercache/lily/filecache/102 41272 ./usercache/lily/filecache/103 41272 ./usercache/lily/filecache/104 ...
yarn@yarn-nodemanager-0:/tmp/yarn/nm-local-dir$ du -s ./usercache/lily/filecache/99/libjars/* 576 ./usercache/lily/filecache/99/libjars/commons-lang3-3.12.0.jar 4496 ./usercache/lily/filecache/99/libjars/hadoop-common-3.3.6-2-lily.jar 1800 ./usercache/lily/filecache/99/libjars/hadoop-mapreduce-client-core-3.3.6-2-lily.jar 100 ./usercache/lily/filecache/99/libjars/hbase-asyncfs-2.6.0-prc-1-lily.jar 2076 ./usercache/lily/filecache/99/libjars/hbase-client-2.6.0-prc-1-lily.jar 876 ./usercache/lily/filecache/99/libjars/hbase-common-2.6.0-prc-1-lily.jar 76 ./usercache/lily/filecache/99/libjars/hbase-hadoop-compat-2.6.0-prc-1-lily.jar 164 ./usercache/lily/filecache/99/libjars/hbase-hadoop2-compat-2.6.0-prc-1-lily.jar 124 ./usercache/lily/filecache/99/libjars/hbase-http-2.6.0-prc-1-lily.jar 436 ./usercache/lily/filecache/99/libjars/hbase-mapreduce-2.6.0-prc-1-lily.jar 32 ./usercache/lily/filecache/99/libjars/hbase-metrics-2.6.0-prc-1-lily.jar 24 ./usercache/lily/filecache/99/libjars/hbase-metrics-api-2.6.0-prc-1-lily.jar 208 ./usercache/lily/filecache/99/libjars/hbase-procedure-2.6.0-prc-1-lily.jar 3208 ./usercache/lily/filecache/99/libjars/hbase-protocol-2.6.0-prc-1-lily.jar 7356 ./usercache/lily/filecache/99/libjars/hbase-protocol-shaded-2.6.0-prc-1-lily.jar 52 ./usercache/lily/filecache/99/libjars/hbase-replication-2.6.0-prc-1-lily.jar 5932 ./usercache/lily/filecache/99/libjars/hbase-server-2.6.0-prc-1-lily.jar 304 ./usercache/lily/filecache/99/libjars/hbase-shaded-gson-4.1.5.jar 4060 ./usercache/lily/filecache/99/libjars/hbase-shaded-miscellaneous-4.1.5.jar 4864 ./usercache/lily/filecache/99/libjars/hbase-shaded-netty-4.1.5.jar 1832 ./usercache/lily/filecache/99/libjars/hbase-shaded-protobuf-4.1.5.jar 20 ./usercache/lily/filecache/99/libjars/hbase-unsafe-4.1.5.jar 108 ./usercache/lily/filecache/99/libjars/hbase-zookeeper-2.6.0-prc-1-lily.jar 120 ./usercache/lily/filecache/99/libjars/metrics-core-3.1.5.jar 128 ./usercache/lily/filecache/99/libjars/opentelemetry-api-1.15.0.jar 48 ./usercache/lily/filecache/99/libjars/opentelemetry-context-1.15.0.jar 32 ./usercache/lily/filecache/99/libjars/opentelemetry-semconv-1.15.0-alpha.jar 524 ./usercache/lily/filecache/99/libjars/protobuf-java-2.5.0.jar 1292 ./usercache/lily/filecache/99/libjars/zookeeper-3.8.3.jar
Investigating the YARN logs showed that for every HBase table included in a full backup, a separate YARN application is started, each uploading these job jars.
We encountered this on an HBase installation with limited tables, where we were running backup&restore related tests (so this was regular use). But I can imagine this could be annoying for HBase installations with hundreds to thousands of tables.
I wonder if it's possible to use shared job jars instead of the current approach?
(Strangely enough, the mechanisms to clean up this cache weren't triggering as expected, but that's probably something that requires its own investigation.)