[HBASE-28445] Shared job jars for full backups - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.6.0
Fix Version/s: None
Component/s: backup&restore
Labels:
None

Description

Our YARN clusters are configured with 10GB of temporary local storage.

When investigating an unhealthy YARN nodemanager, we found it became unhealthy because it's "local-dirs usable space" had dropped below 90%. Investigation showed that this was mainly due to over a 100 different entries in the usercache, all containing the exact same libjars:

yarn@yarn-nodemanager-0:/tmp/yarn/nm-local-dir$ du -s ./usercache/lily/filecache/*
41272   ./usercache/lily/filecache/10
41272   ./usercache/lily/filecache/100
41272   ./usercache/lily/filecache/101
41272   ./usercache/lily/filecache/102
41272   ./usercache/lily/filecache/103
41272   ./usercache/lily/filecache/104 
...

yarn@yarn-nodemanager-0:/tmp/yarn/nm-local-dir$ du -s ./usercache/lily/filecache/99/libjars/*
576     ./usercache/lily/filecache/99/libjars/commons-lang3-3.12.0.jar
4496    ./usercache/lily/filecache/99/libjars/hadoop-common-3.3.6-2-lily.jar
1800    ./usercache/lily/filecache/99/libjars/hadoop-mapreduce-client-core-3.3.6-2-lily.jar
100     ./usercache/lily/filecache/99/libjars/hbase-asyncfs-2.6.0-prc-1-lily.jar
2076    ./usercache/lily/filecache/99/libjars/hbase-client-2.6.0-prc-1-lily.jar
876     ./usercache/lily/filecache/99/libjars/hbase-common-2.6.0-prc-1-lily.jar
76      ./usercache/lily/filecache/99/libjars/hbase-hadoop-compat-2.6.0-prc-1-lily.jar
164     ./usercache/lily/filecache/99/libjars/hbase-hadoop2-compat-2.6.0-prc-1-lily.jar
124     ./usercache/lily/filecache/99/libjars/hbase-http-2.6.0-prc-1-lily.jar
436     ./usercache/lily/filecache/99/libjars/hbase-mapreduce-2.6.0-prc-1-lily.jar
32      ./usercache/lily/filecache/99/libjars/hbase-metrics-2.6.0-prc-1-lily.jar
24      ./usercache/lily/filecache/99/libjars/hbase-metrics-api-2.6.0-prc-1-lily.jar
208     ./usercache/lily/filecache/99/libjars/hbase-procedure-2.6.0-prc-1-lily.jar
3208    ./usercache/lily/filecache/99/libjars/hbase-protocol-2.6.0-prc-1-lily.jar
7356    ./usercache/lily/filecache/99/libjars/hbase-protocol-shaded-2.6.0-prc-1-lily.jar
52      ./usercache/lily/filecache/99/libjars/hbase-replication-2.6.0-prc-1-lily.jar
5932    ./usercache/lily/filecache/99/libjars/hbase-server-2.6.0-prc-1-lily.jar
304     ./usercache/lily/filecache/99/libjars/hbase-shaded-gson-4.1.5.jar
4060    ./usercache/lily/filecache/99/libjars/hbase-shaded-miscellaneous-4.1.5.jar
4864    ./usercache/lily/filecache/99/libjars/hbase-shaded-netty-4.1.5.jar
1832    ./usercache/lily/filecache/99/libjars/hbase-shaded-protobuf-4.1.5.jar
20      ./usercache/lily/filecache/99/libjars/hbase-unsafe-4.1.5.jar
108     ./usercache/lily/filecache/99/libjars/hbase-zookeeper-2.6.0-prc-1-lily.jar
120     ./usercache/lily/filecache/99/libjars/metrics-core-3.1.5.jar
128     ./usercache/lily/filecache/99/libjars/opentelemetry-api-1.15.0.jar
48      ./usercache/lily/filecache/99/libjars/opentelemetry-context-1.15.0.jar
32      ./usercache/lily/filecache/99/libjars/opentelemetry-semconv-1.15.0-alpha.jar
524     ./usercache/lily/filecache/99/libjars/protobuf-java-2.5.0.jar
1292    ./usercache/lily/filecache/99/libjars/zookeeper-3.8.3.jar

Investigating the YARN logs showed that for every HBase table included in a full backup, a separate YARN application is started, each uploading these job jars.

We encountered this on an HBase installation with limited tables, where we were running backup&restore related tests (so this was regular use). But I can imagine this could be annoying for HBase installations with hundreds to thousands of tables.

I wonder if it's possible to use shared job jars instead of the current approach?

(Strangely enough, the mechanisms to clean up this cache weren't triggering as expected, but that's probably something that requires its own investigation.)

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Dieter De Paepe

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 18/Mar/24 15:05

Updated:: 18/Mar/24 15:06