Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-28445

Shared job jars for full backups

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.6.0
    • None
    • backup&restore
    • None

    Description

      Our YARN clusters are configured with 10GB of temporary local storage.

      When investigating an unhealthy YARN nodemanager, we found it became unhealthy because it's "local-dirs usable space" had dropped below 90%. Investigation showed that this was mainly due to over a 100 different entries in the usercache, all containing the exact same libjars:

      yarn@yarn-nodemanager-0:/tmp/yarn/nm-local-dir$ du -s ./usercache/lily/filecache/*
      41272   ./usercache/lily/filecache/10
      41272   ./usercache/lily/filecache/100
      41272   ./usercache/lily/filecache/101
      41272   ./usercache/lily/filecache/102
      41272   ./usercache/lily/filecache/103
      41272   ./usercache/lily/filecache/104 
      ...
      yarn@yarn-nodemanager-0:/tmp/yarn/nm-local-dir$ du -s ./usercache/lily/filecache/99/libjars/*
      576     ./usercache/lily/filecache/99/libjars/commons-lang3-3.12.0.jar
      4496    ./usercache/lily/filecache/99/libjars/hadoop-common-3.3.6-2-lily.jar
      1800    ./usercache/lily/filecache/99/libjars/hadoop-mapreduce-client-core-3.3.6-2-lily.jar
      100     ./usercache/lily/filecache/99/libjars/hbase-asyncfs-2.6.0-prc-1-lily.jar
      2076    ./usercache/lily/filecache/99/libjars/hbase-client-2.6.0-prc-1-lily.jar
      876     ./usercache/lily/filecache/99/libjars/hbase-common-2.6.0-prc-1-lily.jar
      76      ./usercache/lily/filecache/99/libjars/hbase-hadoop-compat-2.6.0-prc-1-lily.jar
      164     ./usercache/lily/filecache/99/libjars/hbase-hadoop2-compat-2.6.0-prc-1-lily.jar
      124     ./usercache/lily/filecache/99/libjars/hbase-http-2.6.0-prc-1-lily.jar
      436     ./usercache/lily/filecache/99/libjars/hbase-mapreduce-2.6.0-prc-1-lily.jar
      32      ./usercache/lily/filecache/99/libjars/hbase-metrics-2.6.0-prc-1-lily.jar
      24      ./usercache/lily/filecache/99/libjars/hbase-metrics-api-2.6.0-prc-1-lily.jar
      208     ./usercache/lily/filecache/99/libjars/hbase-procedure-2.6.0-prc-1-lily.jar
      3208    ./usercache/lily/filecache/99/libjars/hbase-protocol-2.6.0-prc-1-lily.jar
      7356    ./usercache/lily/filecache/99/libjars/hbase-protocol-shaded-2.6.0-prc-1-lily.jar
      52      ./usercache/lily/filecache/99/libjars/hbase-replication-2.6.0-prc-1-lily.jar
      5932    ./usercache/lily/filecache/99/libjars/hbase-server-2.6.0-prc-1-lily.jar
      304     ./usercache/lily/filecache/99/libjars/hbase-shaded-gson-4.1.5.jar
      4060    ./usercache/lily/filecache/99/libjars/hbase-shaded-miscellaneous-4.1.5.jar
      4864    ./usercache/lily/filecache/99/libjars/hbase-shaded-netty-4.1.5.jar
      1832    ./usercache/lily/filecache/99/libjars/hbase-shaded-protobuf-4.1.5.jar
      20      ./usercache/lily/filecache/99/libjars/hbase-unsafe-4.1.5.jar
      108     ./usercache/lily/filecache/99/libjars/hbase-zookeeper-2.6.0-prc-1-lily.jar
      120     ./usercache/lily/filecache/99/libjars/metrics-core-3.1.5.jar
      128     ./usercache/lily/filecache/99/libjars/opentelemetry-api-1.15.0.jar
      48      ./usercache/lily/filecache/99/libjars/opentelemetry-context-1.15.0.jar
      32      ./usercache/lily/filecache/99/libjars/opentelemetry-semconv-1.15.0-alpha.jar
      524     ./usercache/lily/filecache/99/libjars/protobuf-java-2.5.0.jar
      1292    ./usercache/lily/filecache/99/libjars/zookeeper-3.8.3.jar 

      Investigating the YARN logs showed that for every HBase table included in a full backup, a separate YARN application is started, each uploading these job jars.

      We encountered this on an HBase installation with limited tables, where we were running backup&restore related tests (so this was regular use). But I can imagine this could be annoying for HBase installations with hundreds to thousands of tables.

      I wonder if it's possible to use shared job jars instead of the current approach?

      (Strangely enough, the mechanisms to clean up this cache weren't triggering as expected, but that's probably something that requires its own investigation.)

      Attachments

        Activity

          People

            Unassigned Unassigned
            dieterdp_ng Dieter De Paepe
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: