Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-16013

DirectoryScan operation holds dataset lock for long time

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Duplicate
    • None
    • 3.2.2, 3.3.1, 3.4.0
    • None
    • None

    Description

      Environment: 3 Node cluster with around 2M files & same number of blocks.

      All file operations are normal, only during directory scan, which take more memory and some long GC Pause. This directory scan happens for every 6H (default value) which cause slow response to any file operations. Delay is around 5-8 seconds (In production this delay got increased to 30+ seconds with 8M blocks)

      GC Configuration:
      -Xms6144M
      -Xmx12288M /8G
      -XX:NewSize=614M
      -XX:MaxNewSize=1228M
      -XX:MetaspaceSize=128M
      -XX:MaxMetaspaceSize=128M
      -XX:CMSFullGCsBeforeCompaction=1
      -XX:MaxDirectMemorySize=1G
      -XX:+UseConcMarkSweepGC
      -XX:+CMSParallelRemarkEnabled
      -XX:+UseCMSCompactAtFullCollection
      -XX:CMSInitiatingOccupancyFraction=80

      Also we tried with G1 GC, but couldnt find much difference in the result.
      -XX:+UseG1GC
      -XX:MaxGCPauseMillis=200
      -XX:InitiatingHeapOccupancyPercent=45
      -XX:G1ReservePercent=10

      2021-05-07 16:32:23,508 INFO org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool BP-345634799-<IP>-1619695417333 Total blocks: 2767211, missing metadata files: 22, missing block files: 22, missing blocks in memory: 0, mismatched blocks: 0
      2021-05-07 16:32:23,508 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Lock held time above threshold: lock identifier: FsDatasetRWLock lockHeldTimeMs=7061 ms. Suppressed 0 lock warnings. The stack trace is: java.lang.Thread.getStackTrace(Thread.java:1559)
      org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
      org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148)
      org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186)
      org.apache.hadoop.util.InstrumentedReadLock.unlock(InstrumentedReadLock.java:78)
      org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84)
      org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96)
      org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:539)
      org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:416)
      org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:359)
      java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
      java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
      java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      

      We have the following Jiras our code already. But still facing long lock held. - https://issues.apache.org/jira/browse/HDFS-15621, https://issues.apache.org/jira/browse/HDFS-15150, https://issues.apache.org/jira/browse/HDFS-15160, https://issues.apache.org/jira/browse/HDFS-13947

      cc: brahma belugabehr sodonnell ayushsaxena weichiu

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              prasad-acit Renukaprasad C
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: