Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-10488

Datanode OOM due to run out of mmap handler

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.0
    • 2.0.0
    • None

    Description

      When I run command "yarn jar /**/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -Dfs.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzoneFileSystem -Dfs.AbstractFileSystem.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzFs -Dfs.defaultFS=ofs://ozone1708515436 -Dozone.client.bytes.per.checksum=1KB -Dtest.build.data=ofs://ozone1708515436/s3v/testdfsio -write -nrFiles 64 -fileSize 1024MB" on an installed Ozone cluster, several DN crashed due to OOM. Following is the exception stack,

      6:52:03.601 AM  WARN  KeyValueHandler Operation: ReadChunk , Trace ID:  , Message: java.io.IOException: Map failed , Result: IO_EXCEPTION , StorageContainerException Occurred.
      org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: java.io.IOException: Map failed
        at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.wrapInStorageContainerException(ChunkUtils.java:471)
        at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:226)
        at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:260)
        at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:194)
        at org.apache.hadoop.ozone.container.keyvalue.impl.FilePerBlockStrategy.readChunk(FilePerBlockStrategy.java:197)
        at org.apache.hadoop.ozone.container.keyvalue.impl.ChunkManagerDispatcher.readChunk(ChunkManagerDispatcher.java:112)
        at org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleReadChunk(KeyValueHandler.java:773)
        at org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:262)
        at org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:225)
        at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:335)
        at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:183)
        at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
        at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:182)
        at org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:112)
        at org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:105)
        at org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
        at org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
        at org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(GrpcServerInterceptor.java:49)
        at org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:329)
        at org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:314)
        at org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:833)
        at org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
      Caused by: java.io.IOException: Map failed
        at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:938)
        at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.lambda$readData$5(ChunkUtils.java:264)
        at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.lambda$readData$4(ChunkUtils.java:218)
        at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.processFileExclusively(ChunkUtils.java:411)
        at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:215)
        ... 24 more
      Caused by: java.lang.OutOfMemoryError: Map failed
        at sun.nio.ch.FileChannelImpl.map0(Native Method)
        at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:935)
        ... 28 more
      

      In the "Dynamic libraries" section of file hs_err_pid1560151.log, there are 261425 mapped regions for different block files, for example "/hadoop-ozone/datanode/data/hdds/CID-303529f3-9f2b-4427-b389-6909971e960a/current/containerDir3/2005/chunks/113750153625618247.block"

      On OS level, the max mmap handler count is saved in file /proc/sys/vm/max_map_count, which has value 262144.

      In HDDS-7117, MappedByteBuffer is introduced to improve the chunk read performance.
      Property "ozone.chunk.read.mapped.buffer.threshold" with value 32KB is defined as a bar to indicate whether use MappedByteBuffer or normal ByteBuffer to read data.
      If read data length is less than "ozone.chunk.read.mapped.buffer.threshold", MappedByteBuffer should not be used, which is not enforce in current implementation.
      Here is the logs when debug log level is enabled in DN,

      2024-02-27 15:19:28,676 DEBUG [f22679a0-7e8c-4006-a6ab-874736e9c75a-ChunkReader-3]-org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils: mapped: offset=213565440, readLen=0, n=8192, class java.nio.DirectByteBufferR
      2024-02-27 15:19:28,676 DEBUG [f22679a0-7e8c-4006-a6ab-874736e9c75a-ChunkReader-3]-org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils: mapped: offset=213565440, readLen=8192, n=8192, class java.nio.DirectByteBufferR
      ...
      2024-02-27 15:19:28,676 DEBUG [f22679a0-7e8c-4006-a6ab-874736e9c75a-ChunkReader-3]-org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils: mapped: offset=213565440, readLen=327680, n=8192, class java.nio.DirectByteBufferR
      2024-02-27 15:19:28,676 DEBUG [f22679a0-7e8c-4006-a6ab-874736e9c75a-ChunkReader-3]-org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils: mapped: offset=213565440, readLen=335872, n=8192, class java.nio.DirectByteBufferR
      2024-02-27 15:19:28,676 DEBUG [f22679a0-7e8c-4006-a6ab-874736e9c75a-ChunkReader-3]-org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils: Read 344064 bytes starting at offset 213565440 from /hadoop-ozone/datanode/data/hdds/CID-303529f3-9f2b-4427-b389-6909971e960a/current/containerDir9/5007/chunks/113750153625622210.block
      
      

      Due to this, DN is out of memory when the mmap handlers run out.

      Attachments

        Issue Links

          Activity

            People

              Sammi Sammi Chen
              Sammi Sammi Chen
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: