Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.4.0
-
None
Description
When I run command "yarn jar /**/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -Dfs.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzoneFileSystem -Dfs.AbstractFileSystem.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzFs -Dfs.defaultFS=ofs://ozone1708515436 -Dozone.client.bytes.per.checksum=1KB -Dtest.build.data=ofs://ozone1708515436/s3v/testdfsio -write -nrFiles 64 -fileSize 1024MB" on an installed Ozone cluster, several DN crashed due to OOM. Following is the exception stack,
6:52:03.601 AM WARN KeyValueHandler Operation: ReadChunk , Trace ID: , Message: java.io.IOException: Map failed , Result: IO_EXCEPTION , StorageContainerException Occurred. org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: java.io.IOException: Map failed at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.wrapInStorageContainerException(ChunkUtils.java:471) at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:226) at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:260) at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:194) at org.apache.hadoop.ozone.container.keyvalue.impl.FilePerBlockStrategy.readChunk(FilePerBlockStrategy.java:197) at org.apache.hadoop.ozone.container.keyvalue.impl.ChunkManagerDispatcher.readChunk(ChunkManagerDispatcher.java:112) at org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleReadChunk(KeyValueHandler.java:773) at org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:262) at org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:225) at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:335) at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:183) at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89) at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:182) at org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:112) at org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:105) at org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262) at org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33) at org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(GrpcServerInterceptor.java:49) at org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:329) at org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:314) at org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:833) at org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:938) at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.lambda$readData$5(ChunkUtils.java:264) at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.lambda$readData$4(ChunkUtils.java:218) at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.processFileExclusively(ChunkUtils.java:411) at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:215) ... 24 more Caused by: java.lang.OutOfMemoryError: Map failed at sun.nio.ch.FileChannelImpl.map0(Native Method) at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:935) ... 28 more
In the "Dynamic libraries" section of file hs_err_pid1560151.log, there are 261425 mapped regions for different block files, for example "/hadoop-ozone/datanode/data/hdds/CID-303529f3-9f2b-4427-b389-6909971e960a/current/containerDir3/2005/chunks/113750153625618247.block"
On OS level, the max mmap handler count is saved in file /proc/sys/vm/max_map_count, which has value 262144.
In HDDS-7117, MappedByteBuffer is introduced to improve the chunk read performance.
Property "ozone.chunk.read.mapped.buffer.threshold" with value 32KB is defined as a bar to indicate whether use MappedByteBuffer or normal ByteBuffer to read data.
If read data length is less than "ozone.chunk.read.mapped.buffer.threshold", MappedByteBuffer should not be used, which is not enforce in current implementation.
Here is the logs when debug log level is enabled in DN,
2024-02-27 15:19:28,676 DEBUG [f22679a0-7e8c-4006-a6ab-874736e9c75a-ChunkReader-3]-org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils: mapped: offset=213565440, readLen=0, n=8192, class java.nio.DirectByteBufferR 2024-02-27 15:19:28,676 DEBUG [f22679a0-7e8c-4006-a6ab-874736e9c75a-ChunkReader-3]-org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils: mapped: offset=213565440, readLen=8192, n=8192, class java.nio.DirectByteBufferR ... 2024-02-27 15:19:28,676 DEBUG [f22679a0-7e8c-4006-a6ab-874736e9c75a-ChunkReader-3]-org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils: mapped: offset=213565440, readLen=327680, n=8192, class java.nio.DirectByteBufferR 2024-02-27 15:19:28,676 DEBUG [f22679a0-7e8c-4006-a6ab-874736e9c75a-ChunkReader-3]-org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils: mapped: offset=213565440, readLen=335872, n=8192, class java.nio.DirectByteBufferR 2024-02-27 15:19:28,676 DEBUG [f22679a0-7e8c-4006-a6ab-874736e9c75a-ChunkReader-3]-org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils: Read 344064 bytes starting at offset 213565440 from /hadoop-ozone/datanode/data/hdds/CID-303529f3-9f2b-4427-b389-6909971e960a/current/containerDir9/5007/chunks/113750153625622210.block
Due to this, DN is out of memory when the mmap handlers run out.
Attachments
Issue Links
- causes
-
HDDS-10816 [Hbase-Ozone] Hbase Bulkloading Crashing Datanodes
- Open
- relates to
-
HDDS-7188 Read chunk files using netty ChunkedNioFile.
- Open
- links to