Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-5957

Provide support for different mmap cache retention policies in ShortCircuitCache.

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Done
    • Affects Version/s: 2.3.0
    • Fix Version/s: None
    • Component/s: hdfs-client
    • Labels:
      None

      Description

      Currently, the ShortCircuitCache retains mmap regions for reuse by multiple reads of the same block or by multiple threads. The eventual munmap executes on a background thread after an expiration period. Some client usage patterns would prefer strict bounds on this cache and deterministic cleanup by calling munmap. This issue proposes additional support for different caching policies that better fit these usage patterns.

        Issue Links

          Activity

          Hide
          Chris Nauroth added a comment -

          Here are some additional details on the scenario that prompted filing this issue. Thanks to Gopal V for sharing the details.

          Gopal has a YARN application that performs strictly sequential reads of HDFS files. The application may rapidly iterate through a large number of blocks. The reason for this is that each block contains a small metadata header, and based on the contents of this metadata, the application often can decide that there is nothing relevant in the rest of the block. If that happens, then the application seeks all the way past that block. Gopal estimates that it's feasible this code would scan through ~100 HDFS blocks in ~10 seconds.

          This usage pattern in combination with zero-copy read causes retention of a large number of memory-mapped regions in the ShortCircuitCache. Eventually, YARN's resource check kills the container process for exceeding the enforced physical memory bounds. The asynchronous nature of our munmap calls was surprising for Gopal, who had carefully calculated his memory usage to stay under YARN's resource checks.

          As a workaround, I advised Gopal to downtune dfs.client.mmap.cache.timeout.ms to make the munmap happen more quickly. A better solution would be to provide support in the HDFS client for a caching policy that fits this usage pattern. Two possibilities are:

          1. LRU bounded by a client-specified maximum memory size. (Note this is maximum memory size and not number of files or number of blocks, because of the possibility of differing block counts and block sizes.)
          2. Do not cache at all. Effectively, there is only one memory-mapped region alive at a time. The sequential read usage pattern described above always results in a cache miss anyway, so a cache adds no value.

          I don't propose removing the current time-triggered threshold, because I think that's valid for other use cases. I only propose adding support for new policies.

          In addition to the caching policy itself, I want to propose a way to move the munmap calls to run synchronous with the caller instead of in a background thread. This would be a better fit for clients who want deterministic resource cleanup. Right now, we have no way to guarantee that the OS will schedule the CacheCleaner thread ahead of YARN's resource check thread. This isn't a proposal to remove support for the background thread, only to add support for synchronous munmap.

          I think you could also make an argument that YARN shouldn't count these memory-mapped regions towards the container process's RSS. It's really the DataNode process that owns that memory, and clients who mmap the same region shouldn't get penalized. Let's address that part separately though.

          Show
          Chris Nauroth added a comment - Here are some additional details on the scenario that prompted filing this issue. Thanks to Gopal V for sharing the details. Gopal has a YARN application that performs strictly sequential reads of HDFS files. The application may rapidly iterate through a large number of blocks. The reason for this is that each block contains a small metadata header, and based on the contents of this metadata, the application often can decide that there is nothing relevant in the rest of the block. If that happens, then the application seeks all the way past that block. Gopal estimates that it's feasible this code would scan through ~100 HDFS blocks in ~10 seconds. This usage pattern in combination with zero-copy read causes retention of a large number of memory-mapped regions in the ShortCircuitCache . Eventually, YARN's resource check kills the container process for exceeding the enforced physical memory bounds. The asynchronous nature of our munmap calls was surprising for Gopal, who had carefully calculated his memory usage to stay under YARN's resource checks. As a workaround, I advised Gopal to downtune dfs.client.mmap.cache.timeout.ms to make the munmap happen more quickly. A better solution would be to provide support in the HDFS client for a caching policy that fits this usage pattern. Two possibilities are: LRU bounded by a client-specified maximum memory size. (Note this is maximum memory size and not number of files or number of blocks, because of the possibility of differing block counts and block sizes.) Do not cache at all. Effectively, there is only one memory-mapped region alive at a time. The sequential read usage pattern described above always results in a cache miss anyway, so a cache adds no value. I don't propose removing the current time-triggered threshold, because I think that's valid for other use cases. I only propose adding support for new policies. In addition to the caching policy itself, I want to propose a way to move the munmap calls to run synchronous with the caller instead of in a background thread. This would be a better fit for clients who want deterministic resource cleanup. Right now, we have no way to guarantee that the OS will schedule the CacheCleaner thread ahead of YARN's resource check thread. This isn't a proposal to remove support for the background thread, only to add support for synchronous munmap . I think you could also make an argument that YARN shouldn't count these memory-mapped regions towards the container process's RSS. It's really the DataNode process that owns that memory, and clients who mmap the same region shouldn't get penalized. Let's address that part separately though.
          Hide
          Colin Patrick McCabe added a comment -

          This usage pattern in combination with zero-copy read causes retention of a large number of memory-mapped regions in the ShortCircuitCache. Eventually, YARN's resource check kills the container process for exceeding the enforced physical memory bounds.

          mmap regions don't consume physical memory. They do consume virtual memory.

          I don't think limiting virtual memory usage is a particularly helpful policy, and YARN should stop doing that if that is in fact what it is doing.

          As a workaround, I advised Gopal to downtune dfs.client.mmap.cache.timeout.ms to make the munmap happen more quickly. A better solution would be to provide support in the HDFS client for a caching policy that fits this usage pattern.

          In our tests, mmap provided no performance advantage unless it was reused. If Gopal needs to purge mmaps immediately after using them, the correct thing is simply not to use zero-copy reads.

          Show
          Colin Patrick McCabe added a comment - This usage pattern in combination with zero-copy read causes retention of a large number of memory-mapped regions in the ShortCircuitCache. Eventually, YARN's resource check kills the container process for exceeding the enforced physical memory bounds. mmap regions don't consume physical memory. They do consume virtual memory. I don't think limiting virtual memory usage is a particularly helpful policy, and YARN should stop doing that if that is in fact what it is doing. As a workaround, I advised Gopal to downtune dfs.client.mmap.cache.timeout.ms to make the munmap happen more quickly. A better solution would be to provide support in the HDFS client for a caching policy that fits this usage pattern. In our tests, mmap provided no performance advantage unless it was reused. If Gopal needs to purge mmaps immediately after using them, the correct thing is simply not to use zero-copy reads.
          Hide
          Colin Patrick McCabe added a comment -

          This usage pattern in combination with zero-copy read causes retention of a large number of memory-mapped regions in the ShortCircuitCache. Eventually, YARN's resource check kills the container process for exceeding the enforced physical memory bounds.

          mmap regions don't consume physical memory. They do consume virtual memory.

          I don't think YARN should limit the consumption of virtual memory. virtual memory imposes almost no cost on the system and limiting it leads to problems like this one.

          It should be possible to limit the consumption of actual memory (not virtual address space) and solve this problem that way. What do you think?

          As a workaround, I advised Gopal to downtune dfs.client.mmap.cache.timeout.ms to make the munmap happen more quickly. A better solution would be to provide support in the HDFS client for a caching policy that fits this usage pattern.

          In our tests, mmap provided no performance advantage unless it was reused. If Gopal needs to purge mmaps immediately after using them, the correct thing is simply not to use zero-copy reads.

          Show
          Colin Patrick McCabe added a comment - This usage pattern in combination with zero-copy read causes retention of a large number of memory-mapped regions in the ShortCircuitCache. Eventually, YARN's resource check kills the container process for exceeding the enforced physical memory bounds. mmap regions don't consume physical memory. They do consume virtual memory. I don't think YARN should limit the consumption of virtual memory. virtual memory imposes almost no cost on the system and limiting it leads to problems like this one. It should be possible to limit the consumption of actual memory (not virtual address space) and solve this problem that way. What do you think? As a workaround, I advised Gopal to downtune dfs.client.mmap.cache.timeout.ms to make the munmap happen more quickly. A better solution would be to provide support in the HDFS client for a caching policy that fits this usage pattern. In our tests, mmap provided no performance advantage unless it was reused. If Gopal needs to purge mmaps immediately after using them, the correct thing is simply not to use zero-copy reads.
          Hide
          Chris Nauroth added a comment -

          mmap regions don't consume physical memory. They do consume virtual memory.

          YARN has checks on both physical and virtual memory. I reviewed the logs from the application, and it is in fact the physical memory threshold that was exceeded. YARN calculates this by checking /proc/pid/stat for the RSS and multiplying by page size. The process was well within the virtual memory threshold, so virtual address space was not a problem.

          containerID=container_1392067467498_0193_01_000282] is running beyond physical memory limits. Current usage: 4.5 GB of 4 GB physical memory used; 9.4 GB of 40 GB virtual memory used. Killing container.
          
          Dump of the process-tree for container_1392067467498_0193_01_000282 :
          
                  |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
          
                  |- 27095 27015 27015 27015 (java) 8640 1190 9959014400 1189585 /grid/0/jdk/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -server -Xmx3584m -Djava.net.preferIPv4Stack=true -XX:+UseNUMA -XX:+UseParallelGC -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/grid/4/cluster/yarn/logs/application_1392067467498_0193/container_1392067467498_0193_01_000282 -Dtez.root.logger=INFO,CLA -Djava.io.tmpdir=/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1392067467498_0193/container_1392067467498_0193_01_000282/tmp org.apache.hadoop.mapred.YarnTezDagChild 172.19.0.45 38627 container_1392067467498_0193_01_000282 application_1392067467498_0193 1 
          

          I don't think YARN should limit the consumption of virtual memory. virtual memory imposes almost no cost on the system and limiting it leads to problems like this one.

          I don't know the full history behind the virtual memory threshold. I've always assumed that it was in place to guard against virtual address space exhaustion and possible intervention by the OOM killer. So far, the virtual memory threshold doesn't appear to be a factor in this case.

          It should be possible to limit the consumption of actual memory (not virtual address space) and solve this problem that way. What do you think?

          Yes, I agree that the issue here is physical memory based on the logs. What we know at this point is that short-circuit reads were counted against the process's RSS, eventually triggering YARN's physical memory check. Then, downtuning dfs.client.mmap.cache.timeout.ms made the problem go away. I think we can come up with a minimal repro that demonstrates it. Gopal might even already have this.

          In our tests, mmap provided no performance advantage unless it was reused. If Gopal needs to purge mmaps immediately after using them, the correct thing is simply not to use zero-copy reads.

          Yes, something doesn't quite jive here. Gopal V, can you comment on whether or not you're seeing a performance benefit with zero-copy read after down-tuning dfs.client.mmap.cache.timeout.ms like I advised? If so, then did I miss something in the description of your application's access pattern?

          Show
          Chris Nauroth added a comment - mmap regions don't consume physical memory. They do consume virtual memory. YARN has checks on both physical and virtual memory. I reviewed the logs from the application, and it is in fact the physical memory threshold that was exceeded. YARN calculates this by checking /proc/pid/stat for the RSS and multiplying by page size. The process was well within the virtual memory threshold, so virtual address space was not a problem. containerID=container_1392067467498_0193_01_000282] is running beyond physical memory limits. Current usage: 4.5 GB of 4 GB physical memory used; 9.4 GB of 40 GB virtual memory used. Killing container. Dump of the process-tree for container_1392067467498_0193_01_000282 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 27095 27015 27015 27015 (java) 8640 1190 9959014400 1189585 /grid/0/jdk/bin/java -Djava.net.preferIPv4Stack= true -Dhadoop.metrics.log.level=WARN -server -Xmx3584m -Djava.net.preferIPv4Stack= true -XX:+UseNUMA -XX:+UseParallelGC -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/grid/4/cluster/yarn/logs/application_1392067467498_0193/container_1392067467498_0193_01_000282 -Dtez.root.logger=INFO,CLA -Djava.io.tmpdir=/grid/4/cluster/yarn/local/usercache/gopal/appcache/application_1392067467498_0193/container_1392067467498_0193_01_000282/tmp org.apache.hadoop.mapred.YarnTezDagChild 172.19.0.45 38627 container_1392067467498_0193_01_000282 application_1392067467498_0193 1 I don't think YARN should limit the consumption of virtual memory. virtual memory imposes almost no cost on the system and limiting it leads to problems like this one. I don't know the full history behind the virtual memory threshold. I've always assumed that it was in place to guard against virtual address space exhaustion and possible intervention by the OOM killer. So far, the virtual memory threshold doesn't appear to be a factor in this case. It should be possible to limit the consumption of actual memory (not virtual address space) and solve this problem that way. What do you think? Yes, I agree that the issue here is physical memory based on the logs. What we know at this point is that short-circuit reads were counted against the process's RSS, eventually triggering YARN's physical memory check. Then, downtuning dfs.client.mmap.cache.timeout.ms made the problem go away. I think we can come up with a minimal repro that demonstrates it. Gopal might even already have this. In our tests, mmap provided no performance advantage unless it was reused. If Gopal needs to purge mmaps immediately after using them, the correct thing is simply not to use zero-copy reads. Yes, something doesn't quite jive here. Gopal V , can you comment on whether or not you're seeing a performance benefit with zero-copy read after down-tuning dfs.client.mmap.cache.timeout.ms like I advised? If so, then did I miss something in the description of your application's access pattern?
          Hide
          Gopal V added a comment -

          Chris Nauroth: mmap() does take up physical memory, assuming those pages are mapped into RAM and are not disk-resident.

          As long as we're on Linux, it will show up in RSS as well as marked in the Shared_Clean/Referenced field in /proc/<pid>/smaps.

          YARN could do a better job of calculating "How much memory will be free'd up if this process is killed" vs "How much memory does this process use". But that is a completely different issue.

          When I set the mmap timeout to 1000ms, some of my queries succeeded - mostly the queries which were taking > 50 seconds.

          But the really fast ORC queries which take ~10 seconds to run still managed to hit around ~50x task failures out of ~3000 map tasks.

          The perf dip happens because some of the failures.

          For small 200Gb data-sets (~1.4x tasks per container), ZCR does give a perf boost because we get to use HADOOP-10047 instead of shuffling it between byte[] buffers for decompression.

          Show
          Gopal V added a comment - Chris Nauroth : mmap() does take up physical memory, assuming those pages are mapped into RAM and are not disk-resident. As long as we're on Linux, it will show up in RSS as well as marked in the Shared_Clean/Referenced field in /proc/<pid>/smaps. YARN could do a better job of calculating "How much memory will be free'd up if this process is killed" vs "How much memory does this process use". But that is a completely different issue. When I set the mmap timeout to 1000ms, some of my queries succeeded - mostly the queries which were taking > 50 seconds. But the really fast ORC queries which take ~10 seconds to run still managed to hit around ~50x task failures out of ~3000 map tasks. The perf dip happens because some of the failures. For small 200Gb data-sets (~1.4x tasks per container), ZCR does give a perf boost because we get to use HADOOP-10047 instead of shuffling it between byte[] buffers for decompression.
          Hide
          Chris Nauroth added a comment -

          Chris Nauroth: mmap() does take up physical memory, assuming those pages are mapped into RAM and are not disk-resident.

          Yes, most definitely. I think Colin was trying to clarify that the initial mmap call dings virtual memory: call mmap for a 1 MB file and you'll immediately see virtual memory increase by 1 MB, but not physical memory. Certainly as the pages get accessed and mapped in, we'll start to consume physical memory.

          For small 200Gb data-sets (~1.4x tasks per container), ZCR does give a perf boost because we get to use HADOOP-10047 instead of shuffling it between byte[] buffers for decompression.

          Thanks, that clarifies why zero-copy read was still useful.

          It sounds like you really do need a deterministic way to trigger the munmap calls, i.e. LRU caching or no caching at all described above.

          Show
          Chris Nauroth added a comment - Chris Nauroth: mmap() does take up physical memory, assuming those pages are mapped into RAM and are not disk-resident. Yes, most definitely. I think Colin was trying to clarify that the initial mmap call dings virtual memory: call mmap for a 1 MB file and you'll immediately see virtual memory increase by 1 MB, but not physical memory. Certainly as the pages get accessed and mapped in, we'll start to consume physical memory. For small 200Gb data-sets (~1.4x tasks per container), ZCR does give a perf boost because we get to use HADOOP-10047 instead of shuffling it between byte[] buffers for decompression. Thanks, that clarifies why zero-copy read was still useful. It sounds like you really do need a deterministic way to trigger the munmap calls, i.e. LRU caching or no caching at all described above.
          Hide
          Colin Patrick McCabe added a comment -

          I talked to Karthik Kambatla about this. It seems that YARN is monitoring the process' RSS (resident set size), which does seem to include the physical memory taken up by memory-mapped files. I think this is unfortunate. The physical memory taken up by mmapped files is basically part of the page cache. If there is any memory pressure at all, it's easy to purge this memory (the pages are "clean") Charging an application for this memory is similar to charging it for the page cache consumed by calls to read(2)-- it doesn't really make sense for this application. I think this is a problem within YARN, which has to be fixed inside YARN.

          It sounds like you really do need a deterministic way to trigger the munmap calls, i.e. LRU caching or no caching at all described above.

          The munmap calls are deterministic now. You can control the number of unused mmaps that we'll store by changing dfs.client.mmap.cache.size.

          It's very important to keep in mind that dfs.client.mmap.cache.size controls the size of the cache, not the total number of mmaps. So if my application has 10 threads that each use an mmap at a time, and the maximum cache size is 10, I may have 20 mmaps in existence at any given time. The maximum size of any mmap is going to be the size of a block, so you should be able to use this to calculate how much RSS you will need.

          For small 200Gb data-sets (~1.4x tasks per container), ZCR does give a perf boost because we get to use HADOOP-10047 instead of shuffling it between byte[] buffers for decompression.

          As a workaround, have you considered reading into a direct ByteBuffer that you allocated yourself? DFSInputStream implements the ByteBufferReadable interface, which lets you read into any ByteBuffer. This would avoid the array copy that you're talking about.

          I hope we can fix this within YARN soon, since otherwise the perf benefit of zero-copy reads will be substantially reduced or eliminated (as well as people's ability to use ZCR in the first place)

          Show
          Colin Patrick McCabe added a comment - I talked to Karthik Kambatla about this. It seems that YARN is monitoring the process' RSS (resident set size), which does seem to include the physical memory taken up by memory-mapped files. I think this is unfortunate. The physical memory taken up by mmapped files is basically part of the page cache. If there is any memory pressure at all, it's easy to purge this memory (the pages are "clean") Charging an application for this memory is similar to charging it for the page cache consumed by calls to read(2)-- it doesn't really make sense for this application. I think this is a problem within YARN, which has to be fixed inside YARN. It sounds like you really do need a deterministic way to trigger the munmap calls, i.e. LRU caching or no caching at all described above. The munmap calls are deterministic now. You can control the number of unused mmaps that we'll store by changing dfs.client.mmap.cache.size . It's very important to keep in mind that dfs.client.mmap.cache.size controls the size of the cache, not the total number of mmaps. So if my application has 10 threads that each use an mmap at a time, and the maximum cache size is 10, I may have 20 mmaps in existence at any given time. The maximum size of any mmap is going to be the size of a block, so you should be able to use this to calculate how much RSS you will need. For small 200Gb data-sets (~1.4x tasks per container), ZCR does give a perf boost because we get to use HADOOP-10047 instead of shuffling it between byte[] buffers for decompression. As a workaround, have you considered reading into a direct ByteBuffer that you allocated yourself? DFSInputStream implements the ByteBufferReadable interface, which lets you read into any ByteBuffer . This would avoid the array copy that you're talking about. I hope we can fix this within YARN soon, since otherwise the perf benefit of zero-copy reads will be substantially reduced or eliminated (as well as people's ability to use ZCR in the first place)
          Hide
          Chris Nauroth added a comment -

          Thank you Karthik Kambatla for also taking a look.

          I think this is a problem within YARN, which has to be fixed inside YARN.

          Did you have a specific implementation in mind? Something like trying to scan /proc/pid/smaps and subtract the clean pages from RSS? I'm curious if we'd increase the risk of thrashing. It's probably worthwhile at this point to spin off a separate YARN issue to carry on discussion, and focus on short-cicuit read here.

          The munmap calls are deterministic now. You can control the number of unused mmaps that we'll store by changing dfs.client.mmap.cache.size.

          I may have misread this part of the code earlier. I thought munmap could only ever get triggered from the background cleaner thread, but now I see that it can also get triggered on unreferencing a replica, which would be synchronous to the caller.

          Gopal V, I think it would be worthwhile to try reverting your setting for dfs.client.mmap.cache.timeout.ms and instead downtune dfs.client.mmap.cache.size to a small value. Here is the full documentation for this property. (Note the large-ish default.)

          <property>
            <name>dfs.client.mmap.cache.size</name>
            <value>1024</value>
            <description>
              When zero-copy reads are used, the DFSClient keeps a cache of recently used
              memory mapped regions.  This parameter controls the maximum number of
              entries that we will keep in that cache.
          
              If this is set to 0, we will not allow mmap.
          
              The larger this number is, the more file descriptors we will potentially
              use for memory-mapped files.  mmaped files also use virtual address space.
              You may need to increase your ulimit virtual address space limits before
              increasing the client mmap cache size.
            </description>
          </property>
          

          As a workaround, have you considered reading into a direct ByteBuffer that you allocated yourself? DFSInputStream implements the ByteBufferReadable interface, which lets you read into any ByteBuffer. This would avoid the array copy that you're talking about.

          Gopal, is this also worth trying?

          Show
          Chris Nauroth added a comment - Thank you Karthik Kambatla for also taking a look. I think this is a problem within YARN, which has to be fixed inside YARN. Did you have a specific implementation in mind? Something like trying to scan /proc/pid/smaps and subtract the clean pages from RSS? I'm curious if we'd increase the risk of thrashing. It's probably worthwhile at this point to spin off a separate YARN issue to carry on discussion, and focus on short-cicuit read here. The munmap calls are deterministic now. You can control the number of unused mmaps that we'll store by changing dfs.client.mmap.cache.size . I may have misread this part of the code earlier. I thought munmap could only ever get triggered from the background cleaner thread, but now I see that it can also get triggered on unreferencing a replica, which would be synchronous to the caller. Gopal V , I think it would be worthwhile to try reverting your setting for dfs.client.mmap.cache.timeout.ms and instead downtune dfs.client.mmap.cache.size to a small value. Here is the full documentation for this property. (Note the large-ish default.) <property> <name>dfs.client.mmap.cache.size</name> <value>1024</value> <description> When zero-copy reads are used, the DFSClient keeps a cache of recently used memory mapped regions. This parameter controls the maximum number of entries that we will keep in that cache. If this is set to 0, we will not allow mmap. The larger this number is, the more file descriptors we will potentially use for memory-mapped files. mmaped files also use virtual address space. You may need to increase your ulimit virtual address space limits before increasing the client mmap cache size. </description> </property> As a workaround, have you considered reading into a direct ByteBuffer that you allocated yourself? DFSInputStream implements the ByteBufferReadable interface, which lets you read into any ByteBuffer. This would avoid the array copy that you're talking about. Gopal, is this also worth trying?
          Hide
          Karthik Kambatla added a comment -

          Created YARN-1747.

          Show
          Karthik Kambatla added a comment - Created YARN-1747 .
          Hide
          Gopal V added a comment -

          As a workaround, have you considered reading into a direct ByteBuffer that you allocated yourself?

          That was attempted & we do follow that codepath for the remote reads. That is slower than the zero copy read because to produce a direct byte buffer, the JVM has to defragment memory & produce a contiguous large memory region - this triggered a full GC pass, which caused stragglers with container reuse.

          On top of that the direct readable ignores all the mlocked memory in the DataNode, which means we end up spending twice as much physical memory for a cached block than with zero copy reads - plus there's the overhead of the copying from DN's mmap section into a JVM HeapByteBuffer and a checksum check because this is following the Short-Circuit-Read pathway.

          The whole performance push into zero-copy reads is to actually use off-heap memory here for performance & leave that space aside for the sort buffers, map-join memory and group-by top-n hashes.

          I don't think using a slower codepath which takes up twice as much memory with more GC overhead is a good idea if this is to be a performance improvement at the end of it all.

          Show
          Gopal V added a comment - As a workaround, have you considered reading into a direct ByteBuffer that you allocated yourself? That was attempted & we do follow that codepath for the remote reads. That is slower than the zero copy read because to produce a direct byte buffer, the JVM has to defragment memory & produce a contiguous large memory region - this triggered a full GC pass, which caused stragglers with container reuse. On top of that the direct readable ignores all the mlocked memory in the DataNode, which means we end up spending twice as much physical memory for a cached block than with zero copy reads - plus there's the overhead of the copying from DN's mmap section into a JVM HeapByteBuffer and a checksum check because this is following the Short-Circuit-Read pathway. The whole performance push into zero-copy reads is to actually use off-heap memory here for performance & leave that space aside for the sort buffers, map-join memory and group-by top-n hashes. I don't think using a slower codepath which takes up twice as much memory with more GC overhead is a good idea if this is to be a performance improvement at the end of it all.
          Hide
          Colin Patrick McCabe added a comment -

          Thanks for this data point, Gopal. I agree that allocating your own direct ByteBuffer objects will probably be slower than mmap. It certainly will cause issues with the Java heap. I was suggesting it as a temporary workaround until YARN-1747 is fixed, not as a permanent solution.

          It seems like a better interim solution would be simply to tweak dfs.client.mmap.cache.size (and perhaps some job-specific parameters that you're using) to ensure that you don't exceed the limits that YARN is placing.

          As I mentioned earlier, HDFS's behavior here is completely deterministic and synchronous... we will immediately munmap old segments in releaseBuffer (according to the LRU policy) when there is no more space for them. It is not "an eventual munmap on a background thread." If you set the cache size too big, that's a problem, but not one that HDFS can solve. Only the administrator can solve it.

          Why don't we close this JIRA and move the discussion to YARN-1747?

          Show
          Colin Patrick McCabe added a comment - Thanks for this data point, Gopal. I agree that allocating your own direct ByteBuffer objects will probably be slower than mmap . It certainly will cause issues with the Java heap. I was suggesting it as a temporary workaround until YARN-1747 is fixed, not as a permanent solution. It seems like a better interim solution would be simply to tweak dfs.client.mmap.cache.size (and perhaps some job-specific parameters that you're using) to ensure that you don't exceed the limits that YARN is placing. As I mentioned earlier, HDFS's behavior here is completely deterministic and synchronous... we will immediately munmap old segments in releaseBuffer (according to the LRU policy) when there is no more space for them. It is not "an eventual munmap on a background thread." If you set the cache size too big, that's a problem, but not one that HDFS can solve. Only the administrator can solve it. Why don't we close this JIRA and move the discussion to YARN-1747 ?
          Hide
          Chris Nauroth added a comment -

          Gopal V, did you see my earlier comment about trying to downtune dfs.client.mmap.cache.size to see if that helps? I misread some of this code first time around and thought there was no way to get a synchronous munmap, but that's not true.

          Colin Patrick McCabe, I'll close this out once we have confirmation from Gopal that this works for him.

          Show
          Chris Nauroth added a comment - Gopal V , did you see my earlier comment about trying to downtune dfs.client.mmap.cache.size to see if that helps? I misread some of this code first time around and thought there was no way to get a synchronous munmap , but that's not true. Colin Patrick McCabe , I'll close this out once we have confirmation from Gopal that this works for him.
          Hide
          Todd Lipcon added a comment -

          That was attempted & we do follow that codepath for the remote reads. That is slower than the zero copy read because to produce a direct byte buffer, the JVM has to defragment memory & produce a contiguous large memory region - this triggered a full GC pass, which caused stragglers with container reuse.

          This doesn't make sense to me – direct buffers are off-heap and allocating one is really just a wrapper around Unsafe.allocateMemory(), which itself is just a regular old malloc.

          Maybe something else was going on there?

          Show
          Todd Lipcon added a comment - That was attempted & we do follow that codepath for the remote reads. That is slower than the zero copy read because to produce a direct byte buffer, the JVM has to defragment memory & produce a contiguous large memory region - this triggered a full GC pass, which caused stragglers with container reuse. This doesn't make sense to me – direct buffers are off-heap and allocating one is really just a wrapper around Unsafe.allocateMemory(), which itself is just a regular old malloc. Maybe something else was going on there?
          Hide
          Gopal V added a comment -

          I thought java.nio.Bits had an explicit GC() call that I was hitting.

          I didn't investigate it further, because the mmap zero-copy is the right approach for performance.

          Show
          Gopal V added a comment - I thought java.nio.Bits had an explicit GC() call that I was hitting. I didn't investigate it further, because the mmap zero-copy is the right approach for performance.
          Hide
          Gopal V added a comment -

          Chris Nauroth: I will cycle back to this once I've get more hive specific stuff worked out. I think setting the size to 1 will probably work itself out for me, setting it to zero doesn't seem to work at all (i.e only cache it within the open InputStream, not within a cross-thread cache).

          Show
          Gopal V added a comment - Chris Nauroth : I will cycle back to this once I've get more hive specific stuff worked out. I think setting the size to 1 will probably work itself out for me, setting it to zero doesn't seem to work at all (i.e only cache it within the open InputStream, not within a cross-thread cache).
          Hide
          Chris Nauroth added a comment -

          Thanks, Gopal V.

          setting it to zero doesn't seem to work at all

          That's correct. Zero is interpreted as a special value meaning "don't do any mmaps".

          Show
          Chris Nauroth added a comment - Thanks, Gopal V . setting it to zero doesn't seem to work at all That's correct. Zero is interpreted as a special value meaning "don't do any mmaps".
          Hide
          Colin Patrick McCabe added a comment -

          This doesn't make sense to me – direct buffers are off-heap and allocating one is really just a wrapper around Unsafe.allocateMemory(), which itself is just a regular old malloc.

          ByteBuffer#allocateDirect is known to be a lot slower than ByteBuffer#allocate. One of the reasons is that if the constructor of DirectByteBuffer calls Bits.reserveMemory(cap), which sometimes does a System.gc and sleep(100ms).

          The code is here:
          http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/DirectByteBuffer.java

          It does come down to a malloc at the end of the day, but it seems like the JDK (well, openJDK 6 at least) makes some effort to share a single pool of space between the direct and normal java memory, even going so far as to trigger a GC to keep things tidy.

          That's correct. Zero is interpreted as a special value meaning "don't do any mmaps".

          Yeah. I wonder if we should revisit that and create a separate 'enable' flag for the client, before the configuration here gets set in stone.

          Show
          Colin Patrick McCabe added a comment - This doesn't make sense to me – direct buffers are off-heap and allocating one is really just a wrapper around Unsafe.allocateMemory(), which itself is just a regular old malloc. ByteBuffer#allocateDirect is known to be a lot slower than ByteBuffer#allocate . One of the reasons is that if the constructor of DirectByteBuffer calls Bits.reserveMemory(cap) , which sometimes does a System.gc and sleep(100ms) . The code is here: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/DirectByteBuffer.java It does come down to a malloc at the end of the day, but it seems like the JDK (well, openJDK 6 at least) makes some effort to share a single pool of space between the direct and normal java memory, even going so far as to trigger a GC to keep things tidy. That's correct. Zero is interpreted as a special value meaning "don't do any mmaps". Yeah. I wonder if we should revisit that and create a separate 'enable' flag for the client, before the configuration here gets set in stone.
          Hide
          Chris Nauroth added a comment -

          Yeah. I wonder if we should revisit that and create a separate 'enable' flag for the client, before the configuration here gets set in stone.

          I suppose the advantage of a separate flag is that you can then legitimately set the size to 0 if you know you want mmap but no caching. It seems more flexible, so I'm +1 for that proposal.

          Show
          Chris Nauroth added a comment - Yeah. I wonder if we should revisit that and create a separate 'enable' flag for the client, before the configuration here gets set in stone. I suppose the advantage of a separate flag is that you can then legitimately set the size to 0 if you know you want mmap but no caching. It seems more flexible, so I'm +1 for that proposal.
          Hide
          Colin Patrick McCabe added a comment -

          I will cycle back to this once I've get more hive specific stuff worked out. I think setting the size to 1 will probably work itself out for me, setting it to zero doesn't seem to work at all (i.e only cache it within the open InputStream, not within a cross-thread cache).

          There seems to be a bit of confusion here. In Hadoop 2.4, the short-circuit cache is cross-thread-- it doesn't exist in individual InputStreams. This allows us to make the best use of our limited file descriptor resources.

          In Hadoop 2.4, setting dfs.client.mmap.cache.size to 0 should be essentially the same as setting it to 1 in your case. In both cases, the chance of reusing an mmap from the cache should be nil. In older versions, we did used to treat dfs.client.mmap.cache.size = 0 as a special value meaning "don't ever mmap." However, we don't do this any more.

          Show
          Colin Patrick McCabe added a comment - I will cycle back to this once I've get more hive specific stuff worked out. I think setting the size to 1 will probably work itself out for me, setting it to zero doesn't seem to work at all (i.e only cache it within the open InputStream, not within a cross-thread cache). There seems to be a bit of confusion here. In Hadoop 2.4, the short-circuit cache is cross-thread-- it doesn't exist in individual InputStreams. This allows us to make the best use of our limited file descriptor resources. In Hadoop 2.4, setting dfs.client.mmap.cache.size to 0 should be essentially the same as setting it to 1 in your case. In both cases, the chance of reusing an mmap from the cache should be nil. In older versions, we did used to treat dfs.client.mmap.cache.size = 0 as a special value meaning "don't ever mmap." However, we don't do this any more.
          Hide
          Chris Nauroth added a comment -

          However, we don't do this any more.

          Thanks for clarifying. Does that mean that we need to update the following documentation in hdfs-default.xml? Trunk still says that 0 means don't mmap.

          <property>
            <name>dfs.client.mmap.cache.size</name>
            <value>1024</value>
            <description>
              When zero-copy reads are used, the DFSClient keeps a cache of recently used
              memory mapped regions.  This parameter controls the maximum number of
              entries that we will keep in that cache.
          
              If this is set to 0, we will not allow mmap.
          
              The larger this number is, the more file descriptors we will potentially
              use for memory-mapped files.  mmaped files also use virtual address space.
              You may need to increase your ulimit virtual address space limits before
              increasing the client mmap cache size.
            </description>
          </property>
          
          Show
          Chris Nauroth added a comment - However, we don't do this any more. Thanks for clarifying. Does that mean that we need to update the following documentation in hdfs-default.xml? Trunk still says that 0 means don't mmap. <property> <name>dfs.client.mmap.cache.size</name> <value>1024</value> <description> When zero-copy reads are used, the DFSClient keeps a cache of recently used memory mapped regions. This parameter controls the maximum number of entries that we will keep in that cache. If this is set to 0, we will not allow mmap. The larger this number is, the more file descriptors we will potentially use for memory-mapped files. mmaped files also use virtual address space. You may need to increase your ulimit virtual address space limits before increasing the client mmap cache size. </description> </property>
          Hide
          Colin Patrick McCabe added a comment -

          Does that mean that we need to update the following documentation in hdfs-default.xml?

          Good find. I added that to HDFS-6046.

          Show
          Colin Patrick McCabe added a comment - Does that mean that we need to update the following documentation in hdfs-default.xml? Good find. I added that to HDFS-6046 .
          Hide
          Chris Nauroth added a comment -

          I'm going to go ahead and resolve this. I think we have everything we need. Gopal V, I can work with you offline if there are still any tuning requirements.

          Thanks for the discussion, everyone. I think this helped clarify the implementation for everyone, and we also have a useful follow-up item in YARN-1747 to improve container memory monitoring.

          Show
          Chris Nauroth added a comment - I'm going to go ahead and resolve this. I think we have everything we need. Gopal V , I can work with you offline if there are still any tuning requirements. Thanks for the discussion, everyone. I think this helped clarify the implementation for everyone, and we also have a useful follow-up item in YARN-1747 to improve container memory monitoring.

            People

            • Assignee:
              Unassigned
              Reporter:
              Chris Nauroth
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development