Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2800

impalad process blocked while releasing memory after a big query

    Details

      Description

      After running a large join, the impalad process might have allocated tens of GB of memory. When the query finishes, all of this memory is deallocated, and the background MaintenanceThread sees that the heap size is much larger than the actual in-use memory, and calls MallocExtension::instance()->ReleaseFreeMemory(). Inside tcmalloc, this holds a global heap lock while unmapping all of the requested memory.

      In one case on my cluster, releasing ~100G of memory caused the tcmalloc page lock to be held for more than 15 seconds. This caused another query to fail with Kudu scan timeouts. Outside the context of Kudu, it's likely still problematic for latency-sensitive SLAs.

        Issue Links

          Activity

          Hide
          tlipcon Todd Lipcon added a comment -

          A simple fix is probably to have the maintenance thread run more often, but use the MallocExtension::ReleaseToSystem(size_t) API instead, limiting to something like 100MB per call, so the process-wide lock isn't held so long.

          Show
          tlipcon Todd Lipcon added a comment - A simple fix is probably to have the maintenance thread run more often, but use the MallocExtension::ReleaseToSystem(size_t) API instead, limiting to something like 100MB per call, so the process-wide lock isn't held so long.
          Hide
          tlipcon Todd Lipcon added a comment -

          Added some timing/logging around the Release call that logs if it takes longer than 500ms... I canceled a large join query after it had built up about 45G of memory usage on each node, and then saw:

          [1] 15:37:41 [SUCCESS] vd0234
          W1223 15:37:37.886044 24929 init.cc:135] Releasing 45352MB of memory back to the OS blocked the process for 777048us
          [2] 15:37:41 [SUCCESS] vd0236
          W1223 15:37:39.170768  4532 init.cc:135] Releasing 45357MB of memory back to the OS blocked the process for 2270607us
          [3] 15:37:41 [SUCCESS] vd0238
          W1223 15:37:40.071014 18901 init.cc:135] Releasing 45368MB of memory back to the OS blocked the process for 3092046us
          [4] 15:37:41 [SUCCESS] vd0240
          W1223 15:37:37.368393 31203 init.cc:135] Releasing 45356MB of memory back to the OS blocked the process for 945634us
          [5] 15:37:41 [SUCCESS] vd0242
          W1223 15:37:38.134336 24302 init.cc:135] Releasing 45367MB of memory back to the OS blocked the process for 1295757us
          [6] 15:37:41 [SUCCESS] vd0336
          W1223 15:37:40.175263   816 init.cc:135] Releasing 45372MB of memory back to the OS blocked the process for 3024812us
          [7] 15:37:41 [SUCCESS] vd0338
          W1223 15:37:40.199990 23570 init.cc:135] Releasing 45349MB of memory back to the OS blocked the process for 3216939us
          [8] 15:37:41 [SUCCESS] vd0334
          W1223 15:37:39.525789 20564 init.cc:135] Releasing 45348MB of memory back to the OS blocked the process for 2722608us
          [9] 15:37:41 [SUCCESS] vd0340
          W1223 15:37:38.582772 30009 init.cc:135] Releasing 45348MB of memory back to the OS blocked the process for 1246556us
          

          Interesting that different nodes seem to be able to release the memory at fairly different rates, but in any case, it definitely seems like it's worth breaking up this call.

          Show
          tlipcon Todd Lipcon added a comment - Added some timing/logging around the Release call that logs if it takes longer than 500ms... I canceled a large join query after it had built up about 45G of memory usage on each node, and then saw: [1] 15:37:41 [SUCCESS] vd0234 W1223 15:37:37.886044 24929 init.cc:135] Releasing 45352MB of memory back to the OS blocked the process for 777048us [2] 15:37:41 [SUCCESS] vd0236 W1223 15:37:39.170768 4532 init.cc:135] Releasing 45357MB of memory back to the OS blocked the process for 2270607us [3] 15:37:41 [SUCCESS] vd0238 W1223 15:37:40.071014 18901 init.cc:135] Releasing 45368MB of memory back to the OS blocked the process for 3092046us [4] 15:37:41 [SUCCESS] vd0240 W1223 15:37:37.368393 31203 init.cc:135] Releasing 45356MB of memory back to the OS blocked the process for 945634us [5] 15:37:41 [SUCCESS] vd0242 W1223 15:37:38.134336 24302 init.cc:135] Releasing 45367MB of memory back to the OS blocked the process for 1295757us [6] 15:37:41 [SUCCESS] vd0336 W1223 15:37:40.175263 816 init.cc:135] Releasing 45372MB of memory back to the OS blocked the process for 3024812us [7] 15:37:41 [SUCCESS] vd0338 W1223 15:37:40.199990 23570 init.cc:135] Releasing 45349MB of memory back to the OS blocked the process for 3216939us [8] 15:37:41 [SUCCESS] vd0334 W1223 15:37:39.525789 20564 init.cc:135] Releasing 45348MB of memory back to the OS blocked the process for 2722608us [9] 15:37:41 [SUCCESS] vd0340 W1223 15:37:38.582772 30009 init.cc:135] Releasing 45348MB of memory back to the OS blocked the process for 1246556us Interesting that different nodes seem to be able to release the memory at fairly different rates, but in any case, it definitely seems like it's worth breaking up this call.
          Hide
          tlipcon Todd Lipcon added a comment -

          I put up a WIP fix here: http://gerrit.cloudera.org:8080/#/c/1684/ and testing it on a workload that used to hit some 15sec timeouts due to process pauses.

          Show
          tlipcon Todd Lipcon added a comment - I put up a WIP fix here: http://gerrit.cloudera.org:8080/#/c/1684/ and testing it on a workload that used to hit some 15sec timeouts due to process pauses.
          Hide
          tlipcon Todd Lipcon added a comment -

          I'm also exploring fixing this within tcmalloc itself: https://github.com/gperftools/gperftools/issues/754#issuecomment-167447351

          Show
          tlipcon Todd Lipcon added a comment - I'm also exploring fixing this within tcmalloc itself: https://github.com/gperftools/gperftools/issues/754#issuecomment-167447351
          Hide
          tlipcon Todd Lipcon added a comment -

          Another related issue here is that the logic for determining when to call ReleaseFreeMemory() seems to be wrong. Currently it checks if generic.current_allocated_bytes < 0.5 * generic.heap_size. However, generic.heap_size includes unmapped memory, and thus never goes dwn. So, if for example you expand to a 200G heap, then deallocate everything, this check will always be true until you use up 100G+. Instead, it should subtract out the unmapped bytes from generic_heap_size before doing the comparison. I'll fold that into this patch as well.

          Show
          tlipcon Todd Lipcon added a comment - Another related issue here is that the logic for determining when to call ReleaseFreeMemory() seems to be wrong. Currently it checks if generic.current_allocated_bytes < 0.5 * generic.heap_size . However, generic.heap_size includes unmapped memory, and thus never goes dwn. So, if for example you expand to a 200G heap, then deallocate everything, this check will always be true until you use up 100G+. Instead, it should subtract out the unmapped bytes from generic_heap_size before doing the comparison. I'll fold that into this patch as well.
          Hide
          dhecht Dan Hecht added a comment -

          Todd Lipcon in the WIP gerrit patch, you mention that when running with your patch you hit some mem limit errors. Do you have more info on that?

          Show
          dhecht Dan Hecht added a comment - Todd Lipcon in the WIP gerrit patch, you mention that when running with your patch you hit some mem limit errors. Do you have more info on that?
          Hide
          tlipcon Todd Lipcon added a comment -

          The issue was while running TPC-H Q21 at a 3TB scale factor on a cluster with 9 nodes. I'm also testing with an older cut of trunk (the Kudu branch) which may be missing some spill-to-disk bug fixes. The rest of the queries completed OK. I also added logging in the "GcTcmalloc()" code path, which never triggered, so I'm suspecting that the issue wasn't related to my WIP patch at all. I managed to run the rest of the TPC-H suite 5 times in a row without errors on this cluster, with my patch. (it used to fail all the time with RPC timeouts due to this bug). I'll post another revision of the patch this afternoon if you think the approach is reasonable.

          Show
          tlipcon Todd Lipcon added a comment - The issue was while running TPC-H Q21 at a 3TB scale factor on a cluster with 9 nodes. I'm also testing with an older cut of trunk (the Kudu branch) which may be missing some spill-to-disk bug fixes. The rest of the queries completed OK. I also added logging in the "GcTcmalloc()" code path, which never triggered, so I'm suspecting that the issue wasn't related to my WIP patch at all. I managed to run the rest of the TPC-H suite 5 times in a row without errors on this cluster, with my patch. (it used to fail all the time with RPC timeouts due to this bug). I'll post another revision of the patch this afternoon if you think the approach is reasonable.
          Hide
          dhecht Dan Hecht added a comment -

          See http://gerrit.cloudera.org:8080/#/c/2105/1 for the latest on this. The current patch leads to OOM during exhaustive builds that will need some investigation.

          Show
          dhecht Dan Hecht added a comment - See http://gerrit.cloudera.org:8080/#/c/2105/1 for the latest on this. The current patch leads to OOM during exhaustive builds that will need some investigation.
          Hide
          dhecht Dan Hecht added a comment -

          Please sync up with Tim Armstrong on where this stands – next steps is probably to look at the exhaustive build issues when running with this patch.

          Show
          dhecht Dan Hecht added a comment - Please sync up with Tim Armstrong on where this stands – next steps is probably to look at the exhaustive build issues when running with this patch.
          Hide
          mjacobs Matthew Jacobs added a comment -

          Mostafa Mokhtar has this blocking come up in your perf testing at all? if we think this is less of an issue now, I'll downgrade this.

          Show
          mjacobs Matthew Jacobs added a comment - Mostafa Mokhtar has this blocking come up in your perf testing at all? if we think this is less of an issue now, I'll downgrade this.
          Hide
          tarmstrong Tim Armstrong added a comment -

          Once we switch away from using TCMalloc as the sole memory allocator this should be fixed as a side-effect.

          Show
          tarmstrong Tim Armstrong added a comment - Once we switch away from using TCMalloc as the sole memory allocator this should be fixed as a side-effect.
          Hide
          tarmstrong Tim Armstrong added a comment -

          IMPALA-3162 may have helped here - TCMalloc in gperftools 2.4 switched to more aggressively decommit of memory by default so there may be less memory lying around to unmap.

          Show
          tarmstrong Tim Armstrong added a comment - IMPALA-3162 may have helped here - TCMalloc in gperftools 2.4 switched to more aggressively decommit of memory by default so there may be less memory lying around to unmap.
          Hide
          tarmstrong Tim Armstrong added a comment -

          Yeah it looks like TCMalloc's behaviour changed. I tried running a couple of large queries locally then cancelling them and "Bytes in page heap freelist" never goes above 0. So the work is essentially being done by the query as it's tearing things down instead of by the maintenance thread. The buffer pool will change this by adding an additional layer of caching so I'm going to put together a simplified patch that will limit the rate at which memory is released.

          Show
          tarmstrong Tim Armstrong added a comment - Yeah it looks like TCMalloc's behaviour changed. I tried running a couple of large queries locally then cancelling them and "Bytes in page heap freelist" never goes above 0. So the work is essentially being done by the query as it's tearing things down instead of by the maintenance thread. The buffer pool will change this by adding an additional layer of caching so I'm going to put together a simplified patch that will limit the rate at which memory is released.

            People

            • Assignee:
              tarmstrong Tim Armstrong
              Reporter:
              tlipcon Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development