HBase
  1. HBase
  2. HBASE-1062

Compactions at (re)start on a large table can overwhelm DFS

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: regionserver
    • Labels:
      None

      Description

      Given a large table, > 1000 regions for example, if a cluster restart is necessary, the compactions undertaken by the regionservers when the master makes initial region assignments can overwhelm DFS, leading to file errors and data loss. This condition is exacerbated if write load was heavy before restart and so many regions want to split as soon as they are opened.

      1. 1062-1.patch
        7 kB
        Andrew Purtell
      2. 1062-2.patch
        7 kB
        Andrew Purtell
      3. 1062-3.patch
        7 kB
        Andrew Purtell
      4. 1062-4.patch
        7 kB
        Andrew Purtell

        Activity

        Hide
        Andrew Purtell added a comment -

        One way to handle this is to extend the concept of safe mode to the regionservers. They should hold off on compactions and splits for some configurable interval, and then slowly ramp up compactions/splits with randomized hold intervals to avoid thundering herd behavior.

        Show
        Andrew Purtell added a comment - One way to handle this is to extend the concept of safe mode to the regionservers. They should hold off on compactions and splits for some configurable interval, and then slowly ramp up compactions/splits with randomized hold intervals to avoid thundering herd behavior.
        Hide
        Andrew Purtell added a comment -

        Edited description. 1000 region table is only large, not very large.

        Show
        Andrew Purtell added a comment - Edited description. 1000 region table is only large, not very large.
        Hide
        Andrew Purtell added a comment -

        Made this a critical issue for 0.20.0, and took ownership. I think I am done with this issue for now , until the patch anyhow.

        Of course the proposed solution will impact cluster (re)start time by lengthening it, perhaps substantially, but I think that is better than data loss.

        A better solution might be to interrogate DFS for its current load before attempting any action that might load it, but definitely the Hadoop filesystem abstraction layer does not support this and the DFS protocols do not either. What do you think about filing an issue about this upstream?

        Show
        Andrew Purtell added a comment - Made this a critical issue for 0.20.0, and took ownership. I think I am done with this issue for now , until the patch anyhow. Of course the proposed solution will impact cluster (re)start time by lengthening it, perhaps substantially, but I think that is better than data loss. A better solution might be to interrogate DFS for its current load before attempting any action that might load it, but definitely the Hadoop filesystem abstraction layer does not support this and the DFS protocols do not either. What do you think about filing an issue about this upstream?
        Hide
        stack added a comment -

        +1 on it being critical. Powerset has a cluster of ~5000 regions. Start-up is a big-bang but steady-state happens eventually. I haven't done much study of it. I can imagine that indeed if cluster went down bad with a few major compactions in the mix, startup could be messy. How many regionservers Andrew? And a HRS beside each datanode? (Our nodes are relatively lightly-loaded – 50 or so regions in 2G heaps).

        Show
        stack added a comment - +1 on it being critical. Powerset has a cluster of ~5000 regions. Start-up is a big-bang but steady-state happens eventually. I haven't done much study of it. I can imagine that indeed if cluster went down bad with a few major compactions in the mix, startup could be messy. How many regionservers Andrew? And a HRS beside each datanode? (Our nodes are relatively lightly-loaded – 50 or so regions in 2G heaps).
        Hide
        stack added a comment -

        Just saw the 'safe mode' for HRSs suggestion. I like that. In general, we need to do some work to make it so cluster boots as fast as possible and is immediately useable – not sort-of online but soaked running major compactions and splits, etc.

        Show
        stack added a comment - Just saw the 'safe mode' for HRSs suggestion. I like that. In general, we need to do some work to make it so cluster boots as fast as possible and is immediately useable – not sort-of online but soaked running major compactions and splits, etc.
        Hide
        Andrew Purtell added a comment -

        Our 25 node cluster layout is:

        1: namenode, datanode
        2: datanode, hmaster, jobtracker
        3-25: datanode, regionserver, tasktracker

        We run datanodes everywhere because each node has 2.5TB of storage that we'd clearly like to include in the DFS volume.

        Tasktrackers do not run on the semi-dedicated namenode node nor the semi-dedicated hmaster node. There is a HRS running alongside every TT. Each TT is configured to allow only four concurrent tasks – 2 mappers and/or 2 reducers. Some of our tasks can be heavy, running with 1G heap, etc. Especially the document parser really loads CPU, RAM, and DFS while the mappers crunch away.

        Right now our average load is around 50 also.

        Show
        Andrew Purtell added a comment - Our 25 node cluster layout is: 1: namenode, datanode 2: datanode, hmaster, jobtracker 3-25: datanode, regionserver, tasktracker We run datanodes everywhere because each node has 2.5TB of storage that we'd clearly like to include in the DFS volume. Tasktrackers do not run on the semi-dedicated namenode node nor the semi-dedicated hmaster node. There is a HRS running alongside every TT. Each TT is configured to allow only four concurrent tasks – 2 mappers and/or 2 reducers. Some of our tasks can be heavy, running with 1G heap, etc. Especially the document parser really loads CPU, RAM, and DFS while the mappers crunch away. Right now our average load is around 50 also.
        Hide
        Andrew Purtell added a comment -

        Attached patch is in testing on my cluster. Seems to work pretty well. May not pass all testsuite tests yet.

        Show
        Andrew Purtell added a comment - Attached patch is in testing on my cluster. Seems to work pretty well. May not pass all testsuite tests yet.
        Hide
        Andrew Purtell added a comment -

        Replaced patch with one that has the closing </description> for the hbase-defaults.xml hunk.

        Show
        Andrew Purtell added a comment - Replaced patch with one that has the closing </description> for the hbase-defaults.xml hunk.
        Hide
        stack added a comment -

        A few comments on the patch Andrew:

        + Is it wise postponing memcache flushes? Even if its only for the 2 minutes of HRS safe mode? We can take on updates during this time? If so, could we OOME if rabid uploading afoot?
        + We schedule compactions on open and on flush. This would put off the open scheduling for interval of 2 minutes. If cluster went down ugly, and some regions had References outstanding, then these regions would not be splittable, not until a memcache flush ran; i.e. it took on a bunch of uploads. Maybe thats OK?
        + Do we ever break out of this loop:

        +        if ((limit > 0) && (++count > limit)) {
        +          try {
        +            Thread.sleep(this.frequency);
        +          } catch (InterruptedException ex) {
        +            continue;
        +          }
        +          count = 0;
        +        }
        

        Looks like we increment count then set it to zero after sleep. It never progresses?

        Show
        stack added a comment - A few comments on the patch Andrew: + Is it wise postponing memcache flushes? Even if its only for the 2 minutes of HRS safe mode? We can take on updates during this time? If so, could we OOME if rabid uploading afoot? + We schedule compactions on open and on flush. This would put off the open scheduling for interval of 2 minutes. If cluster went down ugly, and some regions had References outstanding, then these regions would not be splittable, not until a memcache flush ran; i.e. it took on a bunch of uploads. Maybe thats OK? + Do we ever break out of this loop: + if ((limit > 0) && (++count > limit)) { + try { + Thread .sleep( this .frequency); + } catch (InterruptedException ex) { + continue ; + } + count = 0; + } Looks like we increment count then set it to zero after sleep. It never progresses?
        Hide
        Andrew Purtell added a comment -

        > Is it wise postponing memcache flushes?

        I thought safe mode should be essentially "don't touch DFS".

        > We schedule compactions on open and on flush. This would put off the open scheduling
        > for interval of 2 minutes. If cluster went down ugly, and some regions had References
        > outstanding, then these regions would not be splittable

        Wouldn't the references be cleared when the deferred compactions finally are allowed to run? Then the split would happen. This is what I observe while testing.

        > Do we ever break out of this loop [...] Looks like we increment count then set it to zero
        > after sleep. It never progresses?

        The code in question just sleeps (once) during the CompactSplitThread main loop if count becomes greater than limit, then count is reset.

        It looks like I still need to be more aggressive with making the compact/split ramp-up a longer slope, at least given our cluster and circumstances. The current patch helps but we can still overwhelm DFS sometimes after a restart.

        Show
        Andrew Purtell added a comment - > Is it wise postponing memcache flushes? I thought safe mode should be essentially "don't touch DFS". > We schedule compactions on open and on flush. This would put off the open scheduling > for interval of 2 minutes. If cluster went down ugly, and some regions had References > outstanding, then these regions would not be splittable Wouldn't the references be cleared when the deferred compactions finally are allowed to run? Then the split would happen. This is what I observe while testing. > Do we ever break out of this loop [...] Looks like we increment count then set it to zero > after sleep. It never progresses? The code in question just sleeps (once) during the CompactSplitThread main loop if count becomes greater than limit, then count is reset. It looks like I still need to be more aggressive with making the compact/split ramp-up a longer slope, at least given our cluster and circumstances. The current patch helps but we can still overwhelm DFS sometimes after a restart.
        Hide
        stack added a comment -

        Linked to HADOOP-4801

        Show
        stack added a comment - Linked to HADOOP-4801
        Hide
        Andrew Purtell added a comment -

        Updated patch that ramps up compactions/splits more slowly.

        Show
        Andrew Purtell added a comment - Updated patch that ramps up compactions/splits more slowly.
        Hide
        Andrew Purtell added a comment -

        The last patch has stabilized restarts on my cluster. Consider for 0.19?

        Show
        Andrew Purtell added a comment - The last patch has stabilized restarts on my cluster. Consider for 0.19?
        Hide
        stack added a comment -

        Bringing in to 0.19.0 so gets reviewed and considered for commit in 0.19.0.

        Show
        stack added a comment - Bringing in to 0.19.0 so gets reviewed and considered for commit in 0.19.0.
        Hide
        Andrew Purtell added a comment -

        Patch-3 fixes a potential problem where abnormal exit of safe mode thread would prevent memcache flushes and compactions.

        Show
        Andrew Purtell added a comment - Patch-3 fixes a potential problem where abnormal exit of safe mode thread would prevent memcache flushes and compactions.
        Hide
        stack added a comment -

        .bq Wouldn't the references be cleared when the deferred compactions finally are allowed to run? Then the split would happen. This is what I observe while testing.

        Yes. Since we only put-off the compact-on-open on startup; thereafter compaction-on-open runs on splits, redeploys, etc. It'll be fine.

        On this code where you make a thread....

        +      // start thread for turning off safemode
        +      if (conf.getInt("hbase.regionserver.safemode.period", 0) < 1) {
        +        safeMode.set(false);
        +        compactSplitThread.setLimit(-1);
        +        LOG.debug("skipping safe mode");
        +      } else {
        +        new SafemodeThread().start();
        +      }
        

        FYI, we have a bit of a convention regards thread naming and where we start them. Can you start it in startServiceThreads and name it like the others (if it makes sense) with hrs name as prefix? Makes it cleaner reading thread dumps figuring which threads are ours and systems.

        Maybe limit should be volatile so changes are seen promptly.

        Won't below log happen alot when in DEBUG?

        +        LOG.debug("in safe mode, deferring memcache flushes");
        +        Thread.sleep(threadWakeFrequency);
        

        if safe mode is two minutes and threadWakeFrequency is 10 seconds...
        Perhaps just print entry and exit with log including how long sleep is for... Same for compactions.

        Otherwise patch looks good. I can try it here if you make a new version to address above.

        Show
        stack added a comment - .bq Wouldn't the references be cleared when the deferred compactions finally are allowed to run? Then the split would happen. This is what I observe while testing. Yes. Since we only put-off the compact-on-open on startup; thereafter compaction-on-open runs on splits, redeploys, etc. It'll be fine. On this code where you make a thread.... + // start thread for turning off safemode + if (conf.getInt( "hbase.regionserver.safemode.period" , 0) < 1) { + safeMode.set( false ); + compactSplitThread.setLimit(-1); + LOG.debug( "skipping safe mode" ); + } else { + new SafemodeThread().start(); + } FYI, we have a bit of a convention regards thread naming and where we start them. Can you start it in startServiceThreads and name it like the others (if it makes sense) with hrs name as prefix? Makes it cleaner reading thread dumps figuring which threads are ours and systems. Maybe limit should be volatile so changes are seen promptly. Won't below log happen alot when in DEBUG? + LOG.debug( "in safe mode, deferring memcache flushes" ); + Thread .sleep(threadWakeFrequency); if safe mode is two minutes and threadWakeFrequency is 10 seconds... Perhaps just print entry and exit with log including how long sleep is for... Same for compactions. Otherwise patch looks good. I can try it here if you make a new version to address above.
        Hide
        Andrew Purtell added a comment -

        Patch -4 addresses stack's comments.

        Show
        Andrew Purtell added a comment - Patch -4 addresses stack's comments.
        Hide
        stack added a comment -

        Reviewed patch. Looks good. Tried it local. I see it leaving safe mode. +1 on patch.

        Show
        stack added a comment - Reviewed patch. Looks good. Tried it local. I see it leaving safe mode. +1 on patch.
        Hide
        Andrew Purtell added a comment -

        Committed. Thanks for the review stack.

        Show
        Andrew Purtell added a comment - Committed. Thanks for the review stack.
        Hide
        stack added a comment -

        @Andrew In IRC discussions, did we say we could do away with this bandaid now we don't compact on every open, only if region has references (which is usually well after startup). I've cut this patch from TRUNK. Let me know if removing it a misstack [sic]. I did it as part of my hbase-1816 hackup job.

        Show
        stack added a comment - @Andrew In IRC discussions, did we say we could do away with this bandaid now we don't compact on every open, only if region has references (which is usually well after startup). I've cut this patch from TRUNK. Let me know if removing it a misstack [sic] . I did it as part of my hbase-1816 hackup job.
        Hide
        Andrew Purtell added a comment -

        We're good. +1 No misstack.

        Show
        Andrew Purtell added a comment - We're good. +1 No misstack.

          People

          • Assignee:
            Andrew Purtell
            Reporter:
            Andrew Purtell
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development