HBase
  1. HBase
  2. HBASE-2236

Upper bound of outstanding WALs can be overrun; take 2 (take 1 was hbase-2053)

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None

      Description

      So hbase-2053 is not aggressive enough. WALs can still overwhelm the upper limit on log count. While the code added by HBASE-2053, when done, will ensure we let go of the oldest WAL, to do it, we might have to flush many regions. E.g:

      2010-02-15 14:20:29,351 INFO org.apache.hadoop.hbase.regionserver.HLog: Too many hlogs: logs=45, maxlogs=32; forcing flush of 5 regions(s): test1,193717,1266095474624, test1,194375,1266108228663, test1,195690,1266095539377, test1,196348,1266095539377, test1,197939,1266069173999
      

      This takes time. If we are taking on edits a furious rate, we might have rolled the log again, meantime, maybe more than once.

      Also log rolls happen inline with a put/delete as soon as it hits the 64MB (default) boundary whereas the necessary flushing is done in background by a single thread and the memstore can overrun the (default) 64MB size. Flushes needed to release logs will be mixed in with "natural" flushes as memstores fill. Flushes may take longer than the writing of an HLog because they can be larger.

      So, on an RS that is struggling the tendency would seem to be for a slight rise in WALs. Only if the RS gets a breather will the flusher catch up.

      If HBASE-2087 happens, then the count of WALs get a boost.

      Ideas to fix this for good would be :

      + Priority queue for queuing up flushes with those that are queued to free up WALs having priority
      + Improve the HBASE-2053 code so that it will free more than just the last WAL, maybe even queuing flushes so we clear all WALs such that we are back under the maximum WALS threshold again.

        Activity

        stack created issue -
        stack made changes -
        Field Original Value New Value
        Fix Version/s 0.20.5 [ 12314800 ]
        Fix Version/s 0.20.4 [ 12314496 ]
        Hide
        ryan rawson added a comment -

        seems like we should flush until we are under the max log count by
        some percent amount, like 20% perhaps. after all flushing logs while
        under load means we are just potentially playing perpetual catchup
        while more edits come in.

        Show
        ryan rawson added a comment - seems like we should flush until we are under the max log count by some percent amount, like 20% perhaps. after all flushing logs while under load means we are just potentially playing perpetual catchup while more edits come in.
        stack made changes -
        Priority Major [ 3 ] Critical [ 2 ]
        Hide
        Jonathan Gray added a comment -

        I like the idea of prioritizing flushes (similar to what we discussed @ hackathon w/ prioritized compactions).

        A flush being done because we are under global heap pressure or to clear out hlogs are high priority.

        Flushes being done because a region memstore reached max size are low priority.

        Show
        Jonathan Gray added a comment - I like the idea of prioritizing flushes (similar to what we discussed @ hackathon w/ prioritized compactions). A flush being done because we are under global heap pressure or to clear out hlogs are high priority. Flushes being done because a region memstore reached max size are low priority.
        Hide
        stack added a comment -

        Bulk move of 0.20.5 issues into 0.21.0 after vote that we merge branch into TRUNK up on list.

        Show
        stack added a comment - Bulk move of 0.20.5 issues into 0.21.0 after vote that we merge branch into TRUNK up on list.
        stack made changes -
        Fix Version/s 0.20.5 [ 12314800 ]
        Labels moved_from_0_20_5
        Hide
        stack added a comment -

        This bug is already marked critical. I think it is. Was at a user site today where hdfs was yanked from under hbase. Each server had 15-20 logs to proces... some even more. We need to be more aggressive about cleaning up old logs.

        Show
        stack added a comment - This bug is already marked critical. I think it is. Was at a user site today where hdfs was yanked from under hbase. Each server had 15-20 logs to proces... some even more. We need to be more aggressive about cleaning up old logs.
        Hide
        stack added a comment -

        Moving out of 0.90.

        Chatting with Ryan and J-D, this is a difficult issue and count of logs is not the right metrics; rather we should be looking at size of all edits out in WALs as it relates to size of content up in memstores.

        Also, splitting needs to be made run faster so we can afford to carry fatter WALs.

        Show
        stack added a comment - Moving out of 0.90. Chatting with Ryan and J-D, this is a difficult issue and count of logs is not the right metrics; rather we should be looking at size of all edits out in WALs as it relates to size of content up in memstores. Also, splitting needs to be made run faster so we can afford to carry fatter WALs.
        stack made changes -
        Fix Version/s 0.90.0 [ 12313607 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            stack
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development