HBase
  1. HBase
  2. HBASE-5174

Coalesce aborted tasks in the TaskMonitor

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 0.92.0
    • Fix Version/s: 0.92.3
    • Component/s: None
    • Labels:
      None

      Description

      Some tasks can get repeatedly canceled like flushing when splitting is going on, in the logs it looks like this:

      2012-01-10 19:28:29,164 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c. due to global heap pressure
      2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c., flushing=false, writesEnabled=false
      2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up because memory above low water=1.6g
      2012-01-10 19:28:29,164 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c. due to global heap pressure
      2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c., flushing=false, writesEnabled=false
      2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up because memory above low water=1.6g
      2012-01-10 19:28:29,164 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c. due to global heap pressure
      2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c., flushing=false, writesEnabled=false
      

      But in the TaskMonitor UI you'll get MAX_TASKS (1000) displayed on top of the regions. Basically 1000x:

      Tue Jan 10 19:28:29 UTC 2012	Flushing test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c.	ABORTED (since 31sec ago)	Not flushing since writes not enabled (since 31sec ago)
      

      It's ugly and I'm sure some users will freak out seeing this, plus you have to scroll down all the way to see your regions. Coalescing consecutive aborted tasks seems like a good solution.

        Activity

        Hide
        Jean-Daniel Cryans added a comment -

        So I guess someone fixed it elsewhere.

        Show
        Jean-Daniel Cryans added a comment - So I guess someone fixed it elsewhere.
        Hide
        Jean-Daniel Cryans added a comment -

        Is it just me or was this somehow fixed in later releases? I just did a PE run and I saw the flushes were aborted only once.

        Show
        Jean-Daniel Cryans added a comment - Is it just me or was this somehow fixed in later releases? I just did a PE run and I saw the flushes were aborted only once.
        Hide
        Ted Yu added a comment -

        A slight variation of my previous proposal:
        MonitoredTaskImpl can maintain Map<String, MonitoredTask> where String key is the description passed to
        TaskMonitor.createStatus(), prepended with MonitoredTask.State and separator string (such as '||').

        A task may have two entries in the map, one starting with 'ABORTED', the other starting with 'COMPLETE'. This corresponds to task retries.
        Special handling would be added to MonitoredTaskImpl.setState().

        Show
        Ted Yu added a comment - A slight variation of my previous proposal: MonitoredTaskImpl can maintain Map<String, MonitoredTask> where String key is the description passed to TaskMonitor.createStatus(), prepended with MonitoredTask.State and separator string (such as '||'). A task may have two entries in the map, one starting with 'ABORTED', the other starting with 'COMPLETE'. This corresponds to task retries. Special handling would be added to MonitoredTaskImpl.setState().
        Hide
        Jimmy Xiang added a comment -

        I meant we can not just show the failed or aborted tasks longer. We should also show the succeeded one or the retrying one as well, if it failed before and the failed tasks is still showing.

        Show
        Jimmy Xiang added a comment - I meant we can not just show the failed or aborted tasks longer. We should also show the succeeded one or the retrying one as well, if it failed before and the failed tasks is still showing.
        Hide
        Jean-Daniel Cryans added a comment -

        Failed or aborted tasks should not be displayed after the retry is succeeded. Otherwise, will it cause confusion?

        I'd rather want to know that something went wrong, and since it's ordered by time you can see that it eventually succeeds.

        Show
        Jean-Daniel Cryans added a comment - Failed or aborted tasks should not be displayed after the retry is succeeded. Otherwise, will it cause confusion? I'd rather want to know that something went wrong, and since it's ordered by time you can see that it eventually succeeds.
        Hide
        Jimmy Xiang added a comment -

        Failed or aborted tasks should not be displayed after the retry is succeeded. Otherwise, will it cause confusion?

        Show
        Jimmy Xiang added a comment - Failed or aborted tasks should not be displayed after the retry is succeeded. Otherwise, will it cause confusion?
        Hide
        Jean-Daniel Cryans added a comment -

        so perhaps failed or aborted tasks could remain displayed for a longer period of time.

        Agreed, and if for each time you coalesce tasks together you reset the timer then it could stick around for a while.

        Show
        Jean-Daniel Cryans added a comment - so perhaps failed or aborted tasks could remain displayed for a longer period of time. Agreed, and if for each time you coalesce tasks together you reset the timer then it could stick around for a while.
        Hide
        Andrew Purtell added a comment -

        Considering ABORTED task would be cleaned up in 1 minute, I wonder if the complexity introduced is worth it.

        On the other hand the display would be cleaner with coalescing, so perhaps failed or aborted tasks could remain displayed for a longer period of time.

        Show
        Andrew Purtell added a comment - Considering ABORTED task would be cleaned up in 1 minute, I wonder if the complexity introduced is worth it. On the other hand the display would be cleaner with coalescing, so perhaps failed or aborted tasks could remain displayed for a longer period of time.
        Hide
        Ted Yu added a comment -

        Considering ABORTED task would be cleaned up in 1 minute, I wonder if the complexity introduced is worth it.

        Show
        Ted Yu added a comment - Considering ABORTED task would be cleaned up in 1 minute, I wonder if the complexity introduced is worth it.
        Hide
        Andrew Purtell added a comment -

        Render the monitored tasks as a treeview, with something like http://jquery.bassistance.de/treeview/ ? While building the tree, put entries with identical text one level down, as soon as you see something different, move back up to toplevel? Render fully collapsed?

        Show
        Andrew Purtell added a comment - Render the monitored tasks as a treeview, with something like http://jquery.bassistance.de/treeview/ ? While building the tree, put entries with identical text one level down, as soon as you see something different, move back up to toplevel? Render fully collapsed?
        Hide
        Ted Yu added a comment -

        I think the MonitoredTask display should be placed under region server section.

        Show
        Ted Yu added a comment - I think the MonitoredTask display should be placed under region server section.
        Hide
        Ted Yu added a comment -

        Looks like I didn't take State of MonitoredTask into account.
        Personally I think seeing the latest status for a MonitoredTask is fine. To dig deeper, log is always the place to check.

        Map<Class, Map<String, Map<MonitoredTask.State, MonitoredTask>>> is easy to confuse a few people reading the code

        Show
        Ted Yu added a comment - Looks like I didn't take State of MonitoredTask into account. Personally I think seeing the latest status for a MonitoredTask is fine. To dig deeper, log is always the place to check. Map<Class, Map<String, Map<MonitoredTask.State, MonitoredTask>>> is easy to confuse a few people reading the code
        Hide
        Todd Lipcon added a comment -

        There's no guarantee that Object.hashCode() is unique - just that it's usually unique. Would rather coalesce by actual identity (WeakIdentityHashMap?) or by some string (eg region id) than use hashcode.

        Show
        Todd Lipcon added a comment - There's no guarantee that Object.hashCode() is unique - just that it's usually unique. Would rather coalesce by actual identity (WeakIdentityHashMap?) or by some string (eg region id) than use hashcode.
        Hide
        Jean-Daniel Cryans added a comment -

        Same as in HBASE-5136, I think we need to know something was aborted. Overwriting it will make it seem that nothing wrong's happening. Then add coalescing to make sure you only have 1 aborted and not a flood.

        Show
        Jean-Daniel Cryans added a comment - Same as in HBASE-5136 , I think we need to know something was aborted. Overwriting it will make it seem that nothing wrong's happening. Then add coalescing to make sure you only have 1 aborted and not a flood.
        Hide
        Ted Yu added a comment -

        Maybe we can add the following method to TaskMonitor:

          public MonitoredTask createStatus(String description, Object obj) {
        

        TaskMonitor can maintain Map<Class, Map<Integer, MonitoredTask>> where Class key is the class of Object and Integer key is obj.hashCode().
        This way we keep the current usage pattern and reduce redundancy in the mean time.

        Show
        Ted Yu added a comment - Maybe we can add the following method to TaskMonitor: public MonitoredTask createStatus( String description, Object obj) { TaskMonitor can maintain Map<Class, Map<Integer, MonitoredTask>> where Class key is the class of Object and Integer key is obj.hashCode(). This way we keep the current usage pattern and reduce redundancy in the mean time.
        Hide
        Ted Yu added a comment -

        I think this issue is similar to HBASE-5136 in that TaskMonitor.get().createStatus() is called imprudently.
        We can store MonitoredTask for flushcache() as a field in HRegion and reuse it.

        Show
        Ted Yu added a comment - I think this issue is similar to HBASE-5136 in that TaskMonitor.get().createStatus() is called imprudently. We can store MonitoredTask for flushcache() as a field in HRegion and reuse it.

          People

          • Assignee:
            Unassigned
            Reporter:
            Jean-Daniel Cryans
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development