Accumulo
  1. Accumulo
  2. ACCUMULO-2645

tablet stuck unloading, and problem is hard to diagnose

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.4
    • Fix Version/s: 1.6.1, 1.7.0
    • Component/s: tserver
    • Labels:
    • Environment:

      very large production cluster, CDH3u5

      Description

      • master failed to balance
      • custom balancer refused to balance while migrations were in place
      • tablet server was not unloading the tablet
      • tablet server was otherwise serving tablets, providing status
      • memory dump determined that there were 21K UnloadTabletHandler objects
      • jstack showed UnloadTabletHandler in Tablet.completeClose, line 2674
      • the last print of the debug "completeClose(safeState=true, completeClose=true) occured 9 days ago
      • there was a query that had been running for 9 days

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          156d 23h 47m 1 Eric Newton 11/Sep/14 20:32
          Resolved Resolved Reopened Reopened
          22h 19m 1 Christopher Tubbs 12/Sep/14 18:52
          Reopened Reopened Resolved Resolved
          25s 1 Christopher Tubbs 12/Sep/14 18:52
          ASF subversion and git services made changes -
          Time Spent 40m [ 2400 ] 50m [ 3000 ]
          Worklog Id 18076 [ 18076 ]
          ASF subversion and git services logged work - 12/Sep/14 23:37
          ASF subversion and git services made changes -
          Time Spent 0.5h [ 1800 ] 40m [ 2400 ]
          Worklog Id 18072 [ 18072 ]
          ASF subversion and git services logged work - 12/Sep/14 23:37
          Christopher Tubbs made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Christopher Tubbs added a comment -

          Resolved as "Fixed" under new scope of making it easier to diagnose.

          Show
          Christopher Tubbs added a comment - Resolved as "Fixed" under new scope of making it easier to diagnose.
          Christopher Tubbs made changes -
          Resolution Not a Problem [ 8 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Christopher Tubbs made changes -
          Summary tablet stuck unloading tablet stuck unloading, and problem is hard to diagnose
          ASF subversion and git services made changes -
          Time Spent 20m [ 1200 ] 0.5h [ 1800 ]
          Worklog Id 18065 [ 18065 ]
          ASF subversion and git services logged work - 12/Sep/14 15:55
          Eric Newton made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Not a Problem [ 8 ]
          Hide
          Eric Newton added a comment -

          You can now get information about scans (count, oldest) in the monitor which can help identify this problem.

          Show
          Eric Newton added a comment - You can now get information about scans (count, oldest) in the monitor which can help identify this problem.
          Eric Newton made changes -
          Fix Version/s 1.6.1 [ 12325441 ]
          ASF subversion and git services made changes -
          Time Spent 10m [ 600 ] 20m [ 1200 ]
          Worklog Id 18053 [ 18053 ]
          ASF subversion and git services logged work - 11/Sep/14 20:21
          ASF subversion and git services made changes -
          Remaining Estimate 0h [ 0 ]
          Time Spent 10m [ 600 ]
          Worklog Id 18052 [ 18052 ]
          ASF subversion and git services logged work - 11/Sep/14 20:18
          Eric Newton made changes -
          Link This issue relates to ACCUMULO-2673 [ ACCUMULO-2673 ]
          Hide
          Keith Turner added a comment -

          When I added scan interruption I first tried using Thread.interrupt(), but that did not work so well because HDFS client code are the interrupts. So then I switched to the strategy of using an atomic boolean and checking it in certain places. HDFS eating interrupts was in an older version, maybe it does not anymore? We could possibly try thread Thread.interrupt() in addition to checking a atomic boolean.

          Show
          Keith Turner added a comment - When I added scan interruption I first tried using Thread.interrupt(), but that did not work so well because HDFS client code are the interrupts. So then I switched to the strategy of using an atomic boolean and checking it in certain places. HDFS eating interrupts was in an older version, maybe it does not anymore? We could possibly try thread Thread.interrupt() in addition to checking a atomic boolean.
          Hide
          Josh Elser added a comment -

          Keith Turner, yes, thanks for mentioning. I re-read what you had mentioned there, but I will look at the code. My comments above were mostly from the standpoint of "if reading/writing data via the hdfs api can prevent a scan from being interrupted", maybe there's something more we need to do. Not yet substantiated with what the implementation does.

          Show
          Josh Elser added a comment - Keith Turner , yes, thanks for mentioning. I re-read what you had mentioned there, but I will look at the code. My comments above were mostly from the standpoint of "if reading/writing data via the hdfs api can prevent a scan from being interrupted", maybe there's something more we need to do. Not yet substantiated with what the implementation does.
          Hide
          Keith Turner added a comment -

          Josh Elser I made some comments on ACCUMULO-2542 about scan interruption that may be helpful. Tablet.completeClose(...) calls ScanDataSource.interrupt() which sets the atomic boolean mentioned in ACCUMULO-2542 to true.

          Show
          Keith Turner added a comment - Josh Elser I made some comments on ACCUMULO-2542 about scan interruption that may be helpful. Tablet.completeClose(...) calls ScanDataSource.interrupt() which sets the atomic boolean mentioned in ACCUMULO-2542 to true.
          Hide
          Josh Elser added a comment -

          I am writing a test for this theory.

          Neat!

          Current theory is that this interrupt is being caught by the HDFS library, which indirectly causes the request to the NN to hang forever.

          Yeah, this is what I was getting at. I wonder if there is something we could design into SKVI or the interruption call to ensure interruption actually propagates to the scan actually sees it and takes action. Just a thought.

          Show
          Josh Elser added a comment - I am writing a test for this theory. Neat! Current theory is that this interrupt is being caught by the HDFS library, which indirectly causes the request to the NN to hang forever. Yeah, this is what I was getting at. I wonder if there is something we could design into SKVI or the interruption call to ensure interruption actually propagates to the scan actually sees it and takes action. Just a thought.
          Hide
          Eric Newton added a comment -

          The unloader attempts to interrupt the scans. Current theory is that this interrupt is being caught by the HDFS library, which indirectly causes the request to the NN to hang forever. I am writing a test for this theory.

          Show
          Eric Newton added a comment - The unloader attempts to interrupt the scans. Current theory is that this interrupt is being caught by the HDFS library, which indirectly causes the request to the NN to hang forever. I am writing a test for this theory.
          Hide
          Josh Elser added a comment -

          the monitor could display the number of unload requests outstanding in the tserver

          That would be cool. I could see the general premise being otherwise useful too.

          Perhaps related, does tablet unload interrupt running scans? Or, does a scan have the ability to block unloads indefinitely? Perhaps the tserver should try for some amount of time to unload, if it still hasn't unloaded because a scan is running, forcefully abort it? That also begs the question in the case of custom iterators, can we make something that will gracefully abort a scan using such iterators or are we reliant on users implementing exception handling properly to avoid the "9 day query"?

          Show
          Josh Elser added a comment - the monitor could display the number of unload requests outstanding in the tserver That would be cool. I could see the general premise being otherwise useful too. Perhaps related, does tablet unload interrupt running scans? Or, does a scan have the ability to block unloads indefinitely? Perhaps the tserver should try for some amount of time to unload, if it still hasn't unloaded because a scan is running, forcefully abort it? That also begs the question in the case of custom iterators, can we make something that will gracefully abort a scan using such iterators or are we reliant on users implementing exception handling properly to avoid the "9 day query"?
          Hide
          Eric Newton added a comment -

          The aforementioned complex iterator stack was also doing HDFS IO.

          Show
          Eric Newton added a comment - The aforementioned complex iterator stack was also doing HDFS IO.
          Eric Newton made changes -
          Labels newbie
          Eric Newton made changes -
          Fix Version/s 1.7.0 [ 12324607 ]
          Hide
          Eric Newton added a comment -

          Possible ways of detecting this problem in the future:

          • UnloadTabletHandler could issue a warning if a tablet does not unload
          • master could generate warnings about unload requests that are old
          • the monitor could display the number of unload requests outstanding in the tserver
          Show
          Eric Newton added a comment - Possible ways of detecting this problem in the future: UnloadTabletHandler could issue a warning if a tablet does not unload master could generate warnings about unload requests that are old the monitor could display the number of unload requests outstanding in the tserver
          Hide
          Eric Newton added a comment -

          I should mention that the query runs a very complex iterator stack.

          Show
          Eric Newton added a comment - I should mention that the query runs a very complex iterator stack.
          Eric Newton made changes -
          Field Original Value New Value
          Description  * master failed to balance
           * custom balancer refused to balance while migrations were in place
           * tablet server was not unloading the tablet
           * tablet server was otherwise serving tablets, providing status
           * memory dump determined that there were 21K UnloadTabletHandler objects
           * jstack showed UnloadTabletHandler in Tablet.completeClose, line 2674
           * the last print of the debug "completeClose(safeState=true, completeClose=true) occured 9 days ago
           * there was a query that had been for 9 days

           * master failed to balance
           * custom balancer refused to balance while migrations were in place
           * tablet server was not unloading the tablet
           * tablet server was otherwise serving tablets, providing status
           * memory dump determined that there were 21K UnloadTabletHandler objects
           * jstack showed UnloadTabletHandler in Tablet.completeClose, line 2674
           * the last print of the debug "completeClose(safeState=true, completeClose=true) occured 9 days ago
           * there was a query that had been running for 9 days

          Eric Newton created issue -

            People

            • Assignee:
              Eric Newton
              Reporter:
              Eric Newton
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 50m
                50m

                  Development