HBase
  1. HBase
  2. HBASE-9787

HCM should not stop retrying after retry timeout if the retry count is not exhausted

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Invalid
    • Affects Version/s: 0.96.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      See HBASE-9775:

      Some comment on the retry time limit, we may need to fix it.
      It was introduced for server-specific retry fallback, which I hope is not broken by recent changes to HCM. That is the logic where we go to one server, retry, wait, retry, wait more, retry, wait more, then we learn that region went to different server. Here, we don't need to wait, because we can assume by default the different server is healthy; but the old code would carry on with wait sequence.
      However, if region moves around (which is common in aggressive CM IT tests), retry count can quickly be exhausted as we go to each new server a few times and never reach higher multipliers. It was especially pronounced w/10 retries, where some request could fail in just a few seconds in case of double server failure where region is recovered twice; w/31-35 now it's probably less pronounced but still possible.
      So, the time limit based on original retries is supposed to prevent these fast failures, by allowing the retries to go on for as long as we would have retried "as if" we were just using the multiplier sequence to its "full potential".
      It should not serve as lower limit, we might want to change code to check that both time AND count are exhaused, in this case.

        Issue Links

          Activity

          Sergey Shelukhin created issue -
          Sergey Shelukhin made changes -
          Field Original Value New Value
          Description See HBASE-9775 See HBASE-9775:

          Some comment on the retry time limit, we may need to fix it.
          It was introduced for server-specific retry fallback, which I hope is not broken by recent changes to HCM. That is the logic where we go to one server, retry, wait, retry, wait more, retry, wait more, then we learn that region went to different server. Here, we don't need to wait, because we can assume by default the different server is healthy; but the old code would carry on with wait sequence.
          However, if region moves around (which is common in aggressive CM IT tests), retry count can quickly be exhausted as we go to each new server a few times and never reach higher multipliers. It was especially pronounced w/10 retries, where some request could fail in just a few seconds in case of double server failure where region is recovered twice; w/31-35 now it's probably less pronounced but still possible.
          So, the time limit based on original retries is supposed to prevent these fast failures, by allowing the retries to go on for as long as we would have retried "as if" we were just using the multiplier sequence to its "full potential".
          It should not serve as lower limit, we might want to change code to check that both time AND count are exhaused, in this case.
          Sergey Shelukhin made changes -
          Link This issue is related to HBASE-9775 [ HBASE-9775 ]
          Hide
          Sergey Shelukhin added a comment -

          I see this is already done

          Show
          Sergey Shelukhin added a comment - I see this is already done
          Sergey Shelukhin made changes -
          Link This issue Is contained by HBASE-9843 [ HBASE-9843 ]
          Sergey Shelukhin made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 0.96.1 [ 12324961 ]
          Resolution Invalid [ 6 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          54d 21h 17m 1 Sergey Shelukhin 10/Dec/13 21:49

            People

            • Assignee:
              Unassigned
              Reporter:
              Sergey Shelukhin
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development