Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-18143

[AMv2] Backoff on failed report of region transition quickly goes to astronomical time scale

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 2.0.0
    • Component/s: Region Assignment
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Testing on cluster w/ aggressive killing, if Master is killed serially a few times such that is offline a good while, regionservers that want to report a region transition pause too long between retries.

      Here is the regionserver reporting failures:

        1 2017-05-31 20:50:53,840 INFO  [RS_CLOSE_REGION-ve0542:16020-2] regionserver.HRegionServer: Failed report of region transition server { host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 } transition { transition_code: CLOSED region_info { region_id: 1496284931226 table_name { namesp    ace: "default" qualifier: "IntegrationTestBigLinkedList" } start_key: "\337\377\377\377\377\377\377\362" end_key: "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 } }; retry (#0) after 1008ms delay (Master is coming online...).
        2 2017-05-31 20:50:54,853 INFO  [RS_CLOSE_REGION-ve0542:16020-2] regionserver.HRegionServer: Failed report of region transition server { host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 } transition { transition_code: CLOSED region_info { region_id: 1496284931226 table_name { namesp    ace: "default" qualifier: "IntegrationTestBigLinkedList" } start_key: "\337\377\377\377\377\377\377\362" end_key: "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 } }; retry (#1) after 2026ms delay (Master is coming online...).
        3 2017-05-31 20:50:56,886 INFO  [RS_CLOSE_REGION-ve0542:16020-2] regionserver.HRegionServer: Failed report of region transition server { host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 } transition { transition_code: CLOSED region_info { region_id: 1496284931226 table_name { namesp    ace: "default" qualifier: "IntegrationTestBigLinkedList" } start_key: "\337\377\377\377\377\377\377\362" end_key: "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 } }; retry (#2) after 6084ms delay (Master is coming online...).
        4 2017-05-31 20:51:02,976 INFO  [RS_CLOSE_REGION-ve0542:16020-2] regionserver.HRegionServer: Failed report of region transition server { host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 } transition { transition_code: CLOSED region_info { region_id: 1496284931226 table_name { namesp    ace: "default" qualifier: "IntegrationTestBigLinkedList" } start_key: "\337\377\377\377\377\377\377\362" end_key: "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 } }; retry (#3) after 30588ms delay (Master is coming online...).
        5 2017-05-31 20:51:33,570 INFO  [RS_CLOSE_REGION-ve0542:16020-2] regionserver.HRegionServer: Failed report of region transition server { host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 } transition { transition_code: CLOSED region_info { region_id: 1496284931226 table_name { namesp    ace: "default" qualifier: "IntegrationTestBigLinkedList" } start_key: "\337\377\377\377\377\377\377\362" end_key: "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 } }; retry (#4) after 308422ms delay (Master is coming online...).
        6 2017-05-31 20:56:41,997 INFO  [RS_CLOSE_REGION-ve0542:16020-2] regionserver.HRegionServer: Failed report of region transition server { host_name: "ve0542.halxg.cloudera.com" port: 16020 start_code: 1496279470954 } transition { transition_code: CLOSED region_info { region_id: 1496284931226 table_name { namesp    ace: "default" qualifier: "IntegrationTestBigLinkedList" } start_key: "\337\377\377\377\377\377\377\362" end_key: "\352\252\252\252\252\252\252\234" offline: false split: false replica_id: 0 } }; retry (#5) after 6171203ms delay (Master is coming online...).
      

      See how by the time we get to the 5th retry, we are waiting 100 minutes before we'll retry. That is too long. Make retry happen more frequently. Data is offline until the close is successfully reported.

        Attachments

        1. HBASE-18143.master.001.patch
          3 kB
          Michael Stack
        2. HBASE-18143.master.002.patch
          3 kB
          Michael Stack
        3. HBASE-18143.master.002.patch
          3 kB
          Michael Stack

          Activity

            People

            • Assignee:
              stack Michael Stack
              Reporter:
              stack Michael Stack
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: