Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-3362

If .META. offline between OPENING and OPENED, then wrong server location in .META. is possible

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.90.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This is a good one. It happened to me testing OOME in split logging.

      • Balancer moves region to new location, regionservrer X.
      • New location regionserver X successfully opens the region and then goes to update .META.
      • At this point, the server carrying .META. crashes.
      • Regionserver X is stuck waiting on .META. to come back online. It takes so long master times out the region-in-transition
      • Master assigns the region elsewhere to regionserver Y
      • It opens successfully on regionserver Y and then it also parks waiting on .META. coming online
      • .META. comes online
      • The two servers X and Y race to update .META.

      I saw case where server X edit went in after server Ys edit which means that lookups in .META. get the wrong server. HBCK can detect this situation.

      RegionServer X when it wakes up coreeclty notices that its lost control of the region but the damage is done – where damage is .META. edit.

      Chatting with Jon, he suggested that regionserver X should 'rollback' the .META. edit – do explicit delete of what it added. This would work I think but chatting more, I'll make a fix that keeps updating the zookeeper OPENING state while edit goes on in a separate thread. Our continuous setting of OPENING will make it so region-in-transition does not timeout.

        Attachments

          Activity

            People

            • Assignee:
              stack Michael Stack
              Reporter:
              stack Michael Stack
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: