Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-20018

Safe online META repair

Add voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • hbck
    • None

    Description

      HBCK is a tank, or a giant shotgun, or choose the battlefield metaphor you feel is most appropriate. It rolls onto the field and leaves problems crushed in its wake, but if you point it in the wrong direction, it will also crush your production data too. As such it is a means of last resort to fix an ailing cluster. It is also imperative that user request traffic, writes in particular, are stopped before attempting a number of the fixes. It is unlikely the default "-repair" option is what you want - this turns on too many fixes to risk at one time. There are a large number of command line switches for individual checks and fixes which are very useful but also error prone when cobbling together a command line for a cluster fix under pressure. An operations team might hesitate to employ hbck to fix some accumulating bad state, because of the disruption use of it requires, and the risk of compounding the problem if not carefully done. That of course would be bad because the accumulating bad state will eventually have an availability impact. 

      It should be safer to use hbck, but changing hbck also carries risk. We can leave it be as the useful (but dangerous) tool it is and focus on a subset of its functionality to make safer.

      There are a class of META corruptions of mild to moderate severity which could in theory be handled more safely in an online manner without requiring a suspension of user traffic. Some things hbck does are safe enough to use directly for this. Others need tweaks to do more preflight checks (like checking region states) first. Develop these as a separate tool, maybe even a new HMaster or Admin component.

      Look for opportunities to share code with existing hbck, via refactor into a shared library. 

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            apurtell Andrew Kyle Purtell

            Dates

              Created:
              Updated:

              Slack

                Issue deployment