Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-10856 Prep for 1.0
  3. HBASE-11094

Distributed log replay is incompatible for rolling restarts

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • None
    • 0.99.0, 0.98.4
    • None
    • None
    • Reviewed
    • Hide
      Configuration setting "hbase.master.distributed.log.replay" is only used by Master(source of truth) and region servers which participate in recovery process will recover region servers either in log splitting or replay mode depending on what setting is told by Master.

      When "hbase.master.distributed.log.replay" configuration setting changes, master will wait for all existing log recovery work items drain before it applies the new setting in order not to mix different recovery mode & ease administrator duty to manually wait all recovery work is done and then restart master.
      Show
      Configuration setting "hbase.master.distributed.log.replay" is only used by Master(source of truth) and region servers which participate in recovery process will recover region servers either in log splitting or replay mode depending on what setting is told by Master. When "hbase.master.distributed.log.replay" configuration setting changes, master will wait for all existing log recovery work items drain before it applies the new setting in order not to mix different recovery mode & ease administrator duty to manually wait all recovery work is done and then restart master.

    Description

      0.99.0 comes with dist log replay by default (HBASE-10888). However, reading the code and discussing this with Jeffrey, we realized that the dist log replay code is not compatible with rolling upgrades from 0.98.0 and 1.0.0.

      The issue is that, the region server looks at it own configuration to decide whether the region should be opened in replay mode or not. The open region RPC does not contain that info. So if dist log replay is enabled on master, the master will assign the region and schedule replay tasks. If the region is opened in a RS that does not have this conf enabled, then it will happily open the region in normal mode (not replay mode) causing possible (transient) data loss.

      Attachments

        1. hbase-11094.patch
          88 kB
          Jeffrey Zhong
        2. hbase-11094-v2.patch
          97 kB
          Jeffrey Zhong
        3. hbase-11094-v3.patch
          68 kB
          Jeffrey Zhong
        4. hbase-11094-v4.patch
          115 kB
          Jeffrey Zhong
        5. hbase-11094-v5.1.patch
          118 kB
          Jeffrey Zhong
        6. hbase-11094-v5.patch
          116 kB
          Jeffrey Zhong

        Issue Links

          Activity

            People

              jeffreyz Jeffrey Zhong
              enis Enis Soztutar
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: