Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-26722

Snapshot is corrupted due to interaction between move, warmupRegion, compaction, and HFileArchiver

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Duplicate
    • 1.3.5
    • 2.2.0, 2.3.0
    • Compaction, mover, snapshots
    • None

    Description

      There is an interesting sequence of events which leads to split-brain, double-assignment type of behavior with management of store files.

      The scenario is this:

      1. Take snapshot
      2. RegionX of snapshotted table is hosted on RegionServer1.
      3. Stop RegionServer1, using region_mover, gracefully moving all regions to other regionservers using move RPCs.
      4. RegionX is now opened on RegionServer2.
      5. RegionServer2 compacts RegionX after opening.
      6. RegionServer1 starts and uses region_mover to move all previously owned regions back to itself.
      7. The HMaster RPC to move calls warmupRegion on RegionServer1.
      8. As part of warmupRegion, RegionServer1 opens all store files of RegionX. CompactedHFilesDischarger chore has not yet archived the pre-compacted store file. RegionServer1 finds both the pre-compacted store file and post-compacted store file. It logs a warning and archives the pre-compacted file.
      9. RegionServer1 has warmed up the region, so now HMaster resumes the move and sends close RegionX to RegionServer2.
      10. RegionServer2 closes its store files. As part of this, it archives any compacted files which have not yet been archived by the CompactedHFilesDischarger chore.
      11. Since RegionServer1 already archived the file, RegionServer2's HFileArchiver finds the destination archive file already exists. (code link)
      12. RegionServer2 renames the archived file, to free up the desired destination filename.
        With the archived file renamed, RegionServer2 attempts to archive the file as planned. But the source file doesn't exist because RegionServer1 already moved it... to the location RegionServer2 expected to use!
      13. RegionServer2 silently ignores this archival failure. (code link)
      14. HMaster HFileCleaner chore later deletes the renamed archive file, because there is no active reference to it. (The snapshot reference is to the original named file, not the "backup" timestamped version.) The snapshot data is irretrievably lost.

      HBASE-26718 tracks a potential, specific change to the archival process to avoid this specific issue.

      However, there is a more fundamental problem here that a region opened by warmupRegion can operate on that region's store files while the region is opened elsewhere, which must not be allowed.

      This was seen on branch-1, and is a combination of HBASE-22330 and not having the fix for HBASE-22163.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dmanning David Manning
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: