Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Duplicate
-
1.3.5
-
None
Description
There is an interesting sequence of events which leads to split-brain, double-assignment type of behavior with management of store files.
The scenario is this:
- Take snapshot
- RegionX of snapshotted table is hosted on RegionServer1.
- Stop RegionServer1, using region_mover, gracefully moving all regions to other regionservers using move RPCs.
- RegionX is now opened on RegionServer2.
- RegionServer2 compacts RegionX after opening.
- RegionServer1 starts and uses region_mover to move all previously owned regions back to itself.
- The HMaster RPC to move calls warmupRegion on RegionServer1.
- As part of warmupRegion, RegionServer1 opens all store files of RegionX. CompactedHFilesDischarger chore has not yet archived the pre-compacted store file. RegionServer1 finds both the pre-compacted store file and post-compacted store file. It logs a warning and archives the pre-compacted file.
- RegionServer1 has warmed up the region, so now HMaster resumes the move and sends close RegionX to RegionServer2.
- RegionServer2 closes its store files. As part of this, it archives any compacted files which have not yet been archived by the CompactedHFilesDischarger chore.
- Since RegionServer1 already archived the file, RegionServer2's HFileArchiver finds the destination archive file already exists. (code link)
- RegionServer2 renames the archived file, to free up the desired destination filename.
With the archived file renamed, RegionServer2 attempts to archive the file as planned. But the source file doesn't exist because RegionServer1 already moved it... to the location RegionServer2 expected to use! - RegionServer2 silently ignores this archival failure. (code link)
- HMaster HFileCleaner chore later deletes the renamed archive file, because there is no active reference to it. (The snapshot reference is to the original named file, not the "backup" timestamped version.) The snapshot data is irretrievably lost.
HBASE-26718 tracks a potential, specific change to the archival process to avoid this specific issue.
However, there is a more fundamental problem here that a region opened by warmupRegion can operate on that region's store files while the region is opened elsewhere, which must not be allowed.
This was seen on branch-1, and is a combination of HBASE-22330 and not having the fix for HBASE-22163.
Attachments
Issue Links
- is fixed by
-
HBASE-22163 Should not archive the compacted store files when region warmup
- Resolved
- is related to
-
HBASE-27974 CompactionServer cause the loss of HFile references in snapshot
- Open
-
HBASE-26718 HFileArchiver can remove referenced StoreFiles from the archive
- Resolved
- relates to
-
HBASE-26726 Allow disable of region warmup before graceful move
- Resolved
- requires
-
HBASE-22330 Backport HBASE-20724 (Sometimes some compacted storefiles are still opened after region failover) to branch-1
- Resolved
-
HBASE-20724 Sometimes some compacted storefiles are still opened after region failover
- Closed