Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
We had incremental backup fail with FileNotFoundException for a file in the WALs directory. Upon investigation, the log had been archived a few mins earlier. WALInputFormat's record reader has support for falling back on an archived path:
} catch (IOException e) { Path archivedLog = AbstractFSWALProvider.findArchivedLog(logFile, conf); // archivedLog can be null if unable to locate in archiveDir. if (archivedLog != null) { openReader(archivedLog); // Try call again in recursion return nextKeyValue(); } else { throw e; } }
But the getSplits method has different handling:
try { List<FileStatus> files = getFiles(fs, inputPath, startTime, endTime); allFiles.addAll(files); } catch (FileNotFoundException e) { if (ignoreMissing) { LOG.warn("File " + inputPath + " is missing. Skipping it."); continue; } throw e; }
This ignoreMissing variable was added in HBASE-14141 and is enabled via
wal.input.ignore.missing.files which is defaulted to false and never set. Looking at the comment and reviewboard history of HBASE-14141 I think there might have been some confusion about where to handle these missing files, and this got lost in the shuffle.
I would prefer not to ignore missing hfiles. I think that could result in some weird behavior:
- RegionServer has 10 archived and 30 not-yet-archived WALs needing to be backed up
- The process starts, and while it's running 1 of those 30 WALs gets archived. That would get skipped due to FileNotFoundException
- But the remaining 29 would be backed up
This scenario could cause some data consistency issues if this incremental backup is restored. We missed some edits in the middle of applied edits from other WALs.
So I do think failing as we do today is necessary for consistency, but unrealistic in a live cluster. The solution is to try finding the missing file in the archived directory. Backups has a coprocessor which will not allow the archived file to be cleaned up until it's backed up, so I think it's safe to say that a WAL is either definitely in WALs or oldWALs.
Attachments
Issue Links
- relates to
-
HBASE-19681 Online snapshot creation failing with missing store file
- Open
-
HBASE-28602 Incremental backup fails when WALs move
- Open
-
HBASE-28461 Timing issue in Incremental Backup or TestIncrementalBackup
- Open