Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
2.5.8
-
Reviewed
Description
In our production clusters we have observed that when WAL close fails It causes the the oldWAL files not marked as close and not letting them cleaned. When a WAL close fails in closeWriter it increments the error count.
Span span = Span.current(); try { span.addEvent("closing writer"); writer.close(); span.addEvent("writer closed"); } catch (IOException ioe) { int errors = closeErrorCount.incrementAndGet(); boolean hasUnflushedEntries = isUnflushedEntries(); if (syncCloseCall && (hasUnflushedEntries || (errors > this.closeErrorsTolerated))) { LOG.error("Close of WAL " + path + " failed. Cause=\"" + ioe.getMessage() + "\", errors=" + errors + ", hasUnflushedEntries=" + hasUnflushedEntries); throw ioe; } LOG.warn("Riding over failed WAL close of " + path + "; THIS FILE WAS NOT CLOSED BUT ALL EDITS SYNCED SO SHOULD BE OK", ioe); }
When there are errors in closing WAL only twice doReplaceWALWriter enters this code block
if (isUnflushedEntries() || closeErrorCount.get() >= this.closeErrorsTolerated) { try { closeWriter(this.writer, oldPath, true); } finally { inflightWALClosures.remove(oldPath.getName()); } }
as we don't mark them closed here like we do it here
Writer localWriter = this.writer; closeExecutor.execute(() -> { try { closeWriter(localWriter, oldPath, false); } catch (IOException e) { LOG.warn("close old writer failed", e); } finally { // call this even if the above close fails, as there is no other chance we can set // closed to true, it will not cause big problems. {color:red} markClosedAndClean(oldPath);{color} inflightWALClosures.remove(oldPath.getName()); } });