while (remaining > 0) { int toRead = (int) Math.min(DEFAULT_BLOCK_SIZE, remaining); byte[] data = new byte[toRead]; long startPos = corruptFileLen - remaining; fdis.readFully(startPos, data, 0, toRead); // find all MAGIC string and see if the file is readable from there int index = 0; long nextFooterOffset; byte[] magicBytes = OrcFile.MAGIC.getBytes(StandardCharsets.UTF_8); while (index != -1) { index = indexOf(data, magicBytes, index + 1); if (index != -1) { nextFooterOffset = startPos + index + magicBytes.length + 1; if (isReadable(corruptPath, conf, nextFooterOffset)) { footerOffsets.add(nextFooterOffset); } } } System.err.println("Scanning for valid footers - startPos: " + startPos + " toRead: " + toRead + " remaining: " + remaining); remaining = remaining - toRead; }
Two adjacent reads may be exactly separated by OrcFile.MAGIC, making it impossible to find the location of the recovered file. Because the current implementation only matches in a single read.
private static int indexOf(final byte[] data, final byte[] pattern, final int index) { if (data == null || data.length == 0 || pattern == null || pattern.length == 0 || index > data.length || index < 0) { return -1; } int j = 0; for (int i = index; i < data.length; i++) { if (pattern[j] == data[i]) { j++; } else { j = 0; } if (j == pattern.length) { return i - pattern.length + 1; } } return -1; }
This matching algorithm is wrong when i does not backtrack after a failed match in the middle. As a simple example data = OOORC, pattern= ORC, index = 1, this algorithm will return -1.
Issue Links
- links to