Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
-
None
-
Reviewed
Description
After trying several times, now I can reproduce a critical problem when reading compressed WAL file in replication.
The problem is about how we construct the LRUDictionary when reset the WALEntryStream. In the current design, we will not reconstruct the LRUDictionary when reseting, but when reading again, we will call addEntry directly to add 'new' word into the dict, which will mess up the dict and cause data corruption.
I've implemented a UT to simulate reading partial WAL entry in replication, with the current code base, after reseting and reading again, we will stuck there for ever.
The fix is to always use findEntry when constructing the dict when reading, so we will not mess things up.
It turns out that the above solution does not work.
Another possible fix is to always reconstruct the dict after reseting, we will also clear the dict and reconstruct it again. But it is less efficient as we need to read from the beginning to the position we want to seek to, instead of seek to the position directly, especially when tailing the WAL file which is currently being written.
And notice that, the UT can only reproduce the problem in local file system, on HDFS, the available method is implemented so if there is not enough data, we will throw EOFException earlier before parsing cells with the compression decoder, so we will not add duplicated word to dict. But in real world, it is possible that even if there are enough data to read, we could hit an IOException while reading and lead to the same problem described above.
And while fixing, I also found another problem that in TagConressionContext and CompressionContext, we use the result of InputStream incorrectly, as we just cast it to byte and test whether it is -1 to determine whether the field is in the dict. The return value of InputStream.read is an int, and it will return -1 if reaches EOF, but here we will consider it as not in dict... We should throw EOFException instead.
I'm not sure whether fix this can also fix HBASE-27073 but let's have a try first.
Attachments
Issue Links
- is related to
-
HBASE-27073 TestReplicationValueCompressedWAL.testMultiplePuts is flaky
- Resolved
- relates to
-
HBASE-27632 Refactor WAL.Reader implementation so we can better support WAL splitting and replication
- Resolved
- links to