Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
The original DFSOutputStream is very powerful and aims to serve all purposes. But in fact, we do not need most of the features if we only want to log WAL. For example, we do not need pipeline recovery since we could just close the old logger and open a new one. And also, we do not need to write multiple blocks since we could also open a new logger if the old file is too large.
And the most important thing is that, it is hard to handle all the corner cases to avoid data loss or data inconsistency(such as HBASE-14004) when using original DFSOutputStream due to its complicated logic. And the complicated logic also force us to use some magical tricks to increase performance. For example, we need to use multiple threads to call hflush when logging, and now we use 5 threads. But why 5 not 10 or 100?
So here, I propose we should implement our own DFSOutputStream when logging WAL. For correctness, and also for performance.
Attachments
Issue Links
- is related to
-
HBASE-14004 [Replication] Inconsistency between Memstore and WAL may result in data in remote cluster that is not in the origin
- Closed
-
HDFS-916 Rewrite DFSOutputStream to use a single thread with NIO
- Open
- relates to
-
HDFS-223 Asynchronous IO Handling in Hadoop and HDFS
- Open