I think the proposed approach does not solve the problem.
It improves in a sense current state but does not eliminate the problem completely.
The idea of promoting blocks from "tmp" to the real storage is necessary to support sync(). We are trying to cover the following sequence of events:
- a client starts writing to a block and says sync();
- the data-node does the write and the sync() and then fails;
- the data-node restarts and the sync-ed block should appear as a valid block on this node, because the semantic of sync demands the sync-ed data to survive failures of clients, data-nodes and the name-node.
It seams natural to promote blocks from tmp to the real storage during startup, but this caused problems because together with the sync-ed blocks the data-node also promotes all other potentially incomplete blocks from the tmp.
One source of incomplete blocks in tmp is the internal block replication initiated by the name-node but not completed on the data-node due to a failure.
(I) The proposal is to divide tmp into 2 directories
"blocks_being_replicated" and "blocks_being_written". This excludes partially replicated
blocks from being promoted to the real storage during data-node restarts.
But this does not cover another source of incomplete blocks, which is the regular block writing by a client. It also can fail, remain incomplete, and will be promoted to the main storage during data-node restart.
The question is why do we not care about these blocks?
Why they cannot cause the same problems as the ones that are being replicated?
Suppose I do not use sync-s. The incomplete blocks will still be promoted to the real storage, will be reported to the name-node and the name-node will have to process them and finally remove most of them. Isn't it a degradation in performance.
(II) I'd rather consider dividing into two directories one having "transient" and another having "persistent" blocks. The transient blocks should be removed during startup, and persistent should be promoted into the real storage. Once sync-ed a block should be moved into the persistent directory.
(III) Another variation is to promote sync-ed (persistent) blocks directly to the main storage. The problem here is to deal with blockReceived and blockReports, which I feel can be solved.
(IV) Yet another approach is to keep the storage structure unchanged and write each sync-ed (finalized) portions of the block into main storage by appending that portion to the real block file. This will require an extra disk io for merging files (instead of renaming) but will let us discard everything that is in tmp as we do now.
What we are trying to do with the directories is to assign properties to the block replicas and make them survive crashes. Previously there were just 2 properties final and transient. So we had 2 directories: tmp and main. Now we need a third property, which says the block is persistent but not finalized yet. So we tend to introduce yet another directory. I think this is too straightforward, because there is plenty of other approaches to implement boolean properties on entities.
Why we are not considering them?
Also worth noting this issue directly affects appends, because a replica being appended is first copied to the tmp directory and then treated as described above.