I've been thinking about this over the weekend and this morning. My current thinking is that the safest bet is the following approach:
When an RBW block is reported for a finalized replica:
- Case 1) if the block has a too-low generation stamp, mark it corrupt.
- Case 2) if the block has the correct generation stamp, ignore it (don't add to block locations or mark it corrupt)
Here's the reasoning:
Case 1 One of the DNs is reporting a stale generation stamp.
This means that the client must have either appended to the block or undergone pipeline recovery. There are two possibilities of why the DN is thus reporting an old genstamp:
- 1a) it is a "delayed block report" as described in this JIRA. We will later see a correct/up-to-date BR for the same block.
Here it is OK to mark the block as corrupt, since when we sent the "invalidate" message to the DN, we'll invalidate the old genstamp specifically. So when the DN receives the invalidation, it will not delete the new (correct) replica, but rather just ignore it.
- 1b) the client lost its connection to this DN node and did a pipeline recovery before closing the file. In this case we will never see a correct/up-to-date BR.
Here it's also OK to mark it as corrupt, because it really is corrupt (ie didn't participate in the block recovery).
Case 2 correct generation stamp, but RBW report on a FINALIZED block
As far as I can think, the only way we can get here is with the "delayed report" scenario described in this JIRA. The reasoning is as follows:
- in order for the client to call completeBlock(), it must have gotten a successful pipeline close from all of the DNs in the current pipeline
- if the pipeline nodes had changed, it would have gotten a different generation stamp. So, all of the nodes that have a block with the correct genstamp were in the closed pipeline
- thus all of the nodes with the correct genstamp would have the correct length and state, and any report otherwise is because of a message delay.
The only other possibility is something like a machine crash which doesn't replay the ext3 journal causing some blocks to get rolled back to a prior state. In this case, upon restart, the DN would change it to be a RWR (ReplicaWaitingRecovery) and we could use the original logic of marking it corrupt.
I think the above solution is safer and simpler than any other solutions I could come up with.