Currently, we allocate block IDs using a random number generator, ensuring that the blocks we allocate are not already in use. Of course that doesn't proclude a block which was previously used and then deleted from having its ID reused.
This interacts quite poorly with the "orphaned block" processing we have in tablet metadata. As a refresher, the "orphaned block" thing is used as follows:
- during a compaction, we have the output blocks (newly written data) and the input blocks (data which has been compacted and no longer relevant)
- when the compaction finishes, we write a new TabletMetadata which swaps in the new blocks and removes the old blocks
- followed by that, we delete the old (input) blocks. Of course we can't delete the old blocks until after we've flushed the metadata, or else if we crashed before flushing the metadata we'd have lost track of the new block IDs.
- so, we defer the deletion of the input blocks until after the metadata has been flushed
- this leaves open the opposite hole: if we defer the deletion of the old blocks, and we crash just after flushing metadata, we would leak those old blocks and their disk space, which is no good either.
- so, when we flush metadata, we include the 'old blocks' in a 'orphan_blocks' array. On loading of metadata, we try to 'roll forward' the deletion to prevent the above-mentioned leak from being permanent.
The "roll forward" behavior mentioned above is what seems to be eating blocks. We can now have the following bad interleaving:
- a compaction in tablet A succeeds and lists block ID "X" as orphaned
- a different tablet B re-uses block ID "X"
- we restart the TS, or trigger a remote bootstrap (which also "cleans up" orphan blocks)
- it deletes block "X" from underneath tablet "B"