Aaron, could you please describe the solution along with the patch.
Sure. In the original patch, the fix I had was just to check if only a single rack was configured and not count a block as being on too-few racks if there in fact was only one rack.
However, here's an updated patch which should address the test failures.
There were two reasons for the test failures:
- There were 2 tests which asserted that blocks were under-replicated when only one rack was configured, which is no longer the case.
- There was 1 test which had blocks for a single file spread across two racks, and asserted that after all the nodes on one of the racks died, that blocks were then considered misreplicated.
Fixing #1 above just consisted of changing the tests to not assert that behavior.
Fixing #2 was a little more involved. I added a boolean to DatanodeManager to track whether or not the cluster had ever consisted of multiple racks. This is updated whenever a node is added to the cluster. When a cluster goes from being sinlge-rack to multi-rack, we call BlockManager#processMisReplicatedBlocks to make sure that we look for blocks which were not considered under-replicated when there was only a single rack, that should now be on separate racks.
Then, the check for whether or not a block is on enough racks becomes "is the desired replication only 1, OR has there only ever been one rack configured in this cluster."