Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
2.6.0, 3.0.0
-
None
-
None
-
Our 2.6.0 environment is a 3 node cluster running cdh5.15.0.
Our 3.0.0 environment is a 4 node cluster running cdh6.3.0.
Description
We have been experiencing large amounts of scheduled blocks that never get cleared out. This is preventing blocks from being placed even when there is plenty of space on the system.
Here is an example of the block growth over 24 hours on one of our systems running 2.6.0
Here is an example of the block growth over 24 hours on one of our systems running 3.0.0
https://issues.apache.org/jira/browse/HDFS-1172 appears to be the main issue we were having on 2.6.0 so the growth has decreased since upgrading to 3.0.0, however, there appears to still be a systemic growth in scheduled blocks over time and our systems will still need to restart the namenode on occasion to reset this count. I have not determined what is causing the leaked blocks in 3.0.0.
Looking into the issue, I discovered that the intention is for scheduled blocks to slowly go back down to 0 after errors cause blocks to be leaked.
/** Increment the number of blocks scheduled. */ void incrementBlocksScheduled(StorageType t) { currApproxBlocksScheduled.add(t, 1); } /** Decrement the number of blocks scheduled. */ void decrementBlocksScheduled(StorageType t) { if (prevApproxBlocksScheduled.get(t) > 0) { prevApproxBlocksScheduled.subtract(t, 1); } else if (currApproxBlocksScheduled.get(t) > 0) { currApproxBlocksScheduled.subtract(t, 1); } // its ok if both counters are zero. } /** Adjusts curr and prev number of blocks scheduled every few minutes. */ private void rollBlocksScheduled(long now) { if (now - lastBlocksScheduledRollTime > BLOCKS_SCHEDULED_ROLL_INTERVAL) { prevApproxBlocksScheduled.set(currApproxBlocksScheduled); currApproxBlocksScheduled.reset(); lastBlocksScheduledRollTime = now; } }
However, this code does not do what is intended if the system has a constant flow of written blocks. If blocks make it into prevApproxBlocksScheduled, the next scheduled block increments currApproxBlocksScheduled and when it completes, it decrements prevApproxBlocksScheduled preventing the leaked block to be removed from the approx count. So, for errors to be corrected, we have to not write any data for the roll period of 10 minutes. The number of blocks we write per 10 minutes is quite high. This allows the error on the approx counts to grow to very large numbers.
The comments in the ticket for the original implementation suggest this issues was known. https://issues.apache.org/jira/browse/HADOOP-3707. However, it's not clear to me if the severity of it was known at the time.
> So if there are some blocks that are not reported back by the datanode, they will eventually get adjusted (usually 10 min; bit longer if datanode is continuously receiving blocks).
The comments suggest it will eventually get cleared out, but in our case, it never gets cleared out.