Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
Reviewed
Description
The unit test TestDataNodeMetrics fails intermittently. The failed info show these:
Results : Failed tests: TestDataNodeVolumeFailureToleration.testVolumeAndTolerableConfiguration:195->testVolumeConfig:232 expected:<false> but was:<true> Tests in error: TestOpenFilesWithSnapshot.testWithCheckpoint:94 ? IO Timed out waiting for Min... TestDataNodeMetrics.testDataNodeTimeSpend:279 ? Timeout Timed out waiting for ... TestHFlush.testHFlushInterrupted ? IO The stream is closed
In line 279 in TestDataNodeMetrics, it takes place timed out. Then I looked into the code and found the real reason is that the metric of TotalWriteTime frequently count 0 in each iteration of creating file. And the this leads to retry operations till timeout.
I debug the test in my local. I found the most suspect reason which cause TotalWriteTime metric count always be 0 is that we using the SimulatedFSDataset for spending time test. In SimulatedFSDataset, it will use the inner class's method SimulatedOutputStream#write to count the write time and the method of this class just updates the length and throws its data away.
@Override public void write(byte[] b, int off, int len) throws IOException { length += len; }
So the writing operation hardly not costs any time. So we should use a real way to create file instead of simulated way. I have tested in my local that the test is passed just one time when I delete the simulated way, while the test retries many times to count write time in old way.