Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10275

TestDataNodeMetrics failing intermittently due to TotalWriteTime counted incorrectly

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 2.7.3, 3.0.0-alpha1
    • Component/s: test
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The unit test TestDataNodeMetrics fails intermittently. The failed info show these:

      Results :
      
      Failed tests: 
        TestDataNodeVolumeFailureToleration.testVolumeAndTolerableConfiguration:195->testVolumeConfig:232 expected:<false> but was:<true>
      
      Tests in error: 
        TestOpenFilesWithSnapshot.testWithCheckpoint:94 ? IO Timed out waiting for Min...
        TestDataNodeMetrics.testDataNodeTimeSpend:279 ? Timeout Timed out waiting for ...
        TestHFlush.testHFlushInterrupted ? IO The stream is closed
      

      In line 279 in TestDataNodeMetrics, it takes place timed out. Then I looked into the code and found the real reason is that the metric of TotalWriteTime frequently count 0 in each iteration of creating file. And the this leads to retry operations till timeout.
      I debug the test in my local. I found the most suspect reason which cause TotalWriteTime metric count always be 0 is that we using the SimulatedFSDataset for spending time test. In SimulatedFSDataset, it will use the inner class's method SimulatedOutputStream#write to count the write time and the method of this class just updates the length and throws its data away.

          @Override
          public void write(byte[] b,
                    int off,
                    int len) throws IOException  {
            length += len;
          }
      

      So the writing operation hardly not costs any time. So we should use a real way to create file instead of simulated way. I have tested in my local that the test is passed just one time when I delete the simulated way, while the test retries many times to count write time in old way.

        Attachments

          Activity

            People

            • Assignee:
              linyiqun Yiqun Lin
              Reporter:
              linyiqun Yiqun Lin
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: