Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15646

Track failing tests in HDFS

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Blocker
    • Resolution: Unresolved
    • None
    • None
    • hdfs
    • None

    Description

      There are several Units that are consistently failing on Yetus for a log period of time.
      The list keeps growing and it is driving the repository into unstable status. Qbt  reports more than 40 failing unit tests on average.

      Personally, over the last week, with every submitted patch, I have to spend a considerable time looking at the same stack trace to double check whether or not the patch contributes to those failures.

      I found out that the majority of those tests were failing for quite sometime but no Jiras were filed.

      The main problem of those consistent failures is that they have side effect on the runtime of the other Junits by sucking up resources such as memory and ports.

      StripedFile and EC tests in particular are 100% show-ups in the list of bad tests.
      I looked at those tests and they certainly need some improvements (i.e., HDFS-15459). Is any one interested in those test cases? Can we just turn them off?

      I like to give some heads-up that we need some more collaboration to enforce the stability of the code set.

      • For all developers, please, file a Jira once you see a failing test whether it is unrelated to your patch or not. This gives heads-up to other developers about the potential failures. Please do not stop at commenting on your patch "this is unrelated to my work".
      • Volunteer to dedicate more time on fixing flaky tests.
      • Periodically, make sure that the list of failing tests does not exceed a certain number of tests. We have Qbt reports to monitor that, but there is no follow up on its status.
      • We should consider aggressive strategies such as blocking any merges until the code is brought back to stability.
      • We need a clear and well-defined process to address Yetus issues: configuration, investigating running out of memory, slowness..etc.
      • Turn-off the Junits within the modules that are not being actively used in the community (i.e., EC, stripedFiles, or..etc.). 

       

      CC: aajisaka, elgoiri, kihwal, daryn, weichiu

      Do you guys have any thoughts on the current status of the HDFS ?

       

      The following list is a quick list of failing Junits from Qbt reports:

       

       org.apache.hadoop.crypto.key.kms.server.TestKMS.testKMSProviderCaching1.5 sec1

       org.apache.hadoop.fs.azure.TestBlobMetadata.testFolderMetadata42 ms3

       org.apache.hadoop.fs.azure.TestBlobMetadata.testFirstContainerVersionMetadata46 ms3

       org.apache.hadoop.fs.azure.TestBlobMetadata.testPermissionMetadata27 ms3

       org.apache.hadoop.fs.azure.TestBlobMetadata.testOldPermissionMetadata19 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemConcurrency.testNoTempBlobsVisible0.95 sec3

        org.apache.hadoop.fs.azure.TestNativeAzureFileSystemConcurrency.testLinkBlobs33 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractMocked.testListStatusRootDir31 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractMocked.testRenameDirectoryMoveToExistingDirectory0.25 sec3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractMocked.testListStatus29 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractMocked.testRenameDirectoryAsExistingDirectory36 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractMocked.testRenameToDirWithSamePrefixAllowed23 ms

      3  org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractMocked.testLSRootDir19 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractMocked.testDeleteRecursively31 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemFileNameCheck.testWasbFsck1 sec3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testChineseCharactersFolderRename1 sec3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testRedoRenameFolderInFolderListingWithZeroByteRenameMetadata41 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testRedoRenameFolderInFolderListing37 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testUriEncoding38 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testDeepFileCreation37 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testListDirectory29 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testRedoRenameFolderRenameInProgress37 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testRenameFolder34 ms

      3  org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testRenameImplicitFolder27 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testRedoRenameFolder66 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testStoreDeleteFolder27 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testRename40 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemOperationsMocked.testListStatus36 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemOperationsMocked.testRenameDirectoryAsEmptyDirectory0.26 sec3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemOperationsMocked.testListStatusFilterWithSomeMatches23 ms

      3  org.apache.hadoop.fs.azure.TestNativeAzureFileSystemOperationsMocked.testRenameDirectoryAsNonExistentDirectory28 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemOperationsMocked.testGlobStatusSomeMatchesInDirectories26 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemOperationsMocked.testGlobStatusWithMultipleWildCardMatches27 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemOperationsMocked.testDeleteRecursively22 ms3

       org.apache.hadoop.fs.azure.TestOutOfBandAzureBlobOperations.testImplicitFolderDeleted0.99 sec3

       org.apache.hadoop.fs.azure.TestOutOfBandAzureBlobOperations.testFileAndImplicitFolderSameName31 ms3

       org.apache.hadoop.fs.azure.TestOutOfBandAzureBlobOperations.testSetOwnerOnImplicitFolder26 ms3

       org.apache.hadoop.fs.azure.TestOutOfBandAzureBlobOperations.testFileInImplicitFolderDeleted30 ms3

       org.apache.hadoop.fs.azure.TestOutOfBandAzureBlobOperations.testImplicitFolderListed22 ms3

       org.apache.hadoop.fs.azure.TestOutOfBandAzureBlobOperations.testCreatingDeepFileCreatesExplicitFolder53 ms3

       org.apache.hadoop.fs.azure.TestOutOfBandAzureBlobOperations.testSetPermissionOnImplicitFolder22 ms3

       org.apache.hadoop.fs.azure.TestWasbFsck.testDelete1 sec3

       org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithOpportunisticContainers1 min 30 sec17

       org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithEnforceExecutionType 

       

       

      Attachments

        Issue Links

          1.
          TestPersistBlocks#TestRestartDfsWithFlush flaky failure Sub-task Resolved Unassigned  
          2.
          Occasional failure in TestDFSClientRetries#testGetFileChecksum because the number of available xcievers is set too low Sub-task Resolved Unassigned  
          3.
          TestEditLogTailer is flaky Sub-task Resolved Unassigned  
          4.
          TestBlockTokenWithDFSStriped fails intermittently Sub-task Resolved Ahmed Hussein  
          5.
          TestDFSClientRetries#testGetFileChecksum fails intermittently Sub-task Resolved Ahmed Hussein

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1.5h
          6.
          TestHAAppend#testMultipleAppendsDuringCatchupTailing is flaky Sub-task Resolved Ahmed Hussein

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 20m
          7.
          TestReconstructStripedFile.testNNSendsErasureCodingTasks randomly cannot finish in 60s Sub-task Resolved Sammi Chen  
          8.
          TestFileCreation#testServerDefaultsWithMinimalCaching fails intermittently Sub-task Resolved Ahmed Hussein

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 2h 10m
          9.
          TestFsDatasetImpl fails intermittently Sub-task Resolved Ahmed Hussein

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 50m
          10.
          TestBPOfferService#testMissBlocksWhenReregister fails intermittently Sub-task Resolved Ahmed Hussein

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 40m
          11.
          RBF: TestRouter#testNamenodeHeartBeatEnableDefault fails by BindException Sub-task Resolved Akira Ajisaka

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 2h
          12.
          EC: Fix checksum computation in case of native encoders Sub-task Resolved Ayush Saxena

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 4h 40m
          13.
          Flaky test TestSnapshotFileLength.testSnapshotfileLength Sub-task Resolved Ahmed Hussein  
          14.
          TestBPOfferService#testMissBlocksWhenReregister is flaky Sub-task Resolved Unassigned  
          15.
          Disable Broken Azure Junits Sub-task Resolved Ahmed Hussein

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 20m
          16.
          Set dfs.namenode.redundancy.considerLoad to false in MiniDFSCluster Sub-task Resolved Ahmed Hussein

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 3h 40m
          17.
          TestBPOfferService#testMissBlocksWhenReregister fails on trunk Sub-task Resolved Masatake Iwasaki

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h
          18.
          TestRouterRpcMultiDestination#testGetCachedDatanodeReport fails on trunk Sub-task Resolved Masatake Iwasaki

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 40m
          19.
          TestRouterRpcMultiDestination#testNamenodeMetrics fails on trunk Sub-task Resolved Masatake Iwasaki  
          20.
          Testcase TestBalancer#testBalancerWithPinnedBlocks always fails Sub-task Resolved Unassigned  
          21.
          TestReconstructStripedFile#testNNSendsErasureCodingTasks fails intermittently Sub-task Resolved Hemanth Boyina  
          22.
          TestDistributedFileSystem#testGetFileBlockStorageLocationsBatching fails intermittently Sub-task Resolved Unassigned  
          23.
          TestDFSOutputStream#testCloseTwice implementation is broken Sub-task Resolved Ahmed Hussein  
          24.
          TestURLConnectionFactory fails by NoClassDefFoundError in branch-3.3 and branch-3.2 Sub-task Resolved Chao Sun

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 50m
          25.
          TestUpgradeDomainBlockPlacementPolicy flaky Sub-task Resolved Ahmed Hussein

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 50m
          26.
          TestFileChecksum should be parameterized Sub-task Resolved Masatake Iwasaki

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 40m
          27.
          Fix intermittent falilure of TestDecommission#testAllocAndIBRWhileDecommission Sub-task Resolved Masatake Iwasaki

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 40m
          28.
          TestMultipleNNPortQOP#testMultipleNNPortOverwriteDownStream fails intermittently Sub-task Resolved Toshihiko Uchida

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 3h 40m
          29.
          TestBalancerWithMultipleNameNodes#testBalancingBlockpoolsWithBlockPoolPolicy fails on trunk Sub-task Resolved Masatake Iwasaki

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h
          30.
          Fix TestFsDatasetImpl.testReadLockCanBeDisabledByConfig Sub-task Resolved Leon Gao

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h 20m
          31.
          TestBalancer#testMaxIterationTime fails sporadically Sub-task Resolved Toshihiko Uchida

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 40m
          32.
          TestRouterRpcMultiDestination#testProxyGetTransactionID and testProxyVersionRequest are flaky Sub-task Resolved Akira Ajisaka

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 1h
          33.
          RBF: TestRouterFederationRename is flaky Sub-task Resolved Unassigned  
          34.
          TestBalancerRPCDelay. testBalancerRPCDelayQpsDefault fails intermittently Sub-task Resolved Ahmed Hussein  
          35.
          TestBalancerRPCDelay#testBalancerRPCDelayQpsDefault fails on Trunk Sub-task Resolved Ahmed Hussein

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 0.5h
          36.
          Some tests in TestBlockRecovery are consistently failing Sub-task Resolved Viraj Jasani

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 6h 20m
          37.
          TestBlockRecovery fails consistently on Branch-2.10 Sub-task Resolved Ahmed Hussein

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 40m
          38.
          TestDecommissioningStatus#testDecommissionStatus fails intermittently Sub-task Resolved Ajay Kumar

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 40m
          39.
          TestBootstrapAliasmap fails by BindException Sub-task Resolved Akira Ajisaka

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 50m
          40.
          TestEditLogTailer#testStandbyTriggersLogRollsWhenTailInProgressEdits is flaky Sub-task Resolved Viraj Jasani

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 10h 50m
          41.
          De-flake testDecommissionStatus Sub-task Resolved Viraj Jasani

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 2h 40m
          42.
          De-flake TestBlockScanner#testSkipRecentAccessFile Sub-task Resolved Viraj Jasani

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 2h 20m
          43.
          Flaky test TestFsDatasetImpl#testDnRestartWithHardLink Sub-task Resolved Viraj Jasani

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 9.5h
          44.
          testMoverWithStripedFile fails intermittently Sub-task Resolved Viraj Jasani

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 3h 10m
          45.
          RBF: Flaky test TestRouterWebHDFSContractCreate>AbstractContractCreateTest#testSyncable in Trunk Sub-task Reopened Fengnan Li  
          46.
          TestWebHDFS#testLargeFile fails intermittently Sub-task Open Yongjun Zhang  
          47.
          Use JUnit Parameterized test suite in TestWriteReadStripedFile Sub-task Patch Available Huafeng Wang  
          48.
          TestErasureCodeBenchmarkThroughput#testECReadWrite fails intermittently Sub-task Open Unassigned  
          49.
          TestStartup#testStorageBlockContentsStaleAfterNNRestart fails intermittently Sub-task Open Ajith S  
          50.
          TestDirectoryScanner#testThrottling fails: Throttle is too permissive Sub-task Patch Available Daniel Templeton

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 50m
          51.
          TestDecommission.testIncludeByRegistrationName fails intermittently Sub-task Patch Available Binglin Chang  
          52.
          TestRetryCacheWithHA#testUpdatePipeline fails intermittently Sub-task Patch Available Ranith Sardar  
          53.
          TestRBWBlockInvalidation#testBlockInvalidationWhenRBWReplicaMissedInDN is flaky and sometimes gets stuck in infinite loops Sub-task Patch Available Ratandeep Ratti  
          54.
          TestTransferFsImage#testClientSideException fails intermittently Sub-task Open Unassigned  
          55.
          TestReconstructStripedFile.testNNSendsErasureCodingTasks fails occasionally Sub-task Open Unassigned  
          56.
          TestBalancer#testUnknownDatanode occasionally fails in trunk Sub-task Reopened Unassigned  
          57.
          Refactor TestBalancer for faster execution. Sub-task Open Unassigned  
          58.
          TestRouterRpcMultiDestination#testErasureCoding fails on trunk Sub-task Open Fengnan Li  
          59.
          TestStripedFileAppend#testAppendToNewBlock fails on trunk Sub-task Open Takanobu Asanuma  
          60.
          TestBlockTokenWithDFSStriped errors port binding Sub-task Open Unassigned  
          61.
          TestUnderReplicatedBlocks#testSetrepIncWithUnderReplicatedBlocks test timeout Sub-task Open Hrishikesh Gadre  
          62.
          TestDecommissionWithStripedBackoffMonitor#testDecommissionWithMissingBlock fails on trunk intermittently Sub-task Open Unassigned  
          63.
          TestFsDatasetImpl#testDnRestartWithHardLink fails intermittently Sub-task Resolved Unassigned  
          64.
          TestNamenodeCapacityReport#testXceiverCount is flaky Sub-task Open Unassigned  
          65.
          TestStandbyCheckpoints#testCheckpointBeforeNameNodeInitializationIsComplete fails intermittently Sub-task Open Unassigned  
          66.
          TestJournalNodeRespectsBindHostKeys fails consistently on branch-2.10 Sub-task Open Unassigned  
          67.
          TestObservernode#testMkdirsRaceWithObserverRead is flaky Sub-task Open Unassigned  
          68.
          TestDecommissionWithBackoffMonitor#testDecommissionWithCloseFileAndListOpenFiles fails Sub-task Open Unassigned  
          69.
          TestDFSInotifyEventInputStreamKerberized#testWithKerberizedCluster fails Sub-task Open Unassigned  
          70.
          TestHDFSFileSystemContract#testAppend fails Sub-task Open Unassigned  
          71.
          TestBlockTokenWithDFSStriped#testEnd2End fails Sub-task Open Unassigned  
          72.
          TestFileTruncate#testTruncateWithDataNodesShutdownImmediately fails Sub-task Open Unassigned  
          73.
          Fix TestDataNodeMetrics#testReceivePacketSlowMetrics Sub-task Resolved Haiyang Hu

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 2h 40m
          74.
          De-flake TestRollingUpgrade#testRollback Sub-task Resolved Viraj Jasani

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 2h

          Activity

            People

              Unassigned Unassigned
              ahussein Ahmed Hussein
              Votes:
              0 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 75.5h
                  75.5h