Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15646

Track failing tests in HDFS

Add voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Blocker
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: hdfs
    • Labels:
      None

      Description

      There are several Units that are consistently failing on Yetus for a log period of time.
      The list keeps growing and it is driving the repository into unstable status. Qbt  reports more than 40 failing unit tests on average.

      Personally, over the last week, with every submitted patch, I have to spend a considerable time looking at the same stack trace to double check whether or not the patch contributes to those failures.

      I found out that the majority of those tests were failing for quite sometime but no Jiras were filed.

      The main problem of those consistent failures is that they have side effect on the runtime of the other Junits by sucking up resources such as memory and ports.

      StripedFile and EC tests in particular are 100% show-ups in the list of bad tests.
      I looked at those tests and they certainly need some improvements (i.e., HDFS-15459). Is any one interested in those test cases? Can we just turn them off?

      I like to give some heads-up that we need some more collaboration to enforce the stability of the code set.

      • For all developers, please, file a Jira once you see a failing test whether it is unrelated to your patch or not. This gives heads-up to other developers about the potential failures. Please do not stop at commenting on your patch "this is unrelated to my work".
      • Volunteer to dedicate more time on fixing flaky tests.
      • Periodically, make sure that the list of failing tests does not exceed a certain number of tests. We have Qbt reports to monitor that, but there is no follow up on its status.
      • We should consider aggressive strategies such as blocking any merges until the code is brought back to stability.
      • We need a clear and well-defined process to address Yetus issues: configuration, investigating running out of memory, slowness..etc.
      • Turn-off the Junits within the modules that are not being actively used in the community (i.e., EC, stripedFiles, or..etc.). 

       

      CC: Akira Ajisaka, Íñigo Goiri, Kihwal Lee, Daryn Sharp, Wei-Chiu Chuang

      Do you guys have any thoughts on the current status of the HDFS ?

       

      The following list is a quick list of failing Junits from Qbt reports:

       

       org.apache.hadoop.crypto.key.kms.server.TestKMS.testKMSProviderCaching1.5 sec1

       org.apache.hadoop.fs.azure.TestBlobMetadata.testFolderMetadata42 ms3

       org.apache.hadoop.fs.azure.TestBlobMetadata.testFirstContainerVersionMetadata46 ms3

       org.apache.hadoop.fs.azure.TestBlobMetadata.testPermissionMetadata27 ms3

       org.apache.hadoop.fs.azure.TestBlobMetadata.testOldPermissionMetadata19 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemConcurrency.testNoTempBlobsVisible0.95 sec3

        org.apache.hadoop.fs.azure.TestNativeAzureFileSystemConcurrency.testLinkBlobs33 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractMocked.testListStatusRootDir31 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractMocked.testRenameDirectoryMoveToExistingDirectory0.25 sec3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractMocked.testListStatus29 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractMocked.testRenameDirectoryAsExistingDirectory36 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractMocked.testRenameToDirWithSamePrefixAllowed23 ms

      3  org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractMocked.testLSRootDir19 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemContractMocked.testDeleteRecursively31 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemFileNameCheck.testWasbFsck1 sec3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testChineseCharactersFolderRename1 sec3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testRedoRenameFolderInFolderListingWithZeroByteRenameMetadata41 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testRedoRenameFolderInFolderListing37 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testUriEncoding38 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testDeepFileCreation37 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testListDirectory29 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testRedoRenameFolderRenameInProgress37 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testRenameFolder34 ms

      3  org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testRenameImplicitFolder27 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testRedoRenameFolder66 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testStoreDeleteFolder27 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemMocked.testRename40 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemOperationsMocked.testListStatus36 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemOperationsMocked.testRenameDirectoryAsEmptyDirectory0.26 sec3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemOperationsMocked.testListStatusFilterWithSomeMatches23 ms

      3  org.apache.hadoop.fs.azure.TestNativeAzureFileSystemOperationsMocked.testRenameDirectoryAsNonExistentDirectory28 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemOperationsMocked.testGlobStatusSomeMatchesInDirectories26 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemOperationsMocked.testGlobStatusWithMultipleWildCardMatches27 ms3

       org.apache.hadoop.fs.azure.TestNativeAzureFileSystemOperationsMocked.testDeleteRecursively22 ms3

       org.apache.hadoop.fs.azure.TestOutOfBandAzureBlobOperations.testImplicitFolderDeleted0.99 sec3

       org.apache.hadoop.fs.azure.TestOutOfBandAzureBlobOperations.testFileAndImplicitFolderSameName31 ms3

       org.apache.hadoop.fs.azure.TestOutOfBandAzureBlobOperations.testSetOwnerOnImplicitFolder26 ms3

       org.apache.hadoop.fs.azure.TestOutOfBandAzureBlobOperations.testFileInImplicitFolderDeleted30 ms3

       org.apache.hadoop.fs.azure.TestOutOfBandAzureBlobOperations.testImplicitFolderListed22 ms3

       org.apache.hadoop.fs.azure.TestOutOfBandAzureBlobOperations.testCreatingDeepFileCreatesExplicitFolder53 ms3

       org.apache.hadoop.fs.azure.TestOutOfBandAzureBlobOperations.testSetPermissionOnImplicitFolder22 ms3

       org.apache.hadoop.fs.azure.TestWasbFsck.testDelete1 sec3

       org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithOpportunisticContainers1 min 30 sec17

       org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithEnforceExecutionType 

       

       

        Attachments

        Issue Links

        1.
        TestPersistBlocks#TestRestartDfsWithFlush flaky failure Sub-task Resolved Unassigned   Actions
        2.
        Occasional failure in TestDFSClientRetries#testGetFileChecksum because the number of available xcievers is set too low Sub-task Resolved Unassigned   Actions
        3.
        TestEditLogTailer is flaky Sub-task Resolved Unassigned   Actions
        4.
        TestBlockTokenWithDFSStriped fails intermittently Sub-task Resolved Ahmed Hussein   Actions
        5.
        TestDFSClientRetries#testGetFileChecksum fails intermittently Sub-task Resolved Ahmed Hussein

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1.5h
        Actions
        6.
        TestHAAppend#testMultipleAppendsDuringCatchupTailing is flaky Sub-task Resolved Ahmed Hussein

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 20m
        Actions
        7.
        TestReconstructStripedFile.testNNSendsErasureCodingTasks randomly cannot finish in 60s Sub-task Resolved Sammi Chen   Actions
        8.
        TestFileCreation#testServerDefaultsWithMinimalCaching fails intermittently Sub-task Resolved Ahmed Hussein

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 2h 10m
        Actions
        9.
        TestFsDatasetImpl fails intermittently Sub-task Resolved Ahmed Hussein

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 50m
        Actions
        10.
        TestBPOfferService#testMissBlocksWhenReregister fails intermittently Sub-task Resolved Ahmed Hussein

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 40m
        Actions
        11.
        RBF: TestRouter#testNamenodeHeartBeatEnableDefault fails by BindException Sub-task Resolved Akira Ajisaka

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 2h
        Actions
        12.
        EC: Fix checksum computation in case of native encoders Sub-task Resolved Ayush Saxena

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 4h 40m
        Actions
        13.
        Flaky test TestSnapshotFileLength.testSnapshotfileLength Sub-task Resolved Ahmed Hussein   Actions
        14.
        TestBPOfferService#testMissBlocksWhenReregister is flaky Sub-task Resolved Unassigned   Actions
        15.
        Disable Broken Azure Junits Sub-task Resolved Ahmed Hussein

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        Actions
        16.
        Set dfs.namenode.redundancy.considerLoad to false in MiniDFSCluster Sub-task Resolved Ahmed Hussein

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 3h 40m
        Actions
        17.
        TestBPOfferService#testMissBlocksWhenReregister fails on trunk Sub-task Resolved Masatake Iwasaki

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h
        Actions
        18.
        TestRouterRpcMultiDestination#testGetCachedDatanodeReport fails on trunk Sub-task Resolved Masatake Iwasaki

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 40m
        Actions
        19.
        TestRouterRpcMultiDestination#testNamenodeMetrics fails on trunk Sub-task Resolved Masatake Iwasaki   Actions
        20.
        Testcase TestBalancer#testBalancerWithPinnedBlocks always fails Sub-task Resolved Unassigned   Actions
        21.
        TestReconstructStripedFile#testNNSendsErasureCodingTasks fails intermittently Sub-task Resolved Hemanth Boyina   Actions
        22.
        TestDistributedFileSystem#testGetFileBlockStorageLocationsBatching fails intermittently Sub-task Resolved Unassigned   Actions
        23.
        TestDFSOutputStream#testCloseTwice implementation is broken Sub-task Resolved Ahmed Hussein   Actions
        24.
        TestURLConnectionFactory fails by NoClassDefFoundError in branch-3.3 and branch-3.2 Sub-task Resolved Chao Sun

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 50m
        Actions
        25.
        TestUpgradeDomainBlockPlacementPolicy flaky Sub-task Resolved Ahmed Hussein

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 50m
        Actions
        26.
        TestFileChecksum should be parameterized Sub-task Resolved Masatake Iwasaki

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 40m
        Actions
        27.
        Fix intermittent falilure of TestDecommission#testAllocAndIBRWhileDecommission Sub-task Resolved Masatake Iwasaki

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 40m
        Actions
        28.
        TestMultipleNNPortQOP#testMultipleNNPortOverwriteDownStream fails intermittently Sub-task Resolved Toshihiko Uchida

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 3h 40m
        Actions
        29.
        TestBalancerWithMultipleNameNodes#testBalancingBlockpoolsWithBlockPoolPolicy fails on trunk Sub-task Resolved Masatake Iwasaki

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h
        Actions
        30.
        Fix TestFsDatasetImpl.testReadLockCanBeDisabledByConfig Sub-task Resolved Leon Gao

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 1h 20m
        Actions
        31.
        TestWebHDFS#testLargeFile fails intermittently Sub-task Open Yongjun Zhang   Actions
        32.
        Use JUnit Parameterized test suite in TestWriteReadStripedFile Sub-task Patch Available Huafeng Wang   Actions
        33.
        TestErasureCodeBenchmarkThroughput#testECReadWrite fails intermittently Sub-task Open Unassigned   Actions
        34.
        TestStartup#testStorageBlockContentsStaleAfterNNRestart fails intermittently Sub-task Open Ajith S   Actions
        35.
        TestDirectoryScanner#testThrottling fails: Throttle is too permissive Sub-task Patch Available Daniel Templeton   Actions
        36.
        TestDecommission.testIncludeByRegistrationName fails intermittently Sub-task Patch Available Binglin Chang   Actions
        37.
        TestRetryCacheWithHA#testUpdatePipeline fails intermittently Sub-task Patch Available Ranith Sardar   Actions
        38.
        TestRBWBlockInvalidation#testBlockInvalidationWhenRBWReplicaMissedInDN is flaky and sometimes gets stuck in infinite loops Sub-task Patch Available Ratandeep Ratti   Actions
        39.
        TestTransferFsImage#testClientSideException fails intermittently Sub-task Open Unassigned   Actions
        40.
        TestReconstructStripedFile.testNNSendsErasureCodingTasks fails occasionally Sub-task Open Unassigned   Actions
        41.
        TestBalancer#testMaxIterationTime fails sporadically Sub-task Patch Available Toshihiko Uchida

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 0.5h
        Actions
        42.
        TestBalancer#testUnknownDatanode occasionally fails in trunk Sub-task Reopened Unassigned   Actions
        43.
        Refactor TestBalancer for faster execution. Sub-task Open Unassigned   Actions
        44.
        TestBalancerRPCDelay#testBalancerRPCDelayQpsDefault fails on Trunk Sub-task Open Unassigned   Actions
        45.
        TestRouterRpcMultiDestination#testErasureCoding fails on trunk Sub-task Open Unassigned   Actions
        46.
        TestStripedFileAppend#testAppendToNewBlock fails on trunk Sub-task Open Takanobu Asanuma   Actions
        47.
        TestBlockTokenWithDFSStriped errors port binding Sub-task Open Unassigned   Actions
        48.
        TestUnderReplicatedBlocks#testSetrepIncWithUnderReplicatedBlocks test timeout Sub-task Open Hrishikesh Gadre   Actions
        49.
        TestDecommissioningStatus#testDecommissionStatus fails intermittently Sub-task Open Ajay Kumar   Actions
        50.
        TestDecommissionWithStripedBackoffMonitor#testDecommissionWithMissingBlock fails on trunk intermittently Sub-task Open Unassigned   Actions
        51.
        TestFsDatasetImpl#testDnRestartWithHardLink fails intermittently Sub-task Open Unassigned   Actions
        52.
        Flaky test TestRouterWebHDFSContractCreate>AbstractContractCreateTest#testSyncable in Trunk Sub-task Open Unassigned   Actions

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              ahussein Ahmed Hussein

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 31h 20m
                31h 20m

                  Issue deployment