Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-14420

Zombie Stomping Session

    XMLWordPrintableJSON

Details

    • Umbrella
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • None
    • 1.2.0, 1.3.0, 2.0.0
    • test
    • None

    Description

      Patch build are now failing most of the time because we are dropping zombies. I confirm we are doing this on non-apache build boxes too.

      Left-over zombies consume resources on build boxes (OOME cannot create native threads). Having to do multiple test runs in the hope that we can get a non-zombie-making build or making (arbitrary) rulings that the zombies are 'not related' is a productivity sink. And so on...

      This is an umbrella issue for a zombie stomping session that started earlier this week. Will hang sub-issues of this one. Am running builds back-to-back on little cluster to turn out the monsters.

      Attachments

        1. hangers.txt
          33 kB
          Michael Stack
        2. none_fix.txt
          0.9 kB
          Michael Stack
        3. none_fix.txt
          0.9 kB
          Michael Stack
        4. none_fix (1).txt
          0.9 kB
          Michael Stack
        5. none_fix.txt
          0.9 kB
          Michael Stack
        6. none_fix.txt
          0.9 kB
          Michael Stack
        7. none_fix.txt
          0.9 kB
          Michael Stack
        8. none_fix.txt
          0.9 kB
          Michael Stack
        9. none_fix.txt
          0.9 kB
          Michael Stack
        10. none_fix.txt
          0.9 kB
          Michael Stack
        11. none_fix.txt
          0.9 kB
          Michael Stack
        12. none_fix.txt
          0.9 kB
          Michael Stack
        13. none_fix.txt
          0.9 kB
          Michael Stack
        14. none_fix.txt
          0.9 kB
          Michael Stack
        15. none_fix.txt
          0.9 kB
          Michael Stack
        16. none_fix.txt
          0.9 kB
          Michael Stack
        17. none_fix.txt
          0.9 kB
          Michael Stack
        18. none_fix.txt
          0.9 kB
          Michael Stack
        19. none_fix.txt
          0.9 kB
          Michael Stack
        20. none_fix.txt
          0.9 kB
          Michael Stack
        21. none_fix.txt
          0.9 kB
          Michael Stack
        22. none_fix.txt
          0.9 kB
          Michael Stack
        23. none_fix.txt
          0.9 kB
          Michael Stack
        24. none_fix.txt
          0.9 kB
          Michael Stack
        25. none_fix.txt
          0.9 kB
          Michael Stack
        26. none_fix.txt
          0.9 kB
          Michael Stack
        27. none_fix.txt
          0.9 kB
          Michael Stack
        28. none_fix.txt
          0.9 kB
          Michael Stack
        29. none_fix.txt
          0.9 kB
          Michael Stack
        30. none_fix.txt
          0.9 kB
          Michael Stack
        31. none_fix.txt
          0.9 kB
          Michael Stack
        32. none_fix.txt
          0.9 kB
          Michael Stack
        33. none_fix.txt
          0.9 kB
          Michael Stack
        34. none_fix.txt
          0.9 kB
          Michael Stack

        Issue Links

          1.
          TestFastFail* are flakey Sub-task Closed Michael Stack
          2.
          TestStochasticBalancerJmxMetrics.testJmxMetrics_PerTableMode:183 NullPointer Sub-task Closed Michael Stack
          3.
          Upgrade our surefire-plugin from 2.18 to 2.18.1 Sub-task Closed Michael Stack
          4.
          TestHttpServerLifecycle#testStartedServerIsAlive times out Sub-task Closed Michael Stack
          5.
          Set down the client executor core thread count from 256 in tests Sub-task Closed Michael Stack
          6.
          thrift tests don't have test-specific hbase-site.xml so 'BindException: Address already in use' because info port is not turned off Sub-task Closed Michael Stack
          7.
          Spark tests failing: bind exception when putting up info server Sub-task Closed Michael Stack
          8.
          TestFailedAppendAndSync fail Sub-task Closed Unassigned
          9.
          TestHCM and TestRegionServerNoMaster fixes Sub-task Closed Michael Stack
          10.
          Follow-on from HBASE-14421, just disable TestFastFail* until someone digs in and fixes it Sub-task Closed Michael Stack
          11.
          TestHRegion#testFlushCacheWhileScanning goes zombie Sub-task Closed Michael Stack
          12.
          Disable TestDistributedLogSplitting#testWorkerAbort Its flakey with tenuous chance of success Sub-task Closed Michael Stack
          13.
          TestBucketCache runs obnoxious 1k threads in a unit test Sub-task Closed Michael Stack
          14.
          Purge TestFavoredNodeAssignmentHelper, a test for an abandoned feature that can hang Sub-task Closed Michael Stack
          15.
          Remove TestVisibilityLabelsWithDistributedLogReplay, a test for an unsupported feature Sub-task Closed Michael Stack
          16.
          Have findHangingTests.py dump more info Sub-task Closed Michael Stack
          17.
          branch-1 test tweeks; disable assert explicit region lands post-restart and up a few handlers Sub-task Closed Michael Stack
          18.
          Disable zombie TestReplicationShell Sub-task Closed Michael Stack
          19.
          Disable zombie TestHFileOutputFormat2 Sub-task Closed Michael Stack
          20.
          Tuneup hanging test TestMobCompactor and TestMobSweeper Sub-task Closed Jingcheng Du
          21.
          Disable hanging test TestStochasticLoadBalancer2 Sub-task Closed Unassigned
          22.
          Disable hanging test TestNamespaceAuditor Sub-task Closed Michael Stack
          23.
          Split TestHBaseFsck in order to help with hanging tests Sub-task Closed Elliott Neil Clark
          24.
          Purge TestProcessBasedCluster; it does nothing and then fails Sub-task Closed Michael Stack
          25.
          TestImportExport#testImport94Table can't find its src data file Sub-task Closed Michael Stack
          26.
          Clean up TestSnapshotCloneIndependence Sub-task Closed Elliott Neil Clark
          27.
          Looking for the surefire-killer; builds being killed...ExecutionException: java.lang.RuntimeException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called? Sub-task Closed Michael Stack
          28.
          Add more info on zombies to test-patch.sh Sub-task Closed Michael Stack
          29.
          TestCellACLs failing... on1.2 builds Sub-task Closed Michael Stack
          30.
          Purge TestZkLess* tests from branch-1 Sub-task Closed Michael Stack
          31.
          Loosen TestChoreService assert AND have TestDataBlockEncoders do less work (and add timeouts) Sub-task Closed Michael Stack
          32.
          Disable flakey TestMultiParallel#testActiveThreadsCount Sub-task Closed Unassigned
          33.
          Move TestCellACLs from medium to large category Sub-task Closed Michael Stack
          34.
          Move TestAssignmentManager from medium to large category Sub-task Closed Michael Stack
          35.
          Set category timeouts on TestScanner and TestNamespaceAuditor Sub-task Closed Michael Stack
          36.
          TestZKProcedureControllers.testZKCoordinatorControllerWithSingleMemberCohort is a flakey Sub-task Closed Michael Stack
          37.
          Add category-based timeouts to MR tests Sub-task Closed Michael Stack
          38.
          Make TestHCM and TestMetaWithReplicas large tests rather than mediums Sub-task Closed Michael Stack
          39.
          Vet categorization of tests so they for sure go into the right small/medium/large buckets Sub-task Closed Michael Stack
          40.
          Run less client threads in tests Sub-task Closed Unassigned
          41.
          Improve zombie detector; be more discerning Sub-task Closed Michael Stack
          42.
          TestProcedureAdmin hangs Sub-task Closed Matteo Bertozzi
          43.
          Cleanup TestAtomicOperation, TestImportExport, and TestMetaWithReplicas Sub-task Closed Michael Stack
          44.
          NPE reporting server load causes regionserver abort; causes TestAcidGuarantee to fail Sub-task Closed Michael Stack
          45.
          hbase-it tests failing with OOME; permgen Sub-task Closed Michael Stack
          46.
          TestSplitTransactionOnCluster#testFailedSplit flakey Sub-task Closed Michael Stack
          47.
          TestSplitTransactionOnCluster.testSSHCleanupDaugtherRegionsOfAbortedSplit is flakey Sub-task Closed Heng Chen
          48.
          TestRowCounter flakey especially on branch-1 Sub-task Closed Michael Stack
          49.
          NPE testing for RIT Sub-task Closed Michael Stack
          50.
          Hanging test : org.apache.hadoop.hbase.mapreduce.TestImportExport Sub-task Closed Heng Chen
          51.
          OOME: cannot create native thread is back Sub-task Closed Unassigned

          Activity

            People

              stack Michael Stack
              stack Michael Stack
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: