Uploaded image for project: 'Slider'
  1. Slider
  2. SLIDER-629

Slider's count of failure threshold may not be accurate or it could be a logging issue

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Slider 0.50
    • Slider 0.80
    • appmaster
    • None
    • Slider December #1, Slider Jan #2, Slider Feb #1, Slider April #1

    Description

      One of the long running HBase tests failed with the following error:

      2014-11-08 01:07:26,407 [AmExecutor-008] ERROR appmaster.SliderAppMaster - Cluster teardown triggered org.apache.slider.core.exceptions.TriggerClusterTeardownException: Unstable Application Instance : - failed with component H       BASE_REGIONSERVER failing 8 times (0 in startup); threshold is 5 - last failure: Failure container_1415341585168_0005_01_000008 on host onprem-slider23: http://onprem-slider21:19888/jobhistory/logs/onprem-slider23:45454/contai       ner_1415341585168_0005_01_000008/ctx/hadoop^M
      

      However, there were total of "9" REGION_SERVERs created.

      2014-11-07 16:00:35,346 [AMRM Callback Handler Thread] INFO  state.AppState - Assigning role HBASE_REGIONSERVER to container container_1415341585168_0005_01_000002, on onprem-slider25:45454,
      2014-11-07 16:00:35,347 [AMRM Callback Handler Thread] INFO  state.AppState - Assigning role HBASE_REGIONSERVER to container container_1415341585168_0005_01_000005, on onprem-slider24:45454,
      2014-11-07 16:00:35,347 [AMRM Callback Handler Thread] INFO  state.AppState - Assigning role HBASE_REGIONSERVER to container container_1415341585168_0005_01_000007, on onprem-slider22:45454,
      2014-11-07 16:00:35,347 [AMRM Callback Handler Thread] INFO  state.AppState - Assigning role HBASE_REGIONSERVER to container container_1415341585168_0005_01_000008, on onprem-slider23:45454,
      2014-11-07 23:51:20,040 [AMRM Callback Handler Thread] INFO  state.AppState - Assigning role HBASE_REGIONSERVER to container container_1415341585168_0005_01_000009, on onprem-slider22:45454,
      2014-11-07 23:58:44,810 [AMRM Callback Handler Thread] INFO  state.AppState - Assigning role HBASE_REGIONSERVER to container container_1415341585168_0005_01_000013, on onprem-slider24:45454,
      2014-11-08 00:12:17,804 [AMRM Callback Handler Thread] INFO  state.AppState - Assigning role HBASE_REGIONSERVER to container container_1415341585168_0005_01_000015, on onprem-slider22:45454,
      2014-11-08 00:15:57,373 [AMRM Callback Handler Thread] INFO  state.AppState - Assigning role HBASE_REGIONSERVER to container container_1415341585168_0005_01_000018, on onprem-slider25:45454,
      2014-11-08 01:06:36,771 [AMRM Callback Handler Thread] INFO  state.AppState - Assigning role HBASE_REGIONSERVER to container container_1415341585168_0005_01_000020, on onprem-slider25:45454,
      

      As the ask was for 4 but 9 were created, obviously there are 5 failures.

      Perhaps its a logging issue. Can we also print the Window - e.g. 5 failures in X minutes or hours.

      Attachments

        Issue Links

          Activity

            People

              stevel@apache.org Steve Loughran
              sumitmohanty Sumit Mohanty
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: