Chukwa
  1. Chukwa
  2. CHUKWA-534

Improve fault-tolerance of DemuxManager, PostProcessManager and ChukwaArchiveManager.

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      If the any of these processes receives more than N consecutive errors, it dies with the message "Too many errors, Bail out!".

      Let's change to this introduce a configurable number of concurrent exceptions to be encountered before dying. If the value is set to -1, expected behavior is to keep retrying ad infinitum.

      Also as part if this bug is to improve logging of how many consecutive errors have occurred, as well as the time they started. A possible future enhancement could be to support an error time threshold as well as an absolute count.

      Suggesting the following new config setting. It's a bit verbose, but it's clear.

      demux.max.error.count.before.shutdown
      post.process.max.error.count.before.shutdown
      archive.max.error.count.before.shutdown
      
      1. CHUKWA-534_1.patch
        4 kB
        Bill Graham
      2. CHUKWA-534_2.patch
        5 kB
        Bill Graham
      3. CHUKWA-534_3.patch
        11 kB
        Bill Graham

        Activity

        Bill Graham made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Bill Graham added a comment -

        Thanks Ari, committed.

        Show
        Bill Graham added a comment - Thanks Ari, committed.
        Hide
        Ari Rabkin added a comment -

        Looks good. +1 to commit it.

        Show
        Ari Rabkin added a comment - Looks good. +1 to commit it.
        Bill Graham made changes -
        Attachment CHUKWA-534_3.patch [ 12457025 ]
        Hide
        Bill Graham added a comment -

        Attaching patch 3, which expands the scope to include all 3 processes.

        Show
        Bill Graham added a comment - Attaching patch 3, which expands the scope to include all 3 processes.
        Bill Graham made changes -
        Summary Improve fault-tolerance of DemuxManager. Improve fault-tolerance of DemuxManager, PostProcessManager and ChukwaArchiveManager.
        Description If the DemuxManager received more than 5 consecutive errors, it dies with the message "Too many errors, Bail out!".

        Let's change to this introduce a configurable number of concurrent exceptions to be encountered before dying. If the value is set to -1, expected behavior is to keep retrying ad infinitum.

        Also as part if this bug is to improve logging of how many consecutive errors have occurred, as well as the time they started. A possible future enhancement could be to support an error time threshold as well as an absolute count.

        Suggesting the following new config setting. It's a bit verbose, but it's clear.

        {noformat}
        chukwa.demux.max.error.count.before.shutdown
        {noformat}
        If the any of these processes receives more than N consecutive errors, it dies with the message "Too many errors, Bail out!".

        Let's change to this introduce a configurable number of concurrent exceptions to be encountered before dying. If the value is set to -1, expected behavior is to keep retrying ad infinitum.

        Also as part if this bug is to improve logging of how many consecutive errors have occurred, as well as the time they started. A possible future enhancement could be to support an error time threshold as well as an absolute count.

        Suggesting the following new config setting. It's a bit verbose, but it's clear.

        {noformat}
        demux.max.error.count.before.shutdown
        post.process.max.error.count.before.shutdown
        archive.max.error.count.before.shutdown
        {noformat}
        Hide
        Bill Graham added a comment -

        Expanding the scope of this JIRA, since all three of these processes could be more fault tolerant. Most have comments regarding how they should shut down after 4 errors since watchdog will restart, but watchdog has been deprecated afaik.

        Show
        Bill Graham added a comment - Expanding the scope of this JIRA, since all three of these processes could be more fault tolerant. Most have comments regarding how they should shut down after 4 errors since watchdog will restart, but watchdog has been deprecated afaik.
        Bill Graham made changes -
        Attachment CHUKWA-534_2.patch [ 12457010 ]
        Hide
        Bill Graham added a comment -

        Attaching patch 2, which also contains default configs set to 5.

        Show
        Bill Graham added a comment - Attaching patch 2, which also contains default configs set to 5.
        Bill Graham made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Bill Graham made changes -
        Field Original Value New Value
        Attachment CHUKWA-534_1.patch [ 12457008 ]
        Hide
        Bill Graham added a comment -

        Attaching patch 1. Let me know if you have comments or suggestions.

        Show
        Bill Graham added a comment - Attaching patch 1. Let me know if you have comments or suggestions.
        Hide
        Bill Graham added a comment -

        Looking more closely at DemuxManager, it seems globalErrorcounter is never reset to 0, so > 5 non-consecutive errors in the life of the daemon would kill the process. I propose we reset that counter upon a successful demux run.

        Also, for consistency with other demux params, we should drop 'chukwa.' from what I show above:

        demux.max.error.count.before.shutdown
        
        Show
        Bill Graham added a comment - Looking more closely at DemuxManager, it seems globalErrorcounter is never reset to 0, so > 5 non-consecutive errors in the life of the daemon would kill the process. I propose we reset that counter upon a successful demux run. Also, for consistency with other demux params, we should drop 'chukwa.' from what I show above: demux.max.error.count.before.shutdown
        Bill Graham created issue -

          People

          • Assignee:
            Bill Graham
            Reporter:
            Bill Graham
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development