Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-408

Expose a metric for tracking AM availability

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: 0.8.0
    • Component/s: container
    • Labels:
      None

      Description

      We have a metric for tracking number of containers running. But we don't have anything to indicate if the job is healthy. This translates to the AM along with all the containers must be running.

      Expose a "healthy" metric: It should be 1 if the AM and all containers are running. 0 otherwise.

      1. SAMZA-408.0.patch
        3 kB
        David Chen
      2. SAMZA-408.1.patch
        6 kB
        David Chen
      3. SAMZA-408.2.patch
        7 kB
        David Chen

        Activity

        Hide
        criccomini Chris Riccomini added a comment -

        +1 Merged and committed.

        Show
        criccomini Chris Riccomini added a comment - +1 Merged and committed.
        Hide
        davidzchen David Chen added a comment -

        Agreed. I have updated the patch to handle the case of node failure and test coverage for this case.

        Show
        davidzchen David Chen added a comment - Agreed. I have updated the patch to handle the case of node failure and test coverage for this case.
        Hide
        criccomini Chris Riccomini added a comment -

        Another question: for the case where a container is killed by YARN due to a node failure (exit code -100), should jobHealthy still be 0 until the container is re-allocated?

        Personally, I think so. As a developer, I care if my containers aren't running, even if it's just because YARN has killed them due to a node failure. It seems most intuitive to report the job as unhealthy any time the containers aren't running.

        Show
        criccomini Chris Riccomini added a comment - Another question: for the case where a container is killed by YARN due to a node failure (exit code -100), should jobHealthy still be 0 until the container is re-allocated? Personally, I think so. As a developer, I care if my containers aren't running, even if it's just because YARN has killed them due to a node failure. It seems most intuitive to report the job as unhealthy any time the containers aren't running.
        Hide
        davidzchen David Chen added a comment -

        Another question: for the case where a container is killed by YARN due to a node failure (exit code -100), should jobHealthy still be 0 until the container is re-allocated?

        Show
        davidzchen David Chen added a comment - Another question: for the case where a container is killed by YARN due to a node failure (exit code -100), should jobHealthy still be 0 until the container is re-allocated?
        Hide
        davidzchen David Chen added a comment -

        Attaching a new patch handling the case where a failed container is restarted and adding test coverage.

        Show
        davidzchen David Chen added a comment - Attaching a new patch handling the case where a failed container is restarted and adding test coverage.
        Hide
        davidzchen David Chen added a comment -

        Attaching an initial patch. The changes are as follows:

        • Add a new metric job-healthy that emits 1 if the job is healthy or 0 otherwise
        • Add a new field jobHealthy field to SamzaAppMasterState that is set to true by default
        • Set jobHealthy to false whenever the count of failed containers is incremented and wherever state.status is set to FinalApplicationStatus.FAILED

        RB: https://reviews.apache.org/r/25522/

        Show
        davidzchen David Chen added a comment - Attaching an initial patch. The changes are as follows: Add a new metric job-healthy that emits 1 if the job is healthy or 0 otherwise Add a new field jobHealthy field to SamzaAppMasterState that is set to true by default Set jobHealthy to false whenever the count of failed containers is incremented and wherever state.status is set to FinalApplicationStatus.FAILED RB: https://reviews.apache.org/r/25522/
        Hide
        cpsoman Chinmay Soman added a comment -

        Agreed. I'll change the description

        Show
        cpsoman Chinmay Soman added a comment - Agreed. I'll change the description
        Hide
        criccomini Chris Riccomini added a comment -

        I think the most useful thing is an "is job healthy" metric. If the AM and all containers are running, then it should be 1, else, 0.

        Show
        criccomini Chris Riccomini added a comment - I think the most useful thing is an "is job healthy" metric. If the AM and all containers are running, then it should be 1, else, 0.

          People

          • Assignee:
            davidzchen David Chen
            Reporter:
            cpsoman Chinmay Soman
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development