Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-843

Slow start of Samza jobs with large number of containers

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.0
    • Fix Version/s: 0.10.1
    • Component/s: None
    • Labels:
      None

      Description

      We have noticed that when a job has large number of containers and is deployed in Yarn, all the containers query the coordinator URL at the same time, causing an almost herd-like effect. It takes a long time for the job to reach a steady state, where all containers start processing messages and none of them are seeing Socket Timeout exception from the Job Coordinator. This effect is amplified further, if the AM machine is already heavily loaded.

      We could fix this in many ways.
      1. We could have containers wait for random time period before querying the Job Coordinator.
      2. We could add a cache in the JobServlet so that the JobModel is not refreshed with each request.
      3. We could make the JobModel as an atomic reference that gets updated only when the AM requires to restart a failed container. It is ok for the containers to get slightly stale JobModel as long as the partition assignment doesn't change.

      While the above options are good ways to solve this problem, it does bring up the question about why the containers should query the coordinator for the JobModel (which creates a SPOF for the retrieving JobModel) when it can be inferred by consuming from the Coordinator Stream directly.
      We should consider an architecture where each container has an embedded job coordinator module that only reads partition assignment messages. The embedded job coordinator can act like a "slave" JC to the job coordinator running in the AM. This will be a major architecture change that requires more thought. Just wanted to put down the idea here.

      1. SAMZA-843-0.patch
        13 kB
        Navina Ramesh
      2. SAMZA-843-1.patch
        20 kB
        Navina Ramesh

        Activity

        Hide
        navina Navina Ramesh added a comment -

        Attaching the patch with changes based on feedback from the RB

        Show
        navina Navina Ramesh added a comment - Attaching the patch with changes based on feedback from the RB
        Hide
        navina Navina Ramesh added a comment -

        Review Board - https://reviews.apache.org/r/41663/

        I have modified the JobServlet to simply read from an AtomicReference of the JobModel that is generated by JobCoordinator, instead of re-generating it on each http request. JobServlet is started only after the JC initializes the JobModel. This change to reduce the time taken to serve the Http request.

        I have also modified the container to query the JobCoordinator with random initial delay. It will no longer fail on an IOException. Instead, it will continue to retry. Not sure if we should also factor in a max-retry attempt. Any thoughts on this? Yi Pan

        Show
        navina Navina Ramesh added a comment - Review Board - https://reviews.apache.org/r/41663/ I have modified the JobServlet to simply read from an AtomicReference of the JobModel that is generated by JobCoordinator, instead of re-generating it on each http request. JobServlet is started only after the JC initializes the JobModel. This change to reduce the time taken to serve the Http request. I have also modified the container to query the JobCoordinator with random initial delay. It will no longer fail on an IOException. Instead, it will continue to retry. Not sure if we should also factor in a max-retry attempt. Any thoughts on this? Yi Pan

          People

          • Assignee:
            navina Navina Ramesh
            Reporter:
            navina Navina Ramesh
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development