Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-1992

Hot standby containers for improving recovery-time of Stateful jobs

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Recovery time for jobs with large state increases significantly in host-failure scenarios. This is problematic because of two reasons:
      a) causes unavailability ,

      and  
      b) causes backlog buildup in case of jobs with high input QPS, requiring scaleup (then scale-down) or causes increased catch-up time.

       

      The solution comprises of two parts 

      1. Increasing restore parallelism: Samza restores stores at SamzaContainer startup sequentially for each task (using TaskStorageManager and TaskSideInputStorage Manager). Parallelizing task restores. We can parallelize store restores either in SamzaContainer (using a bounded/configurable threadpool) or in the TaskInstance or in the TaskStorageManager.

      2. Hot-Standby Containers: 
      A  Samza container which consumes input, reads or updates state, or invokes external services or produces outputs is called an “active container.” 
      A hot-standby container is one which carries updated or hot KV state, and guarantees that, when it is used for a failover, its KV state corresponds to atleast the last checkpoint of the corresponding active container (for each task). There are multiple ways of building such hot-standby container implementations, see this section for pros and cons. Restore time With hot-standby container, restore times should be similar to host-affinity restore-times – 10s of seconds regardless of state size.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              rayman7718 Rayman
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: