Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-1992

Hot standby containers for improving recovery-time of Stateful jobs

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Recovery time for jobs with large state increases significantly in host-failure scenarios. This is problematic because of two reasons:
      a) causes unavailability ,

      and  
      b) causes backlog buildup in case of jobs with high input QPS, requiring scaleup (then scale-down) or causes increased catch-up time.

       

      The solution comprises of two parts 

      1. Increasing restore parallelism: Samza restores stores at SamzaContainer startup sequentially for each task (using TaskStorageManager and TaskSideInputStorage Manager). Parallelizing task restores. We can parallelize store restores either in SamzaContainer (using a bounded/configurable threadpool) or in the TaskInstance or in the TaskStorageManager.

      2. Hot-Standby Containers: 
      A  Samza container which consumes input, reads or updates state, or invokes external services or produces outputs is called an “active container.” 
      A hot-standby container is one which carries updated or hot KV state, and guarantees that, when it is used for a failover, its KV state corresponds to atleast the last checkpoint of the corresponding active container (for each task). There are multiple ways of building such hot-standby container implementations, see this section for pros and cons. Restore time With hot-standby container, restore times should be similar to host-affinity restore-times – 10s of seconds regardless of state size.

      Attachments

        Activity

          People

            Unassigned Unassigned
            rayman7718 Rayman
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: