Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-798

Performance and stability issue after combining checkpoint and coordinator stream

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.10.0
    • 0.10.0
    • None
    • None

    Description

      When testing 0.10 release candidate w/ large number of topic partitions and containers, we have observed that there is a serious stability issue when combining checkpoint and coordinator streams together.

      The reasons being the following:
      1) The current JobCoordinator's use case of coordinator includes the following:

      • job configuration
      • task changelog partition map
      • container locality info
      • misc (like migration marker, etc.)
      • checkpoint
        Out of all the above, checkpoint creates the biggest problem:
        1) It is much more dynamic than others. Trying to keep up-to-state w/ the coordinator stream's tail becomes impossible due to checkpointing from all containers. Mixing checkpoint w/ other message together in one stream makes it impossible to differentiate the cases whether there is more important configuration/status information that has to be read immediately versus there are just checkpoint updates in the coordinator stream.
        2) In case of checkpoint, it is not necessary for JobCoordinator to be in the middle of the path. Our previous checkpoint model actually works properly: the checkpoint is written by the containers and read by the containers, and it is very clear that read only happens when recover/restart the container while write happens during the container runtime. Making all containers rely on the JobCoordinator to read the latest checkpoint actually makes JobCoordinator the single point bottleneck.
        3) With the current change, removing CheckpointManagerFactory also disable the possibility for users to plugin their own checkpoint system.

      Bottom line is: the write rate to coordinator stream should just be configuration and other low volume information. High-volume traffic should not be sent to coordinator stream, which is often used by JobCoordinator to build an up-to-date job status.

      In addition, it would be preferred to move the checkpoint back before this release to avoid the unnecessary migration of checkpoint to coordinator stream.

      Attachments

        1. SAMZA-798-2.patch
          195 kB
          Navina Ramesh
        2. SAMZA-798-1.patch
          195 kB
          Navina Ramesh
        3. SAMZA-798-0.patch
          184 kB
          Navina Ramesh

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            navina Navina Ramesh
            nickpan47 Yi Pan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment