Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-2787

Add GetDeleted API to Blob Store backup and restore managers and recover from DeletedException

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Problem Statement:

      • Yarn can sometimes create orphaned containers. In our production systems, we noticed that there were overlapping Samza containers running/committing at the same time.
      • If the stores are backed up to a blob store, this orphaned and overlapping container may delete a blob (which is common during delta state calculation in commit lifecycle with blob store backend). The other non-orphaned container may expect this blob to be present.
      • This causes the container and subsequently the job to fail. During this, the container fails with DeletedException - which is Blob store's response that the blob was present but is gone now.

      Fix:

      • During commit, if a container fails with DeletedException, let the container fail/restart.
      • During the recovery phase of the restart, get the deleted blob with get() call with getDeleted flag that indicates that if the blob is marked for deletion but not yet compacted, blob store should return it.
      • Recreate the new blob by uploading it to blob store afresh. Use the new blob id received to create a new checkpoint.
      • Write this new checkpoint to the checkpoint topic.
      • After this, and as long as orphaned container is not cleaned up by Yarn, the container should be able to commit regulary.

      Attachments

        Activity

          People

            shekhars-li Shekhar Sharma
            shekhars-li Shekhar Sharma
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 7h 50m
                7h 50m