Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1968

Aborted tablet copies delete live blocks

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.3.0
    • 1.3.1, 1.4.0
    • tserver
    • None

    Description

      72541b47eb55b2df4eab5d6050f517476ed6d370 (KUDU-1853) caused a serious regression in the case of a failed tablet copy. As of that patch, the following sequence happens:

      • we fetch the remote tablet's metadata, and set our local metadata to match it (including the remote block IDs)
      • as we download blocks, we replace remote block ids with local block IDs
      • if we fail in the middle, we call DeleteTablet
        • this means that, since we still have some remote block IDs in the metadata, the DeleteTablet call deletes local blocks based on remote block IDs. These block ids are likely to belong to other live tablets locally!

      This can cause pretty serious dataloss, and has the tendency to cascade around a cluster, since later attempts to copy a tablet with missing blocks will get aborted as well.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            tlipcon Todd Lipcon
            tlipcon Todd Lipcon
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Issue deployment