Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1968

Aborted tablet copies delete live blocks

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.3.0
    • Fix Version/s: 1.3.1, 1.4.0
    • Component/s: tserver
    • Labels:
      None

      Description

      72541b47eb55b2df4eab5d6050f517476ed6d370 (KUDU-1853) caused a serious regression in the case of a failed tablet copy. As of that patch, the following sequence happens:

      • we fetch the remote tablet's metadata, and set our local metadata to match it (including the remote block IDs)
      • as we download blocks, we replace remote block ids with local block IDs
      • if we fail in the middle, we call DeleteTablet
        • this means that, since we still have some remote block IDs in the metadata, the DeleteTablet call deletes local blocks based on remote block IDs. These block ids are likely to belong to other live tablets locally!

      This can cause pretty serious dataloss, and has the tendency to cascade around a cluster, since later attempts to copy a tablet with missing blocks will get aborted as well.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tlipcon Todd Lipcon
                Reporter:
                tlipcon Todd Lipcon
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: