Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1968

Aborted tablet copies delete live blocks

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.3.0
    • 1.3.1, 1.4.0
    • tserver
    • None

    Description

      72541b47eb55b2df4eab5d6050f517476ed6d370 (KUDU-1853) caused a serious regression in the case of a failed tablet copy. As of that patch, the following sequence happens:

      • we fetch the remote tablet's metadata, and set our local metadata to match it (including the remote block IDs)
      • as we download blocks, we replace remote block ids with local block IDs
      • if we fail in the middle, we call DeleteTablet
        • this means that, since we still have some remote block IDs in the metadata, the DeleteTablet call deletes local blocks based on remote block IDs. These block ids are likely to belong to other live tablets locally!

      This can cause pretty serious dataloss, and has the tendency to cascade around a cluster, since later attempts to copy a tablet with missing blocks will get aborted as well.

      Attachments

        Issue Links

          Activity

            People

              tlipcon Todd Lipcon
              tlipcon Todd Lipcon
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: