Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1131

Crash in compaction due to overlapping flush/undo snapshots

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • Private Beta
    • 0.9.0
    • tablet

    Description

      Binglin is triggering a crash reasonably regularly under load:

      • a tablet is flushed with a snapshot that has at least one txn in flight, but a txn with a later timestamp already committed. eg:
        • txn 1 and 3 committed, 2 in flight. This gives a flush snapshot txn <= 1 or txn == 3.
      • as of KUDU-987, we don't wait for all in-flight transactions to commit during flush (necessary since the txn might be in flight for a while)
      • because txn 3 was committed, the UNDO delta has a ts range of [1, 3]
      • we then select the newly-flushed rowset for compaction, and txn 2 is still not committed
        • at this point, we hit a CHECK failure because we see an UNDO file which can't be fully ignored by a compaction (its time range overlaps with uncommitted ranges in the current snapshot)

      Attachments

        Activity

          People

            tlipcon Todd Lipcon
            tlipcon Todd Lipcon
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: