Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2807

Possible crash when flush or compaction overlaps with another compaction

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.9.0
    • Fix Version/s: 1.10.0
    • Component/s: tablet
    • Labels:
      None

      Description

      Manuel Sopena reported a crash like this in Slack:

      Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
      F0429 07:26:56.918041 34043 tablet.cc:2268] Check failed: lock.owns_lock() RowSet(24130) unable to lock compact_flush_lock
      

      It's hard to say exactly what's going on without more logging, but after looking at the code in more detail, I think the culprit is this commit, new in Kudu 1.9.0. To understand why it's problematic, we first need to understand the locking invariant in play:

      1. A thread must acquire the tablet's compact_select_lock_ in order to select rowsets to compact.
      2. Because of #1, it's safe to assume that, if a thread successfully acquired a rowset's compact_flush_lock_ in the act of selecting it for compaction, it can release and reacquire the lock without contention. More precisely, it can release the compact_flush_lock_, then try-lock it, and the try-lock is guaranteed to succeed. All compacting MM ops use a CHECK to enforce this invariant.

      With that in mind, here's the problem: at the time that the call to RowSetInfo::ComputeCdfAndCollectOrdered is made from Tablet::AtomicSwapRowSetsUnlocked, the tablet's compact_select_lock_ is not held. ComputeCdfAndCollectOrdered calls RowSet::IsAvailableForCompaction, which try-locks the per-rowset compact_flush_lock_. As a result, it's possible for a racing MM operation to also call IsAvailableForCompaction, successfully try-lock the compact_flush_lock_, release it, try-lock it again (as per the invariant above), fail, and crash in the aforementioned CHECK.

      I don't think this can result in corruption as we crash rather than allowing the MM op to proceed. But it's a bad race and a bad crash, so we should fix it. Possibly producing a 1.9.1 release in the process.

        Attachments

        1. kudu-tserver.INFO.gz
          5.25 MB
          Adar Dembo

          Activity

            People

            • Assignee:
              wdberkeley William Berkeley
              Reporter:
              adar Adar Dembo
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: