Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-16807

Weak visibility guarantees of Accumulator lead to failed assertions during digest comparison

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Correctness - Consistency
    • Critical
    • Normal
    • Fuzz Test
    • All
    • None
    • Hide

      a new time-bound multi-threaded unit test that "fuzzes" the problematic behavior described in this issue

      Show
      a new time-bound multi-threaded unit test that "fuzzes" the problematic behavior described in this issue

    Description

      This problem could manifest on all versions, beginning on at least 3.0, but I’ll focus on the way it manifests in 4.0 here.

      In what now seems like a wise move, CASSANDRA-16097 added an assertion to DigestResolver#responseMatch() that ensures the responses snapshot has at least one visible elements to compare (although of course only one element trivially cannot generate a mismatch and short-circuits immediately). However, at the point ReadCallback#onResponse() signals the waiting resolver, there is no guarantee that the size of the generated snapshot of the responses Accumulator is non-zero, or perhaps more worryingly, at least equal to the number of blocked-for responses. This seems to be a consequence of the documented weak visibility guarantees on Accumulator#add(). In short, if there are concurrent invocations of add(), is it not guaranteed that there is any visible size change after any one of them return, but only after all complete.

      The particular exception looks something like this:

      java.lang.AssertionError: Attempted response match comparison while no responses have been received.
      	at org.apache.cassandra.service.reads.DigestResolver.responsesMatch(DigestResolver.java:110)
      	at org.apache.cassandra.service.reads.AbstractReadExecutor.awaitResponses(AbstractReadExecutor.java:393)
      	at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:2150)
      	at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1979)
      	at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1882)
      	at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:1121)
      	at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:296)
      	at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:248)
      	at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:90)
      

      It’s possible to reproduce this on simple single-partition reads without any short-read protection or replica filtering protection. I’ve also been able to reproduce this synthetically with a unit test on ReadCallback.

      It seems like the most straightforward way to fix this would be to avoid signaling in ReadCallback#onResponse() until the visible size of the accumulator is at least the number of received responses. In most cases, this is trivially true, and our signaling behavior won’t change at all. In the very rare case that there are two (or more) concurrent calls to onResponse(), the second (or last) will signal, and having one more response than we strictly need should have no negative side effects. (We don’t seem to make any strict assertions about having exactly the number of required responses, only that we have enough.)

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            maedhroz Caleb Rackliffe Assign to me
            maedhroz Caleb Rackliffe
            Caleb Rackliffe
            Andres de la Peña, Benedict Elliott Smith
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 2.5h
              2.5h

              Slack

                Issue deployment