Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-19365

invalid EstimatedHistogramReservoirSnapshot::getValue values due to race condition in DecayingEstimatedHistogramReservoir

Agile BoardAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      `DecayingEstimatedHistogramReservoir` has a race condition between `update` and `rescaleIfNeeded`.
      A sample which ends up (`update`) in an already scaled decayingBucket (`rescaleIfNeeded`) may still use a non-scaled weight because `decayLandmark` has not been updated yet at the moment of `update`.
       
      The observed consequence was flooding of the cluster with speculative retries (we happened to hit low-percentile buckets with overweight samples, which drove p99 below true p50 for a long time).

      Please note that despite the manifestation being similar to CASSANDRA-19330, these are two distinct bugs in their own right.

      This bug affects versions 4.0+
      On 3.11 there's locking in DEHR. I did not check earlier versions.

      Attachments

        1. result_details.tar.gz
          2.76 MB
          Caleb Rackliffe
        2. ci_summary.html
          38 kB
          Caleb Rackliffe

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            mmuzaf Maxim Muzafarov Assign to me
            jakubzytka Jakub Zytka
            Jakub Zytka, Maxim Muzafarov
            Caleb Rackliffe
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 4h 20m
                4h 20m

                Slack

                  Issue deployment