Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-17954

Error getting oldest-iterator-open-since-ms from JMX

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.8.1
    • 4.0.0, 3.9.1, 3.8.2
    • streams
    • None

    Description

      In KIP-989 we introduced a new metric, oldest-iterator-open-since-ms, which reports the timestamp that the oldest currently open KeyValueIterator was opened at.

      On-scrape, we sometimes see this WARN log message:

      Error getting JMX attribute 'oldest-iterator-open-since-ms'
      java.util.NoSuchElementException
      at java.base/java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:1859)
      at java.base/java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396)
      at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.lambda$registerMetrics$5(MeteredKeyValueStore.java:179)

      However, if no iterators are currently open, this Gauge returns null.

      When using the Prometheus JmxScraper to scrape this metric, its value is added to a ConcurrentHashMap, which does not permit null values.

      We should find some other way to report the absence of this metric that does not cause problems with ConcurrentHashMap.

      My initial analysis was incorrect. The problem appears to be caused by the openIterators Set in MeteredKeyValueStore:

          protected NavigableSet<MeteredIterator> openIterators = new ConcurrentSkipListSet<>(Comparator.comparingLong(MeteredIterator::startTimestamp)); 

      This is used by the Gauge to report the metric:

      openIterators.isEmpty() ? null : openIterators.first().startTimestamp() 

      The source of the exception is the right-hand side of this ternary expression, specifically openIterators.first().

      The condition of this expression should ensure that there is at least one element to retrieve by the right-hand side. However, if the last Iterator is removed from this Set concurrently to the Gauge being reported, after the emptiness check, but before retrieving the element, we can throw the above exception here.

      This can happen because interactive queries and stream threads operate concurrently from the thread that reads the Gauge to report metrics.

      Attachments

        Issue Links

          Activity

            People

              nicktelford Nicholas Telford
              nicktelford Nicholas Telford
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: