Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-9261

NPE when updating client metadata

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0, 2.3.1
    • Fix Version/s: 2.4.0, 2.3.2
    • Component/s: None
    • Labels:
      None

      Description

      We have seen the following exception recently:

      java.lang.NullPointerException
      	at java.base/java.util.Objects.requireNonNull(Objects.java:221)
      	at org.apache.kafka.common.Cluster.<init>(Cluster.java:134)
      	at org.apache.kafka.common.Cluster.<init>(Cluster.java:89)
      	at org.apache.kafka.clients.MetadataCache.computeClusterView(MetadataCache.java:120)
      	at org.apache.kafka.clients.MetadataCache.<init>(MetadataCache.java:82)
      	at org.apache.kafka.clients.MetadataCache.<init>(MetadataCache.java:58)
      	at org.apache.kafka.clients.Metadata.handleMetadataResponse(Metadata.java:325)
      	at org.apache.kafka.clients.Metadata.update(Metadata.java:252)
      	at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:1059)
      	at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:845)
      	at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:548)
      	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262)
      	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
      	at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1281)
      	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1225)
      	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1201)
      

      The client assumes that if a leader is included in the response, then node information must also be available. There are at least a couple possible reasons this assumption can fail:

      1. The client is able to detect stale partition metadata using leader epoch information available. If stale partition metadata is detected, the client ignores it and uses the last known metadata. However, it cannot detect stale broker information and will always accept the latest update. This means that the latest metadata may be a mix of multiple metadata responses and therefore the invariant will not generally hold.
      2. There is no lock which protects both the fetching of partition metadata and the live broker when handling a Metadata request. This means an UpdateMetadata request can arrive concurrently and break the intended invariant.

      It seems case 2 has been possible for a long time, but it should be extremely rare. Case 1 was only made possible with KIP-320, which added the leader epoch tracking. It should also be rare, but the window for inconsistent metadata is probably a bit bigger than the window for a concurrent update.

      To fix this, we should make the client more defensive about metadata updates and not assume that the leader is among the live endpoints.

        Attachments

          Activity

            People

            • Assignee:
              hachikuji Jason Gustafson
              Reporter:
              hachikuji Jason Gustafson
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: