Kafka
  1. Kafka
  2. KAFKA-972

MetadataRequest returns stale list of brokers

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: 0.8.3
    • Component/s: core
    • Labels:
      None

      Description

      When we issue an metadatarequest towards the cluster, the list of brokers is stale. I mean, even when a broker is down, it's returned back to the client. The following are examples of two invocations one with both brokers online and the second with a broker down:

      {
      "brokers": [

      { "nodeId": 0, "host": "10.139.245.106", "port": 9092, "byteLength": 24 }

      ,

      { "nodeId": 1, "host": "localhost", "port": 9093, "byteLength": 19 }

      ],
      "topicMetadata": [
      {
      "topicErrorCode": 0,
      "topicName": "foozbar",
      "partitions": [

      { "replicas": [ 0 ], "isr": [ 0 ], "partitionErrorCode": 0, "partitionId": 0, "leader": 0, "byteLength": 26 }

      ,

      { "replicas": [ 1 ], "isr": [ 1 ], "partitionErrorCode": 0, "partitionId": 1, "leader": 1, "byteLength": 26 }

      ,

      { "replicas": [ 0 ], "isr": [ 0 ], "partitionErrorCode": 0, "partitionId": 2, "leader": 0, "byteLength": 26 }

      ,

      { "replicas": [ 1 ], "isr": [ 1 ], "partitionErrorCode": 0, "partitionId": 3, "leader": 1, "byteLength": 26 }

      ,

      { "replicas": [ 0 ], "isr": [ 0 ], "partitionErrorCode": 0, "partitionId": 4, "leader": 0, "byteLength": 26 }

      ],
      "byteLength": 145
      }
      ],
      "responseSize": 200,
      "correlationId": -1000
      }

      {
      "brokers": [

      { "nodeId": 0, "host": "10.139.245.106", "port": 9092, "byteLength": 24 }

      ,

      { "nodeId": 1, "host": "localhost", "port": 9093, "byteLength": 19 }

      ],
      "topicMetadata": [
      {
      "topicErrorCode": 0,
      "topicName": "foozbar",
      "partitions": [

      { "replicas": [ 0 ], "isr": [], "partitionErrorCode": 5, "partitionId": 0, "leader": -1, "byteLength": 22 }

      ,

      { "replicas": [ 1 ], "isr": [ 1 ], "partitionErrorCode": 0, "partitionId": 1, "leader": 1, "byteLength": 26 }

      ,

      { "replicas": [ 0 ], "isr": [], "partitionErrorCode": 5, "partitionId": 2, "leader": -1, "byteLength": 22 }

      ,

      { "replicas": [ 1 ], "isr": [ 1 ], "partitionErrorCode": 0, "partitionId": 3, "leader": 1, "byteLength": 26 }

      ,

      { "replicas": [ 0 ], "isr": [], "partitionErrorCode": 5, "partitionId": 4, "leader": -1, "byteLength": 22 }

      ],
      "byteLength": 133
      }
      ],
      "responseSize": 188,
      "correlationId": -1000
      }

      1. BrokerMetadataTest.scala
        4 kB
        Grant Henke
      2. KAFKA-972_2015-06-30_18:42:13.patch
        9 kB
        Ashish K Singh
      3. KAFKA-972_2015-07-01_01:36:56.patch
        9 kB
        Ashish K Singh
      4. KAFKA-972_2015-07-01_01:42:42.patch
        9 kB
        Ashish K Singh
      5. KAFKA-972_2015-07-01_08:06:03.patch
        9 kB
        Ashish K Singh
      6. KAFKA-972_2015-07-06_23:07:34.patch
        8 kB
        Ashish K Singh
      7. KAFKA-972_2015-07-07_10:42:41.patch
        8 kB
        Ashish K Singh
      8. KAFKA-972_2015-07-07_23:24:13.patch
        8 kB
        Ashish K Singh
      9. KAFKA-972.patch
        1 kB
        Ashish K Singh

        Issue Links

          Activity

          Hide
          Ashish K Singh added a comment - - edited

          Thanks Jun Rao!

          Show
          Ashish K Singh added a comment - - edited Thanks Jun Rao !
          Hide
          Jun Rao added a comment -

          Thanks for the latest patch. +1 and committed to trunk.

          Show
          Jun Rao added a comment - Thanks for the latest patch. +1 and committed to trunk.
          Hide
          Ashish K Singh added a comment -

          Jun Rao could you take a look, thanks.

          Show
          Ashish K Singh added a comment - Jun Rao could you take a look, thanks.
          Hide
          Ashish K Singh added a comment -

          Updated reviewboard https://reviews.apache.org/r/36030/
          against branch trunk

          Show
          Ashish K Singh added a comment - Updated reviewboard https://reviews.apache.org/r/36030/ against branch trunk
          Hide
          Ashish K Singh added a comment -

          Updated reviewboard https://reviews.apache.org/r/36030/
          against branch trunk

          Show
          Ashish K Singh added a comment - Updated reviewboard https://reviews.apache.org/r/36030/ against branch trunk
          Hide
          Ashish K Singh added a comment -

          Updated reviewboard https://reviews.apache.org/r/36030/
          against branch trunk

          Show
          Ashish K Singh added a comment - Updated reviewboard https://reviews.apache.org/r/36030/ against branch trunk
          Hide
          Ashish K Singh added a comment -

          Updated reviewboard https://reviews.apache.org/r/36030/
          against branch trunk

          Show
          Ashish K Singh added a comment - Updated reviewboard https://reviews.apache.org/r/36030/ against branch trunk
          Hide
          Ashish K Singh added a comment -

          Updated reviewboard https://reviews.apache.org/r/36030/
          against branch trunk

          Show
          Ashish K Singh added a comment - Updated reviewboard https://reviews.apache.org/r/36030/ against branch trunk
          Hide
          Ashish K Singh added a comment -

          Updated reviewboard https://reviews.apache.org/r/36030/
          against branch trunk

          Show
          Ashish K Singh added a comment - Updated reviewboard https://reviews.apache.org/r/36030/ against branch trunk
          Hide
          Ashish K Singh added a comment -

          Updated reviewboard https://reviews.apache.org/r/36030/
          against branch trunk

          Show
          Ashish K Singh added a comment - Updated reviewboard https://reviews.apache.org/r/36030/ against branch trunk
          Hide
          Ashish K Singh added a comment -

          Created reviewboard https://reviews.apache.org/r/36030/
          against branch trunk

          Show
          Ashish K Singh added a comment - Created reviewboard https://reviews.apache.org/r/36030/ against branch trunk
          Hide
          Grant Henke added a comment -

          This solutions sounds reasonable to me.

          Show
          Grant Henke added a comment - This solutions sounds reasonable to me.
          Hide
          Ashish K Singh added a comment - - edited

          Hey Guys,

          I spent some time reproducing the issue and finding the root cause. Turns out KAFKA-1367 is not the issue here. Below is the problem and my suggested solution.

          Problem:
          Alive brokers list not being propagated to brokers by coordinator. When a broker is started, it writes to ZK brokers path. Coordinator watches that path and notices the new broker. On noticing a new broker, the coordinator sends the UpdateMetadataRequest to only the new broker that just started up. The other brokers in cluster never gets to know that there are new brokers in the cluster.

          Effect of KAFKA-1367: After KAFKA-1367 goes in it correct alive brokers information will be propagated to all live brokers after ISR changes at any broker. However, if there are no topics/ partitions KAFKA-1367 will not help and this issue will still be there.

          Solution:
          Instead of sending the UpdateMetadataRequest only to new broker, send it to all live brokers in the cluster.

          Jun Rao, Neha Narkhede, Grant Henke, Gwen Shapira, Joe Stein, Joel Koshy please provide your thoughts. I have a patch ready which I will post if you guys think this is indeed the correct approach. I have verified that above approach fixes the issue.

          Show
          Ashish K Singh added a comment - - edited Hey Guys, I spent some time reproducing the issue and finding the root cause. Turns out KAFKA-1367 is not the issue here. Below is the problem and my suggested solution. Problem: Alive brokers list not being propagated to brokers by coordinator. When a broker is started, it writes to ZK brokers path. Coordinator watches that path and notices the new broker. On noticing a new broker, the coordinator sends the UpdateMetadataRequest to only the new broker that just started up. The other brokers in cluster never gets to know that there are new brokers in the cluster. Effect of KAFKA-1367 : After KAFKA-1367 goes in it correct alive brokers information will be propagated to all live brokers after ISR changes at any broker. However, if there are no topics/ partitions KAFKA-1367 will not help and this issue will still be there. Solution: Instead of sending the UpdateMetadataRequest only to new broker, send it to all live brokers in the cluster. Jun Rao , Neha Narkhede , Grant Henke , Gwen Shapira , Joe Stein , Joel Koshy please provide your thoughts. I have a patch ready which I will post if you guys think this is indeed the correct approach. I have verified that above approach fixes the issue.
          Hide
          Grant Henke added a comment -

          Sample Failing Tests:

          • testBrokerMetadataOnClusterWithNoTopics
          • testBrokerMetadataOnBrokerShutdown
          • testBrokerMetadataOnBrokerAddition
          Show
          Grant Henke added a comment - Sample Failing Tests: testBrokerMetadataOnClusterWithNoTopics testBrokerMetadataOnBrokerShutdown testBrokerMetadataOnBrokerAddition
          Hide
          Neha Narkhede added a comment -

          Is this repetitive or the metadata starts returning consistent data after some time ? Since the metadata is communicated to the brokers by the controller, it is possible that there is a time window after an event has happened and before all the brokers have learned of the event.

          Show
          Neha Narkhede added a comment - Is this repetitive or the metadata starts returning consistent data after some time ? Since the metadata is communicated to the brokers by the controller, it is possible that there is a time window after an event has happened and before all the brokers have learned of the event.
          Hide
          Jun Rao added a comment -

          Could you describe how to reproduce this?

          Show
          Jun Rao added a comment - Could you describe how to reproduce this?

            People

            • Assignee:
              Ashish K Singh
              Reporter:
              Vinicius Carvalho
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development