Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-591

Slow down reconnects in KafkaSystemConsumer

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 0.9.0
    • Fix Version/s: 0.9.0
    • Component/s: kafka
    • Labels:

      Description

      During a preferred leadership election in Kafka, I see a ton of these messages:

      2015-03-11 17:25:40 KafkaSystemConsumer [WARN] While refreshing brokers for [topic1,7]: kafka.common.NotLeaderForPartitionException. Retrying.
      2015-03-11 17:25:40 GetOffset [INFO] Validating offset 1625488 for topic and partition [topic1,7]
      2015-03-11 17:25:40 KafkaSystemConsumer [WARN] While refreshing brokers for [topic1,7]: kafka.common.NotLeaderForPartitionException. Retrying.
      2015-03-11 17:25:40 GetOffset [INFO] Validating offset 1625488 for topic and partition [topic1,7]
      2015-03-11 17:25:40 KafkaSystemConsumer [WARN] While refreshing brokers for [topic1,7]: kafka.common.NotLeaderForPartitionException. Retrying.
      2015-03-11 17:25:40 GetOffset [INFO] Validating offset 1625488 for topic and partition [topic1,7]
      2015-03-11 17:25:40 KafkaSystemConsumer [WARN] While refreshing brokers for [topic1,7]: kafka.common.NotLeaderForPartitionException. Retrying.
      2015-03-11 17:25:40 GetOffset [INFO] Validating offset 1625488 for topic and partition [topic1,7]
      2015-03-11 17:25:40 KafkaSystemConsumer [WARN] While refreshing brokers for [topic1,7]: kafka.common.NotLeaderForPartitionException. Retrying.
      2015-03-11 17:25:40 GetOffset [INFO] Validating offset 1625488 for topic and partition [topic1,7]
      2015-03-11 17:25:40 KafkaSystemConsumer [WARN] While refreshing brokers for [topic1,7]: kafka.common.NotLeaderForPartitionException. Retrying.
      2015-03-11 17:25:40 GetOffset [INFO] Validating offset 1625488 for topic and partition [topic1,7]
      2015-03-11 17:25:40 KafkaSystemConsumer [WARN] While refreshing brokers for [topic1,7]: kafka.common.NotLeaderForPartitionException. Retrying.
      2015-03-11 17:25:40 GetOffset [INFO] Validating offset 1625488 for topic and partition [topic1,7]
      2015-03-11 17:25:40 KafkaSystemConsumer [WARN] While refreshing brokers for [topic1,7]: kafka.common.NotLeaderForPartitionException. Retrying.
      2015-03-11 17:25:40 GetOffset [INFO] Validating offset 1625488 for topic and partition [topic1,7]
      2015-03-11 17:25:40 KafkaSystemConsumer [WARN] While refreshing brokers for [topic1,7]: kafka.common.NotLeaderForPartitionException. Retrying.
      2015-03-11 17:25:40 GetOffset [INFO] Validating offset 1625488 for topic and partition [topic1,7]
      2015-03-11 17:25:40 KafkaSystemConsumer [WARN] While refreshing brokers for [topic1,7]: kafka.common.NotLeaderForPartitionException. Retrying.
      2015-03-11 17:25:40 GetOffset [INFO] Validating offset 1625488 for topic and partition [topic1,7]
      2015-03-11 17:25:40 KafkaSystemConsumer [WARN] While refreshing brokers for [topic1,7]: kafka.common.NotLeaderForPartitionException. Retrying.
      2015-03-11 17:25:40 GetOffset [INFO] Validating offset 1625488 for topic and partition [topic1,7]
      2015-03-11 17:25:40 KafkaSystemConsumer [WARN] While refreshing brokers for [topic1,7]: kafka.common.NotLeaderForPartitionException. Retrying.
      2015-03-11 17:25:40 GetOffset [INFO] Validating offset 1625488 for topic and partition [topic1,7]
      2015-03-11 17:25:40 KafkaSystemConsumer [WARN] While refreshing brokers for [topic1,7]: kafka.common.NotLeaderForPartitionException. Retrying.
      2015-03-11 17:25:40 GetOffset [INFO] Validating offset 1625488 for topic and partition [topic1,7]
      2015-03-11 17:25:40 KafkaSystemConsumer [WARN] While refreshing brokers for [topic1,7]: kafka.common.NotLeaderForPartitionException. Retrying.
      2015-03-11 17:25:40 GetOffset [INFO] Validating offset 1625488 for topic and partition [topic1,7]
      2015-03-11 17:25:40 KafkaSystemConsumer [WARN] While refreshing brokers for [topic1,7]: kafka.common.NotLeaderForPartitionException. Retrying.
      2015-03-11 17:25:40 GetOffset [INFO] Validating offset 1625488 for topic and partition [topic1,7]
      2015-03-11 17:25:40 KafkaSystemConsumer [WARN] While refreshing brokers for [topic1,7]: kafka.common.NotLeaderForPartitionException. Retrying.
      2015-03-11 17:25:40 GetOffset [INFO] Validating offset 1625488 for topic and partition [topic1,7]
      2015-03-11 17:25:40 KafkaSystemConsumer [WARN] While refreshing brokers for [topic1,7]: kafka.common.NotLeaderForPartitionException. Retrying.
      2015-03-11 17:25:40 GetOffset [INFO] Validating offset 1625488 for topic and partition [topic1,7]
      2015-03-11 17:25:40 KafkaSystemConsumer [WARN] While refreshing brokers for [topic1,7]: kafka.common.NotLeaderForPartitionException. Retrying.
      2015-03-11 17:25:40 GetOffset [INFO] Validating offset 1625488 for topic and partition [topic1,7]
      2015-03-11 17:25:40 KafkaSystemConsumer [WARN] While refreshing brokers for [topic1,7]: kafka.common.NotLeaderForPartitionException. Retrying.
      

      It looks like we are not pausing between retries in KafkaSystemConsumer.refreshBrokers.

      1. SAMZA-591-0.patch
        1 kB
        Chris Riccomini

        Activity

        Hide
        criccomini Chris Riccomini added a comment -

        I know why. It's because all BrokerProxies are calling refreshBroekrs simultaneously due to line 140 of BrokerProxy.scala. When a partition is dropped, they all trigger a refresh. The refresh is thread safe, but it results in the same dropped partition getting checked a bunch of times, aggressively. They all use exponential backoff, but if 20 BP threads all start with 100ms delays, we can expected to see this.

        Closing as not a problem.

        Show
        criccomini Chris Riccomini added a comment - I know why. It's because all BrokerProxies are calling refreshBroekrs simultaneously due to line 140 of BrokerProxy.scala. When a partition is dropped, they all trigger a refresh. The refresh is thread safe, but it results in the same dropped partition getting checked a bunch of times, aggressively. They all use exponential backoff, but if 20 BP threads all start with 100ms delays, we can expected to see this. Closing as not a problem.
        Hide
        criccomini Chris Riccomini added a comment -

        Hrm, good point. Wonder why I see so many log lines...

        Show
        criccomini Chris Riccomini added a comment - Hrm, good point. Wonder why I see so many log lines...
        Hide
        closeuris Yan Fang added a comment -

        Shouldn't the "retryBackOff" takes care of the sleeping time and retrying ?

        Show
        closeuris Yan Fang added a comment - Shouldn't the "retryBackOff" takes care of the sleeping time and retrying ?
        Hide
        criccomini Chris Riccomini added a comment -

        Attaching patch. No RB since it's a one liner. Using the standard 10s pause that we use everywhere else.

        Show
        criccomini Chris Riccomini added a comment - Attaching patch. No RB since it's a one liner. Using the standard 10s pause that we use everywhere else.

          People

          • Assignee:
            criccomini Chris Riccomini
            Reporter:
            criccomini Chris Riccomini
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development