Uploaded image for project: 'Ambari'
  1. Ambari
  2. AMBARI-24719

Kafka Rolling Restart causes outage(s) due to not checking for under replicated partitions

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 2.6.2
    • None
    • ambari-server
    • None

    Description

      Ambari causes Kafka topic partition outages during rolling restarts because it only does a simplistic 2 minute wait between brokers and doesn't check the state of partition replicas before taking another broker down.

      On busty Kafka clusters with lots topics / partitions / data it might take a while before in-sync replicas recover.

      Ambari should therefore check for any under replicated partitions and wait as long as it takes for them to recover before proceeding to the next broker. There is however an issue in doing so which is there is a topic partition with a replica that no longer exists (eg. ambari_kafka_service_check) then it will never recover so there needs to be some thoughtful handling around that.

      This might be solved by AMBARI-24203 but I'm not sure it is tied in properly to the rolling restarts or what the timeout policy or time interval is for it, or whether it takes the above paragraph in to account.

      This could also have been easily offset if Ambari had proper extensible checking as raised in AMBARI-24381.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              harisekhon Hari Sekhon
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: