Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-14016

Revoke more partitions than expected in Cooperative rebalance

    XMLWordPrintableJSON

Details

    Description

      In https://issues.apache.org/jira/browse/KAFKA-13419 we found that some consumer didn't reset generation and state after sync group fail with REABALANCE_IN_PROGRESS error.

      So we fixed it by reset generationId (no memberId) when  sync group fail with REABALANCE_IN_PROGRESS error.

      But this change missed the reset part, so another change made in https://issues.apache.org/jira/browse/KAFKA-13891 make this works.

      After apply this change, we found that: sometimes consumer will revoker almost 2/3 of the partitions with cooperative enabled. Because if a consumer did a very quick re-join, other consumers will get REABALANCE_IN_PROGRESS in syncGroup and revoked their partition before re-jion. example:

      1. consumer A1-A10 (ten consumers) joined and synced group successfully with generation 1 
      2. New consumer B1 joined and start a rebalance
      3. all consumer joined successfully and then A1 need to revoke partition to transfer to B1
      4. A1 do a very quick syncGroup and re-join, because it revoked partition
      5. A2-A10 didn't send syncGroup before A1 re-join, so after the send syncGruop, will get REBALANCE_IN_PROGRESS
      6. A2-A10 will revoke there partitions and re-join

      So in this rebalance almost every partition revoked, which highly decrease the benefit of Cooperative rebalance 

      I think instead of "resetStateAndRejoin when RebalanceInProgressException errors happend in sync group" we need another way to fix it.

      Here is my proposal:

      1. revert the change in https://issues.apache.org/jira/browse/KAFKA-13891
      2. In Server Coordinator handleSyncGroup when generationId checked and group state is PreparingRebalance. We can send the assignment along with the error code REBALANCE_IN_PROGRESS. ( i think it's safe since we verified the generation first )
      3. When get the REBALANCE_IN_PROGRESS error in client, try to apply the assignment first and then set the rejoinNeeded = true to make it re-join immediately 

      Attachments

        Activity

          People

            pnee Philip Nee
            aiquestion Shawn Wang
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: