Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-13368

Support smart topic polling for consumer with multiple topic subscriptions

    XMLWordPrintableJSON

Details

    • Wish
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • None
    • None
    • consumer
    • None

    Description

      Currently there is no way to control how a Kafka consumer polls messages from a list of topics that it has subscribed to. If I understand correctly, the current approach is a round-robin polling mechanism across all topics that a consumer has subscribed to.
      This works reasonably well when the consumer's offset is aligned with the latest message offset of the topics, however if we configured the Kafka consumer to consume from the earliest offset where the topics have very distinct amounts of messages each, there is no guarantee/control on how to selectively read from topics.

      Depending on the use-case it may be useful for the Kafka consumer developer to override this polling mechanism with a custom solution that makes sense for downstream applications.

      Suppose you have 2 or more topics, where you want to merge the topics into a single topic but due to large differences between the topic's message rates you want to control from which topics to poll at a given time.

      As an example consider 2 topics with the following schemas:

      Topic1 Schema: {
         timestamp: Long,
         key: String,
         col1: String,
         col2: String
      }
      
      Topic2 Schema: { 
         timestamp: Long,
         key: String,
         col3: String,
         col4: String 
      }
      

      Where Topic1 has 1,000,000 events from timestamp 0 to 1,000 (1000 ev/s) & topic2 has 50,000 events from timestamp 0 to 50,000 (1 ev/s).

      Next we define a Kafka consumer that subscribes to Topic1 & Topic2. In the current situation (round robin), assuming a polling batch of 100 messages, we would read 50,000 from each topic which maps to 50 seconds worth of messages on Topic1 and 50,000 seconds worth of messages on Topic2.

      If we then try to sort the messages by timestamp we have incorrect results, missing 500,000 messages from Topic1 that should be inserted between message 0 & 1,000 of Topic2.

      The workaround solution is either to buffer the messages from Topic2 of have 1 Kafka consumer per topic which has significant overhead with periodic heartbeats, consumer registration in consumer groups, re-balancing, etc...
      For a couple of topics this approach may be OK, but it does not scale for 10's, 100's or more topics in a subscription.

      The ideal solution would be to extend the Kafka consumer API to allow a user to define how to selectively poll messages from a subscription.

      Attachments

        Activity

          People

            Unassigned Unassigned
            pcless Pedro Cardoso Silva
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: