Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38715

Would be nice to be able to configure a client ID pattern in Kafka integration

    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.0.0
    • None
    • Structured Streaming
    • None

    Description

      By default Kafka client automatically generated a unique client ID.
      Client ID is used by many data lineage tool to gather consumer/producer (for consumer the consumer group is also used, but only client ID can be used for producer).

      Setting the [client.id](https://kafka.apache.org/documentation/#producerconfigs_client.id) is options passed to Spark Kafka read or write is not possible, as it would force the same client.id on at east both the driver and the executor.

      What could be done is to be able to passed Spark specific option, maybe named `clientIdPrefix`.

      e.g.

      ```scala
      val df = spark
      .read
      .format("kafka")
      .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
      .option("subscribePattern", "topic.*")
      .option("startingOffsets", "earliest")
      .option("endingOffsets", "latest")
      .option("clientIdPrefix", "my-workflow-")
      .load()
      ```

      Possible implement would be to update [InternalKafkaProducerPool](https://github.com/apache/spark/blob/master/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/producer/InternalKafkaProducerPool.scala#L75), or maybe in Spark `KafkaConfigUpdater` ?

      Attachments

        Activity

          People

            Unassigned Unassigned
            cchantepie Cédric Chantepie
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: