Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6714

additionally overload KafkaUtils.createDirectStream for using a messageHandler without having to specify the offsets



    • Type: Improvement
    • Status: Closed
    • Priority: Trivial
    • Resolution: Won't Fix
    • Affects Version/s: 1.3.0
    • Fix Version/s: None
    • Component/s: DStreams
    • Labels:


      Currently, in the Scala API, KafkaUtils.createDirectStream has two overload methods, one for an "easy mode" where only the Kafka parameters and topics are specified, and other "hard mode" where we can specify the offsets and a messageHandler for manipulating the MessageAndMetadata values obtained from Kafka. I think an intermediate method that automatically handles the offsets, but that allows you to specify the messageHandler would be very useful. For example the triple (topic, partition, offset) uniquely identifies each message, so that could be useful as primary key for idempotent insertions in an external database. Also, in an implementation of a lambda architecture, offsets could be used to trace which part of a kafka topic has been covered by the batch view, and which part by the real-time / live view. For both cases I think that having access to the meta information through a messageHandler, while maintaining managed offsets, would be interesting

      A proposal for the implementation is available at https://github.com/juanrh/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala. A new overload of KafkaUtils.createDirectStream for the Scala API, and another for the Java API, have been added




            • Assignee:
              juanrh Juan Rodríguez Hortalá
            • Votes:
              0 Vote for this issue
              2 Start watching this issue


              • Created: