Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15272

DirectKafkaInputDStream doesn't work with window operation

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 1.5.2
    • None
    • DStreams

    Description

      Using Kafka direct DStream with simple window operation like:

      kafkaDStream.window(Durations.milliseconds(10000),
                          Durations.milliseconds(1000));
                  .print();
      

      with 1s batch duration either freezes after several seconds or lags terribly (depending on cluster mode).

      This happens when Kafka brokers are not part of the Spark cluster (they are on different nodes). The KafkaRDD still reports them as preferred locations. This doesn't seem to be problem in non-window scenarios but with window it conflicts with delay scheduling algorithm implemented in TaskSetManager. It either significantly delays (Yarn mode) or completely drains (Spark mode) resource offers with TaskLocality.ANY which are needed to process tasks with these Kafka broker aligned preferred locations. When delay scheduling algorithm is switched off (spark.locality.wait=0), the example works correctly.

      I think that the KafkaRDD shouldn't report preferred locations if the brokers don't correspond to worker nodes or allow the reporting of preferred locations to be switched off. Also it would be good if delay scheduling algorithm didn't drain / delay offers in the case, the tasks have unmatched preferred locations.

      Attachments

        Activity

          People

            Unassigned Unassigned
            lubomir.nerad Lubomir Nerad
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: