[SPARK-26167] No output created for aggregation query in append mode - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 2.3.2
Fix Version/s: None
Component/s: Structured Streaming
Labels:
None

Description

For aggregation query in append mode not all outputs are produced for inputs with expired watermark. I have data in kafka that need to be reprocessed and results stored in S3. S3 works only with append mode. Problem is that only part of the data is written to S3. Code below illustrates the my approach.

String windowDuration = "24 hours";

String slideDuration = "15 minutes";

Dataset<Row> sliding24h = rowData

.withWatermark(eventTimeCol, slideDuration) .groupBy(functions.window(col(eventTimeCol), windowDuration, slideDuration), col(nameCol))

count();

sliding24h .writeStream()

.format("console")

.option("truncate", false)

.option("numRows", 1000)

.outputMode(OutputMode.Append())

.start()

.awaitTermination();

Below is the example that shows the behavior. Code produces only empty Batch 0 in Append mode. Data is aggregated in 24 hour windows with 15 minute slide. Input data covers 84 hours. I think that code should produce all aggregated results expect for the last 15 minute interval.

public static void main(String [] args) throws StreamingQueryException {
SparkSession spark = SparkSession.builder().master("local[*]").getOrCreate();

ArrayList<String> rl = new ArrayList<>();
for (int i = 0; i < 1000; ++i) {
long t = 1512164314L + i * 5 * 60;
rl.add(t + ",qwer");
}

String nameCol = "name";
String eventTimeCol = "eventTime";
String eventTimestampCol = "eventTimestamp";

MemoryStream<String> input = new MemoryStream<>(42, spark.sqlContext(), Encoders.STRING());
input.addData(JavaConversions.asScalaBuffer(rl).toSeq());
Dataset<Row> stream = input.toDF().selectExpr(
"cast(split(value,'[,]')[0] as long) as " + eventTimestampCol,
"cast(split(value,'[,]')[1] as String) as " + nameCol);

System.out.println("isStreaming: " + stream.isStreaming());

Column eventTime = functions.to_timestamp(col(eventTimestampCol));
Dataset<Row> rowData = stream.withColumn(eventTimeCol, eventTime);

String windowDuration = "24 hours";
String slideDuration = "15 minutes";
Dataset<Row> sliding24h = rowData
.withWatermark(eventTimeCol, slideDuration)
.groupBy(functions.window(col(eventTimeCol), windowDuration, slideDuration),
col(nameCol)).count();

sliding24h
.writeStream()
.format("console")
.option("truncate", false)
.option("numRows", 1000)
.outputMode(OutputMode.Append())
//.outputMode(OutputMode.Complete()) .start()
.awaitTermination();
}

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: dejan miljkovic

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 25/Nov/18 20:07

Updated:: 12/Dec/22 18:10

Resolved:: 26/Nov/18 16:25