[SPARK-48992] applyInPandas does not respect streaming watermark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.5.0
Fix Version/s: None
Component/s: Spark Core
Labels:
None
Environment:

Azure Databricks runtime 14.3 LTS

Description

When I use GroupedData.applyInPandas to implement aggregation in a streaming query, it fails to respect a watermark specified using DataFrame.withWatermark.

This query reproduces the behaviour I'm seeing:

from pyspark.sql.functions import window
from typing import Tuple
import pandas as pd

df_source_stream = (
    spark.readStream
    .format("rate")
    .option("rowsPerSecond", 3)
    .load()
    .withColumn("bucket", window("timestamp", "10 seconds").end)
)

def my_function(
    key: Tuple[str], df: pd.DataFrame
) -> pd.DataFrame:
    return pd.DataFrame({"bucket": [key[0]], "count": [df.shape[0]]})

df = (
    df_source_stream
    .withWatermark("bucket", "10 seconds")
    .groupBy("bucket")
    .applyInPandas(my_function, "bucket TIMESTAMP, count INT")
)
display(df)

I expect the output of the query to contain one row per bucket value, but a new row is emitted for each incoming microbatch.

In contrast, an out of the box aggregate behaves as expected. For example:

df = (
    df_source_stream
    .withWatermark("bucket", "10 seconds")
    .groupBy("bucket")
    .count()  # standard aggregate in place of applyInPandas
)

The output of this query contains one row per bucket value.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Richard Swinbank

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 24/Jul/24 15:02

Updated:: 25/Jul/24 00:57