[SPARK-25756] pyspark pandas_udf does not respect append outputMode in structured streaming - ASF JIRA

XML

Word

Printable

JSON

When using the following setup:

I would expect the following:

udf to be called for each group --> OK
when new data arrives, the udf will be called again –> OK
when new data arrives for the same group, the udf will be called with the complete pandas dataframe of all received data for that group (up till the watermark) --> NOK: within the same group, the size of the pandas dataframe can decrease between invocations
the results are only written to the sink once the processing time is passed the watermark --> NOK: every time the udf is called, new results are being sent to the output

It looks like pandas udf is unusable for structured streaming this way.