Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.3.2
-
None
Description
When using the following setup:
- structured streaming
- a watermark and groupBy followed by an apply using a pandas grouped map udf
- a sink using an append outputMode
I would expect the following:
- udf to be called for each group --> OK
- when new data arrives, the udf will be called again –> OK
- when new data arrives for the same group, the udf will be called with the complete pandas dataframe of all received data for that group (up till the watermark) --> NOK: within the same group, the size of the pandas dataframe can decrease between invocations
- the results are only written to the sink once the processing time is passed the watermark --> NOK: every time the udf is called, new results are being sent to the output
It looks like pandas udf is unusable for structured streaming this way.