Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25756

pyspark pandas_udf does not respect append outputMode in structured streaming

    XMLWordPrintableJSON

Details

    Description

      When using the following setup:

      • structured streaming
      • a watermark and groupBy followed by an apply using a pandas grouped map udf
      • a sink using an append outputMode

      I would expect the following:

      • udf to be called for each group --> OK
      • when new data arrives, the udf will be called again –> OK
      • when new data arrives for the same group, the udf will be called with the complete pandas dataframe of all received data for that group (up till the watermark) --> NOK: within the same group, the size of the pandas dataframe can decrease between invocations
      • the results are only written to the sink once the processing time is passed the watermark --> NOK: every time the udf is called, new results are being sent to the output

      It looks like pandas udf is unusable for structured streaming this way.

      Attachments

        Activity

          People

            Unassigned Unassigned
            janbols Jan Bols
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: