Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27519

Pandas udf corrupting data

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 3.0.0
    • PySpark
    • None

    Description

      While trying to use a pandas udf, i sent the udf 2 columns, a string and a list of a list of strings. The second argument structure for example: [['1'],['2'],['3']]

      But when getting this same value in the udf, i receive something like this: [['1','2'],['3'],[]]

      I checked and the same row in the table has the list with the correct structure, only in the udf did it change.

       

      I don't know why this happens, but i do know it has something to do with the fact that that row was the 10,001th row and last row in it's partition. Pandas batch size is 10,000 so that row was sent as a second batch alone, and that's the only thing that seems to cause it, having 1 or 2 rows in a second batch of the partition. I was also able to get this with a second batch of 2 rows, the list wasn't changed except an empty list was added to the end. 

      Hope you can help me understand what is going on, thanks!

      Attachments

        1. Pandas UDF Bug.py
          0.7 kB
          Jeff gold

        Activity

          People

            Unassigned Unassigned
            f7faf8ba36 Jeff gold
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: