[SPARK-27519] Pandas udf corrupting data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 3.0.0
Component/s: PySpark
Labels:
None

Description

While trying to use a pandas udf, i sent the udf 2 columns, a string and a list of a list of strings. The second argument structure for example: [['1'],['2'],['3']]

But when getting this same value in the udf, i receive something like this: [['1','2'],['3'],[]]

I checked and the same row in the table has the list with the correct structure, only in the udf did it change.

I don't know why this happens, but i do know it has something to do with the fact that that row was the 10,001th row and last row in it's partition. Pandas batch size is 10,000 so that row was sent as a second batch alone, and that's the only thing that seems to cause it, having 1 or 2 rows in a second batch of the partition. I was also able to get this with a second batch of 2 rows, the list wasn't changed except an empty list was added to the end.

Hope you can help me understand what is going on, thanks!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Pandas UDF Bug.py
28/Apr/19 22:09
0.7 kB
Jeff gold

Activity

People

Assignee:: Unassigned

Reporter:: Jeff gold

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 19/Apr/19 07:24

Updated:: 12/Dec/22 18:11

Resolved:: 30/Apr/19 22:49