Description
`spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for GroupedData, the whole group is passed to Pandas UDF at once, which can cause various 2Gb limitations on Arrow side (and in current versions of Arrow, also 2Gb limitation on Netty allocator side) - https://issues.apache.org/jira/browse/ARROW-4890
Would be great to consider feeding GroupedData into a pandas UDF in batches to solve this issue.
cc hyukjin.kwon
Attachments
Issue Links
- duplicates
-
SPARK-33576 PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC message: negative bodyLength'.
- Resolved
- is related to
-
ARROW-4890 [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1
- Resolved