[SPARK-32294] GroupedData Pandas UDF 2Gb limit - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 3.0.0, 3.1.0
Fix Version/s: 3.2.0
Component/s: PySpark
Labels:
None

Description

`spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for GroupedData, the whole group is passed to Pandas UDF at once, which can cause various 2Gb limitations on Arrow side (and in current versions of Arrow, also 2Gb limitation on Netty allocator side) - https://issues.apache.org/jira/browse/ARROW-4890

Would be great to consider feeding GroupedData into a pandas UDF in batches to solve this issue.

cc hyukjin.kwon

Attachments

Issue Links

duplicates

SPARK-33576 PythonException: An exception was thrown from a UDF: 'OSError: Invalid IPC message: negative bodyLength'.

Resolved

is related to

ARROW-4890 [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Ruslan Dautkhanov

Votes:: 2 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 13/Jul/20 18:27

Updated:: 12/Dec/22 18:10

Resolved:: 15/Dec/21 06:48