[SPARK-23258] Should not split Arrow record batches based on row count - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Reopened
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

Currently when executing scalar pandas_udf or using toPandas() the Arrow record batches are split up once the record count reaches a max value, which is configured with "spark.sql.execution.arrow.maxRecordsPerBatch". This is not ideal because the number of columns is not taken into account and if there are many columns, then OOMs can occur. An alternative approach could be to look at the size of the Arrow buffers being used and cap it at a certain size.

Attachments

Activity

People

Assignee:: Bryan Cutler

Reporter:: Bryan Cutler

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 29/Jan/18 18:01

Updated:: 24/Oct/22 11:13