Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23258

Should not split Arrow record batches based on row count

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Reopened
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • SQL
    • None

    Description

      Currently when executing scalarĀ pandas_udf or using toPandas() the Arrow record batches are split up once the record count reaches a max value, which is configured with "spark.sql.execution.arrow.maxRecordsPerBatch". This is not ideal because the number of columns is not taken into account and if there are many columns, then OOMs can occur. An alternative approach could be to look at the size of the Arrow buffers being used and cap it at a certain size.

      Attachments

        Activity

          People

            bryanc Bryan Cutler
            bryanc Bryan Cutler
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: