Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32294

GroupedData Pandas UDF 2Gb limit

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 3.0.0, 3.1.0
    • 3.2.0
    • PySpark
    • None

    Description

      `spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for GroupedData, the whole group is passed to Pandas UDF at once, which can cause various 2Gb limitations on Arrow side (and in current versions of Arrow, also 2Gb limitation on Netty allocator side) - https://issues.apache.org/jira/browse/ARROW-4890 

      Would be great to consider feeding GroupedData into a pandas UDF in batches to solve this issue. 

      cc hyukjin.kwon 

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Tagar Ruslan Dautkhanov
              Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: