[SPARK-39979] IndexOutOfBoundsException on groupby + apply pandas grouped map udf function - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.2.1
Fix Version/s: 3.5.0
Component/s: PySpark
Labels:
None

Description

I'm grouping on relatively small subset of groups with big size groups.

Working with pyarrow version 2.0.0, machines memory is 64 GiB.

I'm getting the following error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 387 in stage 162.0 failed 4 times, most recent failure: Lost task 387.3 in stage 162.0 (TID 29957) (ip-172-21-129-187.eu-west-1.compute.internal executor 71): java.lang.IndexOutOfBoundsException: index: 2147483628, length: 36 (expected: range(0, 2147483648))
	at org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699)
	at org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:890)
	at org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1087)
	at org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:251)
	at org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:130)
	at org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:95)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:92)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1474)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:103)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:435)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2031)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:270)

Why do I hit this 2 GB limit? according ~~SPARK-34588~~ this is supported, perhaps related to ~~SPARK-34020~~.

Please assist.

Note:
Is it related to the usage of BaseVariableWidthVector and not BaseLargeVariableWidthVector?

Attachments

Issue Links

links to

[Github] Pull Request #39572 (Kimahriman)

Activity

People

Assignee:: Adam Binford

Reporter:: yaniv oren

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 04/Aug/22 12:05

Updated:: 29/May/23 00:05

Resolved:: 29/May/23 00:05