[SPARK-34463] toPandas failed with error: buffer source array is read-only when Arrow with self-destruct is enabled - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.2.0
Fix Version/s: 3.2.0
Component/s: PySpark
Labels:
None

Description

Environment:

apache/spark master
pandas version > 1.0.5

Reproduce code:

spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
spark.conf.set('spark.sql.execution.arrow.pyspark.selfDestruct.enabled', True)
spark.createDataFrame(sc.parallelize([(i,) for i in range(13)], 1), 'id long').selectExpr('IF(id % 3==0, id+1, NULL) AS f1', '(id+1) % 2 AS label').toPandas()['label'].value_counts()

Get error like:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/base.py", line 1033, in value_counts
dropna=dropna,
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/algorithms.py", line 820, in value_counts
keys, counts = value_counts_arraylike(values, dropna)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/algorithms.py", line 865, in value_counts_arraylike
keys, counts = f(values, dropna)
File "pandas/_libs/hashtable_func_helper.pxi", line 1098, in pandas._libs.hashtable.value_count_int64
File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
File "stringsource", line 349, in View.MemoryView.memoryview._cinit_
ValueError: buffer source array is read-only

Attachments

Issue Links

relates to

SPARK-32953 Lower memory usage in toPandas with Arrow self_destruct

Resolved

links to

[Github] Pull Request #31738 (lidavidm)

Activity

People

Assignee:: David Li

Reporter:: Weichen Xu

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 18/Feb/21 13:11

Updated:: 12/Dec/22 18:10

Resolved:: 30/Mar/21 04:30