[SPARK-33073] Improve error handling on Pandas to Arrow conversion failures - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.1
Fix Version/s: 3.0.2, 3.1.0
Component/s: PySpark
Labels:
None

Description

Currently, when converting from Pandas to Arrow for Pandas UDF return values or from createDataFrame(), PySpark will catch all ArrowExceptions and display info on how to disable the safe conversion config. This is displayed with the original error as a tuple:

('Exception thrown when converting pandas.Series (object) to Arrow Array (int32). It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`.', ArrowInvalid('Could not convert a with type str: tried to convert to int'))

The problem is that this is meant mainly for thing like float truncation or overflow, but will also show if the user has an invalid schema with types that are incompatible. The extra information is confusing in this case and the real error is buried.

This could be improved by only printing the extra info on how to disable safe checking if the config is actually set and using exception chaining to better show the original error. Also, any safe failures would be a ValueError, which ArrowInvaildError is a subclass, so the catch could be made more narrow.

Attachments

Issue Links

is related to

ARROW-10178 [CI] Fix spark master integration test build setup

Resolved

links to

[Github] Pull Request #29951 (BryanCutler)

[Github] Pull Request #29962 (BryanCutler)

Activity

People

Assignee:: Bryan Cutler

Reporter:: Bryan Cutler

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 06/Oct/20 06:38

Updated:: 12/Dec/22 18:10

Resolved:: 06/Oct/20 09:13