Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33073

Improve error handling on Pandas to Arrow conversion failures

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.1
    • 3.0.2, 3.1.0
    • PySpark
    • None

    Description

      Currently, when converting from Pandas to Arrow for Pandas UDF return values or from createDataFrame(), PySpark will catch all ArrowExceptions and display info on how to disable the safe conversion config. This is displayed with the original error as a tuple:

      ('Exception thrown when converting pandas.Series (object) to Arrow Array (int32). It can be caused by overflows or other unsafe conversions warned by Arrow. Arrow safe type check can be disabled by using SQL config `spark.sql.execution.pandas.convertToArrowArraySafely`.', ArrowInvalid('Could not convert a with type str: tried to convert to int'))
      

      The problem is that this is meant mainly for thing like float truncation or overflow, but will also show if the user has an invalid schema with types that are incompatible. The extra information is confusing in this case and the real error is buried.

      This could be improved by only printing the extra info on how to disable safe checking if the config is actually set and using exception chaining to better show the original error. Also, any safe failures would be a ValueError, which ArrowInvaildError is a subclass, so the catch could be made more narrow.

      Attachments

        Issue Links

          Activity

            People

              bryanc Bryan Cutler
              bryanc Bryan Cutler
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: