Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40770

Improved error messages for applyInPandas for schema mismatch

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.4.0
    • 3.5.0
    • PySpark
    • None

    Description

      Error messages raised by `applyInPandas` and `mapInPadnas` are very generic or useless when used with complex schemata:

      KeyError: 'val'
      
      RuntimeError: Number of columns of the returned pandas.DataFrame doesn't match specified schema. Expected: 2 Actual: 3
      
      java.lang.IllegalArgumentException: not all nodes and buffers were consumed. nodes: [ArrowFieldNode [length=3, nullCount=0]] buffers: [ArrowBuf[304], address:139860828549160, length:0, ArrowBuf[305], address:139860828549160, length:24]
      
      pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
      
      pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double
      

      These should be improved by adding column names or descriptive messages (in the same order as above):

      RuntimeError: Column names of the returned pandas.DataFrame do not match specified schema.  Missing: val  Unexpected: v  Schema: id, val
      
      RuntimeError: Column names of the returned pandas.DataFrame do not match specified schema.  Missing: val  Unexpected: foo, v  Schema: id, val
      
      RuntimeError: Column names of the returned pandas.DataFrame do not match specified schema.  Unexpected: v  Schema: id, id
      
      pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
      The above exception was the direct cause of the following exception:
      TypeError: Exception thrown when converting pandas.Series (int64) with name 'val' to Arrow Array (string).
      
      pyarrow.lib.ArrowInvalid: Could not convert '0' with type str: tried to convert to double
      The above exception was the direct cause of the following exception:
      ValueError: Exception thrown when converting pandas.Series (object) with name 'val' to Arrow Array (double).
      

      When no column names are given, the following error was returned:

      RuntimeError: Number of columns of the returned pandas.DataFrame doesn't match specified schema. Expected: 2 Actual: 3
      

      Where it should contain the output schema:

      RuntimeError: Number of columns of the returned pandas.DataFrame doesn't match specified schema.  Expected: 2  Actual: 3  Schema: id, val
      

      Attachments

        Activity

          People

            enricomi Enrico Minack
            enricomi Enrico Minack
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: