Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32478

Error message to show the schema mismatch in gapply with Arrow vectorization

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 3.0.1, 3.1.0
    • Component/s: SparkR
    • Labels:
      None

      Description

      Currently, the error message is confusing when the output schema type is not matched with the actual R DataFrame in gapply:

      ./bin/sparkR --conf spark.sql.execution.arrow.sparkr.enabled=true
      
      df <- createDataFrame(list(list(a=1L, b="2")))
      count(gapply(df, "a", function(key, group) { group }, structType("a int, b int")))
      
        org.apache.spark.SparkException: Job aborted due to stage failure: Task 43 in stage 2.0 failed 1 times, most recent failure: Lost task 43.0 in stage 2.0 (TID 2, 192.168.35.193, executor driver): java.lang.UnsupportedOperationException
      	at org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getInt(ArrowColumnVector.java:212)
      	...
      

      We should probably also document that the type should be matched always.

        Attachments

          Activity

            People

            • Assignee:
              hyukjin.kwon Hyukjin Kwon
              Reporter:
              hyukjin.kwon Hyukjin Kwon
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: