[SPARK-38285] ClassCastException: GenericArrayData cannot be cast to InternalRow - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.2.1
Fix Version/s: 3.3.0, 3.2.2
Component/s: SQL
Labels:
None

Description

The following code with Spark 3.2.1 raises an exception:

import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, ArrayType, StringType

t = StructType([
    StructField('o', 
        ArrayType(
            StructType([
                StructField('s', StringType(), False),
                StructField('b', ArrayType(
                    StructType([
                        StructField('e', StringType(), False)
                    ]),
                    True),
                False)
            ]), 
        True),
    False)])

value = {
    "o": [
        {
            "s": "string1",
            "b": [
                {
                    "e": "string2"
                },
                {
                    "e": "string3"
                }
            ]
        },
        {
            "s": "string4",
            "b": [
                {
                    "e": "string5"
                },
                {
                    "e": "string6"
                },
                {
                    "e": "string7"
                }
            ]
        }
    ]
}

df = (
    spark.createDataFrame([value], schema=t)
    .select(F.explode("o").alias("eo"))
    .select("eo.b.e")
)


df.show()

The exception message is:

java.lang.ClassCastException: org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow
	at org.apache.spark.sql.catalyst.util.GenericArrayData.getStruct(GenericArrayData.scala:76)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
	at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:153)
	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:122)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.Task.run(Task.scala:93)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:824)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1641)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:827)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I am using Spark 3.2.1, but I don't know if even Spark 3.3.0 is affected.

Please note that the issue seems to be related to ~~SPARK-37577~~: I am using the same DataFrame schema, but this time I have populated it with non empty value.

I think that this is bug because with the following configuration it works as expected:

spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False)
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", False)

Update: The provided code is working with Spark 3.1.2 without problems, so it seems an error due to expression pruning.

The expected result is:

+---------------------------+
|e                          |
+---------------------------+
|[string2, string3]         |
|[string5, string6, string7]|
+---------------------------+

Attachments

Issue Links

relates to

SPARK-37577 ClassCastException: ArrayType cannot be cast to StructType

Resolved

links to

[Github] Pull Request #35749 (viirya)

ClassCastException: GenericArrayData cannot be cast to InternalRow

Details

Description

Attachments

Issue Links

Activity

People

Dates