Description
The following code with Spark 3.2.1 raises an exception:
import pyspark.sql.functions as F from pyspark.sql.types import StructType, StructField, ArrayType, StringType t = StructType([ StructField('o', ArrayType( StructType([ StructField('s', StringType(), False), StructField('b', ArrayType( StructType([ StructField('e', StringType(), False) ]), True), False) ]), True), False)]) value = { "o": [ { "s": "string1", "b": [ { "e": "string2" }, { "e": "string3" } ] }, { "s": "string4", "b": [ { "e": "string5" }, { "e": "string6" }, { "e": "string7" } ] } ] } df = ( spark.createDataFrame([value], schema=t) .select(F.explode("o").alias("eo")) .select("eo.b.e") ) df.show()
The exception message is:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow at org.apache.spark.sql.catalyst.util.GenericArrayData.getStruct(GenericArrayData.scala:76) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80) at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:153) at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:122) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.Task.run(Task.scala:93) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:824) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1641) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:827) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:683) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
I am using Spark 3.2.1, but I don't know if even Spark 3.3.0 is affected.
Please note that the issue seems to be related to SPARK-37577: I am using the same DataFrame schema, but this time I have populated it with non empty value.
I think that this is bug because with the following configuration it works as expected:
spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False) spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", False)
Update: The provided code is working with Spark 3.1.2 without problems, so it seems an error due to expression pruning.
The expected result is:
+---------------------------+ |e | +---------------------------+ |[string2, string3] | |[string5, string6, string7]| +---------------------------+
Attachments
Issue Links
- relates to
-
SPARK-37577 ClassCastException: ArrayType cannot be cast to StructType
- Resolved
- links to