Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.3.0, 3.4.0
-
None
-
None
Description
Observe the following sample code, where I'm using when() based on typeof():
from pyspark.sql.types import ( ArrayType, IntegerType, StringType, StructField, StructType, ) schema = StructType( [ StructField("ID", IntegerType(), nullable=False), StructField("name", StringType(), nullable=False), StructField("colors", ArrayType(StringType()), nullable=False), ] ) data = [ (1, "John", ["red", "blue", "green"]), (2, "Jane", ["yellow", "orange", "purple"]), (3, "Bob", ["black", "white"]), (4, "Alice", ["pink"]), (5, "Tom", ["brown", "gray"]), ] df = spark.createDataFrame(data, schema) col = "name" df = df.withColumn( col, F.when(F.expr(f"typeof({col}) == 'string'"), F.trim(col)) .when( F.expr(f"typeof({col}) LIKE 'array%'"), F.array_join(col, ","), ) .otherwise(F.lit(None)), )
Here's the exception I'm seeing:
pyspark.sql.utils.AnalysisException: cannot resolve 'array_join(name, ',')' due to data type mismatch: argument 1 requires array<string> type, however, 'name' is of string type.; 'Project [ID#0, CASE WHEN (typeof(name#1) = string) THEN trim(name#1, None) WHEN typeof(name#1) LIKE array% THEN array_join(name#1, ,, None) ELSE null END AS name#6, colors#2] +- LogicalRDD [ID#0, name#1, colors#2], false
If I change col to "colors", I get this similar exception:
pyspark.sql.utils.AnalysisException: cannot resolve 'trim(colors)' due to data type mismatch: argument 1 requires string type, however, 'colors' is of array<string> type.; 'Project [ID#0, name#1, CASE WHEN (typeof(colors#2) = string) THEN trim(colors#2, None) WHEN typeof(colors#2) LIKE array% THEN array_join(colors#2, ,, None) ELSE null END AS colors#6] +- LogicalRDD [ID#0, name#1, colors#2], false
It seems to try to evaluate all possible paths of code for type checking, even if that code path won't be hit for the current query. I was able to repro this on 3.3.0 and 3.4.0.