Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-48868

Incorrect AnalysisException thrown using when() and mixed data types

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.3.0, 3.4.0
    • None
    • PySpark
    • None

    Description

      Observe the following sample code, where I'm using when() based on typeof():

      from pyspark.sql.types import (
          ArrayType,
          IntegerType,
          StringType,
          StructField,
          StructType,
      )
      
      schema = StructType(
          [
              StructField("ID", IntegerType(), nullable=False),
              StructField("name", StringType(), nullable=False),
              StructField("colors", ArrayType(StringType()), nullable=False),
          ]
      )
      data = [
          (1, "John", ["red", "blue", "green"]),
          (2, "Jane", ["yellow", "orange", "purple"]),
          (3, "Bob", ["black", "white"]),
          (4, "Alice", ["pink"]),
          (5, "Tom", ["brown", "gray"]),
      ]
      df = spark.createDataFrame(data, schema)
      col = "name"
      df = df.withColumn(
          col,
          F.when(F.expr(f"typeof({col}) == 'string'"), F.trim(col))
          .when(
              F.expr(f"typeof({col}) LIKE 'array%'"),
              F.array_join(col, ","),
          )
          .otherwise(F.lit(None)),
      )
      

       
      Here's the exception I'm seeing:

      pyspark.sql.utils.AnalysisException: cannot resolve 'array_join(name, ',')' due to data type mismatch: argument 1 requires array<string> type, however, 'name' is of string type.;
      'Project [ID#0, CASE WHEN (typeof(name#1) = string) THEN trim(name#1, None) WHEN typeof(name#1) LIKE array% THEN array_join(name#1, ,, None) ELSE null END AS name#6, colors#2]
      +- LogicalRDD [ID#0, name#1, colors#2], false
      

       

      If I change col to "colors", I get this similar exception:

      pyspark.sql.utils.AnalysisException: cannot resolve 'trim(colors)' due to data type mismatch: argument 1 requires string type, however, 'colors' is of array<string> type.;
      'Project [ID#0, name#1, CASE WHEN (typeof(colors#2) = string) THEN trim(colors#2, None) WHEN typeof(colors#2) LIKE array% THEN array_join(colors#2, ,, None) ELSE null END AS colors#6]
      +- LogicalRDD [ID#0, name#1, colors#2], false
      

       

      It seems to try to evaluate all possible paths of code for type checking, even if that code path won't be hit for the current query. I was able to repro this on 3.3.0 and 3.4.0.

      Attachments

        Activity

          People

            Unassigned Unassigned
            abuqutaita Tabrez Mohammed
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: