Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13410

unionAll AnalysisException with DataFrames containing UDT columns.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.5.0, 1.6.0
    • 1.6.1, 2.0.0
    • SQL
    • Patch

    Description

      Unioning two DataFrames that contain UDTs fails with

      AnalysisException: u"unresolved operator 'Union;"

      I tracked this down to this line https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202

      Which compares datatypes between the output attributes of both logical plans. However for UDTs this will be a new instance of the UserDefinedType or PythonUserDefinedType https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158

      So this equality check will check if the two instances are the same and since they aren't references to a singleton this check fails.

      Note: this will work fine if you are unioning the dataframe with itself.

      I have a proposed patch for this which overrides the equality operator on the two classes here: https://github.com/apache/spark/pull/11279

      Reproduction steps

      from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
      from pyspark.sql import types
      
      schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])
      
      #note they need to be two separate dataframes
      a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
      b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)
      
      c = a.unionAll(b)
      

      Attachments

        Activity

          People

            franklynDsouza Franklyn Dsouza
            franklynDsouza Franklyn Dsouza
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 3h
                3h
                Remaining:
                Remaining Estimate - 3h
                3h
                Logged:
                Time Spent - Not Specified
                Not Specified