Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13410

unionAll AnalysisException with DataFrames containing UDT columns.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.5.0, 1.6.0
    • Fix Version/s: 1.6.1, 2.0.0
    • Component/s: SQL
    • Labels:
    • Flags:
      Patch

      Description

      Unioning two DataFrames that contain UDTs fails with

      AnalysisException: u"unresolved operator 'Union;"

      I tracked this down to this line https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202

      Which compares datatypes between the output attributes of both logical plans. However for UDTs this will be a new instance of the UserDefinedType or PythonUserDefinedType https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158

      So this equality check will check if the two instances are the same and since they aren't references to a singleton this check fails.

      Note: this will work fine if you are unioning the dataframe with itself.

      I have a proposed patch for this which overrides the equality operator on the two classes here: https://github.com/apache/spark/pull/11279

      Reproduction steps

      from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
      from pyspark.sql import types
      
      schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])
      
      #note they need to be two separate dataframes
      a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
      b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)
      
      c = a.unionAll(b)
      

        Attachments

          Activity

            People

            • Assignee:
              franklynDsouza Franklyn Dsouza
              Reporter:
              franklynDsouza Franklyn Dsouza
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 3h
                3h
                Remaining:
                Remaining Estimate - 3h
                3h
                Logged:
                Time Spent - Not Specified
                Not Specified