Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21110

Structs should be usable in inequality filters

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.1.1
    • Fix Version/s: 2.3.0
    • Component/s: SQL
    • Labels:
      None
    • Target Version/s:

      Description

      It seems like a missing feature that you can't compare structs in a filter on a DataFrame.

      Here's a simple demonstration of a) where this would be useful and b) how it's different from simply comparing each of the components of the structs.

      import pyspark
      from pyspark.sql.functions import col, struct, concat
      
      spark = pyspark.sql.SparkSession.builder.getOrCreate()
      
      df = spark.createDataFrame(
          [
              ('Boston', 'Bob'),
              ('Boston', 'Nick'),
              ('San Francisco', 'Bob'),
              ('San Francisco', 'Nick'),
          ],
          ['city', 'person']
      )
      pairs = (
          df.select(
              struct('city', 'person').alias('p1')
          )
          .crossJoin(
              df.select(
                  struct('city', 'person').alias('p2')
              )
          )
      )
      
      print("Everything")
      pairs.show()
      
      print("Comparing parts separately (doesn't give me what I want)")
      (pairs
          .where(col('p1.city') < col('p2.city'))
          .where(col('p1.person') < col('p2.person'))
          .show())
      
      print("Comparing parts together with concat (gives me what I want but is hacky)")
      (pairs
          .where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person'))
          .show())
      
      print("Comparing parts together with struct (my desired solution but currently yields an error)")
      (pairs
          .where(col('p1') < col('p2'))
          .show())
      

      The last query yields the following error in Spark 2.1.1:

      org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint or int or bigint or float or double or decimal or timestamp or date or string or binary) type, not struct<city:string,person:string>;;
      'Filter (p1#5 < p2#8)
      +- Join Cross
         :- Project [named_struct(city, city#0, person, person#1) AS p1#5]
         :  +- LogicalRDD [city#0, person#1]
         +- Project [named_struct(city, city#0, person, person#1) AS p2#8]
            +- LogicalRDD [city#0, person#1]
      

        Attachments

          Activity

            People

            • Assignee:
              a1ray Andrew Ray
              Reporter:
              nchammas Nicholas Chammas
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: