Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10960

SQL with windowing function cannot reference column in inner select block

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotVotersStop watchingWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.4.0, 1.5.0
    • 1.4.2, 1.5.2, 1.6.0
    • SQL
    • None

    Description

      There seems to be a bug in the Spark SQL parser when I use windowing functions. Specifically, when the SELECT refers to a column from an inner select block, the parser throws an error.

      Here is an example:
      --------------------------
      When I use a windowing function and add a '1' constant to the result,

         select Rank() OVER ( ORDER BY D1.c3 ) + 1 as c1
      

      The Spark SQL parser works. The whole SQL is:

      select Rank() OVER ( ORDER BY D1.c3 ) + 1 as c1,
                               D1.c3 as c3,
                               D1.c4 as c4,
                               D1.c5 as c5
                          from 
                               (select T3671.ROW_WID as c3,
                                         T3671.CAL_MONTH as c4,
                                         T3671.CAL_YEAR as c5,
                                         1 as c6
                                    from 
                                         W_DAY_D T3671
                               ) D1
      

      However, if I change the projection so that it refers to a column in an inner select block, D1.C6, whose value is itself a '1' literal, so it is functionally equivalent to the SQL above, Spark SQL will throw an error:

      select Rank() OVER ( ORDER BY D1.c3 ) + D1.C6 as c1,
                               D1.c3 as c3,
                               D1.c4 as c4,
                               D1.c5 as c5
                          from 
                               (select T3671.ROW_WID as c3,
                                         T3671.CAL_MONTH as c4,
                                         T3671.CAL_YEAR as c5,
                                         1 as c6
                                    from 
                                         W_DAY_D T3671
                               ) D1
      

      The error message is:

      . . . . . . . . . . . . . . . .> java.lang.NullPointerException
      Error: org.apache.spark.sql.AnalysisException: resolved attribute(s) c6#3386 missing from c5#3390
      ,c3#3383,c4#3389,_we0#3461,c3#3388 in operator !Project [c3#3388,c4#3389,c5#3390,c3#3383,_we0#346
      1,(_we0#3461 + c6#3386) AS c1#3387]; (state=,code=0)
      

      The above example is a simplified version of the SQL I was testing. The full SQL I was using, which fails with a similar error, is as follows:

      select Case when case D1.c6 when 1 then D1.c3 else NULL end  is not null then Rank() OVER ( ORDER BY case when ( case D1.c6 when 1 then D1.c3 else NULL end  ) is null then 1 else 0 end, case D1.c6 when 1 then D1.c3 else NULL end ) end as c1,
                               Case when case D1.c7 when 1 then D1.c3 else NULL end  is not null then Rank() OVER ( PARTITION BY D1.c4, D1.c5 ORDER BY case when ( case D1.c7 when 1 then D1.c3 else NULL end  ) is null then 1 else 0 end, case D1.c7 when 1 then D1.c3 else NULL end ) end as c2,
                               D1.c3 as c3,
                               D1.c4 as c4,
                               D1.c5 as c5
                          from 
                               (select T3671.ROW_WID as c3,
                                         T3671.CAL_MONTH as c4,
                                         T3671.CAL_YEAR as c5,
                                         ROW_NUMBER() OVER (PARTITION BY T3671.CAL_MONTH, T3671.CAL_YEAR ORDER BY T3671.CAL_MONTH DESC, T3671.CAL_YEAR DESC) as c6,
                                         ROW_NUMBER() OVER (PARTITION BY T3671.CAL_MONTH, T3671.CAL_YEAR, T3671.ROW_WID ORDER BY T3671.CAL_MONTH DESC, T3671.CAL_YEAR DESC, T3671.ROW_WID DESC) as c7
                                    from 
                                         W_DAY_D T3671
                               ) D1
      

      Hopefully when fixed, both these sample SQLs should work!

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            viirya L. C. Hsieh Assign to me
            jeanlucw David Wong
            Votes:
            0 Vote for this issue
            Watchers:
            4 Stop watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Issue deployment