Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-49679

validateSchemaOutput should refer to case sensitivity flag

    XMLWordPrintableJSON

Details

    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.0.0
    • None
    • Optimizer

    Description

      If we're using `spark.sql.caseSensitive` set to false, we should accept queries like this:

                |SELECT * FROM (
                |    Select a.ppmonth,
                |    a.ppweek,
                |    case when a.retsubcategoryderived <= 1 then 'XXXXXXXXXXXXX'
                |    else
                |    'XXXXXX'
                |    end as mappedflag,
                |    b.name as subcategory_name,
                |    sum(a.totalvalue) as RDOLLARS
                |    from a, b
                |    where a.retsubcategoryderived = b.retsubcategoryderived
                |    group by a.Ppmonth,a.ppweek,a.retsubcategoryderived,b.name, mappedflag)
      However, validateSchemaOutput in optimizer's checks about plan schema changes does not use this flag, which leads to a situation that some queries will fail this check even if the optimization is correct. Take this query as an example:

      After AggregatePushdownThroughJoins, the plan changes from
      Aggregate Ppmonth#3L, ppweek#4L, retsubcategoryderived#7L, name#13, _groupingexpression#29, ppmonth#3L, ppweek#4L, _groupingexpression#29 AS mappedflag#0, name#13 AS subcategory_name#1, sum(totalvalue#9L) AS RDOLLARS#2L
      +- Project ppmonth#3L, ppweek#4L, retsubcategoryderived#7L, totalvalue#9L, name#13, CASE WHEN (retsubcategoryderived#7L <= 1) THEN XXXXXXXXXXXXX ELSE XXXXXX END AS _groupingexpression#29
      {{ +- Join Inner, (retsubcategoryderived#7L = retsubcategoryderived#10L)}}
      {{ :- Project ppmonth#3L, ppweek#4L, retsubcategoryderived#7L, totalvalue#9L}}
      {{ : +- Filter isnotnull(retsubcategoryderived#7L)}}
      {{ : +- Relation spark_catalog.default.appmonth#3L,ppweek#4L,retcategorygroupderived#5L,rethidsubcategoryderived#6L,retsubcategoryderived#7L,retsupercategoryderived#8L,totalvalue#9L parquet}}
      {{ +- Project retsubcategoryderived#10L, name#13}}
      {{ +- Filter isnotnull(retsubcategoryderived#10L)}}
      {{ +- Relation spark_catalog.default.bretsubcategoryderived#10L,description#11,displayorder#12L,name#13,shortname#14,startrange#15,endrange#16,retcategoryderived#17L,retcategorygroupderived#18L,retsupercategoryderived#19L,altbusiness#20L parquet}}

      To:

      Project Ppmonth#3L, ppweek#4L, _groupingexpression#29 AS mappedflag#0, name#13 AS subcategory_name#1, sum(totalvalue#9L)#23L AS RDOLLARS#2L
      +- AggregatePart Ppmonth#3L, ppweek#4L, retsubcategoryderived#7L, name#13, _groupingexpression#29, finalmerge_sum(merge sum#31L) AS sum(totalvalue#9L)#23L, true
      {{ +- AggregatePart Ppmonth#3L, ppweek#4L, retsubcategoryderived#7L, name#13, _groupingexpression#29, merge_sum(merge sum#31L) AS sum#31L, false}}
      {{ +- Project Ppmonth#3L, ppweek#4L, retsubcategoryderived#7L, name#13, _groupingexpression#29, sum#31L}}
      {{ +- Join Inner, (retsubcategoryderived#7L = retsubcategoryderived#10L)}}
      {{ :- AggregatePart Ppmonth#3L, ppweek#4L, retsubcategoryderived#7L, CASE WHEN (retsubcategoryderived#7L <= 1) THEN XXXXXXXXXXXXX ELSE XXXXXX END AS _groupingexpression#29, partial_sum(totalvalue#9L) AS sum#31L, false}}
      {{ : +- Project ppmonth#3L, ppweek#4L, retsubcategoryderived#7L, totalvalue#9L}}
      {{ : +- Filter isnotnull(retsubcategoryderived#7L)}}
      {{ : +- Relation spark_catalog.default.appmonth#3L,ppweek#4L,retcategorygroupderived#5L,rethidsubcategoryderived#6L,retsubcategoryderived#7L,retsupercategoryderived#8L,totalvalue#9L parquet}}
      {{ +- Project retsubcategoryderived#10L, name#13}}
      {{ +- Filter isnotnull(retsubcategoryderived#10L)}}
      {{ +- Relation spark_catalog.default.bretsubcategoryderived#10L,description#11,displayorder#12L,name#13,shortname#14,startrange#15,endrange#16,retcategoryderived#17L,retcategorygroupderived#18L,retsupercategoryderived#19L,altbusiness#20L parquet}}

      where the schema Ppmonth does not match with schema ppmonth.

      We need to use this flag in validateSchemaOutput.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              avery_qi Avery Qi
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: