Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46779

Grouping by subquery with a cached relation can fail

    XMLWordPrintableJSON

Details

    Description

      Example:

      create or replace temp view data(c1, c2) as values
      (1, 2),
      (1, 3),
      (3, 7),
      (4, 5);
      
      cache table data;
      
      select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from data d2 group by all;
      

      It fails with the following error:

      [INTERNAL_ERROR] Couldn't find count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
      org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000
      

      If you don't cache the view, the query succeeds.

      Note, in 3.4.2 and 3.5.0 the issue happens only with cached tables, not cached views. I think that's because cached views were not getting properly deduplicated in those versions.

      Attachments

        Issue Links

          Activity

            People

              bersprockets Bruce Robbins
              bersprockets Bruce Robbins
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: