Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-47527

Cache miss for queries using With expressions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 4.0.0
    • None
    • SQL

    Description

      For example:

      create or replace temp view v1 as
      select id from range(10);
      
      create or replace temp view q1 as
      select * from v1
      where id between 2 and 4;
      
      cache table q1;
      
      explain select * from q1;
      
      == Physical Plan ==
      *(1) Filter ((id#51L >= 2) AND (id#51L <= 4))
      +- *(1) Range (0, 10, step=1, splits=8)
      

      Similarly:

      create or replace temp view q2 as
      select count_if(id > 3) as cnt
      from v1;
      
      cache table q2;
      
      explain select * from q2;
      
      == Physical Plan ==
      AdaptiveSparkPlan isFinalPlan=false
      +- HashAggregate(keys=[], functions=[count(if (NOT _common_expr_0#88) null else _common_expr_0#88)])
         +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=182]
            +- HashAggregate(keys=[], functions=[partial_count(if (NOT _common_expr_0#88) null else _common_expr_0#88)])
               +- Project [(id#86L > 3) AS _common_expr_0#88]
                  +- Range (0, 10, step=1, splits=8)
      
      

      In the output of the above explain commands, neither include an InMemoryRelation node.

      The culprit seems to be the common expression ids in the With expressions used in runtime replacements for between and count_if, e.g. this code.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bersprockets Bruce Robbins
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: