Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18455 General support for correlated subquery processing
  3. SPARK-19993

Caching logical plans containing subquery expressions does not work.

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.0
    • 2.2.0
    • SQL
    • None

    Description

      Here is a simple repro that depicts the problem. In this case the second invocation of the sql should have been from the cache. However the lookup fails currently.

      scala> val ds = spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where s1.c1 = s2.c1)")
      ds: org.apache.spark.sql.DataFrame = [c1: int]
      
      scala> ds.cache
      res13: ds.type = [c1: int]
      
      scala> spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where s1.c1 = s2.c1)").explain(true)
      == Analyzed Logical Plan ==
      c1: int
      Project [c1#86]
      +- Filter c1#86 IN (list#78 [c1#86])
         :  +- Project [c1#87]
         :     +- Filter (outer(c1#86) = c1#87)
         :        +- SubqueryAlias s2
         :           +- Relation[c1#87] parquet
         +- SubqueryAlias s1
            +- Relation[c1#86] parquet
      
      == Optimized Logical Plan ==
      Join LeftSemi, ((c1#86 = c1#87) && (c1#86 = c1#87))
      :- Relation[c1#86] parquet
      +- Relation[c1#87] parquet
      

      Attachments

        Activity

          People

            dkbiswal Dilip Biswal
            dkbiswal Dilip Biswal
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: