Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19093

Cached tables are not used in SubqueryExpression

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.2, 2.1.0
    • Fix Version/s: 2.1.1, 2.2.0
    • Component/s: SQL
    • Labels:
      None

      Description

      See reproduction at https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1903098128019500/2699761537338853/1395282846718893/latest.html

      Consider the following:

      Seq(("a", "b"), ("c", "d"))
        .toDS
        .write
        .parquet("/tmp/rows")
      
      val df = spark.read.parquet("/tmp/rows")
      df.cache()
      df.count()
      df.createOrReplaceTempView("rows")
      
      spark.sql("""
        select * from rows cross join rows
      """).explain(true)
      
      spark.sql("""
        select * from rows where not exists (select * from rows)
      """).explain(true)
      

      In both plans, I'd expect that both sides of the joins would read from the cached table for both the cross join and anti join, but the left anti join produces the following plan which only reads the left side from cache and reads the right side via a regular non-cahced scan:

      == Parsed Logical Plan ==
      'Project [*]
      +- 'Filter NOT exists#3994
         :  +- 'Project [*]
         :     +- 'UnresolvedRelation `rows`
         +- 'UnresolvedRelation `rows`
      
      == Analyzed Logical Plan ==
      _1: string, _2: string
      Project [_1#3775, _2#3776]
      +- Filter NOT predicate-subquery#3994 []
         :  +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
         :     +- Project [_1#3775, _2#3776]
         :        +- SubqueryAlias rows
         :           +- Relation[_1#3775,_2#3776] parquet
         +- SubqueryAlias rows
            +- Relation[_1#3775,_2#3776] parquet
      
      == Optimized Logical Plan ==
      Join LeftAnti
      :- InMemoryRelation [_1#3775, _2#3776], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
      :     +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
      +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
         +- Relation[_1#3775,_2#3776] parquet
      
      == Physical Plan ==
      BroadcastNestedLoopJoin BuildRight, LeftAnti
      :- InMemoryTableScan [_1#3775, _2#3776]
      :     +- InMemoryRelation [_1#3775, _2#3776], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
      :           +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
      +- BroadcastExchange IdentityBroadcastMode
         +- *Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
            +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
      

        Attachments

          Activity

            People

            • Assignee:
              dkbiswal Dilip Biswal
              Reporter:
              joshrosen Josh Rosen
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: