Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10334

Partitioned table scan's query plan does not show Filter and Project on top of the table scan

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.5.0
    • 1.5.0
    • SQL
    • None

    Description

      Seq(Tuple2(1, 1), Tuple2(2, 2)).toDF("i", "j").write.format("parquet").partitionBy("i").save("/tmp/testFilter_partitioned")
      val df1 = sqlContext.read.format("parquet").load("/tmp/testFilter_partitioned")
      df1.selectExpr("hash(i)", "hash(j)").show
      df1.filter("hash(j) = 1").explain
      == Physical Plan ==
      Scan ParquetRelation[file:/tmp/testFilter_partitioned][j#20,i#21]
      

      Looks like the reason is that we correctly apply the project and filter. Then, we create an RDD for the result and then manually create a PhysicalRDD. So, the Project and Filter on top of the original table scan disappears from the physical plan.

      See https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L138-L175

      We will not generate wrong result. But, the query plan is confusing.

      Attachments

        Activity

          People

            yhuai Yin Huai
            yhuai Yin Huai
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: