Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16026 Cost-based Optimizer Framework
  3. SPARK-25185

CBO rowcount statistics doesn't work for partitioned parquet external table

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.2.1, 2.3.0
    • None
    • Spark Core, SQL
    • None
    •  

      Tried on Ubuntu, FreBSD and windows, running spark-shell in local mode reading data from local file system

    Description

      Created a dummy partitioned data with partition column on string type col1=a and col1=b

      added csv data-> read through spark > created partitioned external table> msck repair table to load partition. Did analyze on all columns and partition column as well.

      println(spark.sql("select * from test_p where e='1a'").queryExecution.toStringWithStats)
      val op = spark.sql("select * from test_p where e='1a'").queryExecution.optimizedPlan

      // e is the partitioned column
      val stat = op.stats(spark.sessionState.conf)
      print(stat.rowCount)

       

      Created the same way in parquet the rowcount comes up correctly in case of csv but in parquet it shows as None.

      Attachments

        Activity

          People

            Unassigned Unassigned
            imamitsehgal Amit
            Votes:
            5 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated: