Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21969

CommandUtils.updateTableStats should call refreshTable

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.3.0
    • SQL
    • None

    Description

      The table is cached so even though statistics are removed, they will still be used by the existing sessions.

      spark.range(100).write.saveAsTable("tab1")
      sql("analyze table tab1 compute statistics")
      sql("explain cost select distinct * from tab1").show(false)
      

      Produces:

      Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, hints=none)
      
      spark.range(100).write.mode("append").saveAsTable("tab1")
      sql("explain cost select distinct * from tab1").show(false)
      

      After append something, the same stats are used

      Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, hints=none)
      

      Manually refreshing the table removes the stats

      spark.sessionState.catalog.refreshTable(TableIdentifier("tab1"))
      sql("explain cost select distinct * from tab1").show(false)
      
      Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none)
      

      Attachments

        Issue Links

          Activity

            People

              aokolnychyi Anton Okolnychyi
              bograd Bogdan Raducanu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: