Description
The table is cached so even though statistics are removed, they will still be used by the existing sessions.
spark.range(100).write.saveAsTable("tab1") sql("analyze table tab1 compute statistics") sql("explain cost select distinct * from tab1").show(false)
Produces:
Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, hints=none)
spark.range(100).write.mode("append").saveAsTable("tab1") sql("explain cost select distinct * from tab1").show(false)
After append something, the same stats are used
Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, hints=none)
Manually refreshing the table removes the stats
spark.sessionState.catalog.refreshTable(TableIdentifier("tab1")) sql("explain cost select distinct * from tab1").show(false)
Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none)
Attachments
Issue Links
- is part of
-
SPARK-21237 Invalidate stats once table data is changed
- Resolved
- links to