Description
In SPARK-21237 (link SPARK-21237) it is fix when Appending data using write.mode("append") . But when create same type of parquet table using SQL and Insert data ,stats shows in-correct (not updated).
Correct Stats Example (SPARK-21237)
scala> spark.range(100).write.saveAsTable("tab1")
scala> spark.sql("explain cost select * from tab1").show(false)
+------------------------------------------------------------------------
plan +------------------------------------------------------------------------ |
== Optimized Logical Plan == Relationid#10L parquet, Statistics(sizeInBytes=784.0 B, hints=none) |
== Physical Plan ==
FileScan parquet default.tab1id#10L Batched: false, Format: Parquet,
scala> spark.range(100).write.mode("append").saveAsTable("tab1")
scala> spark.sql("explain cost select * from tab1").show(false)
+----------------------------------------------------------------------
plan +---------------------------------------------------------------------- |
== Optimized Logical Plan == Relationid#23L parquet, Statistics(sizeInBytes=1568.0 B, hints=none) |
== Physical Plan ==
FileScan parquet default.tab1id#23L Batched: false, Format: Parquet,
Incorrect Stats Example
scala> spark.sql("create table tab2(id bigint) using parquet")
res6: org.apache.spark.sql.DataFrame = []
scala> spark.sql("explain cost select * from tab2").show(false)
+----------------------------------------------------------------------
plan +---------------------------------------------------------------------- |
== Optimized Logical Plan == Relationid#30L parquet, Statistics(sizeInBytes=374.0 B, hints=none) |
== Physical Plan ==
FileScan parquet default.tab2id#30L Batched: false, Format: Parquet,
scala> spark.sql("insert into tab2 select 1")
res9: org.apache.spark.sql.DataFrame = []
scala> spark.sql("explain cost select * from tab2").show(false)
+----------------------------------------------------------------------
plan +---------------------------------------------------------------------- |
== Optimized Logical Plan == Relationid#30L parquet, Statistics(sizeInBytes=374.0 B, hints=none) |
== Physical Plan ==
FileScan parquet default.tab2id#30L Batched: false, Format: Parquet,
Both table are same type of table
scala> spark.sql("desc formatted tab1").show(2000,false)
----------------------------------------------------------------------------------------+
col_name | data_type |
----------------------------------------------------------------------------------------+
id | bigint |
|
|
Database | default |
Table | tab1 |
Owner | Administrator |
Created Time | Tue Jul 16 21:08:35 IST 2019 |
Last Access | Thu Jan 01 05:30:00 IST 1970 |
Created By | Spark 2.3.2 |
Type | MANAGED |
Provider | parquet |
Table Properties | [transient_lastDdlTime=1563291579] |
Statistics | 1568 bytes |
Location | file:/x/2 |
Serde Library | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe |
InputFormat | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat |
OutputFormat | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat |
scala> spark.sql("desc formatted tab2").show(2000,false)
----------------------------------------------------------------------------------------
col_name | data_type ---------------------------- |
id | bigint |
|
|
Database | default |
Table | tab2 |
Owner | Administrator |
Created Time | Tue Jul 16 21:10:24 IST 2019 |
Last Access | Thu Jan 01 05:30:00 IST 1970 |
Created By | Spark 2.3.2 |
Type | MANAGED |
Provider | parquet |
Table Properties | [transient_lastDdlTime=1563291624] |
Location | file:/x/1 |
Serde Library | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe |
InputFormat | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat |
OutputFormat | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat |
Attachments
Attachments
Issue Links
- duplicates
-
SPARK-19784 refresh datasource table after alter the location
- Resolved