Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28413

sizeInByte is Not updated for parquet datasource on Next Insert.

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.3.2, 2.4.1
    • 3.0.0
    • SQL
    • None

    Description

      In  SPARK-21237 (link SPARK-21237)  it is fix when Appending data using  write.mode("append") . But when create same type of parquet table using SQL and  Insert data ,stats shows in-correct (not updated).

      Correct Stats  Example (SPARK-21237)

      scala> spark.range(100).write.saveAsTable("tab1")

      scala> spark.sql("explain cost select * from tab1").show(false)
      +------------------------------------------------------------------------

      plan
      +------------------------------------------------------------------------
      == Optimized Logical Plan ==
      Relationid#10L parquet, Statistics(sizeInBytes=784.0 B, hints=none)

      == Physical Plan ==
      FileScan parquet default.tab1id#10L Batched: false, Format: Parquet, 

      scala> spark.range(100).write.mode("append").saveAsTable("tab1")

      scala> spark.sql("explain cost select * from tab1").show(false)
      +----------------------------------------------------------------------

      plan
      +----------------------------------------------------------------------
      == Optimized Logical Plan ==
      Relationid#23L parquet, Statistics(sizeInBytes=1568.0 B, hints=none)

      == Physical Plan ==
      FileScan parquet default.tab1id#23L Batched: false, Format: Parquet,

       

       

      Incorrect Stats Example

      scala> spark.sql("create table tab2(id bigint) using parquet")
      res6: org.apache.spark.sql.DataFrame = []

      scala> spark.sql("explain cost select * from tab2").show(false)
      +----------------------------------------------------------------------

      plan
      +----------------------------------------------------------------------
      == Optimized Logical Plan ==
      Relationid#30L parquet, Statistics(sizeInBytes=374.0 B, hints=none)

      == Physical Plan ==
      FileScan parquet default.tab2id#30L Batched: false, Format: Parquet,

       

      scala> spark.sql("insert into tab2 select 1")
      res9: org.apache.spark.sql.DataFrame = []

      scala> spark.sql("explain cost select * from tab2").show(false)
      +----------------------------------------------------------------------

      plan
      +----------------------------------------------------------------------
      == Optimized Logical Plan ==
      Relationid#30L parquet, Statistics(sizeInBytes=374.0 B, hints=none)

      == Physical Plan ==
      FileScan parquet default.tab2id#30L Batched: false, Format: Parquet,

       

       

      Both table are same type of table

      scala> spark.sql("desc formatted tab1").show(2000,false)
      ----------------------------------------------------------------------------------------+

      col_name data_type

      ----------------------------------------------------------------------------------------+

      id bigint
         
      1. Detailed Table Information
       
      Database default
      Table tab1
      Owner Administrator
      Created Time Tue Jul 16 21:08:35 IST 2019
      Last Access Thu Jan 01 05:30:00 IST 1970
      Created By Spark 2.3.2
      Type MANAGED
      Provider parquet
      Table Properties [transient_lastDdlTime=1563291579]
      Statistics 1568 bytes
      Location file:/x/2
      Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
      InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
      OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat

       

      scala> spark.sql("desc formatted tab2").show(2000,false)
      ----------------------------------------------------------------------------------------

      col_name data_type
      ----------------------------------------------------------------------------------------
      id bigint
       
      1. Detailed Table Information
      Database default
      Table tab2
      Owner Administrator
      Created Time Tue Jul 16 21:10:24 IST 2019
      Last Access Thu Jan 01 05:30:00 IST 1970
      Created By Spark 2.3.2
      Type MANAGED
      Provider parquet
      Table Properties [transient_lastDdlTime=1563291624]
      Location file:/x/1
      Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
      InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
      OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            Bjangir Babulal
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment