Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-7162

RDD's Don't cache in some situations with new filegroup reader + new parquet file format

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • spark, spark-sql
    • None

    Description

      "Test Call rollback_to_instant Procedure with refreshTable" 

      Fails if a projection is added to the query plan. The test does not currently fail, because we don't do the project for non-partitioned tables. Adding the projection prevents the rdd from being cached.

      Query plans:

      without projection, caching works:

      == Parsed Logical Plan =='Project ['id]+- SubqueryAlias spark_catalog.default.h0   +- Relation default.h0[_hoodie_commit_time#547,_hoodie_commit_seqno#548,_hoodie_record_key#549,_hoodie_partition_path#550,_hoodie_file_name#551,id#552,name#553,price#554,ts#555L] parquet
      == Analyzed Logical Plan ==id: intProject [id#552]+- SubqueryAlias spark_catalog.default.h0   +- Relation default.h0[_hoodie_commit_time#547,_hoodie_commit_seqno#548,_hoodie_record_key#549,_hoodie_partition_path#550,_hoodie_file_name#551,id#552,name#553,price#554,ts#555L] parquet
      == Optimized Logical Plan ==InMemoryRelation [id#552], StorageLevel(disk, memory, deserialized, 1 replicas)   +- *(1) ColumnarToRow      +- FileScan parquet default.h0[id#552] Batched: true, DataFilters: [], Format: Parquet, Location: HoodieFileIndex(1 paths)[file:/private/var/folders/d0/l7mfhzl1661byhh3mbyg5fv00000gn/T/spark-87b3..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
      == Physical Plan ==InMemoryTableScan [id#552]   +- InMemoryRelation [id#552], StorageLevel(disk, memory, deserialized, 1 replicas)         +- *(1) ColumnarToRow            +- FileScan parquet default.h0[id#552] Batched: true, DataFilters: [], Format: Parquet, Location: HoodieFileIndex(1 paths)[file:/private/var/folders/d0/l7mfhzl1661byhh3mbyg5fv00000gn/T/spark-87b3..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> 

      With projection, no caching:

      == Parsed Logical Plan =='Project ['id]+- SubqueryAlias spark_catalog.default.h0   +- Relation default.h0[_hoodie_commit_time#539,_hoodie_commit_seqno#540,_hoodie_record_key#541,_hoodie_partition_path#542,_hoodie_file_name#543,id#544,name#545,price#546,ts#547L] parquet
      == Analyzed Logical Plan ==id: intProject [id#544]+- SubqueryAlias spark_catalog.default.h0   +- Relation default.h0[_hoodie_commit_time#539,_hoodie_commit_seqno#540,_hoodie_record_key#541,_hoodie_partition_path#542,_hoodie_file_name#543,id#544,name#545,price#546,ts#547L] parquet
      == Optimized Logical Plan ==Project [id#544]+- Relation default.h0[_hoodie_commit_time#539,_hoodie_commit_seqno#540,_hoodie_record_key#541,_hoodie_partition_path#542,_hoodie_file_name#543,id#544,name#545,price#546,ts#547L] parquet
      == Physical Plan ==*(1) ColumnarToRow+- FileScan parquet default.h0[id#544] Batched: true, DataFilters: [], Format: Parquet, Location: HoodieFileIndex(1 paths)[file:/private/var/folders/d0/l7mfhzl1661byhh3mbyg5fv00000gn/T/spark-8c60..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
      
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            jonvex Jonathan Vexler
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: