Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-5697

Spark SQL re-lists Hudi table after every SQL operations

    XMLWordPrintableJSON

Details

    Description

      Currently, after most DML operations in Spark SQL, Hudi invokes `Catalog.refreshTable`

      Prior to Spark 3.2, this was essentially doing the following:

      1. Invalidating relation cache (forcing next time for relation to be re-resolved, creating new FileIndex, listing files, etc)
      2. Trigger cascading invalidation (re-caching) of the cached data (in CacheManager)

      As of Spark 3.2 it now additionally does `LogicalRelation.refresh` for ALL tables (previously this was only done for Temporary Views), therefore entailing whole table to be re-listed again by triggering `FileIndex.refresh` which might be costly operation.

       

      We should revert back to preceding behavior from Spark 3.1

      Attachments

        Issue Links

          Activity

            People

              alexey.kudinkin Alexey Kudinkin
              alexey.kudinkin Alexey Kudinkin
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: