Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30616

Introduce TTL config option for SQL Metadata Cache

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 3.1.0
    • SQL
    • None

    Description

      From documentation:

      Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table conversion is enabled, metadata of those converted tables are also cached. If these tables are updated by Hive or other external tools, you need to refresh them manually to ensure consistent metadata.

      Currently Spark caches file listing for tables and requires issuing "REFRESH TABLE" any time the file listing has changed outside of Spark. Unfortunately, simply submitting "REFRESH TABLE" commands could be very cumbersome. Assuming frequently added files, hundreds of tables and dozens of users querying the data (and expecting up-to-date results), manually refreshing metadata for each table is not a solution.

      This is a pretty common use-case for streaming ingestion of data, which can be done outside of Spark (with tools like Kafka Connect, etc.).

      A similar feature exists in Presto: hive.file-status-cache-expire-time can be found here.

      I propose to introduce a new option in Spark (something like "spark.sql.hive.filesourcePartitionFileCacheTTL") that controls the TTL of this metadata cache. It can be disabled by default (-1), so it doesn't change the existing behaviour. 

      Attachments

        Activity

          People

            sap1ens Yaroslav Tkachenko
            sap1ens Yaroslav Tkachenko
            Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: