Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.0.0
-
None
Description
Metadata cache is a key-value cache built on Google Guava Cache to speed up building logical plan nodes (`LogicalRelation`) for data source tables. The cache key is a unique identifier of a table. Here, the identifier is the fully qualified table name, including the database in which it resides. (In the future, it could be extended to a multi-part names when introducing federated Catalog). The value is the corresponding LogicalRelation that represents a specific data source table.
The cache is session based. In each session, the cache is managed in two different ways at the same time.
1. *Auto loading*: when Spark querying the cache for a user-defined data source table, the cache either returns a cached LogicalRelation, or else automatically builds a new one by decoding the metadata fetched from the external Catalog.
2. *Manual caching*: Hive tables are represented as logical plan nodes MetastoreRelation. For better performance, we convert Hive serde tables to data source tables, if convertible. The conversion is not completed at the stage of metadata loading. Instead, it is conducted during semantic analysis. If a Hive serde table is convertible, we first try to get the value (by the fully qualified table name) from the metadata cache. If existed, we use it directly; otherwise, build a new one and also push it into the cache for the future reuse.
Currently, the file `HiveMetastoreCatalog.scala` contains different entities/functions since all of them require interaction with the cache, called `cachedDataSourceTables`. This JIRA is to cleanup `HiveMetastoreCatalog.scala`.
*Proposal*: To avoid mixing everything related to cache in the same file, we abstract and define the following API for cache operations. After the code changes, `HiveMetastoreatalog.scala` only contains the cache API implementation. The file name can be renamed to `MetadataCache.scala`
// cacheTable is a wrapper of cache.put(key, value). It associates value with key in this cache. // If the cache previously contained a value associated with key, the old value is replaced by value. def cacheTable(tableIdent: TableIdentifier, plan: LogicalPlan): Unit
// getTableIfPresent is a wrapper of cache.getIfPresent(key) that never causes values to be automatically loaded. def getTableIfPresent(tableIdent: TableIdentifier): Option[LogicalPlan]
// getTable is a wrapper of cache.get(key). If cache misses, Caches loaded by a CacheLoader will call // CacheLoader.load(K) to load new values into the cache. That means, it will call the function load. def getTable(tableIdent: TableIdentifier): LogicalPlan
// refreshTable is a wrapper of cache.invalidate. It does not eagerly reload the cache. // It just invalidate the cache. Next time when we use the table, it will be populated in the cache. def refreshTable(tableIdent: TableIdentifier): Unit
// Discards all entries in the cache. It is a wrapper of cache.invalidateAll. def invalidateAll(): Unit
We should also move three Hive-specific Analyzer rules `CreateTables`, `OrcConversions` and `ParquetConversions` from `HiveMetastoreCatalog.scala` to `HiveStrategies.scala`.