Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
We need to cache file metadata (e.g. ORC file footers) for split generation (which, on FSes that support fileId, will be valid permanently and only needs to be removed lazily when ORC file is erased or compacted), and potentially even some information about splits (e.g. grouping based on location that would be good for some short time), in HBase metastore.
It should be queryable by table. Partition predicate pushdown should be supported. If bucket pruning is added, that too. Given that we cannot cache file lists (we have to check FS for new/changed files anyway), and the difficulty of passing of data about partitions/etc. to split generation compared to paths, we will probably just filter by paths and fileIds. It might be different for splits
In later phases, it would be nice to save the (first category above) results of expensive work done by jobs, e.g. data size after decompression/decoding per column, etc. to avoid surprises when ORC encoding is very good, or very bad. Perhaps it can even be lazily generated. Here's a pony: 🐴
Attachments
Attachments
Issue Links
- incorporates
-
HIVE-14941 Renable stats_filemetadata.q test case
- Open
-
HIVE-12051 push file metadata PPD into HBase
- Open
-
HIVE-12052 automatically populate file metadata to HBase metastore based on config or table properties
- Open
-
HIVE-12801 document HIVE-11500 (metadata cache in HBase metastore)
- Open
-
HIVE-12925 make sure metastore footer cache doesn't get all functions
- Open
-
HIVE-11542 port fileId support on shims and splits from llap branch
- Closed
-
HIVE-11552 implement basic methods for getting/putting file metadata
- Closed
-
HIVE-11553 use basic file metadata cache in ETLSplitStrategy-related paths
- Closed
-
HIVE-11595 refactor ORC footer reading to make it usable from outside
- Closed
-
HIVE-11644 make sure Hive config is propagated to AM-side split generation
- Closed
-
HIVE-11675 make use of file footer PPD API in ETL strategy or separate strategy
- Closed
-
HIVE-11676 implement metastore API to do file footer PPD
- Closed
-
HIVE-11689 minor flow changes to ORC split generation
- Closed
-
HIVE-11705 refactor SARG stripe filtering for ORC into a separate method
- Closed
-
HIVE-11777 implement an option to have single ETL strategy for multiple directories
- Closed
-
HIVE-11823 create a self-contained translation for SARG to be used by metastore
- Closed
-
HIVE-11856 allow split strategies to run on threadpool
- Closed
-
HIVE-12048 metastore file metadata cache should not be used when deltas are present
- Closed
-
HIVE-12061 add file type support to file metadata by expr call
- Closed
-
HIVE-12062 enable HBase metastore file metadata cache for tez tests
- Closed
-
HIVE-12075 add analyze command to explictly cache file metadata in HBase metastore
- Closed