Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
2
Description
Pruning Design:
- step1 : Fetch latest file slices for pruned partitions (from MDT)
- step2.a : Fetch stats from Col stats index which outputs in the format {
{File1, col1 ➝ stat1}
, {File2, col1 ➝ stat2},...} i.e. one entry per file,column combo. Here we are reading using HoodieTableMetadata.getRecordsByKeyPrefixes(). just that we are passing in just the columns.
- step2.b: Apply filter function to prune entries from step 2.a based on the list from step 1. col stats value will contain the file name and we filter based on that. Output from this step will be latest files looked up from col stats partition in MDT.
- step2.b : Construct a matrix of the format File1 ➝ {col1_valuecount, col1_minvalue, col1_maxvalue, col2_valuecount, .... } i.e. one entry per file.
- step2.c: Get the list of files indexed by col stats.
- step2.d: Apply the query predicate and get the list of pruned file names over step 2.b.
- step3: If there are any files missing to be indexed from col stats (step1 output - step2.c output), add them back to 2.d to get list of final pruned files list. Or in other words, pruned files + missingToIndexFiles are the final set of candidate files we return from this step.
- lets name the output from step3 as candidate files.
- step4: For every file slice from step3 => if every file in this file slice is missing from the candidate files, we can ignore the file slice(in other words, every file in this file slice did not match the predicate from col stats, we are safe to ignore the entire file slice). Even if one file is present in candidate files, we need to include the file slice in its entirety.
This is regarding step 4. As per current logic, we go through every file in the candidate file and check if any of them matches any file in the current file slice. If it matches, we include the file slice and move onto next file slice for processing. Shouldn't we reverse the lookup. For every file slice ➝ for every file ➝ check if its part of candidate list, if any match, include the file slice. if not ignore the file slice.
Attachments
Issue Links
- links to