Details
-
Improvement
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Currently, RecordReaders such as ORC support filtering at coarser-grained levels, namely: File, Stripe (64 to 256mb), and Row group (10k row) level. They only filter sets of rows if they can guarantee that none of the rows can pass a filter (usually given as searchable argument).
However, a significant amount of time can be spend decoding rows with multiple columns that are not even used in the final result. See figure where original is what happens today and in LazyDecode we skip decoding rows that do not match the key.
To enable a more fine-grained filtering in the particular case of a MapJoin we could utilize the key HashTable created from the smaller table to skip deserializing row columns at the larger table that do not match any key and thus save CPU time.
This Jira investigates this direction.
Attachments
Attachments
Issue Links
- depends upon
-
HIVE-23215 Make FilterContext and MutableFilterContext interfaces
- Closed
-
ORC-577 Allow row-level filtering
- Closed
- is blocked by
-
HIVE-23553 Upgrade ORC version to 1.6.7
- Closed
- relates to
-
HIVE-23167 Expression probe decode with row-level filtering
- In Progress
- links to