Description
Currently in the ORCRecordReader the filter logic that perform LazyIO receives the following inputs:
- SearchArgument as passed by the client using `Reader.Options.getSearchArgument`
- Input filter as passed by the client using `Reader.Options.getFilterCallback`
The SearchArgument is particularly convenient in allowing for easy integration with the existing engines such as Spark without necessitating any code changes on the engine. However this push down is limited to what can be represented via SearchArguments as an example if we take any predicate that uses a function this cannot be pushed down.
SELECT * FROM table WHERE lower(f1) IN ... OR f2 IN ... OR f3 IN ...
For the above query none of the filters are pushed down to ORC from the engine as we have no means for representing Functions and the use of OR to combine the predicates.
An additional input mechanism is requested for supplying filters that is plugable without requiring a change in the clients directly. We are proposing the use of Java *ServiceLoader* to dynamically determine the desired filters for a given fully qualified file path.
This filter if determined is applied as an AND in conjunction with the other available filters. It is understood that the plugin filter cannot differentiate multiple aliases for the same table.
This generic capability will allow us to represent complex filters that currently cannot be pushed down to the storage layer from the existing engines allowing us to reap the benefits of LazyIO in many cases.
Attachments
Issue Links
- links to