Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.6.0
-
None
Description
The RecordReader implementation currently will read all the columns before applying the filter predicate and deciding whether to keep the row or discard it.
We can have a RecordReader which will only assemble the columns on which filters are applied (which are usually a few), then apply the filter and decide whether to keep the row or not , and then goes on to assemble the remaining columns or skip the remaining columns accordingly.
Also for applications like spark sql , the schema usually applied is a flat one with no repeating or nested columns. In such cases, its better to have a light-weight, faster RecordReader.
The performance improvement by this change is seen to be significant , and is better in case smaller number of rows are returned by filtering (which is usually the case) and there are many number of columns