[ORC-744] LazyIO of non-filter columns - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.7.0
Fix Version/s: 1.7.0
Component/s: Reader
Labels:
- releasenotes

Description

Background

This feature request started as a result of a large search that is performed with the following characteristics:

The search fields are not part of partition, bucket or sort fields.
The table is a very large table.
The predicates result in very few rows compared to the scan size.
The search columns are a significant subset of selection columns in the query.

Initial analysis showed that we could have a significant benefit by lazily reading the non-search columns only when we have a match. We explore the design and some benchmarks in subsequent sections.

Design

This builds further on ~~ORC-577~~ which currently only restricts deserialization for some selected data types but does not improve on IO.

On a high level the design includes the following components:

SArg to Filter: Converts Search Arguments passed down into filters for efficient application during scans.
Read: Performs the lazy read using the filters.
- Read Filter Columns: Read the filter columns from the file.
- Apply Filter: Apply the filter on the read filter columns.
- Read Select Columns: If filter selects at least a row then read the remaining columns.

This issue has the following tasks that provides further details on the design of the respective components:

~~ORC-741~~: Bug fix related to schema evolution of missing columns in the presence of filters
~~ORC-742~~: LazyIO of non-filter columns
~~ORC-743~~: Conversion of SArg to Filter

Tests

We evaluated this approach against a search job with the following stats:

Table
- Size: ~420 TB
- Data fields: ~120
- Partition fields: 3
Scan
- Search fields: 3 data fields with large (~ 1000 value) IN clauses compounded by OR.
- Select fields: 16 data fields (includes the 3 search fields), 1 partition field
- Search:
  - Size: ~180 TB
  - Records: 3.99 T
- Selected:
  - Size: ~100 MB
  - Records: 1 M

We have observed the following reductions compared with the absence of the patch:

Test	IO Reduction %	CPU Reduction %
Select 16 columns	45	47
SELECT *	70	87

The savings are more significant as you increase the number of select columns with respect to the search columns
When the filter selects most data, no significant penalty observed as a result of 2 IO compared with a single IO
- We do have a penalty as a result of the filter application on the selected records.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2021-01-25-14-34-45-375.png
25/Jan/21 22:34
40 kB
Pavan Lanka

Issue Links

is depended upon by

ORC-980 Filter processing ignores the schema case-sensitivity flag

Closed

ORC-983 Revise filter processing log level/location/message

Closed

is related to

ORC-577 Allow row-level filtering

Closed

SPARK-41798 Upgrade hive-storage-api to 2.8.1

Resolved

Sub-Tasks

1.	Code cleanup	Closed	Pavan Lanka
2.	Schema Evolution missing column is not handled in the presence of filters	Closed	Pavan Lanka
3.	Introduce OrcFilterContext	Closed	Pavan Lanka
4.	Avoid decompressing compressed streams if already decompressed	Closed	Pavan Lanka
5.	StructBatchReader should always skip processing on the rootReader	Closed	Pavan Lanka
6.	LazyIO of non-filter columns in the presence of filters	Closed	Pavan Lanka
7.	Conversion of SArg into Filters, to take advantage of LazyIO	Closed	Pavan Lanka
8.	Benchmarks for Filters	Closed	Pavan Lanka
9.	Upgrade hive-storage-api to 2.8.1	Closed	Dongjoon Hyun
10.	Revise filter processing log level/location/message	Closed	Pavan Lanka
11.	Filter processing ignores the schema case-sensitivity flag	Closed	Pavan Lanka

Activity

People

Assignee:: Pavan Lanka

Reporter:: Pavan Lanka

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 25/Jan/21 22:27

Updated:: 31/Dec/22 05:12

Resolved:: 01/Sep/21 04:25