[IMPALA-2017] Lazy materialization of Parquet columns during query - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: Impala 1.4, Impala 2.0, Impala 2.1, Impala 2.2
Fix Version/s: None
Component/s: Backend
Labels:
- parquet
- performance

Target Version:

Product Backlog

Description

When I run a query over a 4 billion row table that returns a single row, it takes ~30 seconds if i do 'select * ...'. It takes only 3 seconds if I do a 'select field1, field2 ...'. This is repeatable.

Given these times, it would seem that the 'select *' query is materializing all the fields for rows whether they match or not.

Lazy materialization of columns when they are needed could improve performance.

These four queries were run back to back. The actual returned data is elided (sorry). The table has 35 fields.

0: jdbc:hive2://atl1c1r2data09.vldb-bo.secure> select * from events where event_id=1416403791; 
<elided>
1 row selected (33.777 seconds)
0: jdbc:hive2://atl1c1r2data09.vldb-bo.secure> select event_id, client_id from events where event_id=1416403791;
+-------------+------------+--+
| event_id | client_id |
+-------------+------------+--+
| 1416403791 | <elided> |
+-------------+------------+--+
1 row selected (3.363 seconds)
0: jdbc:hive2://atl1c1r2data09.vldb-bo.secure> select * from events where event_id=1416403791; 
<elided>
1 row selected (33.138 seconds)
0: jdbc:hive2://atl1c1r2data09.vldb-bo.secure> select event_id, client_id from events where event_id=1416403791;
+-------------+------------+--+
| event_id | client_id |
+-------------+------------+--+
| 1416403791 | <elided> |
+-------------+------------+--+
1 row selected (3.074 seconds)
0: jdbc:hive2://atl1c1r2data09.vldb-bo.secure>

Attachments

Issue Links

is blocked by

IMPALA-2736 Column-wise value materialisation in Parquet scanner

Resolved

is duplicated by

IMPALA-3052 Reorder Parquet Column readers such that slots with probe filters are read first

Resolved

is related to

IMPALA-3841 Avoid materializing nested collections if top-level predicates already disqualify the row.

Open

IMPALA-8077 Avoid converting timestamps in dropped rows during Parquet scanning

Resolved

relates to

IMPALA-9810 Support Kudu's columnar scan format (Apache Arrow)

Open

Sub-Tasks

1.	Skip decoding of non-materialised columns in Parquet		Resolved	Amogh Margoor
2.	Reduce or avoid I/O for pruned columns		Open	Abhishek Rawat

Activity

People

Assignee:: Abhishek Rawat

Reporter:: Lou Bershad

Votes:: 3 Vote for this issue

Watchers:: 23 Start watching this issue

Dates

Created:: 19/May/15 12:29

Updated:: 18/Nov/20 19:53