[IMPALA-6431] Speed up Kudu queries with PK predicates - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Later
Affects Version/s: None
Fix Version/s: None
Component/s: Frontend
Labels:
- kudu

Epic Color:
ghx-label-3

Description

In near real-time use cases, lots of kudu queries are filtered by primary keys so it only needs to process a small amount of data. In such scenario, codegen could consume a big part of CPU compare to the total query CPU usage. when running with high concurrency, this can easily saturate CPU and impact throughput. Disable codegen will not only improve throughput, also reduce CPU usage.

Here are some test results:

Create table trip_data_kudu(
...
PRIMARY KEY (pickup_datetime, medallion, hack_license, vendor_id)
)
PARTITION BY HASH (medallion, vendor_id) PARTITIONS 10
stored as KUDU;

Q1 - select count(*) from trip_data_kudu where pickup_datetime='2013-01-09 20:33:00';

Q2 - select passenger_count, avg(trip_time_in_secs), count(1) from trip_data_kudu where pickup_datetime='2013-01-09 20:33:00' group by 1 order by 2 desc;

Q3 - select * from trip_data_kudu where pickup_datetime='2013-01-09 20:33:00' and vendor_id='CMT';

16 concurrency	QPS	CPU usage
Q1 (codegen enabled)	42	76%
Q1 (codegen disabled)	250	58%

Q2 (codegen enabled)	7.7	85%
Q2 (codegen disabled)	78	65%

Q3 (codegen enabled)	52	75%
Q3 (codegen disabled)	185	60%

CPU usage here is Impala + Kudu, if only compare Impala, with codegen enabled, CPU usage is ~2x compare with codegen disabled.

Note that Impala doesn't have per partition cardinality stats for Kudu table, query cannot benefit from DISABLE_CODEGEN_ROWS_THRESHOLD optimization.

Attachments

Sub-Tasks

There are no Sub-Tasks for this issue.

Activity

People

Assignee:: Unassigned

Reporter:: Juan Yu

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 20/Jan/18 02:30

Updated:: 02/Jun/20 23:23

Resolved:: 02/Jun/20 23:23