Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.9.0
-
None
-
None
Description
I'm looking at a CPU profile of Impala inserting into Kudu. KuduTableSink::Send has code that schematically does the following:
for each row in the batch for each column if (schema.Column(col_idx).isNullable()) { write->mutable_row()->SetNull(col); } } }
See kudu-table-sink.cc. However, KuduSchema::Column copies the column schema and returns it by value, so the if statement constructs and destroys a column schema object just to check if the column is nullable.
This is by far the biggest user of CPU in the Impala process (35% or so). The workload might be I/O bound writing to Kudu anyway, though. Nevertheless, we should provide a way to avoid this copying in the API, either by adding a method like
class KuduSchema { const KuduColumnSchema& get_column(int idx); }
or a method like
class KuduSchema { bool is_column_nullable(int idx); }
The former is the most flexible while the latter frees the client from worrying about holding the ref longer than the KuduColumnSchema object lives. We might need to add a number of methods similar to the latter method to cover other potentially useful things like checking encoding, type, etc.
Attachments
Issue Links
- is related to
-
IMPALA-8284 KuduTableSink spends a lot of CPU copying KuduColumnSchemas
- Resolved