[KUDU-2731] Getting column schema information from KuduSchema requires copying a KuduColumnSchema object - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.9.0
Fix Version/s: None
Component/s: perf
Labels:
None

Description

I'm looking at a CPU profile of Impala inserting into Kudu. KuduTableSink::Send has code that schematically does the following:

for each row in the batch
  for each column
    if (schema.Column(col_idx).isNullable()) {
      write->mutable_row()->SetNull(col);
    }
  }
}

See kudu-table-sink.cc. However, KuduSchema::Column copies the column schema and returns it by value, so the if statement constructs and destroys a column schema object just to check if the column is nullable.

This is by far the biggest user of CPU in the Impala process (35% or so). The workload might be I/O bound writing to Kudu anyway, though. Nevertheless, we should provide a way to avoid this copying in the API, either by adding a method like

class KuduSchema {
  const KuduColumnSchema& get_column(int idx);
}

or a method like

class KuduSchema {
  bool is_column_nullable(int idx);
}

The former is the most flexible while the latter frees the client from worrying about holding the ref longer than the KuduColumnSchema object lives. We might need to add a number of methods similar to the latter method to cover other potentially useful things like checking encoding, type, etc.

Attachments

Issue Links

is related to

IMPALA-8284 KuduTableSink spends a lot of CPU copying KuduColumnSchemas

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: William Berkeley

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/Mar/19 22:54

Updated:: 03/Jun/20 14:38