Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2731

Getting column schema information from KuduSchema requires copying a KuduColumnSchema object

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.9.0
    • None
    • perf
    • None

    Description

      I'm looking at a CPU profile of Impala inserting into Kudu. KuduTableSink::Send has code that schematically does the following:

      for each row in the batch
        for each column
          if (schema.Column(col_idx).isNullable()) {
            write->mutable_row()->SetNull(col);
          }
        }
      }
      

      See kudu-table-sink.cc. However, KuduSchema::Column copies the column schema and returns it by value, so the if statement constructs and destroys a column schema object just to check if the column is nullable.

      This is by far the biggest user of CPU in the Impala process (35% or so). The workload might be I/O bound writing to Kudu anyway, though. Nevertheless, we should provide a way to avoid this copying in the API, either by adding a method like

      class KuduSchema {
        const KuduColumnSchema& get_column(int idx);
      }
      

      or a method like

      class KuduSchema {
        bool is_column_nullable(int idx);
      }
      

      The former is the most flexible while the latter frees the client from worrying about holding the ref longer than the KuduColumnSchema object lives. We might need to add a number of methods similar to the latter method to cover other potentially useful things like checking encoding, type, etc.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              wdberkeley William Berkeley
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: