Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-11204

OrcStringColumnReader should be a template class

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • Impala 4.1.0
    • Backend
    • None

    Description

      There are some duplicated checks in OrcStringColumnReader::ReadValue() which is executed for each row. They are static and should only be performed once per ORC stripe.

      Status OrcStringColumnReader::ReadValue(int row_idx, Tuple* tuple, MemPool* pool) {
        if (IsNull(DCHECK_NOTNULL(batch_), row_idx)) {
          SetNullSlot(tuple);
          return Status::OK();
        }
        char* src_ptr;
        int src_len;
      
        if (batch_->isEncoded) {
          orc::EncodedStringVectorBatch* currentBatch =
              static_cast<orc::EncodedStringVectorBatch*>(batch_);
      
          orc::DataBuffer<int64_t>& offsets = currentBatch->dictionary->dictionaryOffset;
          int64_t index = currentBatch->index[row_idx];
          if (UNLIKELY(index < 0  || static_cast<uint64_t>(index) + 1 >= offsets.size())) {
            return Status(Substitute("Corrupt ORC file: $0. Index ($1) out of range [0, $2) in "
                "StringDictionary.", scanner_->filename(), index, offsets.size()));;
          }
          src_ptr = blob_ + offsets[index];
          src_len = offsets[index + 1] - offsets[index];
        } else {
          // The pointed data is now in blob_, a buffer handled by Impala.
          src_ptr = blob_ + (batch_->data[row_idx] - batch_->blob.data());
          src_len = batch_->length[row_idx];
        }
        int dst_len = slot_desc_->type().len;
        if (slot_desc_->type().type == TYPE_CHAR) {
          int unpadded_len = min(dst_len, static_cast<int>(src_len));
          char* dst_char = reinterpret_cast<char*>(GetSlot(tuple));
          memcpy(dst_char, src_ptr, unpadded_len);
          StringValue::PadWithSpaces(dst_char, dst_len, unpadded_len);
          return Status::OK();
        }
        StringValue* dst = reinterpret_cast<StringValue*>(GetSlot(tuple));
        if (slot_desc_->type().type == TYPE_VARCHAR && src_len > dst_len) {
          dst->len = dst_len;
        } else {
          dst->len = src_len;
        }
        dst->ptr = src_ptr;
        return Status::OK();
      }
      

      This method is executed for each row. The complexity of it causes the compiler unable to inline it. I can see this method takes some portion of the time in TPC-H queries (e.g. 21% in Q12).

      Actually, the slot type is determined when we are creating the reader. Whether the orc vector batch is encoded is determined when we start reading a new stripe. If the string column is in dictionary encodings in the stripe and we set EnableLazyDecoding to true in RowReaderOptions, the orc vector batch will be encoded.
      https://github.com/apache/orc/blob/21259815ae665f6a4beac29edb4fab52c2403e13/c%2B%2B/src/Reader.cc#L1205
      https://github.com/apache/orc/blob/21259815ae665f6a4beac29edb4fab52c2403e13/c%2B%2B/src/ColumnReader.cc#L1843-L1847

      We can switch to a template implementation so the compile can remove these checks and inline the method.

      Attachments

        1. perf-report-tpch30-orc-snap.txt
          43 kB
          Quanlong Huang

        Activity

          People

            stigahuang Quanlong Huang
            stigahuang Quanlong Huang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: