Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-11665

ORC StringDictionaryReader should not use Chunked buffers

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.3.0, 2.0.0
    • None
    • File Formats
    • None

    Description

      ORC String Dictionary Reader is slow due to the chunking of the input stream.

      ql/src/java/org/apache/hadoop/hive/ql/io/orc/TreeReaderFactory.java#L1678
       private void readDictionaryStream(InStream in) throws IOException {
            if (in != null) { // Guard against empty dictionary stream.
              if (in.available() > 0) {
                dictionaryBuffer = new DynamicByteArray(64, in.available());
                dictionaryBuffer.readAll(in);
                // Since its start of strip invalidate the cache.
                dictionaryBufferInBytesCache = null;
              }
              in.close();
            } else {
              dictionaryBuffer = null;
            }
          }
      

      The fact that the data is chunked offers no advantage for the read-path where there is no grow() operation for memory savings.

      Attachments

        1. orc-stringdict-reader.png
          27 kB
          Gopal Vijayaraghavan

        Activity

          People

            prasanth_j Prasanth Jayachandran
            gopalv Gopal Vijayaraghavan
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: