[HIVE-11665] ORC StringDictionaryReader should not use Chunked buffers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.3.0, 2.0.0
Fix Version/s: None
Component/s: File Formats
Labels:
None

Description

ORC String Dictionary Reader is slow due to the chunking of the input stream.

ql/src/java/org/apache/hadoop/hive/ql/io/orc/TreeReaderFactory.java#L1678

 private void readDictionaryStream(InStream in) throws IOException {
      if (in != null) { // Guard against empty dictionary stream.
        if (in.available() > 0) {
          dictionaryBuffer = new DynamicByteArray(64, in.available());
          dictionaryBuffer.readAll(in);
          // Since its start of strip invalidate the cache.
          dictionaryBufferInBytesCache = null;
        }
        in.close();
      } else {
        dictionaryBuffer = null;
      }
    }

The fact that the data is chunked offers no advantage for the read-path where there is no grow() operation for memory savings.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

orc-stringdict-reader.png
27/Aug/15 06:43
27 kB
Gopal Vijayaraghavan

Activity

People

Assignee:: Prasanth Jayachandran

Reporter:: Gopal Vijayaraghavan

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 27/Aug/15 06:41

Updated:: 03/Sep/15 17:55