[HIVE-3874] Create a new Optimized Row Columnar file format for Hive - ASF JIRA

XML

Word

Printable

JSON

There are several limitations of the current RC File format that I'd like to address by creating a new format:

each column value is stored as a binary blob, which means:
- the entire column value must be read, decompressed, and deserialized
- the file format can't use smarter type-specific compression
- push down filters can't be evaluated
the start of each row group needs to be found by scanning
user metadata can only be added to the file when the file is created
the file doesn't store the number of rows per a file or row group
there is no mechanism for seeking to a particular row number, which is required for external indexes.
there is no mechanism for storing light weight indexes within the file to enable push-down filters to skip entire row groups.
the type of the rows aren't stored in the file

is depended upon by

HIVE-4097 ORC file doesn't properly interpret empty hive.io.file.readcolumn.ids

HIVE-4098 OrcInputFormat assumes Hive always calls createValue

ORC-15 Add floating point compression to ORC file

HIVE-4059 Make Column statistics for ORC optional

HIVE-4060 Make streams for types for ORC pluggable

HIVE-4063 Negative tests for types not supported by ORC

HIVE-4058 make ORC versioned

HIVE-4061 skip columns which are not accessed in the query for ORC

HIVE-4062 use column statistics for ORC to evaluate predicates for ORC

HIVE-4015 Add ORC file to the grammar as a file format

is related to

HIVE-4376 Document ORC file format in Hive wiki

(5 is depended upon by, 1 is related to)