Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
There are several limitations of the current RC File format that I'd like to address by creating a new format:
- each column value is stored as a binary blob, which means:
- the entire column value must be read, decompressed, and deserialized
- the file format can't use smarter type-specific compression
- push down filters can't be evaluated
- the start of each row group needs to be found by scanning
- user metadata can only be added to the file when the file is created
- the file doesn't store the number of rows per a file or row group
- there is no mechanism for seeking to a particular row number, which is required for external indexes.
- there is no mechanism for storing light weight indexes within the file to enable push-down filters to skip entire row groups.
- the type of the rows aren't stored in the file
Attachments
Attachments
Issue Links
- is depended upon by
-
HIVE-4097 ORC file doesn't properly interpret empty hive.io.file.readcolumn.ids
- Closed
-
HIVE-4098 OrcInputFormat assumes Hive always calls createValue
- Closed
-
ORC-15 Add floating point compression to ORC file
- Open
-
HIVE-4059 Make Column statistics for ORC optional
- Open
-
HIVE-4060 Make streams for types for ORC pluggable
- Open
-
HIVE-4063 Negative tests for types not supported by ORC
- Open
-
HIVE-4058 make ORC versioned
- Resolved
-
HIVE-4061 skip columns which are not accessed in the query for ORC
- Resolved
-
HIVE-4062 use column statistics for ORC to evaluate predicates for ORC
- Resolved
-
HIVE-4015 Add ORC file to the grammar as a file format
- Closed
- is related to
-
HIVE-4376 Document ORC file format in Hive wiki
- Resolved