There are several limitations of the current RC File format that I'd like to address by creating a new format:
- each column value is stored as a binary blob, which means:
- the entire column value must be read, decompressed, and deserialized
- the file format can't use smarter type-specific compression
- push down filters can't be evaluated
- the start of each row group needs to be found by scanning
- user metadata can only be added to the file when the file is created
- the file doesn't store the number of rows per a file or row group
- there is no mechanism for seeking to a particular row number, which is required for external indexes.
- there is no mechanism for storing light weight indexes within the file to enable push-down filters to skip entire row groups.
- the type of the rows aren't stored in the file