RCFile is a column-based file format introduced in
HIVE-352. Column-based storage has shown better compression ratio. On our internal data set (30 columns, most of them are short integer strings), we are seeing gzip-compressed RCFile to be 20%+ smaller than gzip-compressed SequenceFile.
RCFIle also has the potential to improve the reading efficiency a lot since it compresses each column separately.
We should integrate RCFile with the column pruning results from Hive to make the reading faster.