Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.4.0
-
None
-
Reviewed
-
HIVE-461. Optimize RCFile reading by using column pruning results. (Yongqiang He via zshao)
Description
RCFile is a column-based file format introduced in HIVE-352. Column-based storage has shown better compression ratio. On our internal data set (30 columns, most of them are short integer strings), we are seeing gzip-compressed RCFile to be 20%+ smaller than gzip-compressed SequenceFile.
RCFIle also has the potential to improve the reading efficiency a lot since it compresses each column separately.
We should integrate RCFile with the column pruning results from Hive to make the reading faster.