Description
The attached test writes/reads XGB to/from the default filesystem through SequenceFileInputFormat and TextInputFormat, using LzoCodec, GzipCodec, and without compression, using both block and record compression for SequenceFiles.
The following results using 10GB of data through RawLocalFileSystem with 5 word keys, 20 word values (as generated by RandomTextWriter with the same seed for each file) are pretty stable:
Writes:
Format | Compression | Type | Time (sec) | Filesize (bytes) |
---|---|---|---|---|
SEQ | LZO | BLOCK | 318 | 8 604 288 397 |
SEQ | LZO | RECORD | 367 | 11 689 969 413 |
SEQ | ZIP | BLOCK | 929 | 2 827 697 769 |
SEQ | ZIP | RECORD | 1737 | 9 324 730 365 |
SEQ | 201 | 11 282 745 683 | ||
TXT | LZO | 742 | 12 671 065 769 | |
TXT | ZIP | 1320 | 2 597 397 680 | |
TXT | 392 | 10 818 058 643 |
Reads:
Format | Compression | Type | Time (sec) |
---|---|---|---|
SEQ | LZO | BLOCK | 150 |
SEQ | LZO | RECORD | 281 |
SEQ | ZIP | BLOCK | 155 |
SEQ | ZIP | RECORD | 548 |
SEQ | 209 | ||
TXT | LZO | 620 | |
TXT | ZIP | 355 | |
TXT | 284 |
Of note:
- Lzo compressed TextOutput is larger than the uncompressed output (
HADOOP-2402); lzop cannot read it. - Zip compression is expensive. Short values are responsible for the unimpressive compression for record-compressed SequenceFiles.
- TextInputFormat is slow (
HADOOP-2285). TextOutputFormat also looks suspect.