[HADOOP-2406] Micro-benchmark to measure read/write times through InputFormats - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Test
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.16.0
Component/s: fs, test
Labels:
None

Description

The attached test writes/reads XGB to/from the default filesystem through SequenceFileInputFormat and TextInputFormat, using LzoCodec, GzipCodec, and without compression, using both block and record compression for SequenceFiles.

The following results using 10GB of data through RawLocalFileSystem with 5 word keys, 20 word values (as generated by RandomTextWriter with the same seed for each file) are pretty stable:

Writes:

Format	Compression	Type	Time (sec)	Filesize (bytes)
SEQ	LZO	BLOCK	318	8 604 288 397
SEQ	LZO	RECORD	367	11 689 969 413
SEQ	ZIP	BLOCK	929	2 827 697 769
SEQ	ZIP	RECORD	1737	9 324 730 365
SEQ			201	11 282 745 683
TXT	LZO		742	12 671 065 769
TXT	ZIP		1320	2 597 397 680
TXT			392	10 818 058 643

Reads:

Format	Compression	Type	Time (sec)
SEQ	LZO	BLOCK	150
SEQ	LZO	RECORD	281
SEQ	ZIP	BLOCK	155
SEQ	ZIP	RECORD	548
SEQ			209
TXT	LZO		620
TXT	ZIP		355
TXT			284

Of note:

Lzo compressed TextOutput is larger than the uncompressed output (~~HADOOP-2402~~); lzop cannot read it.
Zip compression is expensive. Short values are responsible for the unimpressive compression for record-compressed SequenceFiles.
TextInputFormat is slow (~~HADOOP-2285~~). TextOutputFormat also looks suspect.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

2406-0.patch
11/Dec/07 23:49
28 kB
Christopher Douglas
2406-1.patch
09/Jan/08 20:36
28 kB
Christopher Douglas

Activity

People

Assignee:: Christopher Douglas

Reporter:: Christopher Douglas

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 11/Dec/07 23:43

Updated:: 08/Feb/08 23:38

Resolved:: 09/Jan/08 21:57