[MAPREDUCE-606] Implement a binary input/output format for Streaming - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: contrib/streaming
Labels:
None

Description

Lots of streaming applications process textual data with 1 record per line and fields separated by a delimiter. It turns out that there is no point in using any of Hadoop's input/output formats since the streaming script/binary itself will parse the input and break into records and fields. In such cases we should provide users with a binary input/output format which just sends 64k (or so) blocks of data directly from HDFS to the streaming application.

I did something very similar for Pig-Streaming (~~PIG-94~~ - BinaryStorage) which resulted in 300%+ speedup for scanning (identity mapper & map-only jobs) data... the parsing done by input/output formats in these cases were pure-overhead.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

hadoop-3227.patch
04/Mar/09 05:04
6 kB
weimin zhu

Issue Links

duplicates

MAPREDUCE-598 Streaming: better conrol over input splits

Resolved

is related to

HADOOP-1722 Make streaming to handle non-utf8 byte array

Closed

is superceded by

MAPREDUCE-5018 Support raw binary data with Hadoop streaming

Patch Available

Activity

People

Assignee:: Unassigned

Reporter:: Arun Murthy

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 10/Apr/08 07:22

Updated:: 18/Jul/14 05:09

Resolved:: 18/Jul/14 05:09