Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
Lots of streaming applications process textual data with 1 record per line and fields separated by a delimiter. It turns out that there is no point in using any of Hadoop's input/output formats since the streaming script/binary itself will parse the input and break into records and fields. In such cases we should provide users with a binary input/output format which just sends 64k (or so) blocks of data directly from HDFS to the streaming application.
I did something very similar for Pig-Streaming (PIG-94 - BinaryStorage) which resulted in 300%+ speedup for scanning (identity mapper & map-only jobs) data... the parsing done by input/output formats in these cases were pure-overhead.
Attachments
Attachments
Issue Links
- duplicates
-
MAPREDUCE-598 Streaming: better conrol over input splits
- Resolved
- is related to
-
HADOOP-1722 Make streaming to handle non-utf8 byte array
- Closed
- is superceded by
-
MAPREDUCE-5018 Support raw binary data with Hadoop streaming
- Patch Available