Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-606

Implement a binary input/output format for Streaming

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: contrib/streaming
    • Labels:
      None

      Description

      Lots of streaming applications process textual data with 1 record per line and fields separated by a delimiter. It turns out that there is no point in using any of Hadoop's input/output formats since the streaming script/binary itself will parse the input and break into records and fields. In such cases we should provide users with a binary input/output format which just sends 64k (or so) blocks of data directly from HDFS to the streaming application.

      I did something very similar for Pig-Streaming (PIG-94 - BinaryStorage) which resulted in 300%+ speedup for scanning (identity mapper & map-only jobs) data... the parsing done by input/output formats in these cases were pure-overhead.

        Attachments

        1. hadoop-3227.patch
          6 kB
          weimin zhu

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                acmurthy Arun Murthy
              • Votes:
                0 Vote for this issue
                Watchers:
                12 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: