Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-606

Implement a binary input/output format for Streaming

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • None
    • contrib/streaming
    • None

    Description

      Lots of streaming applications process textual data with 1 record per line and fields separated by a delimiter. It turns out that there is no point in using any of Hadoop's input/output formats since the streaming script/binary itself will parse the input and break into records and fields. In such cases we should provide users with a binary input/output format which just sends 64k (or so) blocks of data directly from HDFS to the streaming application.

      I did something very similar for Pig-Streaming (PIG-94 - BinaryStorage) which resulted in 300%+ speedup for scanning (identity mapper & map-only jobs) data... the parsing done by input/output formats in these cases were pure-overhead.

      Attachments

        1. hadoop-3227.patch
          6 kB
          weimin zhu

        Issue Links

          Activity

            People

              Unassigned Unassigned
              acmurthy Arun Murthy
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: