Details

    • Type: New Feature New Feature
    • Status: Patch Available
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.1.2
    • Fix Version/s: None
    • Component/s: contrib/streaming
    • Labels:
    • Target Version/s:
    • Release Note:
      Add "-io justbytes" I/O format to allow raw binary streaming.

      Description

      People often have a need to run older programs over many files, and turn to Hadoop streaming as a reliable, performant batch system. There are good reasons for this:

      1. Hadoop is convenient: they may already be using it for mapreduce jobs, and it is easy to spin up a cluster in the cloud.
      2. It is reliable: HDFS replicates data and the scheduler retries failed jobs.
      3. It is reasonably performant: it moves the code to the data, maintaining locality, and scales with the number of nodes.

      Historically Hadoop is of course oriented toward processing key/value pairs, and so needs to interpret the data passing through it. Unfortunately, this makes it difficult to use Hadoop streaming with programs that don't deal in key/value pairs, or with binary data in general. For example, something as simple as running md5sum to verify the integrity of files will not give the correct result, due to Hadoop's interpretation of the data.

      There have been several attempts at binary serialization schemes for Hadoop streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed at efficiently encoding key/value pairs, and not passing data through unmodified. Even the "RawBytes" serialization scheme adds length fields to the data, rendering it not-so-raw.

      I often have a need to run a Unix filter on files stored in HDFS; currently, the only way I can do this on the raw data is to copy the data out and run the filter on one machine, which is inconvenient, slow, and unreliable. It would be very convenient to run the filter as a map-only job, allowing me to build on existing (well-tested!) building blocks in the Unix tradition instead of reimplementing them as mapreduce programs.

      However, most existing tools don't know about file splits, and so want to process whole files; and of course many expect raw binary input and output. The solution is to run a map-only job with an InputFormat and OutputFormat that just pass raw bytes and don't split. It turns out to be a little more complicated with streaming; I have attached a patch with the simplest solution I could come up with. I call the format "JustBytes" (as "RawBytes" was already taken), and it should be usable with most recent versions of Hadoop.

      1. MAPREDUCE-5018.patch
        24 kB
        Steven Willis
      2. MAPREDUCE-5018-branch-1.1.patch
        20 kB
        Steven Willis
      3. mapstream
        2 kB
        Jay Hacker
      4. justbytes.jar
        17 kB
        Jay Hacker
      5. MAPREDUCE-5018.patch
        20 kB
        Jay Hacker

        Issue Links

          Activity

          Jay Hacker created issue -
          Jay Hacker made changes -
          Field Original Value New Value
          Status Open [ 1 ] Patch Available [ 10002 ]
          Release Note Add "-io justbytes" I/O format to allow raw binary streaming.
          Target Version/s trunk [ 12320360 ]
          Jay Hacker made changes -
          Attachment MAPREDUCE-5018.patch [ 12570317 ]
          Jay Hacker made changes -
          Link This issue supercedes MAPREDUCE-606 [ MAPREDUCE-606 ]
          Jay Hacker made changes -
          Attachment justbytes.jar [ 12570327 ]
          Attachment mapstream [ 12570328 ]
          Jay Hacker made changes -
          Link This issue is related to HADOOP-1722 [ HADOOP-1722 ]
          Steven Willis made changes -
          Attachment MAPREDUCE-5018-branch-1.1.patch [ 12644852 ]
          Steven Willis made changes -
          Target Version/s trunk [ 12320360 ] 1.1.2, trunk [ 12323594, 12320360 ]
          Steven Willis made changes -
          Attachment MAPREDUCE-5018.patch [ 12644886 ]
          Steven Willis made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Target Version/s trunk, 1.1.2 [ 12320360, 12323594 ] 1.1.2, trunk [ 12323594, 12320360 ]
          Assignee Steven Willis [ stevenwillis ]
          Steven Willis made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Affects Version/s 1.1.2 [ 12323594 ]
          Affects Version/s trunk [ 12320360 ]
          Target Version/s trunk, 1.1.2 [ 12320360, 12323594 ] 1.1.2, trunk [ 12323594, 12320360 ]
          Steven Willis made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Target Version/s trunk, 1.1.2 [ 12320360, 12323594 ] 1.1.2, trunk [ 12323594, 12320360 ]
          Steven Willis made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Target Version/s trunk, 1.1.2 [ 12320360, 12323594 ] 1.1.2, trunk [ 12323594, 12320360 ]
          Allen Wittenauer made changes -
          Link This issue duplicates MAPREDUCE-598 [ MAPREDUCE-598 ]
          Allen Wittenauer made changes -
          Affects Version/s trunk [ 12320360 ]
          Allen Wittenauer made changes -
          Labels BB2015-05-TBR

            People

            • Assignee:
              Steven Willis
              Reporter:
              Jay Hacker
            • Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:

                Development