Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5048

streaming combiner feature breaks when input binary, output text

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.0.2
    • Fix Version/s: None
    • Component/s: contrib/streaming
    • Labels:
      None
    • Environment:

      centos 6.2

      Description

      When running hadoop streaming job with binary input and shuffling but text output with combiner on, it fails with error

      java.lang.RuntimeException: java.io.IOException: wrong key class: class org.apache.hadoop.io.Text is not class org.apache.hadoop.typedbytes.TypedBytesWritable

      repro:

      hadoop jar <streaming jar> -D 'stream.map.input=typedbytes' -D 'stream.map.output=typedbytes' -D 'stream.reduce.input=typedbytes' -input <sequence file containing typedbytes> -output <any valid dir> -mapper cat -combiner cat -reducer cat -inputformat 'org.apache.hadoop.streaming.AutoInputFormat'

      if you remove the -combiner option, it works with only performance implications. If you specify in addition -D 'stream.reduce.output=typedbytes', it succeeds but outputs raw typedbytes (without the sequence file superstructure)

      I asked in the discussion of HADOOP-1722 (where typedbytes was first introduced) if this is a bug or my misunderstanding of that spec and a committer chipped in saying it seems a bug to him too.
      Originally reported by a user of the rmr2 package for R and filed by me here https://github.com/RevolutionAnalytics/rmr2/issues/16

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              piccolbo Antonio Piccolboni
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: