Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-2806

Streaming has no way to force entire record (or null) as key

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 0.17.0
    • None
    • None

    Description

      I think perhaps streaming needs a "-allkey" or "-nullkey" option? Otherwise, I'm concerned there is a subtle streaming documentation problem.

      These two docs:

      http://hadoop.apache.org/core/docs/current/streaming.html
      http://wiki.apache.org/hadoop/HadoopStreaming (Should be merged with above?)

      ... seem to ignore that streaming, by default, splits key/value on TAB. Sure, they mention it, but in all the simple (no separator) examples, they don't seem to take into account that streaming may inconsistently decide whether the whole line is the key, or just up to the first tab, should one occur. This means that some records might be sorted differently as compared to others based on whether or not there's a tab?

      Here's a very simple pair of examples, that to the naive, should produce the same output, but do not:

      > [hod] (marco) >> run dfs -fs local -cat str-tabs
      > a 1
      > b 3
      > a 4
      >
      > [hod] (marco) >> run dfs -put str-tabs str-tabs
      >
      > [hod] (marco) >> run jar hadoop-streaming.jar -input str-tabs -output str-tabs.out -mapper /bin/cat -reducer /bin/cat
      > [blah blah blah]
      >
      > [hod] (marco) >> run dfs -cat str-tabs.out/part-00000
      > a 4
      > a 1
      > b 3

      Compare to this negative-test:
      > [hod] (marco) >> run dfs -fs local -cat str-notabs
      > a 1
      > b 3
      > a 4
      >
      > [hod] (marco) >> run dfs -put str-notabs str-notabs
      >
      > [hod] (marco) >> run jar hadoop-streaming.jar -input str-notabs -output str-notabs.out -mapper /bin/cat -reducer /bin/cat
      > [blah blah blah]
      >
      > [hod] (marco) >> run dfs -cat str-notabs.out/part-00000
      > a 1
      > a 4
      > b 3
      >

      Attachments

        1. patch-2806.txt
          3 kB
          Amareshwari Sriramadasu

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            amareshwari Amareshwari Sriramadasu
            menicosia Marco Nicosia
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment