Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1284

clean up the protocol between stream mapper/reducer and the framework

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.13.0
    • None
    • None

    Description

      Right now, the protocol between stream mapper/reducer and the framework is very inflexible.
      The mapper/reducer generates line oriented output. The framework picks up line by line, and split
      each line into a key/value pair. By default, the substring up to the first tab char is the key, and the
      substring after the first tab char is the value.

      However, in many cases, the application wants some control over how the pair is split.
      Here, I'd like to introduce the following configuration variables for that:

      1. "streaming.output.field.separator": the value will be the tab key, by default.
      But the user can specify a different one (e.g. ':', or ', ', etc.)
      A map output line can be considered as a list of fields separated by the separator.

      2. "streaming.num.fields.for.mapout.key": the number of the first fields will be used the map output key
      (and for sorting in the reduce side).
      The default value is 1.
      The rest of the fields will be used as the value. For example, I can specify the first 5 fields as my mapout key.

      3. "streaming.num.fields.for.partitioning": Sometimes, I want to use fewer fields for partitioning to
      achieve "primary/secondary" composite
      key effect as proposed in HADOOP485. The default value is 1.
      For example, I can set "streaming.num.fields.for.partitioning" to 3
      and "streaming.num.fields.for.mapout.key" to 5.
      This effectively amounts to saying that fields 4 and 5 are my secondary key.

      With the above default values, it is compatible with the current behavior
      while introducing a new desirable feature in a clean way.

      Thoughts?

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            runping Runping Qi Assign to me
            runping Runping Qi
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment