Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5063

Transfering mapper output (key,value) pair to multiple reducer

    Details

    • Type: Wish Wish
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.0.3
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Tags:
      shuffling (key,value) pair to multiple reducer

      Description

      Currently in Hadoop MapReduce mapper output in (key,value) form can be transfered to only one reducer

      Our goal is to be able transfer/shuffle (key,value) pair to multiple reducer

      Note:- we need to shuffle same pair to number of reducers

        Activity

        Hide
        Harsh J added a comment -

        The need is not that highly common although it exists for certain problems. Given that the pluggability allows for users to achieve this, I'm not sure if it would be worth the trouble trying to maintain it upstream here as well. Look at the join package for instance: its serving as a good example but the point was also that folks would use it - but I don't see that happening frequently, as people are preferring to write their own equivalents since its not highly reusable in all scenarios.

        The problem of a broadcast is simple: send it to all partitions, easy to define and perform on flagged keys. The problem of a multi-cast is not that simple: you want the behavior defined on a per key basis - flags get more specific. Besides, these needs only crop up for special MR optimization cases, for which a custom implementation should suffice.

        Adding in APIs for special things is easy but we also have to consider long-time support for it and usefulness when adding it upstream. If it can be provided by a third party repository, then thats better and more flexible. We can also recommend it to anyone who comes along with such a question.

        Show
        Harsh J added a comment - The need is not that highly common although it exists for certain problems. Given that the pluggability allows for users to achieve this, I'm not sure if it would be worth the trouble trying to maintain it upstream here as well. Look at the join package for instance: its serving as a good example but the point was also that folks would use it - but I don't see that happening frequently, as people are preferring to write their own equivalents since its not highly reusable in all scenarios. The problem of a broadcast is simple: send it to all partitions, easy to define and perform on flagged keys. The problem of a multi-cast is not that simple: you want the behavior defined on a per key basis - flags get more specific. Besides, these needs only crop up for special MR optimization cases, for which a custom implementation should suffice. Adding in APIs for special things is easy but we also have to consider long-time support for it and usefulness when adding it upstream. If it can be provided by a third party repository, then thats better and more flexible. We can also recommend it to anyone who comes along with such a question.
        Hide
        Vikas Jadhav added a comment -

        Hi Harsh,

        I had also planned by this way only and currently going to implement it in
        using user code but

        why i thought it will be good to have separate API is

        as you already know that there is lot of data reading and writing in
        mapreduce
        if we implement this using user code it is possible that it may increase
        number of
        writes to be done to local disk because we may have write same pair more
        than one
        time.
        +
        i think there may be problem with original (key,value) pair because we may
        change key -> (key_r1,value) and (key_r2,value) here we r changing "key"
        which is not desirable

        so my point is that can we have approach where we can shuffle pair without
        writing it two times.


        *
        *
        *

        Thanx and Regards*

        • Vikas Jadhav*
        Show
        Vikas Jadhav added a comment - Hi Harsh, I had also planned by this way only and currently going to implement it in using user code but why i thought it will be good to have separate API is as you already know that there is lot of data reading and writing in mapreduce if we implement this using user code it is possible that it may increase number of writes to be done to local disk because we may have write same pair more than one time. + i think there may be problem with original (key,value) pair because we may change key -> (key_r1,value) and (key_r2,value) here we r changing "key" which is not desirable so my point is that can we have approach where we can shuffle pair without writing it two times. – * * * Thanx and Regards* Vikas Jadhav*
        Hide
        Harsh J added a comment -

        This is easily possible in user code with a custom key structure and a custom partitioner. There are no gotchas in implementing this for a specific need.

        I don't see why we should add a dedicated API for this. If you agree, lets close this out as invalid as this qualifies as a simple user question and belongs to the mailing lists.

        Show
        Harsh J added a comment - This is easily possible in user code with a custom key structure and a custom partitioner. There are no gotchas in implementing this for a specific need. I don't see why we should add a dedicated API for this. If you agree, lets close this out as invalid as this qualifies as a simple user question and belongs to the mailing lists.
        Hide
        Karthik Kambatla added a comment -

        While it is an interesting ask, I am not sure changing the PL semantics and guarantees of MapReduce is a good idea. Please share a design doc if you have a particular approach in mind. If we can continue to support the same semantics/guarantees we have been so far with the proposed changes, may be we can include it?

        Show
        Karthik Kambatla added a comment - While it is an interesting ask, I am not sure changing the PL semantics and guarantees of MapReduce is a good idea. Please share a design doc if you have a particular approach in mind. If we can continue to support the same semantics/guarantees we have been so far with the proposed changes, may be we can include it?
        Hide
        Mariappan Asokan added a comment -

        MAPREDUCE-4049 made shuffle pluggable. Will writing your own plugin satisfy your needs?

        – Asokan

        Show
        Mariappan Asokan added a comment - MAPREDUCE-4049 made shuffle pluggable. Will writing your own plugin satisfy your needs? – Asokan

          People

          • Assignee:
            Unassigned
            Reporter:
            Vikas Jadhav
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:

              Development