Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5461

Let users be able to get latest Key in reduce()

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.2.1
    • None
    • task
    • Any environment

    Description

      Reducer generates <K, List(V)> for reduce(). In some cases such as SecondarySort, although current V and next V share the same K, their actual corresponding Ks are different. For example, in SecondarySort, map() outputs
      Key Value
      <1, 3> 3
      <1, 1> 1
      <2, 5> 5
      <1, 8> 8

      After partition by Key.getFirst(), sort and group by Key.getFirst(),
      reducer gets:
      Key Value
      -----Group 1-----
      <1, 1> 1
      <1, 3> 3
      <1, 8> 8
      -----Group 2-----
      <2, 5> 5

      reduce() receives:

      Key List<Value>
      <1, 1> List<1, 3, 8>
      <2, 5> List<5>

      When invoking V.next(), we can get next V (e.g, 3). But we do not have API to get its corresponding Key (e.g, <1, 3>). We can only get the first Key (e.g., <1,1>).

      If we let user be able to get latest key, SecondarySort does not need to emit value in map(). So that the network traffic is better.

      Another example is Join. If we can get latest Key, we do not need to put table label in both key and value.

      Attachments

        Activity

          People

            Unassigned Unassigned
            jerrylead Lijie Xu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - 12h
                12h
                Remaining:
                Remaining Estimate - 12h
                12h
                Logged:
                Time Spent - Not Specified
                Not Specified