Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-717

When there are few reducers, sorting should be done by mappers

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: None
    • Labels:
      None

      Description

      If I understand correctly, currently, sort happens on the reducer side.
      So if few hundred mappers produce few (or many) Gig of data, and there is just ONE reduce to consume it, copying and sorting takes forever.

      It may make sense to have a special case optimization for a single reducer. (E.g. "when there is only reducer and many mappers, sort is done by the mappers, and reducer does only a merge")

      Or to have some smarter policy that makes sure that sorting uses as many CPUs as it makes sense. If the map step has produced data on all the nodes of the cluster, it makes sense to use all the nodes for sorting.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                owen.omalley Owen O'Malley
                Reporter:
                arkady arkady borkovsky
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: