Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6712

Support grouping values for reducer on java-side

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: contrib/streaming
    • Labels:
      None

      Description

      In hadoop streaming, with TextInputWriter, reducer program will receive each line representing a (k, v) tuple from stdin, in which values with identical key is not grouped.
      This brings some inefficiency, especially for runtimes based on interpreter (e.g. cpython), coming from:
      A. user program has to compare key with previous one (but on java side, records already come to reducer in groups),
      B. user program has to perform read, then find or split on each record. even if there are multiple values with identical key,
      C. if length of key is large, apparently this introduces inefficiency for caching,

      Suppose we need another InputWriter. But this is not enough, since the interface of InputWriter defined writeKey and writeValue, not writeValues. Though we can compare key in custom InputWriter and group them, but this is also inefficient. Some other changes are also needed.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              He Tianyi He Tianyi
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: