Hadoop Common
  1. Hadoop Common
  2. HADOOP-331

map outputs should be written to a single output file with an index

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3.2
    • Fix Version/s: 0.10.0
    • Component/s: None
    • Labels:
      None

      Description

      The current strategy of writing a file per target map is consuming a lot of unused buffer space (causing out of memory crashes) and puts a lot of burden on the FS (many opens, inodes used, etc).

      I propose that we write a single file containing all output and also write an index file IDing which byte range in the file goes to each reduce. This will remove the issue of buffer waste, address scaling issues with number of open files and generally set us up better for scaling. It will also have advantages with very small inputs, since the buffer cache will reduce the number of seeks needed and the data serving node can open a single file and just keep it open rather than needing to do directory and open ops on every request.

      The only issue I see is that in cases where the task output is substantiallyu larger than its input, we may need to spill multiple times. In this case, we can do a merge after all spills are complete (or during the final spill).

      1. 331.patch
        60 kB
        Devaraj Das
      2. 331-initial3.patch
        67 kB
        Devaraj Das
      3. 331-design.txt
        4 kB
        Devaraj Das
      4. 331.txt
        4 kB
        Devaraj Das

        Issue Links

          Activity

          eric baldeschwieler created issue -
          Doug Cutting made changes -
          Field Original Value New Value
          Workflow no-reopen-closed [ 12374612 ] no-reopen-closed, patch-avail [ 12377507 ]
          Doug Cutting made changes -
          Fix Version/s 0.5.0 [ 12311939 ]
          Fix Version/s 0.6.0 [ 12312025 ]
          Doug Cutting made changes -
          Fix Version/s 0.6.0 [ 12312025 ]
          Owen O'Malley made changes -
          Link This issue is cloned as HADOOP-570 [ HADOOP-570 ]
          Devaraj Das made changes -
          Assignee Yoram Arnon [ yarnon ] Devaraj Das [ devaraj ]
          Devaraj Das made changes -
          Attachment 331.txt [ 12343303 ]
          Devaraj Das made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          Devaraj Das made changes -
          Attachment 331-design.txt [ 12344023 ]
          Yoram Arnon made changes -
          Link This issue incorporates HADOOP-717 [ HADOOP-717 ]
          Devaraj Das made changes -
          Attachment 331-initial3.patch [ 12346140 ]
          Devaraj Das made changes -
          Attachment 331.patch [ 12346611 ]
          Devaraj Das made changes -
          Attachment 331.patch [ 12346611 ]
          Devaraj Das made changes -
          Attachment 331.patch [ 12346642 ]
          Devaraj Das made changes -
          Status In Progress [ 3 ] Patch Available [ 10002 ]
          Doug Cutting made changes -
          Fix Version/s 0.10.0 [ 12312207 ]
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Doug Cutting made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Owen O'Malley made changes -
          Component/s mapred [ 12310690 ]

            People

            • Assignee:
              Devaraj Das
              Reporter:
              eric baldeschwieler
            • Votes:
              2 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development