Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-902

Map output merge still uses unnecessary seeks

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.20.1
    • None
    • task
    • None

    Description

      HADOOP-3638 improved the merge of the map output by caching the index files.

      But why not also caching the data files?

      In our use-case scenario, still using hadoop-0.18.3, but HADOOP-3638 would only help partially, an individual map tasks finishes in less than 30 minutes, but needs 4 hours to merge 70 spills for 20,000 partitions (with lzo compression), reading about 10kB from each spill file (which is re-opened for every partition). As this is just a merge sort, there is no reason to not keep the input files open and eliminate seek altogether with sequential access.

      Attachments

        Activity

          People

            Unassigned Unassigned
            ckunz Christian Kunz
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: