Hadoop Common
  1. Hadoop Common
  2. HADOOP-38

default splitter should incorporate fs block size


    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.1.0
    • Component/s: None
    • Labels:


      By default, the file splitting code should operate as follows.

      inputs are <file>*, numMapTasks, minSplitSize, fsBlockSize
      output is <file,start,length>*

      totalSize = sum of all file sizes;

      desiredSplitSize = totalSize / numMapTasks;
      if (desiredSplitSize > fsBlockSize) /* new */
      desiredSplitSize = fsBlockSize;
      if (desiredSplitSize < minSplitSize)
      desiredSplitSize = minSplitSize;

      chop input files into desiredSplitSize chunks & return them

      In other words, the numMapTasks is a desired minimum. We'll try to chop input into at least numMapTasks chunks, each ideally a single fs block.

      If there's not enough input data to create numMapTasks tasks, each with an entire block, then we'll permit tasks whose input is smaller than a filesystem block, down to a minimum split size.

      This handles cases where:

      • each input record takes a lot of time to process. In this case we want to make sure we use all of the cluster. Thus it is important to permit splits smaller than the fs block size.
      • input i/o dominates. In this case we want to permit the placement of tasks on hosts where their data is local. This is only possible if splits are fs block size or smaller.

      Are there other common cases that this algorithm does not handle well?

      The part marked 'new' above is not currently implemented, but I'd like to add it.

      Does this sound reasonble?


        Owen O'Malley made changes -
        Component/s mapred [ 12310690 ]
        Doug Cutting made changes -
        Workflow no-reopen-closed [ 12373345 ] no-reopen-closed, patch-avail [ 12377652 ]
        Doug Cutting made changes -
        Workflow no reopen closed [ 12373009 ] no-reopen-closed [ 12373345 ]
        Doug Cutting made changes -
        Workflow jira [ 12347054 ] no reopen closed [ 12373009 ]
        Doug Cutting made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Doug Cutting made changes -
        Fix Version/s 0.1.0 [ 12310812 ]
        Doug Cutting made changes -
        Field Original Value New Value
        Resolution Fixed [ 1 ]
        Status Open [ 1 ] Resolved [ 5 ]
        Doug Cutting created issue -


          • Assignee:
            Doug Cutting
          • Votes:
            0 Vote for this issue
            0 Start watching this issue


            • Created: