Hadoop Common
  1. Hadoop Common
  2. HADOOP-38

default splitter should incorporate fs block size

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.1.0
    • Component/s: None
    • Labels:
      None

      Description

      By default, the file splitting code should operate as follows.

      inputs are <file>*, numMapTasks, minSplitSize, fsBlockSize
      output is <file,start,length>*

      totalSize = sum of all file sizes;

      desiredSplitSize = totalSize / numMapTasks;
      if (desiredSplitSize > fsBlockSize) /* new */
      desiredSplitSize = fsBlockSize;
      if (desiredSplitSize < minSplitSize)
      desiredSplitSize = minSplitSize;

      chop input files into desiredSplitSize chunks & return them

      In other words, the numMapTasks is a desired minimum. We'll try to chop input into at least numMapTasks chunks, each ideally a single fs block.

      If there's not enough input data to create numMapTasks tasks, each with an entire block, then we'll permit tasks whose input is smaller than a filesystem block, down to a minimum split size.

      This handles cases where:

      • each input record takes a lot of time to process. In this case we want to make sure we use all of the cluster. Thus it is important to permit splits smaller than the fs block size.
      • input i/o dominates. In this case we want to permit the placement of tasks on hosts where their data is local. This is only possible if splits are fs block size or smaller.

      Are there other common cases that this algorithm does not handle well?

      The part marked 'new' above is not currently implemented, but I'd like to add it.

      Does this sound reasonble?

        Activity

        Hide
        Bryan Pendleton added a comment -

        The idea sounds sound, but is blocksize the best unit? There's a certain overhead for each additional task added to a job - for jobs with really large input, this could cause really large task lists. Is there going to be any code for pre-replicating blocks? Maybe sequences, so there'd be a natural "first choice" node for many chunkings of larger than one block? Obviously, as datanodes come and go this might not always work ideally, but it could help in the 80% case.

        Show
        Bryan Pendleton added a comment - The idea sounds sound, but is blocksize the best unit? There's a certain overhead for each additional task added to a job - for jobs with really large input, this could cause really large task lists. Is there going to be any code for pre-replicating blocks? Maybe sequences, so there'd be a natural "first choice" node for many chunkings of larger than one block? Obviously, as datanodes come and go this might not always work ideally, but it could help in the 80% case.
        Hide
        Doug Cutting added a comment -

        The surest way to get larger chunks is to increase the block size.

        The default DFS blocksize is currently 32MB, which gives 31k tasks for terabyte inputs, which is reasonable. I think we should design things to be able to handle perhaps a million tasks, which, with the current block size, would get us to 32 terabyte inputs.

        Perhaps the default should be 1GB/block. With a million tasks, would get us to maximum of a petabyte per job. On a 10k node cluster, a petabyte takes hours to read (100GB/node @ 10MB/second = 10k seconds).

        We'll also need to revise the web UI to better handle a million tasks...

        Show
        Doug Cutting added a comment - The surest way to get larger chunks is to increase the block size. The default DFS blocksize is currently 32MB, which gives 31k tasks for terabyte inputs, which is reasonable. I think we should design things to be able to handle perhaps a million tasks, which, with the current block size, would get us to 32 terabyte inputs. Perhaps the default should be 1GB/block. With a million tasks, would get us to maximum of a petabyte per job. On a 10k node cluster, a petabyte takes hours to read (100GB/node @ 10MB/second = 10k seconds). We'll also need to revise the web UI to better handle a million tasks...
        Hide
        eric baldeschwieler added a comment -

        1GB blocks have a lot of issues. Until your networks get faster and RAMs get bigger, this is probably too big. For many of our current tasks 1GB is too much input for reasonable restartabilty too. I think 32M to 128M are a lot closer to the current sweet spot.

        Show
        eric baldeschwieler added a comment - 1GB blocks have a lot of issues. Until your networks get faster and RAMs get bigger, this is probably too big. For many of our current tasks 1GB is too much input for reasonable restartabilty too. I think 32M to 128M are a lot closer to the current sweet spot.
        Hide
        Doug Cutting added a comment -

        I'm not sure what RAM size or network speeds have to do with it: we stream blocks into a task, we don't read them all at once.

        Restartability could be an issue. If you have a petabyte of input, and you want restartability at 100M chunks, then that means you need to be able to support up to 10M tasks per job. This is possible, but means the job tracker has to be even more careful not to store too much in RAM per task, nor iterate over all tasks, etc.

        But I'm not convinced that 1GB would cause problems for restartability. A petabyte input on 10k nodes (my current horizon), 1GB blocks gives 1M tasks, or 100 per node. So each task will average around 1% of the execution time, so even those that are restarted near the end of job completion won't add much to the overall time.

        Show
        Doug Cutting added a comment - I'm not sure what RAM size or network speeds have to do with it: we stream blocks into a task, we don't read them all at once. Restartability could be an issue. If you have a petabyte of input, and you want restartability at 100M chunks, then that means you need to be able to support up to 10M tasks per job. This is possible, but means the job tracker has to be even more careful not to store too much in RAM per task, nor iterate over all tasks, etc. But I'm not convinced that 1GB would cause problems for restartability. A petabyte input on 10k nodes (my current horizon), 1GB blocks gives 1M tasks, or 100 per node. So each task will average around 1% of the execution time, so even those that are restarted near the end of job completion won't add much to the overall time.
        Hide
        eric baldeschwieler added a comment -

        some thoughts:

        1) Eventually you are going to want raid / erasure coding style things. The simplest way to do this without breaking reads is to batch several blocks, keeping them linear and then generate parity once all blocks are full. This gets more expensive as block size increases. At current sizes, this can all be buffered in RAM in some cases. 1GB blocks rule that out.

        2) Currently you can trivially keep a block in RAM for a MAP task. Depending on scaling factor, you can probably keep the output in ram for sorting, reduction, etc. too. This is nice. As block size increases you loose this property.

        3) When you loose a node, the finer grained the lost data, the fewer hotspots you have in the system. Today in a large cluster you can easily have choke points with ~33mbit aggregate all to all. We've seen problems with larger data sizes slowing recovery times to a real problem. 1GB blocks take 10x as long to transmit, and this turns into minutes, which will require more sophisticated management.

        None of these are show stoppers, but one of the main reasons we are interested in hadoop is in getting off of our current very large storage chunk system, so I'd hate to see the default move quickly to something as large as 1GB.

        I can see the advantages of pushing the block size up to manage task tracker RAM size, but I doubt that alone will prove a compelling reason for us to change our default block size. On the other hand, I also don't think we'll be pumping 1 peta byte through a single m/r in the near term, so we can assume the zero code solution, change block size, until we have more data to support some other approach.

        Of course at 1M tasks, you will want to be careful about linear scans anyway...

        I've no concern with the proposal in this bug. Probably can take this discussion elsewhere

        Show
        eric baldeschwieler added a comment - some thoughts: 1) Eventually you are going to want raid / erasure coding style things. The simplest way to do this without breaking reads is to batch several blocks, keeping them linear and then generate parity once all blocks are full. This gets more expensive as block size increases. At current sizes, this can all be buffered in RAM in some cases. 1GB blocks rule that out. 2) Currently you can trivially keep a block in RAM for a MAP task. Depending on scaling factor, you can probably keep the output in ram for sorting, reduction, etc. too. This is nice. As block size increases you loose this property. 3) When you loose a node, the finer grained the lost data, the fewer hotspots you have in the system. Today in a large cluster you can easily have choke points with ~33mbit aggregate all to all. We've seen problems with larger data sizes slowing recovery times to a real problem. 1GB blocks take 10x as long to transmit, and this turns into minutes, which will require more sophisticated management. — None of these are show stoppers, but one of the main reasons we are interested in hadoop is in getting off of our current very large storage chunk system, so I'd hate to see the default move quickly to something as large as 1GB. I can see the advantages of pushing the block size up to manage task tracker RAM size, but I doubt that alone will prove a compelling reason for us to change our default block size. On the other hand, I also don't think we'll be pumping 1 peta byte through a single m/r in the near term, so we can assume the zero code solution, change block size, until we have more data to support some other approach. Of course at 1M tasks, you will want to be careful about linear scans anyway... I've no concern with the proposal in this bug. Probably can take this discussion elsewhere
        Hide
        Doug Cutting added a comment -

        I just committed this.

        Show
        Doug Cutting added a comment - I just committed this.

          People

          • Assignee:
            Unassigned
            Reporter:
            Doug Cutting
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development