Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2046

A CombineFileInputSplit cannot be less than a dfs block

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.22.0
    • Component/s: None
    • Labels:
      None

      Description

      I ran into this while testing some hive features.

      Whether we use hiveinputformat or combinehiveinputformat, a split cannot be less than a dfs block size.
      This is a problem if we want to increase the block size for older data to reduce memory consumption for the
      name node.

      It would be useful if the input split was independent of the dfs block size.

      1. patch-2046-ydist.txt
        6 kB
        Amareshwari Sriramadasu
      2. combineFileInputFormatMaxSize2.txt
        7 kB
        dhruba borthakur
      3. combineFileInputFormatMaxSize.txt
        7 kB
        dhruba borthakur

        Activity

        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #523 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/523/)

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #523 (See https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/523/ )
        Hide
        Amareshwari Sriramadasu added a comment -

        Patch for Yahoo! distribution on top MAPREDUCE-2021.

        Show
        Amareshwari Sriramadasu added a comment - Patch for Yahoo! distribution on top MAPREDUCE-2021 .
        Hide
        Amareshwari Sriramadasu added a comment -

        +1
        I just committed this. Thanks Dhruba !

        Show
        Amareshwari Sriramadasu added a comment - +1 I just committed this. Thanks Dhruba !
        Hide
        Joydeep Sen Sarma added a comment -

        +1 from my side.

        Show
        Joydeep Sen Sarma added a comment - +1 from my side.
        Hide
        dhruba borthakur added a comment -

        joydeep, amareshwari: can you pl review the patch again? Thanks.

        Show
        dhruba borthakur added a comment - joydeep, amareshwari: can you pl review the patch again? Thanks.
        Hide
        dhruba borthakur added a comment -

        if remainder is between max and 2*max - then instead of creating splits of size max, left-max we create splits of size left/2 and left/2. This is a heuristic to avoid creating really really small

        Show
        dhruba borthakur added a comment - if remainder is between max and 2*max - then instead of creating splits of size max, left-max we create splits of size left/2 and left/2. This is a heuristic to avoid creating really really small
        Hide
        dhruba borthakur added a comment -

        have to address joydeep's comments

        Show
        dhruba borthakur added a comment - have to address joydeep's comments
        Hide
        Joydeep Sen Sarma added a comment -

        one concern is whether will we have a lot of small 'runts' broken out and that will lead to inefficient IO (when those are combined into other splits). suggestion:

        when carving up the block - if the remainder (R) is between max and 2*max - then instead of creating splits of size <max, R-max> - create splits of size <R/2, R/2>.

        also when packing blocks within a node/rack - it would be better to sort them by size first (ascending). i think it will lead to better packing - what do you think?

        Show
        Joydeep Sen Sarma added a comment - one concern is whether will we have a lot of small 'runts' broken out and that will lead to inefficient IO (when those are combined into other splits). suggestion: when carving up the block - if the remainder (R) is between max and 2*max - then instead of creating splits of size <max, R-max> - create splits of size <R/2, R/2>. also when packing blocks within a node/rack - it would be better to sort them by size first (ascending). i think it will lead to better packing - what do you think?
        Hide
        Amareshwari Sriramadasu added a comment -

        +1. Patch looks good. Can you update the test-results, as hudson is not responding ?

        Show
        Amareshwari Sriramadasu added a comment - +1. Patch looks good. Can you update the test-results, as hudson is not responding ?
        Hide
        dhruba borthakur added a comment -

        right after calling getBlockLocations, we look at the block-locations returned by hdfs and then split them into location entries where each entry has a maximum length specified by maxSize.

        Show
        dhruba borthakur added a comment - right after calling getBlockLocations, we look at the block-locations returned by hdfs and then split them into location entries where each entry has a maximum length specified by maxSize.
        Hide
        Namit Jain added a comment -

        Dhruba, I have also seen the same problem with HiveInputFormat, which I think (not sure) does not go through CombineFileInputsplit.
        Are we missing something else ?

        Show
        Namit Jain added a comment - Dhruba, I have also seen the same problem with HiveInputFormat, which I think (not sure) does not go through CombineFileInputsplit. Are we missing something else ?
        Hide
        Owen O'Malley added a comment -

        Ah. I don't know that one. If you want to reopen the bug for that specific input format, that would make sense.

        Show
        Owen O'Malley added a comment - Ah. I don't know that one. If you want to reopen the bug for that specific input format, that would make sense.
        Hide
        dhruba borthakur added a comment -

        I think that if you use org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat and you set the maximum split size (via setMaxSplitSize) to be smaller than a dfs block size, still the splits produced are at least as big as a dfs block.

        Show
        dhruba borthakur added a comment - I think that if you use org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat and you set the maximum split size (via setMaxSplitSize) to be smaller than a dfs block size, still the splits produced are at least as big as a dfs block.
        Hide
        Owen O'Malley added a comment -

        This isn't true. InputSplits can be arbitrarily sized by the InputFormat. mapred.TextInputFormat if you set the number of maps very high, you will generate a large number of maps. In the new mapreduce.in.TextInputFormat, there are knobs that set the minimum and maximum block size.

        Show
        Owen O'Malley added a comment - This isn't true. InputSplits can be arbitrarily sized by the InputFormat. mapred.TextInputFormat if you set the number of maps very high, you will generate a large number of maps. In the new mapreduce.in.TextInputFormat, there are knobs that set the minimum and maximum block size.

          People

          • Assignee:
            dhruba borthakur
            Reporter:
            Namit Jain
          • Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development