Hadoop Common
  1. Hadoop Common
  2. HADOOP-5369

Small tweaks to reduce MapFile index size

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Two minor tweaks can help reduce the memory overhead of the MapFile index a bit:

      1) Because the index file is a sequence file, it's length is not known. That means the index is built using the standard "mulitply the size of the buffer on overflow" with a factor of 3/2. With small keys, the slack in the index can be substantial. This patch has a constant upper bound on the amount of slack allowed.

      2) In block compressed map files the index file often has entries with the same offset (because the compressed block had more than index interval keys). The entries with identical offsets do not help MapFile do random access any faster. This patch eliminates these types of entries from new map files, and ignores them while reading old map files. This patch greatly helped with memory usage on a compressed hbase table.

      1. smaller_mapfile.patch
        6 kB
        Sharad Agarwal
      2. smaller_mapfile.patch
        6 kB
        Ben Maurer
      3. smaller_mapfile.patch
        6 kB
        Ben Maurer
      4. smaller_mapfile.patch
        5 kB
        Ben Maurer
      5. mapfile.patch
        4 kB
        Ben Maurer
      6. smaller_mapfile.patch
        4 kB
        Ben Maurer

        Activity

        Hide
        Sharad Agarwal added a comment -

        I committed this. Thanks Ben!

        Show
        Sharad Agarwal added a comment - I committed this. Thanks Ben!
        Hide
        Sharad Agarwal added a comment -

        The tests failures are due to HADOOP-5847

        Show
        Sharad Agarwal added a comment - The tests failures are due to HADOOP-5847
        Hide
        Sharad Agarwal added a comment -

        The patch is good to go. Ben, could you verify that the test failures are unrelated to this patch ?

        Show
        Sharad Agarwal added a comment - The patch is good to go. Ben, could you verify that the test failures are unrelated to this patch ?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12407970/smaller_mapfile.patch
        against trunk revision 774433.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/334/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/334/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/334/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/334/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12407970/smaller_mapfile.patch against trunk revision 774433. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/334/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/334/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/334/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/334/console This message is automatically generated.
        Hide
        Sharad Agarwal added a comment -

        updated to trunk. minor modification to move src/test/org/.. changes to src/test/core/org...

        Show
        Sharad Agarwal added a comment - updated to trunk. minor modification to move src/test/org/.. changes to src/test/core/org...
        Hide
        Sharad Agarwal added a comment -

        unfortunately the patch got stale.

        Show
        Sharad Agarwal added a comment - unfortunately the patch got stale.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12401772/smaller_mapfile.patch
        against trunk revision 751994.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/42/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/42/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/42/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/42/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12401772/smaller_mapfile.patch against trunk revision 751994. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/42/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/42/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/42/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/42/console This message is automatically generated.
        Hide
        Ben Maurer added a comment -

        Heh, I went to all the effort of writing tests for those cases just to not include them on the svn diff command .

        Show
        Ben Maurer added a comment - Heh, I went to all the effort of writing tests for those cases just to not include them on the svn diff command .
        Hide
        Doug Cutting added a comment -

        > Another hbase triggered bug - midKey appeared to throw an exception if the file was empty ...

        It would be great to add test cases for these to TestMapFile. Also, whenever you attach a new version of the patch, please move the issue to the "Patch Available" state so that Hudson will test it.

        Show
        Doug Cutting added a comment - > Another hbase triggered bug - midKey appeared to throw an exception if the file was empty ... It would be great to add test cases for these to TestMapFile. Also, whenever you attach a new version of the patch, please move the issue to the "Patch Available" state so that Hudson will test it.
        Hide
        Ben Maurer added a comment -

        Another hbase triggered bug – midKey appeared to throw an exception if the file was empty, however, the code (count - 1) / 2 turned in to -1/2, which in java is zero. Because the arrays were never resized, it ended up just returning null. I've restored this behavior (as users were depending on it), documented and tested it.

        Show
        Ben Maurer added a comment - Another hbase triggered bug – midKey appeared to throw an exception if the file was empty, however, the code (count - 1) / 2 turned in to -1/2, which in java is zero. Because the arrays were never resized, it ended up just returning null. I've restored this behavior (as users were depending on it), documented and tested it.
        Hide
        Ben Maurer added a comment -

        Fix a bug triggered by hbase – midKey throws an exception if no indexes are written.

        Show
        Ben Maurer added a comment - Fix a bug triggered by hbase – midKey throws an exception if no indexes are written.
        Hide
        Ben Maurer added a comment -

        No tests because this doesn't change externally visible behavior.

        Show
        Ben Maurer added a comment - No tests because this doesn't change externally visible behavior.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12401423/mapfile.patch
        against trunk revision 750237.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/11/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/11/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/11/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/11/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12401423/mapfile.patch against trunk revision 750237. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/11/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/11/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/11/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/11/console This message is automatically generated.
        Hide
        Doug Cutting added a comment -

        +1 This looks good to me.

        Show
        Doug Cutting added a comment - +1 This looks good to me.
        Hide
        Ben Maurer added a comment -

        Updated patch. I used ArrayList/Arrays.copyOf for resizing (much cleaner), and a variation of entriesSinceLastIndex to make the code a bit more symmetrical.

        Show
        Ben Maurer added a comment - Updated patch. I used ArrayList/Arrays.copyOf for resizing (much cleaner), and a variation of entriesSinceLastIndex to make the code a bit more symmetrical.
        Hide
        Doug Cutting added a comment -

        Overall these look good. A few minor comments:

        • Should we switch from using 'size % indexInterval' to 'entriesSinceIndex > indexInterval'? Then we wouldn't need the indexWanted flag, since entriesSinceIndex would only be zeroed when an index entry is added.
        • data.getLength() is used three times and might thus be put it in a local 'position' variable.
        • The lastIndex optimization looks good.
        • Why not always trim the arrays to the exact length? We'll nearly always need to do it when arrays are big, and, when they're small it won't matter. Also, please add a private method to re-size the arrays to avoid repeating those 6 lines of code.
        Show
        Doug Cutting added a comment - Overall these look good. A few minor comments: Should we switch from using 'size % indexInterval' to 'entriesSinceIndex > indexInterval'? Then we wouldn't need the indexWanted flag, since entriesSinceIndex would only be zeroed when an index entry is added. data.getLength() is used three times and might thus be put it in a local 'position' variable. The lastIndex optimization looks good. Why not always trim the arrays to the exact length? We'll nearly always need to do it when arrays are big, and, when they're small it won't matter. Also, please add a private method to re-size the arrays to avoid repeating those 6 lines of code.
        Hide
        Ben Maurer added a comment -

        Patch to apply these tweaks

        Show
        Ben Maurer added a comment - Patch to apply these tweaks

          People

          • Assignee:
            Ben Maurer
            Reporter:
            Ben Maurer
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development