Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.23.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      This changes the fsimage format to be
      root directory-1 directory-2 ... directoy-n.
      Each directory stores all its children in the following format:
      Directory_full_path_name num_of_children child-1 ... child-n.
      Each inode stores only the last component of its path name into fsimage.
      This change requires an upgrade at deployment.
      Show
      This changes the fsimage format to be root directory-1 directory-2 ... directoy-n. Each directory stores all its children in the following format: Directory_full_path_name num_of_children child-1 ... child-n. Each inode stores only the last component of its path name into fsimage. This change requires an upgrade at deployment.

      Description

      Currently each inode stores its full path in the fsimage. I'd propose to store the local name instead. In order for each inode to identify its parent, all inodes in a directory tree are stored in the image in in-order. This proposal also requires each directory stores the number of its children in image.

      This proposal would bring a few benefits as pointed below and therefore speedup the image loading and saving.

      1. Remove the overhead of converting java-UTF8 encoded local name to string-represented full path then to UTF8 encoded full path when saving to an image and vice versa when loading the image.
      2. Remove the overhead of traversing the full path when inserting the inode to its parent inode.
      3. Reduce the number of temporary java objects during the process of image loading or saving and therefore reduce the GC overhead.
      4. Reduce the size of an image.
      1. trunkLocalNameImage9.patch
        14 kB
        Hairong Kuang
      2. trunkLocalNameImage9.patch
        14 kB
        Hairong Kuang
      3. trunkLocalNameImage8.patch
        14 kB
        Hairong Kuang
      4. trunkLocalNameImage7.patch
        13 kB
        Hairong Kuang
      5. trunkLocalNameImage6.patch
        12 kB
        Hairong Kuang
      6. trunkLocalNameImage5.patch
        12 kB
        Hairong Kuang
      7. trunkLocalNameImage4.patch
        12 kB
        Hairong Kuang
      8. trunkLocalNameImage3.patch
        33 kB
        Hairong Kuang
      9. trunkLocalNameImage1.patch
        27 kB
        Hairong Kuang
      10. trunkLocalNameImage.patch
        21 kB
        Hairong Kuang

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/ )
          Hide
          Todd Lipcon added a comment -

          I'm curious, why did we bump the version number from -28 to -30 for this change? -29 considered unlucky or am I missing something?

          Show
          Todd Lipcon added a comment - I'm curious, why did we bump the version number from -28 to -30 for this change? -29 considered unlucky or am I missing something?
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #595 (See https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/595/)
          HDFS-1070. Speedup namenode image loading and saving by storing only local file names. Contributed by Hairong Kuang.

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #595 (See https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/595/ ) HDFS-1070 . Speedup namenode image loading and saving by storing only local file names. Contributed by Hairong Kuang.
          Hide
          Hairong Kuang added a comment -

          I just committed this. Thank Matt for his great effort in pushing this jira forward!

          Show
          Hairong Kuang added a comment - I just committed this. Thank Matt for his great effort in pushing this jira forward!
          Hide
          Matt Foley added a comment -

          +1
          These failures seem unrelated to this patch, and the same patch already passed system test framework yesterday.

          Show
          Matt Foley added a comment - +1 These failures seem unrelated to this patch, and the same patch already passed system test framework yesterday.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12476364/trunkLocalNameImage9.patch
          against trunk revision 1092507.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.server.datanode.TestBlockReport
          org.apache.hadoop.hdfs.TestFileConcurrentReader

          -1 contrib tests. The patch failed contrib unit tests.

          -1 system test framework. The patch failed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/368//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/368//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/368//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12476364/trunkLocalNameImage9.patch against trunk revision 1092507. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.datanode.TestBlockReport org.apache.hadoop.hdfs.TestFileConcurrentReader -1 contrib tests. The patch failed contrib unit tests. -1 system test framework. The patch failed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/368//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/368//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/368//console This message is automatically generated.
          Hide
          Hairong Kuang added a comment -

          re-submitting to trigger hudson.

          Show
          Hairong Kuang added a comment - re-submitting to trigger hudson.
          Hide
          Matt Foley added a comment -

          Brilliant! Thanks for the explanation.

          Show
          Matt Foley added a comment - Brilliant! Thanks for the explanation.
          Hide
          Hairong Kuang added a comment -

          Patch 9 fixed the tests failed in build 347. The problem was that if an empty directory happens to be in the end, loadImage does not know that it needs to continue to load the empty directory since it has already loaded all its files. Then the next loading of FileUnderConstruction got an error. This is a tricky bug to figure it out.

          Let me resubmit the tests. So everybody could be rest assured.

          Show
          Hairong Kuang added a comment - Patch 9 fixed the tests failed in build 347. The problem was that if an empty directory happens to be in the end, loadImage does not know that it needs to continue to load the empty directory since it has already loaded all its files. Then the next loading of FileUnderConstruction got an error. This is a tricky bug to figure it out. Let me resubmit the tests. So everybody could be rest assured.
          Hide
          Matt Foley added a comment -

          Agreed the failed tests don't seem to have anything to do with this. Looks like Hudson got its tail in a knot and couldn't get it out. Maybe it would be best to re-submit the change and let the test run on it again?

          In the previous build (347), both errs were in FSImage.loadFSImage. Did this change directly fix those? I don't understand why it would, since it appears the previous behavior was sub-optimal but not incorrect. Thanks.

          Show
          Matt Foley added a comment - Agreed the failed tests don't seem to have anything to do with this. Looks like Hudson got its tail in a knot and couldn't get it out. Maybe it would be best to re-submit the change and let the test run on it again? In the previous build (347), both errs were in FSImage.loadFSImage. Did this change directly fix those? I don't understand why it would, since it appears the previous behavior was sub-optimal but not incorrect. Thanks.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > The failed tests seem not be caused by this patch. ...

          I agree. Build 364 was executed by hadoop3. The machine keeps failing HDFS builds.

          Show
          Tsz Wo Nicholas Sze added a comment - > The failed tests seem not be caused by this patch. ... I agree. Build 364 was executed by hadoop3 . The machine keeps failing HDFS builds.
          Hide
          Hairong Kuang added a comment -

          The failed tests seem not be caused by this patch. This patch does not have any unit tests because existing tests already cover it. If nobody is against it, I will commit the patch later this evening.

          Show
          Hairong Kuang added a comment - The failed tests seem not be caused by this patch. This patch does not have any unit tests because existing tests already cover it. If nobody is against it, I will commit the patch later this evening.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12476213/trunkLocalNameImage9.patch
          against trunk revision 1091874.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.fs.TestHDFSFileContextMainOperations
          org.apache.hadoop.hdfs.TestFileAppend4
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.hdfs.TestLargeBlock
          org.apache.hadoop.hdfs.TestWriteConfigurationToDFS

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/364//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/364//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/364//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12476213/trunkLocalNameImage9.patch against trunk revision 1091874. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.fs.TestHDFSFileContextMainOperations org.apache.hadoop.hdfs.TestFileAppend4 org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.hdfs.TestLargeBlock org.apache.hadoop.hdfs.TestWriteConfigurationToDFS -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/364//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/364//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/364//console This message is automatically generated.
          Hide
          Giridharan Kesavan added a comment -

          submit patch

          Show
          Giridharan Kesavan added a comment - submit patch
          Hide
          Giridharan Kesavan added a comment -

          re-submitting for the patch test hudson job

          Show
          Giridharan Kesavan added a comment - re-submitting for the patch test hudson job
          Hide
          Hairong Kuang added a comment -

          patch 9 fixed a bug that write full path of an empty directory to image.

          Show
          Hairong Kuang added a comment - patch 9 fixed a bug that write full path of an empty directory to image.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12476058/trunkLocalNameImage8.patch
          against trunk revision 1091131.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.TestFileCreationNamenodeRestart
          org.apache.hadoop.hdfs.TestRenameWhileOpen

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/347//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/347//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/347//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12476058/trunkLocalNameImage8.patch against trunk revision 1091131. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestFileCreationNamenodeRestart org.apache.hadoop.hdfs.TestRenameWhileOpen -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/347//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/347//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/347//console This message is automatically generated.
          Hide
          Matt Foley added a comment -

          +1. Code review looks good to me!

          Show
          Matt Foley added a comment - +1. Code review looks good to me!
          Hide
          Hairong Kuang added a comment -

          This patch fixes the failed oiv related unit tests.

          Show
          Hairong Kuang added a comment - This patch fixes the failed oiv related unit tests.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12475829/trunkLocalNameImage7.patch
          against trunk revision 1090357.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.server.datanode.TestBlockReport
          org.apache.hadoop.hdfs.TestDFSShell
          org.apache.hadoop.hdfs.TestFileAppend4
          org.apache.hadoop.hdfs.TestFileCreationNamenodeRestart
          org.apache.hadoop.hdfs.TestLargeBlock
          org.apache.hadoop.hdfs.TestRenameWhileOpen
          org.apache.hadoop.hdfs.TestWriteConfigurationToDFS
          org.apache.hadoop.hdfs.tools.offlineImageViewer.TestOfflineImageViewer
          org.apache.hadoop.hdfs.tools.offlineImageViewer.TestOIVCanReadOldVersions

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/336//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/336//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/336//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12475829/trunkLocalNameImage7.patch against trunk revision 1090357. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.datanode.TestBlockReport org.apache.hadoop.hdfs.TestDFSShell org.apache.hadoop.hdfs.TestFileAppend4 org.apache.hadoop.hdfs.TestFileCreationNamenodeRestart org.apache.hadoop.hdfs.TestLargeBlock org.apache.hadoop.hdfs.TestRenameWhileOpen org.apache.hadoop.hdfs.TestWriteConfigurationToDFS org.apache.hadoop.hdfs.tools.offlineImageViewer.TestOfflineImageViewer org.apache.hadoop.hdfs.tools.offlineImageViewer.TestOIVCanReadOldVersions -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/336//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/336//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/336//console This message is automatically generated.
          Hide
          Hairong Kuang added a comment -

          Thank Matt for reviewing the patch and performing the test. TrunkLocalNameImage7.patch addressed his comment.

          Show
          Hairong Kuang added a comment - Thank Matt for reviewing the patch and performing the test. TrunkLocalNameImage7.patch addressed his comment.
          Hide
          Matt Foley added a comment -

          At Hairong's request, I did some perf testing on the patch.

          I ran the original trunk and Hairong's patches against a synthetic namespace image with 16M files, two blocks per file, 100 files per directory. It needs about 14GB of JVM to load on the NN.
          The image used rather short numeric directory names, which decreased the benefit of the patch compared to Hairong's real-world measurements, but the benefits were still quite nice.
          The FSImage was uncompressed.

          There was no observed performance difference between trunkLocalNameImage5.patch and trunkLocalNameImage6.patch, or trunkLocalNameImage6.patch with the above suggested modification.

            Full Path Image Local Name Image % Improvement
          Image Size 3.07GB 2.27GB -26%
          Load Time 72sec 60sec -17%
          Save Time 50sec 38sec -24%

          A very nice improvement, and well worthwhile. Based on Hairong's measurements, we expect even larger benefits with real-world FSImages, which have longer path names and more files per directory.

          If it passes full testing, I recommend for commit.

          Show
          Matt Foley added a comment - At Hairong's request, I did some perf testing on the patch. I ran the original trunk and Hairong's patches against a synthetic namespace image with 16M files, two blocks per file, 100 files per directory. It needs about 14GB of JVM to load on the NN. The image used rather short numeric directory names, which decreased the benefit of the patch compared to Hairong's real-world measurements, but the benefits were still quite nice. The FSImage was uncompressed. There was no observed performance difference between trunkLocalNameImage5.patch and trunkLocalNameImage6.patch, or trunkLocalNameImage6.patch with the above suggested modification.   Full Path Image Local Name Image % Improvement Image Size 3.07GB 2.27GB -26% Load Time 72sec 60sec -17% Save Time 50sec 38sec -24% A very nice improvement, and well worthwhile. Based on Hairong's measurements, we expect even larger benefits with real-world FSImages, which have longer path names and more files per directory. If it passes full testing, I recommend for commit.
          Hide
          Matt Foley added a comment -

          I reviewed the new patch. I focused on the differences and changes, rather than doing a detailed read-through again, but it looks generally fine. Comments (all in FSImageFormat):

          loadLocalNameINodes() javadoc says returns number of inodes read, but returns void.

          Suggest merging loadDirectories() into its caller (loadLocalNameINodes()), so the file counting logic is in one place.

          "direcotry" is misspelled in comment in body of saveImage

          Optional: FSImageFormat.saveImage() is confusing. It currently has three variables prefixLength, prefixLen, and newPrefixLength. (I think two of the three were inherited from the old code.) I understand that the point of the manipulations is to special-case the "root directory" directory name so it doesn't generate a "//" at the beginning of paths. But I would suggest something like the following instead:

              private static void saveImage(ByteBuffer currentDirName,
                                            INodeDirectory current,
                                            DataOutputStream out) throws IOException {
                if (current.getChildrenRaw() == null)
                  return;
                // print prefix (parent directory name)
                int prefixLength = currentDirName.position();
                if (prefixLength == 0) {
                  //special-case the root directory name
                  out.writeShort(PATH_SEPARATOR.length);
                  out.write(PATH_SEPARATOR, 0, PATH_SEPARATOR.length);
                } else {
                  //all non-root directories
                  out.writeShort(prefixLength);
                  out.write(currentDirName.array(), 0, prefixLength);
                }
                List<INode> children = current.getChildren();
                out.writeInt(children.size());
                for(INode child : children) {
                  // print all children first
                  FSImageSerialization.saveINode2Image(child, out);
                }
                for(INode child : children) {
                  if(!child.isDirectory() || ((INodeDirectory)child).getChildrenRaw() == null)
                    continue;
                  // append the subdirectory name to path buffer
                  currentDirName.put(PATH_SEPARATOR).put(child.getLocalNameBytes());
                  // save the subdirectory
                  saveImage(currentDirName, (INodeDirectory)child, out);
                  // restore the path buffer
                  currentDirName.position(prefixLength);
                }
              }
          
          (This change requires in save() changing the line "strbuf.put(PATH_SEPARATOR);" 
          to "strbuf.position(0);", as well as fixing the call signature for saveImage() 
          call.)
          
          

          While it isn't great to do a conditional every pass through the method, it is a fair trade for saving a bunch of assignments and an additional arg passed through every call. And it's very clear and maintainable.

          Performance test results in the next comment.

          Show
          Matt Foley added a comment - I reviewed the new patch. I focused on the differences and changes, rather than doing a detailed read-through again, but it looks generally fine. Comments (all in FSImageFormat): loadLocalNameINodes() javadoc says returns number of inodes read, but returns void. Suggest merging loadDirectories() into its caller (loadLocalNameINodes()), so the file counting logic is in one place. "direcotry" is misspelled in comment in body of saveImage Optional: FSImageFormat.saveImage() is confusing. It currently has three variables prefixLength, prefixLen, and newPrefixLength. (I think two of the three were inherited from the old code.) I understand that the point of the manipulations is to special-case the "root directory" directory name so it doesn't generate a "//" at the beginning of paths. But I would suggest something like the following instead: private static void saveImage(ByteBuffer currentDirName, INodeDirectory current, DataOutputStream out) throws IOException { if (current.getChildrenRaw() == null ) return ; // print prefix (parent directory name) int prefixLength = currentDirName.position(); if (prefixLength == 0) { //special- case the root directory name out.writeShort(PATH_SEPARATOR.length); out.write(PATH_SEPARATOR, 0, PATH_SEPARATOR.length); } else { //all non-root directories out.writeShort(prefixLength); out.write(currentDirName.array(), 0, prefixLength); } List<INode> children = current.getChildren(); out.writeInt(children.size()); for (INode child : children) { // print all children first FSImageSerialization.saveINode2Image(child, out); } for (INode child : children) { if (!child.isDirectory() || ((INodeDirectory)child).getChildrenRaw() == null ) continue ; // append the subdirectory name to path buffer currentDirName.put(PATH_SEPARATOR).put(child.getLocalNameBytes()); // save the subdirectory saveImage(currentDirName, (INodeDirectory)child, out); // restore the path buffer currentDirName.position(prefixLength); } } (This change requires in save() changing the line "strbuf.put(PATH_SEPARATOR);" to "strbuf.position(0);" , as well as fixing the call signature for saveImage() call.) While it isn't great to do a conditional every pass through the method, it is a fair trade for saving a bunch of assignments and an additional arg passed through every call. And it's very clear and maintainable. Performance test results in the next comment.
          Hide
          Hairong Kuang added a comment -

          Here is a patch that store image in the modified format suggested by Matt. I did some preliminary testing but have not gotten time to do any performance test yet.

          Show
          Hairong Kuang added a comment - Here is a patch that store image in the modified format suggested by Matt. I did some preliminary testing but have not gotten time to do any performance test yet.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          BTW, I am not saying that we should do the usernames/groups improvement in this JIRA. Just want to express the idea. See if anyone likes to try it later.

          Show
          Tsz Wo Nicholas Sze added a comment - BTW, I am not saying that we should do the usernames/groups improvement in this JIRA. Just want to express the idea. See if anyone likes to try it later.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Since the mappings are already available in NameNode, if they are also used in FSImage, then we can directly read/write the serial numbers but don't have to look up the usernames/groups when reading from/writing to an image. I agree that we should do some measurement.

          Show
          Tsz Wo Nicholas Sze added a comment - Since the mappings are already available in NameNode, if they are also used in FSImage , then we can directly read/write the serial numbers but don't have to look up the usernames/groups when reading from/writing to an image. I agree that we should do some measurement.
          Hide
          dhruba borthakur added a comment -

          Completely agree with you Nicholas that the size of the final image (Item 4 of the jira description) could be same in all cases when compression is used. However, the effect of Items 1 - 3 could be quite different, and perhaps an experiment will be able to measure the benefit.

          Show
          dhruba borthakur added a comment - Completely agree with you Nicholas that the size of the final image (Item 4 of the jira description) could be same in all cases when compression is used. However, the effect of Items 1 - 3 could be quite different, and perhaps an experiment will be able to measure the benefit.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Dhruba, I am not yet a compression expert. Here is a thought – if a compression algorithm is able to somehow compress usernames and groups, should it also be able to compress common path prefixes?

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Dhruba, I am not yet a compression expert. Here is a thought – if a compression algorithm is able to somehow compress usernames and groups, should it also be able to compress common path prefixes?
          Hide
          dhruba borthakur added a comment -

          hi nicholas, when we use "compression" to compress the fsimage, won't the storage for all these duplicate path names be compressed out anyways?

          Show
          dhruba borthakur added a comment - hi nicholas, when we use "compression" to compress the fsimage, won't the storage for all these duplicate path names be compressed out anyways?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Here is an orthogonal idea for reducing image size:

          In NameNode, we have internal maps for usernames, groups to serial numbers (see SerialNumberManager) in order to save memory in the NameNode. How about we do the same for FSImage? I.e. write the maps in the beginning of FSImage and then use the serial numbers in the INode entries.

          Suppose the saving is 10 bytes per name, that is 20 bytes per INode. Then, it is about 1.1 GB for a namespace with 60 million files/directories.

          Show
          Tsz Wo Nicholas Sze added a comment - Here is an orthogonal idea for reducing image size: In NameNode, we have internal maps for usernames, groups to serial numbers (see SerialNumberManager ) in order to save memory in the NameNode. How about we do the same for FSImage ? I.e. write the maps in the beginning of FSImage and then use the serial numbers in the INode entries. Suppose the saving is 10 bytes per name, that is 20 bytes per INode . Then, it is about 1.1 GB for a namespace with 60 million files/directories.
          Hide
          Matt Foley added a comment -

          Thanks, Hairong. I'm sorry the request comes in so long after your original proposal.

          Show
          Matt Foley added a comment - Thanks, Hairong. I'm sorry the request comes in so long after your original proposal.
          Hide
          Hairong Kuang added a comment -

          I agree that the old design has the potential robustness. I could go ahead try to implement the new proposal.

          But I want to point out a fact. The "old" image does not have explicit "sync" points. So if a portion of the image gets corrupt, there is no way that the image loader could find a sync point and recovers the remaining image.

          Also if we turn on image compression, the "resync" feature becomes less significant.

          Show
          Hairong Kuang added a comment - I agree that the old design has the potential robustness. I could go ahead try to implement the new proposal. But I want to point out a fact. The "old" image does not have explicit "sync" points. So if a portion of the image gets corrupt, there is no way that the image loader could find a sync point and recovers the remaining image. Also if we turn on image compression, the "resync" feature becomes less significant.
          Hide
          Sanjay Radia added a comment -

          > With the old "full names" format, an arbitrary chunk of the file could be lost or corrupted,
          Or even a bug in the code ...
          Matt's proposal reintroduces some of robustness of the old design.

          Show
          Sanjay Radia added a comment - > With the old "full names" format, an arbitrary chunk of the file could be lost or corrupted, Or even a bug in the code ... Matt's proposal reintroduces some of robustness of the old design.
          Hide
          Matt Foley added a comment -

          I was talking with Sanjay and he would like us to consider making the serialized tree format more robust. With the old "full names" format, an arbitrary chunk of the file could be lost or corrupted, and the remainder would still be recoverable. With this tree format, if a chunk is lost or corrupted, everything thereafter is difficult or impossible to reconstruct because the tree information is all implicit in the ordering.

          Things that would help are

          • making more of the tree-structure information explicit
          • providing re-synchronization points or labels in the format

          The simplest idea that occurred to me (and I haven't had time to discuss it with Sanjay yet) is to go back to a breadth-first ordering so that the members of each directory are grouped together, and output the full-path name of the directory at the beginning of each such group. The full-path string provides both the structure info and the re-synch capability, just as the full-names do now. Maybe we can avoid actually parsing that string in most cases. This would be going back to something like the old saveImage(), but only output the local name with each inode, and output the full-path name once when beginning each new breadth-first directory.

          This will decrease the benefit of the change, but if we believe that number of files is typically much larger than number of directories, there could still be a big benefit. What do you think? Would it be hard to re-run your experiment with this mod to see the cost of this change?

          Show
          Matt Foley added a comment - I was talking with Sanjay and he would like us to consider making the serialized tree format more robust. With the old "full names" format, an arbitrary chunk of the file could be lost or corrupted, and the remainder would still be recoverable. With this tree format, if a chunk is lost or corrupted, everything thereafter is difficult or impossible to reconstruct because the tree information is all implicit in the ordering. Things that would help are making more of the tree-structure information explicit providing re-synchronization points or labels in the format The simplest idea that occurred to me (and I haven't had time to discuss it with Sanjay yet) is to go back to a breadth-first ordering so that the members of each directory are grouped together, and output the full-path name of the directory at the beginning of each such group. The full-path string provides both the structure info and the re-synch capability, just as the full-names do now. Maybe we can avoid actually parsing that string in most cases. This would be going back to something like the old saveImage(), but only output the local name with each inode, and output the full-path name once when beginning each new breadth-first directory. This will decrease the benefit of the change, but if we believe that number of files is typically much larger than number of directories, there could still be a big benefit. What do you think? Would it be hard to re-run your experiment with this mod to see the cost of this change?
          Hide
          Matt Foley added a comment -

          Hi Hairong, nice clean patch, thanks for this improvement. Couple minor comments:

          1. The original code had an implicit consistency check between numFiles and the number of files actually read in by load(). Can you add a check after the call to loadLocalNameINodes(in) to assure that namesystem.dir.rootDir.numItemsInTree() == numFiles?

          2. Since typically #files >> #directories, I suggest that the last loop in saveINodes() be changed to avoid unnecessary recursion on every file inode:

          //from 
                  for(INode child : children) {
                    saveINodes(child, out);
                  }
          //to
                  for(INode child : children) {
                    if (child.isDirectory())
                      saveINodes(child, out);
                    else
                      FSImageSerialization.saveINode2Image(child, out);
                  }
          
          Show
          Matt Foley added a comment - Hi Hairong, nice clean patch, thanks for this improvement. Couple minor comments: 1. The original code had an implicit consistency check between numFiles and the number of files actually read in by load(). Can you add a check after the call to loadLocalNameINodes(in) to assure that namesystem.dir.rootDir.numItemsInTree() == numFiles? 2. Since typically #files >> #directories, I suggest that the last loop in saveINodes() be changed to avoid unnecessary recursion on every file inode: //from for (INode child : children) { saveINodes(child, out); } //to for (INode child : children) { if (child.isDirectory()) saveINodes(child, out); else FSImageSerialization.saveINode2Image(child, out); }
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12465863/trunkLocalNameImage5.patch
          against trunk revision 1087160.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.server.datanode.TestBlockReport
          org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/308//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/308//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/308//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12465863/trunkLocalNameImage5.patch against trunk revision 1087160. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.datanode.TestBlockReport org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/308//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/308//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/308//console This message is automatically generated.
          Hide
          Matt Foley added a comment -

          Hi Hairong, the speedups you measured with this patch are very impressive. It would be great to have it included in the system.

          The patch still applied successfully to current trunk when I tried it. So can you please go ahead and "submit patch" so we see how it does with test-patch? And I'll review it, and try to get someone else to also. Thanks!

          Show
          Matt Foley added a comment - Hi Hairong, the speedups you measured with this patch are very impressive. It would be great to have it included in the system. The patch still applied successfully to current trunk when I tried it. So can you please go ahead and "submit patch" so we see how it does with test-patch? And I'll review it, and try to get someone else to also. Thanks!
          Hide
          Hairong Kuang added a comment -

          I rebase the patch trunkLocalNameImage5.patch to the latest trunk. I forgot to mention that the latest patches have addressed Todd's review comments. Could a kind soul give it another review?

          The patch does not have a unit test. It seems that existing unit tests already test it. Anybody has a good idea for a unit test?

          Show
          Hairong Kuang added a comment - I rebase the patch trunkLocalNameImage5.patch to the latest trunk. I forgot to mention that the latest patches have addressed Todd's review comments. Could a kind soul give it another review? The patch does not have a unit test. It seems that existing unit tests already test it. Anybody has a good idea for a unit test?
          Hide
          Hairong Kuang added a comment -

          Since HDFS-1526 is committed, I am uploading a new patch trunkLocalNameImage4.patch

          Show
          Hairong Kuang added a comment - Since HDFS-1526 is committed, I am uploading a new patch trunkLocalNameImage4.patch
          Hide
          Todd Lipcon added a comment -

          Looks pretty good, very nice results!

          Few things:

          • There's this random empty block in ImageLoaderCurrent:
            +    for(long i = 0; i < numInodes; i++) {
            +    }
            
          • This loop:
            +       // read all children
            +       for( int i = in.readInt(); i>0; i--) {
            

            might be clearer as:

            int numChildren = in.readInt();
            for (int i = 0; i < numChildren; i++) {
            }
            
          • This code:
            +        in.read(localName); // read local name
            

            might have a bad bug since the return value is unchecked. Instead consider IOUtils.readFully()?

          Show
          Todd Lipcon added a comment - Looks pretty good, very nice results! Few things: There's this random empty block in ImageLoaderCurrent: + for ( long i = 0; i < numInodes; i++) { + } This loop: + // read all children + for ( int i = in.readInt(); i>0; i--) { might be clearer as: int numChildren = in.readInt(); for ( int i = 0; i < numChildren; i++) { } This code: + in.read(localName); // read local name might have a bad bug since the return value is unchecked. Instead consider IOUtils.readFully()?
          Hide
          Hairong Kuang added a comment -

          I did some experiments with one of our internal image.
          1. the image is not compressed

            Full Path Image Local Name Image
          Image Size 14.2G 7.3G
          Loading Time(s) 443 309
          Saving Time(s) 224 126

          2. the image is compressing using LZOP codec

            Full Path Image Local Name Image
          Image Size 3.0G 2.8G
          Loading Time(s) 565 467
          Saving Time(s) 251 231

          We can see that this feature could cut the image size by half and cut the loading/saving time significantly if the image is not compressed. If the image is compressed, the benefit is not as significant but still it improves loading and saving time.

          Show
          Hairong Kuang added a comment - I did some experiments with one of our internal image. 1. the image is not compressed   Full Path Image Local Name Image Image Size 14.2G 7.3G Loading Time(s) 443 309 Saving Time(s) 224 126 2. the image is compressing using LZOP codec   Full Path Image Local Name Image Image Size 3.0G 2.8G Loading Time(s) 565 467 Saving Time(s) 251 231 We can see that this feature could cut the image size by half and cut the loading/saving time significantly if the image is not compressed. If the image is compressed, the benefit is not as significant but still it improves loading and saving time.
          Hide
          Hairong Kuang added a comment -

          This patch is applicable to the trunk. It also includes the changes made in HDFS-1506. So reviewers might be able to have a better idea why there is a need for HDFS-1506.

          Show
          Hairong Kuang added a comment - This patch is applicable to the trunk. It also includes the changes made in HDFS-1506 . So reviewers might be able to have a better idea why there is a need for HDFS-1506 .
          Hide
          Hairong Kuang added a comment -

          trunkLocalnameImage1.patch fixes a few bugs in the previous patch.

          Show
          Hairong Kuang added a comment - trunkLocalnameImage1.patch fixes a few bugs in the previous patch.
          Hide
          Hairong Kuang added a comment -

          Sanjay, the idea is simple. Changes are
          1. full name -> local name
          2. each directory additionally need to store directory size

          Nodes are stored in the depth first order. For example, a namespace like
          /d1/d2/f1
          /d1/d3
          /f2

          will store as: root 3 d1 2 d2 1 f1 d3 0 f2

          I will post performance data later.

          Show
          Hairong Kuang added a comment - Sanjay, the idea is simple. Changes are 1. full name -> local name 2. each directory additionally need to store directory size Nodes are stored in the depth first order. For example, a namespace like /d1/d2/f1 /d1/d3 /f2 will store as: root 3 d1 2 d2 1 f1 d3 0 f2 I will post performance data later.
          Hide
          Sanjay Radia added a comment -

          You have probably done some break down of where the NN startup time is going (process image, process edits, process BRs) .
          Could you please post a short summary. This patch will save the process-image phase.

          Show
          Sanjay Radia added a comment - You have probably done some break down of where the NN startup time is going (process image, process edits, process BRs) . Could you please post a short summary. This patch will save the process-image phase.
          Hide
          Sanjay Radia added a comment -

          Hairong, could add a short design doc (say 1 page).
          Are inode numbers persisted?

          Show
          Sanjay Radia added a comment - Hairong, could add a short design doc (say 1 page). Are inode numbers persisted?
          Hide
          Hairong Kuang added a comment -

          This patch changes the fsimage format.
          1. Each inode saves only its local file name in fsimage (instead of full path name);
          2. Each directory inode in addtion saves the number of its children;
          3. INodes are saved in depth-first order.

          Show
          Hairong Kuang added a comment - This patch changes the fsimage format. 1. Each inode saves only its local file name in fsimage (instead of full path name); 2. Each directory inode in addtion saves the number of its children; 3. INodes are saved in depth-first order.
          Hide
          luoli added a comment -

          Is there any chance that we can load the fsimage in multi-thread? I tried and found that when I launch a thread to addToParent for a file or dir read from the fsimage ,and the numFiles of dirs and files in this fsimage is huge, the jvm get core and corrupt in that situation . Anyone ever tried this?

          Show
          luoli added a comment - Is there any chance that we can load the fsimage in multi-thread? I tried and found that when I launch a thread to addToParent for a file or dir read from the fsimage ,and the numFiles of dirs and files in this fsimage is huge, the jvm get core and corrupt in that situation . Anyone ever tried this?

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Hairong Kuang
            • Votes:
              1 Vote for this issue
              Watchers:
              28 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development