Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-2126 Improve Namenode startup time [umbrella task]
  3. HDFS-1071

savenamespace should write the fsimage to all configured fs.name.dir in parallel

    Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.22.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      If you have a large number of files in HDFS, the fsimage file is very big. When the namenode restarts, it writes a copy of the fsimage to all directories configured in fs.name.dir. This takes a long time, especially if there are many directories in fs.name.dir. Make the NN write the fsimage to all these directories in parallel.

      1. HDFS-1071.patch
        11 kB
        Dmytro Molkov
      2. HDFS-1071.2.patch
        12 kB
        Dmytro Molkov
      3. HDFS-1071.3.patch
        12 kB
        Dmytro Molkov
      4. HDFS-1071.4.patch
        13 kB
        Dmytro Molkov
      5. HDFS-1071.5.patch
        13 kB
        Dmytro Molkov
      6. HDFS-1071.6.patch
        14 kB
        Dmytro Molkov
      7. HDFS-1071.7.patch
        15 kB
        Dmytro Molkov

        Issue Links

          Activity

          Hide
          Allen Wittenauer added a comment -

          I'm looking forward to this fix, because it will make having a quorum for image files that much easier.

          Show
          Allen Wittenauer added a comment - I'm looking forward to this fix, because it will make having a quorum for image files that much easier.
          Hide
          Dmytro Molkov added a comment -

          please have a look. The patch is pretty simple.
          The test copies the TestRestartDFS and adds the part where all images checksums are being compared.

          Show
          Dmytro Molkov added a comment - please have a look. The patch is pretty simple. The test copies the TestRestartDFS and adds the part where all images checksums are being compared.
          Hide
          Todd Lipcon added a comment -

          Few small notes:

          • Please change the Threads to be named - eg "FSImageSaver for /path/to/dir"
          • I'm not sure if the behavior under InterruptedException is right - don't we want to retry joining on that thread? Or perhaps interrupt those threads themselves? I'm worried about leaving a straggling thread saving the namespace during a ^C shutdown, for example.
          • The code to join on all the threads in the list is repeated a lot - maybe factor into a static method?
          Show
          Todd Lipcon added a comment - Few small notes: Please change the Threads to be named - eg "FSImageSaver for /path/to/dir" I'm not sure if the behavior under InterruptedException is right - don't we want to retry joining on that thread? Or perhaps interrupt those threads themselves? I'm worried about leaving a straggling thread saving the namespace during a ^C shutdown, for example. The code to join on all the threads in the list is repeated a lot - maybe factor into a static method?
          Hide
          Dmytro Molkov added a comment -

          Todd, please have a look at the new patch. I've addressed all of your comments. The only thing I wanted to reiterate on was the behaviour of the join.
          With this modification we will try to join until the thread goes dead. However if the namenode is ^C'd while writing it will have to wait for the write to finish, since FSImageSaver threads would not stop.
          Do you think we should address that issue?

          Show
          Dmytro Molkov added a comment - Todd, please have a look at the new patch. I've addressed all of your comments. The only thing I wanted to reiterate on was the behaviour of the join. With this modification we will try to join until the thread goes dead. However if the namenode is ^C'd while writing it will have to wait for the write to finish, since FSImageSaver threads would not stop. Do you think we should address that issue?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12444712/HDFS-1071.2.patch
          against trunk revision 944566.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/365/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/365/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/365/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/365/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12444712/HDFS-1071.2.patch against trunk revision 944566. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/365/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/365/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/365/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/365/console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Looks good. My only remaining comment is that we should add an assert in the test to make sure it's actually got multiple name directories. I'm pretty sure that's the case for MiniDFSCluster, but we should verify that it's actually testing something in case someone changes MiniDFS later.

          Show
          Todd Lipcon added a comment - Looks good. My only remaining comment is that we should add an assert in the test to make sure it's actually got multiple name directories. I'm pretty sure that's the case for MiniDFSCluster, but we should verify that it's actually testing something in case someone changes MiniDFS later.
          Hide
          Dmytro Molkov added a comment -

          Added a check that we have more than 1 checksum of image files, so that the test is valid.

          Show
          Dmytro Molkov added a comment - Added a check that we have more than 1 checksum of image files, so that the test is valid.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12445020/HDFS-1071.3.patch
          against trunk revision 946488.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/371/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/371/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/371/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/371/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12445020/HDFS-1071.3.patch against trunk revision 946488. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/371/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/371/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/371/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/371/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          +1. code looks good to me. I wil commit this in a day or two unless somebody has any other comments.

          Show
          dhruba borthakur added a comment - +1. code looks good to me. I wil commit this in a day or two unless somebody has any other comments.
          Hide
          Konstantin Shvachko added a comment -
          1. FSImage.processIOError() and FSEditsLog.processIOError() go as a pair, sort of. If you change prototype of one you should also change the other's. So FSEditsLog.processIOError() should also take List rather than ArrayList.
          2. TestParallelImageWrite has 2 warnings (import and unused variable).
          3. The main question is how saveNamespace() works on a running NN.
            The scenario here is that NN grabs FSNamesystem lock, then spawns several threads,
            and starts traversing the namespace tree and write inodes to the image file.
            My understanding here is that the new threads are completely unaware of the FSNamesystem synchronization or any other locks.
            Could you please elaborate on this. I don't understand the intentions, but it doesn't look right as is.
          4. It would be good to have a test case for the above condition. That is testing
            saveNamespace() with multiple directories on a running mini-cluster.
          5. Last but not least, could you please share any performance measurements, if you have any.
            How much faster save goes if the directories are
            • on different drives,
            • on the same drive,
            • one on disk drive and another on NFS filer.
          Show
          Konstantin Shvachko added a comment - FSImage.processIOError() and FSEditsLog.processIOError() go as a pair, sort of. If you change prototype of one you should also change the other's. So FSEditsLog.processIOError() should also take List rather than ArrayList . TestParallelImageWrite has 2 warnings (import and unused variable). The main question is how saveNamespace() works on a running NN. The scenario here is that NN grabs FSNamesystem lock, then spawns several threads, and starts traversing the namespace tree and write inodes to the image file. My understanding here is that the new threads are completely unaware of the FSNamesystem synchronization or any other locks. Could you please elaborate on this. I don't understand the intentions, but it doesn't look right as is. It would be good to have a test case for the above condition. That is testing saveNamespace() with multiple directories on a running mini-cluster. Last but not least, could you please share any performance measurements, if you have any. How much faster save goes if the directories are on different drives, on the same drive, one on disk drive and another on NFS filer.
          Hide
          Dmytro Molkov added a comment -

          I added the test that also does saveNamespace on the running cluster and then checks the image files written.
          As far as the locking is concerned saveNamespace can only be done when in safemode, so we are only performing read operations on the datastructure that is essentially read only and while the parent thread is holding a lock, right?

          And for the performance: the current image in our biggest cluster is ~11G and it takes 1.5-2 minutes to write it out to disk and filer each. In case of parallel writes those latencies are completely overlayed, so it will take 1.5-2 minutes for both. Which will give us about 1.5 minutes savings (80-100 seconds is the time of the faster write).

          Show
          Dmytro Molkov added a comment - I added the test that also does saveNamespace on the running cluster and then checks the image files written. As far as the locking is concerned saveNamespace can only be done when in safemode, so we are only performing read operations on the datastructure that is essentially read only and while the parent thread is holding a lock, right? And for the performance: the current image in our biggest cluster is ~11G and it takes 1.5-2 minutes to write it out to disk and filer each. In case of parallel writes those latencies are completely overlayed, so it will take 1.5-2 minutes for both. Which will give us about 1.5 minutes savings (80-100 seconds is the time of the faster write).
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12446062/HDFS-1071.4.patch
          against trunk revision 950231.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/184/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/184/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/184/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/184/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12446062/HDFS-1071.4.patch against trunk revision 950231. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/184/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/184/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/184/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/184/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          hi konstantin, it appears that Dmytro's last comment addresses all of your questions.

          Show
          dhruba borthakur added a comment - hi konstantin, it appears that Dmytro's last comment addresses all of your questions.
          Hide
          Dmytro Molkov added a comment -

          Attaching new patch that applies to trunk since there are some conflicts with the older version.

          Show
          Dmytro Molkov added a comment - Attaching new patch that applies to trunk since there are some conflicts with the older version.
          Hide
          Konstantin Shvachko added a comment -

          > saveNamespace can only be done when in safemode, so we are only performing read operations on the datastructure that is essentially read only

          Yes you can saveNamespace only only in safe mode, but the data structures are not read only. For example, getBlockLocations() updates accessTime for file inodes. This means that different threads in your implementation may record different atimes of the same file, depending on when they write it. This means that file system images will not be identical in different directories. They have been so far.

          > while the parent thread is holding a lock, right?

          Are you relying on the parent thread lock? Could you please explain how that works?

          In fact I thought you would implement something like this: one thread traverses the tree, serializes image objects and puts them into a queue, where other writing threads pick them up and write to disk in parallel. Then it is guaranteed that images are exactly the same.

          > so it will take 1.5-2 minutes for both.

          Have you actually tested it? Is it the same if both directories are on the same drive?

          Show
          Konstantin Shvachko added a comment - > saveNamespace can only be done when in safemode, so we are only performing read operations on the datastructure that is essentially read only Yes you can saveNamespace only only in safe mode, but the data structures are not read only. For example, getBlockLocations() updates accessTime for file inodes. This means that different threads in your implementation may record different atimes of the same file, depending on when they write it. This means that file system images will not be identical in different directories. They have been so far. > while the parent thread is holding a lock, right? Are you relying on the parent thread lock? Could you please explain how that works? In fact I thought you would implement something like this: one thread traverses the tree, serializes image objects and puts them into a queue, where other writing threads pick them up and write to disk in parallel. Then it is guaranteed that images are exactly the same. > so it will take 1.5-2 minutes for both. Have you actually tested it? Is it the same if both directories are on the same drive?
          Hide
          Dmytro Molkov added a comment -

          Well, what I mean by the parent thread holding the lock is the following:

          the saveNamespace method is synchronized in the FSNamesystem and currently while holding this lock, the handler thread walks the tree N times and writes N files, so in a way we assume that the tree is guarded from all the modifications by the FSNamesystem lock.

          The same is true for the patch, except in this case we are walking the tree by N different threads. But operating under the same assumptions that while we are holding the FSNamesystem lock the tree is not being modified, and the handler thread is waiting for all worker threads to finish writing to their files before returning from the section synchronized on FSNamesystem.

          We just deployed this patch internally to our production cluster:

          2010-06-22 10:12:59,714 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 11906663754 saved in 140 seconds.
          2010-06-22 10:13:50,626 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 11906663754 saved in 191 seconds.

          This saved us 140 seconds on the current image.

          As far as both copies being on the same drive is concerned - I guess this patch will not give much of an improvement.
          However I am not sure there is much value in storing two copies of the image on the same drive?
          Please correct me if I am wrong, but I thought that multiple copies of the image should theoretically be stored on different drives to help in case of drive failure (or on a filer to protect against machine dying), and storing two copies on the same drive only helps with file corruption (accidental deletion) and that is a weak argument to have multiple copies on one physical drive?

          I like your approach with one thread doing serialization and others doing writes, but it seems like it is a lot more complicated than the one in this patch.
          Because I am simply executing one call in a new born thread, while with serializer-writer approach there will be more implementation questions, like what to do with multiple writers that consume their queues at different speeds. You cannot grow the queue indefinitely, since the namenode will simply run out of memory, on the other hand you might want to write things out to faster consumers as quickly as possible.
          And the main benefit I see is only doing serialization of a tree once, but since we are holding the FSNamesystem lock at that time the NameNode doesn't do much anyways, it is also not worse than what was in place before that (serialization was taking place once per image location).

          Show
          Dmytro Molkov added a comment - Well, what I mean by the parent thread holding the lock is the following: the saveNamespace method is synchronized in the FSNamesystem and currently while holding this lock, the handler thread walks the tree N times and writes N files, so in a way we assume that the tree is guarded from all the modifications by the FSNamesystem lock. The same is true for the patch, except in this case we are walking the tree by N different threads. But operating under the same assumptions that while we are holding the FSNamesystem lock the tree is not being modified, and the handler thread is waiting for all worker threads to finish writing to their files before returning from the section synchronized on FSNamesystem. We just deployed this patch internally to our production cluster: 2010-06-22 10:12:59,714 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 11906663754 saved in 140 seconds. 2010-06-22 10:13:50,626 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 11906663754 saved in 191 seconds. This saved us 140 seconds on the current image. As far as both copies being on the same drive is concerned - I guess this patch will not give much of an improvement. However I am not sure there is much value in storing two copies of the image on the same drive? Please correct me if I am wrong, but I thought that multiple copies of the image should theoretically be stored on different drives to help in case of drive failure (or on a filer to protect against machine dying), and storing two copies on the same drive only helps with file corruption (accidental deletion) and that is a weak argument to have multiple copies on one physical drive? I like your approach with one thread doing serialization and others doing writes, but it seems like it is a lot more complicated than the one in this patch. Because I am simply executing one call in a new born thread, while with serializer-writer approach there will be more implementation questions, like what to do with multiple writers that consume their queues at different speeds. You cannot grow the queue indefinitely, since the namenode will simply run out of memory, on the other hand you might want to write things out to faster consumers as quickly as possible. And the main benefit I see is only doing serialization of a tree once, but since we are holding the FSNamesystem lock at that time the NameNode doesn't do much anyways, it is also not worse than what was in place before that (serialization was taking place once per image location).
          Hide
          Konstantin Shvachko added a comment -

          So this has nothing to do with the safe mode then.
          As I understand the main thread holds the global (FSNamesystem) lock, and nothing else is going to be executed on the NN at that time. This seems to be the answer to the locking question. Could you please add JavaDoc to FSImageSaver class with the summary of the locking and the image identity issues.

          The approach with one serializing thread and others doing writes is in fact not harder. The queue growth issue is not a problem imo. The speed of the total write depends on the slowest writer, so everybody can simply wait until the slowest guy completes the assignment, they will have to wait for him anyways in the end.
          The advantage of this approach is that we guarantee everybody writes the exactly same bytes into the image files.
          With your approach, although the implementation is simpler, it is not obvious the contents of the image files will be the same, well at least was not for me.

          Out of pure curiosity if you know is there any benefit of multithreaded writing to directories on the same drive.

          Show
          Konstantin Shvachko added a comment - So this has nothing to do with the safe mode then. As I understand the main thread holds the global (FSNamesystem) lock, and nothing else is going to be executed on the NN at that time. This seems to be the answer to the locking question. Could you please add JavaDoc to FSImageSaver class with the summary of the locking and the image identity issues. The approach with one serializing thread and others doing writes is in fact not harder. The queue growth issue is not a problem imo. The speed of the total write depends on the slowest writer, so everybody can simply wait until the slowest guy completes the assignment, they will have to wait for him anyways in the end. The advantage of this approach is that we guarantee everybody writes the exactly same bytes into the image files. With your approach, although the implementation is simpler, it is not obvious the contents of the image files will be the same, well at least was not for me. Out of pure curiosity if you know is there any benefit of multithreaded writing to directories on the same drive.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12447486/HDFS-1071.5.patch
          against trunk revision 957669.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/410/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/410/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/410/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/410/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12447486/HDFS-1071.5.patch against trunk revision 957669. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/410/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/410/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/410/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/410/console This message is automatically generated.
          Hide
          Dmytro Molkov added a comment -

          I added a documentation for FSImageSaver that describes initial assumptions for how the parallel writes are being done.

          As far as writing the image to the single disk in multiple directories. If you do it in parallel that might only hurt the performance, since the disk will do seeks all the times.

          Show
          Dmytro Molkov added a comment - I added a documentation for FSImageSaver that describes initial assumptions for how the parallel writes are being done. As far as writing the image to the single disk in multiple directories. If you do it in parallel that might only hurt the performance, since the disk will do seeks all the times.
          Hide
          Konstantin Shvachko added a comment -

          It looks to me that with HDFS-903 in progress we should guarantee in this patch that all images are identical. Otherwise MD5s of different images will not be the same. So my be my suggestion of implementing this with one thread traversing the namespace tree and other threads writing to the disk is more relevant now.

          Show
          Konstantin Shvachko added a comment - It looks to me that with HDFS-903 in progress we should guarantee in this patch that all images are identical. Otherwise MD5s of different images will not be the same. So my be my suggestion of implementing this with one thread traversing the namespace tree and other threads writing to the disk is more relevant now.
          Hide
          Jakob Homan added a comment -

          implementing this with one thread traversing the namespace tree and other threads writing to the disk is more relevant now.

          This seems like a good way to go forward. It would be good to get a patch that implements this approach..

          Show
          Jakob Homan added a comment - implementing this with one thread traversing the namespace tree and other threads writing to the disk is more relevant now. This seems like a good way to go forward. It would be good to get a patch that implements this approach..
          Hide
          dhruba borthakur added a comment -

          I think the FSnamesystem lock ensures that the same image is being written to all the directories. how is it possible that the images could be different? The saveFSImage acquires the writeLock and then writes all the copies in parallel, isn't it?

          Show
          dhruba borthakur added a comment - I think the FSnamesystem lock ensures that the same image is being written to all the directories. how is it possible that the images could be different? The saveFSImage acquires the writeLock and then writes all the copies in parallel, isn't it?
          Hide
          Konstantin Shvachko added a comment -

          This is what we discussed with Dmytro in details above. As I understood he is not holding the FSNamesystem lock for writing ALL the images, therefore they can be different. And safe mode does not guarantee the tree state is not changing. Could you please verify this. If the images are the same I'm fine with the implementation.

          Show
          Konstantin Shvachko added a comment - This is what we discussed with Dmytro in details above. As I understood he is not holding the FSNamesystem lock for writing ALL the images, therefore they can be different. And safe mode does not guarantee the tree state is not changing. Could you please verify this. If the images are the same I'm fine with the implementation.
          Hide
          Jakob Homan added a comment -

          Could you please verify this. If the images are the same I'm fine with the implementation.

          In the patch, the FSNameSystem::saveNamespace() acquires the writelock before calling FSImage::saveNamespace(renewCheckpointTime). The writing is done in parallel and each of the writer threads is joined (in waitForThreads) before returning from the method, where the writeLock is surrendered. So this should be safe

          There are other calls to saveNamespace that should be considered, though. FSImage::saveNamespace(renewCheckpointTime) is called from several other locations: In FSDirectory::loadFSImage, which is called by FSNameSystem's constructors, by BackupStorage::saveCheckpoint(), by CheckpointStorage::doMerge(), and by FSImage::doImportCheckpoint. Assuming no new operations are coming in, which they shouldn't be, the checkpoint and backupnode calls are safe. The others are as well, assuming we're in safemode. Does this sound reasonable?

          I believe this addresses Konstantin's concerns.

          A couple nits with the current patch (6):

          • Java's Collections documentation is pretty adamant about traversing synchronized collections with a lock on the collection (http://download.oracle.com/javase/6/docs/api/java/util/Collections.html#synchronizedList(java.util.List)), which isn't done currently in the patch in processIOErrors for the sds parameter. This isn't necessary at the moment, as only one thread is guaranteed to be iterating, but it may be better to synchronize now to avoid problems in the future.
          • The MiniDFSCluster constructors have been deprecated since this patch was generated. It should be updated to use the new Builder.
          Show
          Jakob Homan added a comment - Could you please verify this. If the images are the same I'm fine with the implementation. In the patch, the FSNameSystem::saveNamespace() acquires the writelock before calling FSImage::saveNamespace(renewCheckpointTime) . The writing is done in parallel and each of the writer threads is joined (in waitForThreads ) before returning from the method, where the writeLock is surrendered. So this should be safe There are other calls to saveNamespace that should be considered, though. FSImage::saveNamespace(renewCheckpointTime) is called from several other locations: In FSDirectory::loadFSImage , which is called by FSNameSystem's constructors, by BackupStorage::saveCheckpoint() , by CheckpointStorage::doMerge() , and by FSImage::doImportCheckpoint . Assuming no new operations are coming in, which they shouldn't be, the checkpoint and backupnode calls are safe. The others are as well, assuming we're in safemode. Does this sound reasonable? I believe this addresses Konstantin's concerns. A couple nits with the current patch (6): Java's Collections documentation is pretty adamant about traversing synchronized collections with a lock on the collection ( http://download.oracle.com/javase/6/docs/api/java/util/Collections.html#synchronizedList(java.util.List )), which isn't done currently in the patch in processIOErrors for the sds parameter. This isn't necessary at the moment, as only one thread is guaranteed to be iterating, but it may be better to synchronize now to avoid problems in the future. The MiniDFSCluster constructors have been deprecated since this patch was generated. It should be updated to use the new Builder.
          Hide
          Jakob Homan added a comment -

          Also, in TestParallelImageWrite.java, there is an unused import (URI) and an unused local variable (imageIndex) that should be cleaned up.

          Show
          Jakob Homan added a comment - Also, in TestParallelImageWrite.java, there is an unused import (URI) and an unused local variable (imageIndex) that should be cleaned up.
          Hide
          Dmytro Molkov added a comment -

          Hey Jakob. Sorry took me so long to reply.
          Yes, I will make the changes shortly and upload the new patch.

          Show
          Dmytro Molkov added a comment - Hey Jakob. Sorry took me so long to reply. Yes, I will make the changes shortly and upload the new patch.
          Hide
          Dmytro Molkov added a comment -

          Got rid of imageIndex, import of URI in the test case.
          Converted MiniDFSCluster initialization to use builder
          Synchronizing on sds when processing IOErrors

          Should be good now, Jakob can you take another look at this one, thanks.

          Show
          Dmytro Molkov added a comment - Got rid of imageIndex, import of URI in the test case. Converted MiniDFSCluster initialization to use builder Synchronizing on sds when processing IOErrors Should be good now, Jakob can you take another look at this one, thanks.
          Hide
          Jakob Homan added a comment -

          Thanks for the updates. +1 on latest patch. Barring any more objections and pending Hudson, I'll commit this tomorrow.

          Show
          Jakob Homan added a comment - Thanks for the updates. +1 on latest patch. Barring any more objections and pending Hudson, I'll commit this tomorrow.
          Hide
          Jakob Homan added a comment -

          Re-triggering the mythical Hudson...

          Show
          Jakob Homan added a comment - Re-triggering the mythical Hudson...
          Hide
          Jakob Homan added a comment -

          Looks like Hudson didn't make it all the way through due to an unrelated error: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/3/console Dmytro, can you post test and test-patch results and then I'll commit. Thanks.

          Show
          Jakob Homan added a comment - Looks like Hudson didn't make it all the way through due to an unrelated error: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/3/console Dmytro, can you post test and test-patch results and then I'll commit. Thanks.
          Hide
          Jakob Homan added a comment -

          I'd like to get this one into 22, so I ran the tests myself. All tests passed except also-failing-on-trunk TestDatanodeBlockScanner, TestBlockRecovery, TestStorageRestore, TestDatanodeDeath.

              [exec] -1 overall.  
              [exec] 
              [exec]     +1 @author.  The patch does not contain any @author tags.
              [exec] 
              [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
              [exec] 
              [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
              [exec] 
              [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
              [exec] 
              [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
              [exec] 
              [exec]     -1 release audit.  The applied patch generated 98 release audit warnings (more than the trunk's
          current 1 warnings).
              [exec] 
              [exec]     +1 system test framework.  The patch passed system test framework compile.

          I'm going to commit this.

          Show
          Jakob Homan added a comment - I'd like to get this one into 22, so I ran the tests myself. All tests passed except also-failing-on-trunk TestDatanodeBlockScanner, TestBlockRecovery, TestStorageRestore, TestDatanodeDeath. [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 98 release audit warnings (more than the trunk's current 1 warnings). [exec] [exec] +1 system test framework. The patch passed system test framework compile. I'm going to commit this.
          Hide
          Jakob Homan added a comment -

          Forgot to mention, the audit issue from test-patch above is a known bug. I've committed this. Resolving as fixed. Thanks, Dmytro!

          Show
          Jakob Homan added a comment - Forgot to mention, the audit issue from test-patch above is a known bug. I've committed this. Resolving as fixed. Thanks, Dmytro!
          Hide
          Todd Lipcon added a comment -

          It looks like this commit causes TestSaveNamespace to fail - see HDFS-1503

          Show
          Todd Lipcon added a comment - It looks like this commit causes TestSaveNamespace to fail - see HDFS-1503

            People

            • Assignee:
              Dmytro Molkov
              Reporter:
              dhruba borthakur
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development