Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.2-alpha
    • Fix Version/s: 2.0.3-alpha
    • Component/s: datanode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      See discussion in HDFS-744. The actual sync/flush operation in BlockReceiver is not on a synchronous path from the DFSClient, hence it is possible that a DN loses data that it has already acknowledged as persisted to a client.

      Edit: Spelling.

      1. hdfs-3979-v4.txt
        3 kB
        Lars Hofhansl
      2. hdfs-3979-v3.txt
        3 kB
        Lars Hofhansl
      3. hdfs-3979-v2.txt
        2 kB
        Lars Hofhansl
      4. hdfs-3979-sketch.txt
        3 kB
        Lars Hofhansl

        Issue Links

          Activity

          Hide
          Luke Lu added a comment -

          So the test you're looking for is starting 3 DNs, and then have the write permanently fail at any of them, and in all cases have the hsync fail on the client, right?

          Yes, that should provide basic coverage for the failure cases.

          Show
          Luke Lu added a comment - So the test you're looking for is starting 3 DNs, and then have the write permanently fail at any of them, and in all cases have the hsync fail on the client, right? Yes, that should provide basic coverage for the failure cases.
          Hide
          Lars Hofhansl added a comment -

          Thanks Nicholas.

          Luke, so the test you're looking for is starting 3 DNs, and then have the write permanently fail at any of them, and in all cases have the hsync fail on the client, right?

          Show
          Lars Hofhansl added a comment - Thanks Nicholas. Luke, so the test you're looking for is starting 3 DNs, and then have the write permanently fail at any of them, and in all cases have the hsync fail on the client, right?
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1249 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1249/)
          HDFS-3979. For hsync, datanode should wait for the local sync to complete before sending ack. Contributed by Lars Hofhansl (Revision 1406382)

          Result = FAILURE
          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1406382
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1249 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1249/ ) HDFS-3979 . For hsync, datanode should wait for the local sync to complete before sending ack. Contributed by Lars Hofhansl (Revision 1406382) Result = FAILURE szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1406382 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1219 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1219/)
          HDFS-3979. For hsync, datanode should wait for the local sync to complete before sending ack. Contributed by Lars Hofhansl (Revision 1406382)

          Result = SUCCESS
          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1406382
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1219 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1219/ ) HDFS-3979 . For hsync, datanode should wait for the local sync to complete before sending ack. Contributed by Lars Hofhansl (Revision 1406382) Result = SUCCESS szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1406382 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Yarn-trunk #29 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/29/)
          HDFS-3979. For hsync, datanode should wait for the local sync to complete before sending ack. Contributed by Lars Hofhansl (Revision 1406382)

          Result = SUCCESS
          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1406382
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
          Show
          Hudson added a comment - Integrated in Hadoop-Yarn-trunk #29 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/29/ ) HDFS-3979 . For hsync, datanode should wait for the local sync to complete before sending ack. Contributed by Lars Hofhansl (Revision 1406382) Result = SUCCESS szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1406382 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk-Commit #2967 (See https://builds.apache.org/job/Hadoop-trunk-Commit/2967/)
          HDFS-3979. For hsync, datanode should wait for the local sync to complete before sending ack. Contributed by Lars Hofhansl (Revision 1406382)

          Result = SUCCESS
          szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1406382
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
          Show
          Hudson added a comment - Integrated in Hadoop-trunk-Commit #2967 (See https://builds.apache.org/job/Hadoop-trunk-Commit/2967/ ) HDFS-3979 . For hsync, datanode should wait for the local sync to complete before sending ack. Contributed by Lars Hofhansl (Revision 1406382) Result = SUCCESS szetszwo : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1406382 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I have committed this. Thanks, Lars!

          Show
          Tsz Wo Nicholas Sze added a comment - I have committed this. Thanks, Lars!
          Hide
          Luke Lu added a comment -

          I think it will decrease the performance for non-sync write.

          It'll be nice if we can show/quantify the decrease in performance for non-sync writes. It may not be wise to introduce complexity and make hflush less robust if this is a non-issue.

          The existing tests: TestFiPipelines and TestFiHFlush do not cover the other scenarios you worry about?

          It seems that TestFiHFlush doesn't cover the failure scenarios. All the test cases are positive assertions (pipeline can recover in spite of disk error exceptions), which seems not very useful given the ack is done before the disk error exceptions are triggered. A new TestFiHSync seems necessary especially for the new patch, where the ack code path diverged from hflush. Basically, I want to make sure that hsync would be guaranteed to get an error if the pipeline cannot be recovered (e.g., due to required datanodes ran out of disk space etc).

          Anyway, I'm fine with filing another jira for these hflush/hsync improvement.

          Show
          Luke Lu added a comment - I think it will decrease the performance for non-sync write. It'll be nice if we can show/quantify the decrease in performance for non-sync writes. It may not be wise to introduce complexity and make hflush less robust if this is a non-issue. The existing tests: TestFiPipelines and TestFiHFlush do not cover the other scenarios you worry about? It seems that TestFiHFlush doesn't cover the failure scenarios. All the test cases are positive assertions (pipeline can recover in spite of disk error exceptions), which seems not very useful given the ack is done before the disk error exceptions are triggered. A new TestFiHSync seems necessary especially for the new patch, where the ack code path diverged from hflush. Basically, I want to make sure that hsync would be guaranteed to get an error if the pipeline cannot be recovered (e.g., due to required datanodes ran out of disk space etc). Anyway, I'm fine with filing another jira for these hflush/hsync improvement.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Thanks for the update, Lars.
          +1 patch looks good. I will commit it if there is no more comments.

          Show
          Tsz Wo Nicholas Sze added a comment - Thanks for the update, Lars. +1 patch looks good. I will commit it if there is no more comments.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12552061/hdfs-3979-v4.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3442//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3442//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12552061/hdfs-3979-v4.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3442//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3442//console This message is automatically generated.
          Hide
          Lars Hofhansl added a comment -

          Updated patch with Nicholas' suggestion.

          I agree that the previous patch would have slowed all writes that reach the DN.
          We can't distinguish between an hflush from the client and "normal" packet from the client.
          On the other hand this no longer deals with Luke's "kill -9" scenario (where a cluster management tool would kill -9 datanodes in parallel), but in the end no tool really should do that.

          Show
          Lars Hofhansl added a comment - Updated patch with Nicholas' suggestion. I agree that the previous patch would have slowed all writes that reach the DN. We can't distinguish between an hflush from the client and "normal" packet from the client. On the other hand this no longer deals with Luke's "kill -9" scenario (where a cluster management tool would kill -9 datanodes in parallel), but in the end no tool really should do that.
          Hide
          Lars Hofhansl added a comment -

          I'll make that change.

          Show
          Lars Hofhansl added a comment - I'll make that change.
          Hide
          Kan Zhang added a comment -

          I think it will decrease the performance for non-sync write.

          I'd welcome some clarity on whether writing to OS buffers is a real concern here.

          How about keeping enqueue early when syncBlock == false?

          To be on the conservative side, I'm OK with this.

          Show
          Kan Zhang added a comment - I think it will decrease the performance for non-sync write. I'd welcome some clarity on whether writing to OS buffers is a real concern here. How about keeping enqueue early when syncBlock == false? To be on the conservative side, I'm OK with this.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          The patch moves ack to the end in order to fix the sync semantics. I think it will decrease the performance for non-sync write. How about keeping enqueue early when syncBlock == false?

          Show
          Tsz Wo Nicholas Sze added a comment - The patch moves ack to the end in order to fix the sync semantics. I think it will decrease the performance for non-sync write. How about keeping enqueue early when syncBlock == false?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Sure, will check the patch.

          Show
          Tsz Wo Nicholas Sze added a comment - Sure, will check the patch.
          Hide
          Kan Zhang added a comment -

          +1

          Thanks, Lars. Patch looks good to me.

          Nicholas, would appreciate if you could also take a look. Thx!

          Show
          Kan Zhang added a comment - +1 Thanks, Lars. Patch looks good to me. Nicholas, would appreciate if you could also take a look. Thx!
          Hide
          Lars Hofhansl added a comment -

          Hi Kan, the only difference between v2 and v3 is that in v3 the "fsync" metric is updated after the actual sync to the FS (BlockReceiver.flushOrSync).

          This exposes the race condition we want to fix and makes TestHSync fail almost every run (the client return from hsync before the datanode could update the metric). With the rest of this patch applies this race is removed and TestHSync never fails.

          So now we have a test case for the race condition.

          Luke Lu The existing tests: TestFiPipelines and TestFiHFlush do not cover the other scenarios you worry about?

          Show
          Lars Hofhansl added a comment - Hi Kan, the only difference between v2 and v3 is that in v3 the "fsync" metric is updated after the actual sync to the FS (BlockReceiver.flushOrSync). This exposes the race condition we want to fix and makes TestHSync fail almost every run (the client return from hsync before the datanode could update the metric). With the rest of this patch applies this race is removed and TestHSync never fails. So now we have a test case for the race condition. Luke Lu The existing tests: TestFiPipelines and TestFiHFlush do not cover the other scenarios you worry about?
          Hide
          Kan Zhang added a comment -

          This little change makes TestHSync fail most of the time - without the rest of the patch, and never with this patch.

          Lars, I don't quite understand your above comment. What's the behavior of TestHSync with and w/o your latest patch?

          Show
          Kan Zhang added a comment - This little change makes TestHSync fail most of the time - without the rest of the patch, and never with this patch. Lars, I don't quite understand your above comment. What's the behavior of TestHSync with and w/o your latest patch?
          Hide
          Luke Lu added a comment -

          The patch lgtm, even though it lacks tests for failure cases for hsync.

          This issue is a blocker for HBASE-5954, it would be better resolve asap

          You can help by testing the patch and show some results.

          Show
          Luke Lu added a comment - The patch lgtm, even though it lacks tests for failure cases for hsync. This issue is a blocker for HBASE-5954 , it would be better resolve asap You can help by testing the patch and show some results.
          Hide
          Liang Xie added a comment -

          Is there any objections on it, or more comments ?
          This issue is a blocker for HBASE-5954, it would be better resolve asap

          Show
          Liang Xie added a comment - Is there any objections on it, or more comments ? This issue is a blocker for HBASE-5954 , it would be better resolve asap
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12549002/hdfs-3979-v3.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3327//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3327//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12549002/hdfs-3979-v3.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3327//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3327//console This message is automatically generated.
          Hide
          Lars Hofhansl added a comment -

          This little change makes TestHSync fail most of the time - without the rest of the patch, and never with this patch.

          (In HDFS-744 I had avoided this race, by updating the sync metric first. I know that was a hack... By updating the metric last in BlockReceiver.flushOrSync, this race becomes apparent again).

          We do have pipeline tests that seem to verify correct pipeline behavior in the face of failures via fault injection: TestFiPipelines and TestFiHFlush.

          In terms of the API3/API4 discussion, I think we agree that hflush should follow API4, right? (otherwise we'd have unduly complex code)

          Show
          Lars Hofhansl added a comment - This little change makes TestHSync fail most of the time - without the rest of the patch, and never with this patch. (In HDFS-744 I had avoided this race, by updating the sync metric first. I know that was a hack... By updating the metric last in BlockReceiver.flushOrSync, this race becomes apparent again). We do have pipeline tests that seem to verify correct pipeline behavior in the face of failures via fault injection: TestFiPipelines and TestFiHFlush. In terms of the API3/API4 discussion, I think we agree that hflush should follow API4, right? (otherwise we'd have unduly complex code)
          Hide
          Kan Zhang added a comment -

          Why API4 is needed for HBase?

          API3 or API4, it probably doesn't make a huge difference, IMHO. On the other hand, assuming the performance penalty of going from API3 to API4 is negligible, it's probably not worth complicating the code to support API3 (instead of API4).

          Lastly, we can play with this. For example only one of the replicas could sync to disk and the other's just guarantee the data in the OS buffers (API4.5 ).

          Yes, it would be very interesting to see if it saves to sync only the local replica or acknowledge to the client upon the first successful sync of any replica.

          Show
          Kan Zhang added a comment - Why API4 is needed for HBase? API3 or API4, it probably doesn't make a huge difference, IMHO. On the other hand, assuming the performance penalty of going from API3 to API4 is negligible, it's probably not worth complicating the code to support API3 (instead of API4). Lastly, we can play with this. For example only one of the replicas could sync to disk and the other's just guarantee the data in the OS buffers (API4.5 ). Yes, it would be very interesting to see if it saves to sync only the local replica or acknowledge to the client upon the first successful sync of any replica.
          Hide
          Lars Hofhansl added a comment -

          API4 is hflush (with change in OS buffers).

          That's an interesting discussion by itself. hsync'ing every edit in HBase is prohibitive.
          I have some simple numbers in HBASE-5954.

          Although, I need to do that test again with the sync_file_range changes in HDFS-2465 (that would hopefully do most of the data sync'ing asynchronously and only sync the last changes and metadata synchronously upon client request).

          Many applications do not need every edit to be guaranteed on disk, but have "sync points". That is what I am aiming for in HBase. The application will know the specific semantics.

          What is really important for HBase (IMHO) is that every block is synced to disk when it is closed. HBase constantly rewrites existing data via compactions so without syncing arbitrarily old data can be lost during a rack or DC outage.

          Lastly, we can play with this. For example only one of the replicas could sync to disk and the other's just guarantee the data in the OS buffers (API4.5 ).

          Show
          Lars Hofhansl added a comment - API4 is hflush (with change in OS buffers). That's an interesting discussion by itself. hsync'ing every edit in HBase is prohibitive. I have some simple numbers in HBASE-5954 . Although, I need to do that test again with the sync_file_range changes in HDFS-2465 (that would hopefully do most of the data sync'ing asynchronously and only sync the last changes and metadata synchronously upon client request). Many applications do not need every edit to be guaranteed on disk, but have "sync points". That is what I am aiming for in HBase. The application will know the specific semantics. What is really important for HBase (IMHO) is that every block is synced to disk when it is closed. HBase constantly rewrites existing data via compactions so without syncing arbitrarily old data can be lost during a rack or DC outage. Lastly, we can play with this. For example only one of the replicas could sync to disk and the other's just guarantee the data in the OS buffers (API4.5 ).
          Hide
          Luke Lu added a comment -

          Why API4 is needed for HBase?

          Many configuration management system (simplest: pdsh -a hadoop-daemon.sh stop datanode) shutdown/restart HDFS by kill -9 datanodes in parallel. Having to acquiesce any OLTP like workload is error prone. How about a simple ops error: pdsh -a killall -9 java to the wrong window (hence the wrong cluster). IMO, API4 is not robust enough for HBase. Unless the performance difference is huge (> 20% for hflush), which I doubt, it's not worth the risk, again IMO.

          Show
          Luke Lu added a comment - Why API4 is needed for HBase? Many configuration management system (simplest: pdsh -a hadoop-daemon.sh stop datanode) shutdown/restart HDFS by kill -9 datanodes in parallel. Having to acquiesce any OLTP like workload is error prone. How about a simple ops error: pdsh -a killall -9 java to the wrong window (hence the wrong cluster). IMO, API4 is not robust enough for HBase. Unless the performance difference is huge (> 20% for hflush), which I doubt, it's not worth the risk, again IMO.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          For applications like HBase we'd like API4 as well as API5.
          (API4 allows a hypothetical kill -9 of all DNs without loss of acknowledged data, API5 allows HW failures of all data nodes - i.e. a DC outage - with loss of acknowledged data)

          Why API4 is needed for HBase?

          As everyone known, there are usually 3 replicas in HDFS. If only one of the datanodes is killed, the data is still available in the other two datanodes. That's why we have invented "hflush" (i.e. API 3) in HDFS-265.

          Show
          Tsz Wo Nicholas Sze added a comment - For applications like HBase we'd like API4 as well as API5. (API4 allows a hypothetical kill -9 of all DNs without loss of acknowledged data, API5 allows HW failures of all data nodes - i.e. a DC outage - with loss of acknowledged data) Why API4 is needed for HBase? As everyone known, there are usually 3 replicas in HDFS. If only one of the datanodes is killed, the data is still available in the other two datanodes. That's why we have invented "hflush" (i.e. API 3) in HDFS-265 .
          Hide
          Lars Hofhansl added a comment -

          Thanks Luke and Kan. I'll come up with a test once I get some spare cycles (quite busy with HBase atm).

          Show
          Lars Hofhansl added a comment - Thanks Luke and Kan. I'll come up with a test once I get some spare cycles (quite busy with HBase atm).
          Hide
          Luke Lu added a comment -

          It should be eliminated with this patch. When there is write error, ack will not be queued.

          I think so too, but it'll be nice to have a test to cover the case for future "maintenance/refactor".

          Show
          Luke Lu added a comment - It should be eliminated with this patch. When there is write error, ack will not be queued. I think so too, but it'll be nice to have a test to cover the case for future "maintenance/refactor".
          Hide
          Kan Zhang added a comment -

          The race between ack and write errors should be reduced (eliminated) with this patch.

          It should be eliminated with this patch. When there is write error, ack will not be queued.

          Show
          Kan Zhang added a comment - The race between ack and write errors should be reduced (eliminated) with this patch. It should be eliminated with this patch. When there is write error, ack will not be queued.
          Hide
          Lars Hofhansl added a comment -

          I've seen that race when I write a test for HDFS-744. I "fixed" it there by updating the metrics first... Ugh

          I think I can make a test that fails at least with reasonable probability with the current semantics.

          The race between ack and write errors should be reduced (eliminated) with this patch.

          Show
          Lars Hofhansl added a comment - I've seen that race when I write a test for HDFS-744 . I "fixed" it there by updating the metrics first... Ugh I think I can make a test that fails at least with reasonable probability with the current semantics. The race between ack and write errors should be reduced (eliminated) with this patch.
          Hide
          Luke Lu added a comment -

          You don't think the existing pipeline tests cover the failure scenarios?

          Given the existing hflush/hsync semantics (ack can reach client before any pipeline exceptions), I don't think the new semantics is covered by existing tests. I'm worried about the race between the ack and write errors.

          Show
          Luke Lu added a comment - You don't think the existing pipeline tests cover the failure scenarios? Given the existing hflush/hsync semantics (ack can reach client before any pipeline exceptions), I don't think the new semantics is covered by existing tests. I'm worried about the race between the ack and write errors.
          Hide
          Lars Hofhansl added a comment -

          You don't think the existing pipeline tests cover the failure scenarios?
          I see if I can get some performance numbers.

          Show
          Lars Hofhansl added a comment - You don't think the existing pipeline tests cover the failure scenarios? I see if I can get some performance numbers.
          Hide
          Luke Lu added a comment -

          Do we want the change?

          I do think that the change is required for the correct hsync semantics (and better hflush guarantee). I'm not too sure if the change is complete without some reasonable test cases for failure scenarios.

          BTW, do you have any new performance numbers for comparison as well?

          Show
          Luke Lu added a comment - Do we want the change? I do think that the change is required for the correct hsync semantics (and better hflush guarantee). I'm not too sure if the change is complete without some reasonable test cases for failure scenarios. BTW, do you have any new performance numbers for comparison as well?
          Hide
          Lars Hofhansl added a comment -

          Do we want this change?
          Seems to me that HDFS-265 broke hsync/hflush and this would fix it.

          Show
          Lars Hofhansl added a comment - Do we want this change? Seems to me that HDFS-265 broke hsync/hflush and this would fix it.
          Hide
          Lars Hofhansl added a comment -

          I see. Thanks Kan. So now we we have API4 and (with HDFS-744) API5.

          For applications like HBase we'd like API4 as well as API5.
          (API4 allows a hypothetical kill -9 of all DNs without loss of acknowledged data, API5 allows HW failures of all data nodes - i.e. a DC outage - with loss of acknowledged data)

          Show
          Lars Hofhansl added a comment - I see. Thanks Kan. So now we we have API4 and (with HDFS-744 ) API5. For applications like HBase we'd like API4 as well as API5. (API4 allows a hypothetical kill -9 of all DNs without loss of acknowledged data, API5 allows HW failures of all data nodes - i.e. a DC outage - with loss of acknowledged data)
          Show
          Kan Zhang added a comment - I wonder why this was changed? My guess is HDFS-265 intends to implement API3 rather than API4. https://issues.apache.org/jira/browse/HDFS-265?focusedCommentId=12710542&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12710542
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12547049/hdfs-3979-v2.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3247//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3247//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12547049/hdfs-3979-v2.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3247//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3247//console This message is automatically generated.
          Hide
          Lars Hofhansl added a comment -

          New patch. Order of local operations and waiting for downstream DNs now reflects the pre HDFS-265 logic.

          Show
          Lars Hofhansl added a comment - New patch. Order of local operations and waiting for downstream DNs now reflects the pre HDFS-265 logic.
          Hide
          Lars Hofhansl added a comment -

          Enqueing the seqno at end seems like the best approach. (Indeed this is done in the 0.20.x code as both of you said).
          I wonder why this was changed? Will have a new patch momentarily.

          Show
          Lars Hofhansl added a comment - Enqueing the seqno at end seems like the best approach. (Indeed this is done in the 0.20.x code as both of you said). I wonder why this was changed? Will have a new patch momentarily.
          Hide
          Lars Hofhansl added a comment -

          Should we simply do the enqueue at the end of receivePacket(), then?

          So just to make sure: In the current code the seqno is already enqueued in the beginning, so if there's an exception later in the code it won't have any effect on the enqued seqno. The finally is just preserves this existing behavior.

          What happens when there is an exception and the seqno is never enqueued? (and if that is OK, why is it not a problem now.)

          Show
          Lars Hofhansl added a comment - Should we simply do the enqueue at the end of receivePacket(), then? So just to make sure: In the current code the seqno is already enqueued in the beginning, so if there's an exception later in the code it won't have any effect on the enqued seqno. The finally is just preserves this existing behavior. What happens when there is an exception and the seqno is never enqueued? (and if that is OK, why is it not a problem now.)
          Hide
          Kan Zhang added a comment -

          I agree it's probably not a good idea to enqueue in a finally block. The original code before HDFS-265 didn't do that either.

          Show
          Kan Zhang added a comment - I agree it's probably not a good idea to enqueue in a finally block. The original code before HDFS-265 didn't do that either.
          Hide
          Lars Hofhansl added a comment -

          I'm not sure either. I am trying not to change the existing behavior.

          The enqueue used to happen in the beginning of receivePacket(...), so if that latter part of the method fails the ack would already be enqueued.

          Show
          Lars Hofhansl added a comment - I'm not sure either. I am trying not to change the existing behavior. The enqueue used to happen in the beginning of receivePacket(...), so if that latter part of the method fails the ack would already be enqueued.
          Hide
          Luke Lu added a comment -

          TestHSync only tests the success code path, which makes me a bit nervous, as I'm not sure if putting the ack enqueue in the finally block is the right thing to do. I think you want the pipeline to fail and restart if there is an io exception.

          Show
          Luke Lu added a comment - TestHSync only tests the success code path, which makes me a bit nervous, as I'm not sure if putting the ack enqueue in the finally block is the right thing to do. I think you want the pipeline to fail and restart if there is an io exception.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12546734/hdfs-3979-sketch.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3239//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3239//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12546734/hdfs-3979-sketch.txt against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3239//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3239//console This message is automatically generated.
          Hide
          Lars Hofhansl added a comment -

          Let's try HadoopQA.

          TestHSync still passes. I'll also do some tests with HBase...

          Show
          Lars Hofhansl added a comment - Let's try HadoopQA. TestHSync still passes. I'll also do some tests with HBase...
          Hide
          Lars Hofhansl added a comment -

          (and sorry for misspelling you name)

          Show
          Lars Hofhansl added a comment - (and sorry for misspelling you name)
          Hide
          Lars Hofhansl added a comment -

          Something like this.
          (This is a sketch, the only test I performed was compiling)

          Show
          Lars Hofhansl added a comment - Something like this. (This is a sketch, the only test I performed was compiling)
          Hide
          Kan Zhang added a comment -

          Thanks, Lars! BTW, you spell my name wrong.

          Show
          Kan Zhang added a comment - Thanks, Lars! BTW, you spell my name wrong.
          Hide
          Lars Hofhansl added a comment -

          More good discussion on HDFS-744.

          Looks like we can just enqueu the seqno for the packet after the sync/flush is finished. (Khan's idea)

          Show
          Lars Hofhansl added a comment - More good discussion on HDFS-744 . Looks like we can just enqueu the seqno for the packet after the sync/flush is finished. (Khan's idea)
          Show
          Lars Hofhansl added a comment - Also see my comment here: https://issues.apache.org/jira/browse/HDFS-744?focusedCommentId=13279619&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13279619

            People

            • Assignee:
              Lars Hofhansl
              Reporter:
              Lars Hofhansl
            • Votes:
              0 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development