Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None

      Description

      HADOOP-1700 and related issues have put a lot of efforts to provide the first implementation of append. However, append is such a complex feature. It turns out that there are issues that were initially seemed trivial but needs a careful design. This jira revisits append, aiming for a design and implementation supporting a semantics that are acceptable to its users.

      1. AppendSpec.pdf
        48 kB
        Hairong Kuang
      2. appendDesign.pdf
        580 kB
        Hairong Kuang
      3. appendDesign.pdf
        408 kB
        Konstantin Shvachko
      4. TestPlanAppend.html
        50 kB
        Konstantin Boudnik
      5. appendDesign1.pdf
        624 kB
        Hairong Kuang
      6. AppendTestPlan.html
        55 kB
        Konstantin Boudnik
      7. AppendTestPlan.html
        62 kB
        Konstantin Boudnik
      8. AppendTestPlan.html
        62 kB
        Konstantin Boudnik
      9. appendDesign2.pdf
        868 kB
        Hairong Kuang
      10. AppendTestPlan.html
        63 kB
        Konstantin Boudnik
      11. AppendTestPlan.html
        65 kB
        Konstantin Boudnik
      12. AppendTestPlan.html
        65 kB
        Konstantin Boudnik
      13. AppendTestPlan.html
        65 kB
        Konstantin Boudnik
      14. a.sh
        1 kB
        Tsz Wo Nicholas Sze
      15. AppendTestPlan.html
        65 kB
        Konstantin Boudnik
      16. AppendTestPlan.html
        64 kB
        Konstantin Boudnik
      17. appendDesign3.pdf
        869 kB
        Hairong Kuang

        Issue Links

        There are no Sub-Tasks for this issue.

          Activity

          Hide
          Hairong Kuang added a comment -

          Updated the design document to give credits to Konstantin, Nicholas, Sanjay, and Rob from Yahoo! HDFS team who contributed to the append design. Also many thanks to Ben Reed from Yahoo research for many hallway & lunch discussions, and Dhruba from Facebook for many trips to Yahoo! for append discussions and design reviews.

          Show
          Hairong Kuang added a comment - Updated the design document to give credits to Konstantin, Nicholas, Sanjay, and Rob from Yahoo! HDFS team who contributed to the append design. Also many thanks to Ben Reed from Yahoo research for many hallway & lunch discussions, and Dhruba from Facebook for many trips to Yahoo! for append discussions and design reviews.
          Hide
          stack added a comment -

          Yapeeeee!++

          Show
          stack added a comment - Yapeeeee!++
          Hide
          Konstantin Boudnik added a comment -

          Yapeeeee!

          Show
          Konstantin Boudnik added a comment - Yapeeeee!
          Hide
          Hairong Kuang added a comment -

          A few recent updates to the jira:
          1. Block recovery tests have been committed to both 0.21 and the trunk.
          2. Stress/performance/scalability tests are going to be tracked in HDFS-708.

          Since all sub-tasked are committed, let me close this jira.

          Show
          Hairong Kuang added a comment - A few recent updates to the jira: 1. Block recovery tests have been committed to both 0.21 and the trunk. 2. Stress/performance/scalability tests are going to be tracked in HDFS-708 . Since all sub-tasked are committed, let me close this jira.
          Hide
          Hairong Kuang added a comment -

          Todd, the concurrent tailer is a very good use case! Thanks for testing it and those are nice bugs.

          Show
          Hairong Kuang added a comment - Todd, the concurrent tailer is a very good use case! Thanks for testing it and those are nice bugs.
          Hide
          Todd Lipcon added a comment -

          Is the concurrent tailer use case part of this feature? I opened HDFS-1060 with some test cases that show it is not functional on trunk.

          Show
          Todd Lipcon added a comment - Is the concurrent tailer use case part of this feature? I opened HDFS-1060 with some test cases that show it is not functional on trunk.
          Hide
          ryan rawson added a comment -

          The head of the mail conversation is here:

          http://mail-archives.apache.org/mod_mbox/hadoop-general/201002.mbox/<dfe484f01002161412h8bc953axee2a73d81a234bdf@mail.gmail.com>

          There was no close to the discussion. At this point there does not seem to be any plans for Hadoop 0.21.

          Show
          ryan rawson added a comment - The head of the mail conversation is here: http://mail-archives.apache.org/mod_mbox/hadoop-general/201002.mbox/ <dfe484f01002161412h8bc953axee2a73d81a234bdf@mail.gmail.com> There was no close to the discussion. At this point there does not seem to be any plans for Hadoop 0.21.
          Hide
          Hairong Kuang added a comment -

          This feature has been implemented in 0.21 and tested by HBase and our Yahoo! developers. I think 0.21 in general still needs large scale tests. Regarding 0.21, unfortunately Yahoo! is going to skip this release. There are some discussions in the hadoop mailing list on how to release 0.21 in the community, please check it out there.

          Show
          Hairong Kuang added a comment - This feature has been implemented in 0.21 and tested by HBase and our Yahoo! developers. I think 0.21 in general still needs large scale tests. Regarding 0.21, unfortunately Yahoo! is going to skip this release. There are some discussions in the hadoop mailing list on how to release 0.21 in the community, please check it out there.
          Hide
          Jeremy added a comment -

          I like most people have been eagerly awaiting this feature for a while now. I've searched the web and can't seem to find any comments on when this feature is expected to be out. Since we are coming up on 0.20 + 1 year, I wonder if anyone would like to comment on the 0.21 release, is there a particular time frame you all are shooting for?

          Thanks.

          Show
          Jeremy added a comment - I like most people have been eagerly awaiting this feature for a while now. I've searched the web and can't seem to find any comments on when this feature is expected to be out. Since we are coming up on 0.20 + 1 year, I wonder if anyone would like to comment on the 0.21 release, is there a particular time frame you all are shooting for? Thanks.
          Hide
          Hairong Kuang added a comment -

          Min, we implemented the first algorithm, which was done in HDFS-570 and HDFS-585.

          Thanks for your feedback on the design document.

          Show
          Hairong Kuang added a comment - Min, we implemented the first algorithm, which was done in HDFS-570 and HDFS-585 . Thanks for your feedback on the design document.
          Hide
          Min Zhou added a comment -

          @Hairong Kuang

          Which algorithm did you pick for read operation of hdfs? How can I find it from the trunk ?

          btw, I think "A DataNode changes its (BA, BR) to be (a+b, a+b) right after step 2.a " should be ".. 3.a" on Page 5 of your appendDesign2.pdf

          Min

          Show
          Min Zhou added a comment - @Hairong Kuang Which algorithm did you pick for read operation of hdfs? How can I find it from the trunk ? btw, I think "A DataNode changes its (BA, BR) to be (a+b, a+b) right after step 2.a " should be ".. 3.a" on Page 5 of your appendDesign2.pdf Min
          Hide
          Konstantin Boudnik added a comment -

          Latest and perhaps final update of the test plan: fixing block recovery FI test descriptions and synching the document with actual status of pipeline tests developed.

          Show
          Konstantin Boudnik added a comment - Latest and perhaps final update of the test plan: fixing block recovery FI test descriptions and synching the document with actual status of pipeline tests developed.
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #148 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/148/)

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #148 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/148/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #170 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/170/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #170 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/170/ )
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #83 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/83/)

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #83 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/83/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #139 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/139/)
          HDFS-797. TestHDFSCLI much slower after merge. Contributed by Todd Lipcon.

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #139 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/139/ ) HDFS-797 . TestHDFSCLI much slower after merge. Contributed by Todd Lipcon.
          Hide
          Konstantin Boudnik added a comment -

          Test plan update with latest state of HDFS-520

          Show
          Konstantin Boudnik added a comment - Test plan update with latest state of HDFS-520
          Hide
          Jean-Daniel Cryans added a comment -

          We have been trying this patch for some time now on sequence files in HBase for the write ahead logs by syncing the writers output stream. I experimented by doing fast increments on a single cell at a rate of 800-900 operations a second and then killing -9 both the DN and RS after some time. I then waited for the master to recover the WAL and all edits (I log every of them locally) were present. I tried it dozens of times, killing machines after machines, and it works every time.

          +1

          Show
          Jean-Daniel Cryans added a comment - We have been trying this patch for some time now on sequence files in HBase for the write ahead logs by syncing the writers output stream. I experimented by doing fast increments on a single cell at a rate of 800-900 operations a second and then killing -9 both the DN and RS after some time. I then waited for the master to recover the WAL and all edits (I log every of them locally) were present. I tried it dozens of times, killing machines after machines, and it works every time. +1
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #20 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/20/)

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #20 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/20/ )
          Hide
          Konstantin Shvachko added a comment -

          I quick status update:
          The append branch has been merged with hdfs-0.21 and the trunk, thanks to Hairong, she did it when everybody was out of town. The mainstream functionality has been tested through a comprehensive set of tests including fault injection tests, written by Konstantin Boudnik and Nicholas. Myself and Nicholas tried the append branch on our pet clusters before the merge.
          Looking ahead, this issue will be closed when all sub-tasks listed here are completed, most of which are more tests, and the append functionality is turned on by default. Please create new (not sub-tasks) jira's linked to this one for append related issues.

          Show
          Konstantin Shvachko added a comment - I quick status update: The append branch has been merged with hdfs-0.21 and the trunk, thanks to Hairong, she did it when everybody was out of town. The mainstream functionality has been tested through a comprehensive set of tests including fault injection tests, written by Konstantin Boudnik and Nicholas. Myself and Nicholas tried the append branch on our pet clusters before the merge. Looking ahead, this issue will be closed when all sub-tasks listed here are completed, most of which are more tests, and the append functionality is turned on by default. Please create new (not sub-tasks) jira's linked to this one for append related issues.
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #101 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/101/ ) . Merge -c 820487 https://svn.apache.org/repos/asf/hadoop/hdfs/branches/branch-0.21
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I ran NNBenchWithoutMR to test the append branch with 10, 100 files and 4k, 4m, 40m block sizes on a 10-node cluster. Everything worked fine.

          a.sh: the script I used to compile everything.

          Show
          Tsz Wo Nicholas Sze added a comment - I ran NNBenchWithoutMR to test the append branch with 10, 100 files and 4k, 4m, 40m block sizes on a 10-node cluster. Everything worked fine. a.sh: the script I used to compile everything.
          Hide
          Konstantin Shvachko added a comment -

          Stack, check the sub-tasks rather than linked issues. Some of them like 26 (HDFS-624) or 17 (HDFS-570) are not closed yet.

          Show
          Konstantin Shvachko added a comment - Stack, check the sub-tasks rather than linked issues. Some of them like 26 ( HDFS-624 ) or 17 ( HDFS-570 ) are not closed yet.
          Hide
          Hairong Kuang added a comment -

          There are outstanding ones. Please feel free to review. Your comments are very welcome!

          Show
          Hairong Kuang added a comment - There are outstanding ones. Please feel free to review. Your comments are very welcome!
          Hide
          stack added a comment -

          @Konstantin: I took a look at all issues linked too from this issue but they've all been committed (too late for review). Are there others outstanding that we could help review? Thanks.

          Show
          stack added a comment - @Konstantin: I took a look at all issues linked too from this issue but they've all been committed (too late for review). Are there others outstanding that we could help review? Thanks.
          Hide
          Konstantin Shvachko added a comment -

          Stack, Ryan: It's a bit early. The append branch is still very dynamic - a lot of changes is going in. The plan is to stabilize it by the end of Sept. and ask people to try it then. It will be too easy to break it right now - no fun for you. Code reviews are welcome any time.

          Show
          Konstantin Shvachko added a comment - Stack, Ryan: It's a bit early. The append branch is still very dynamic - a lot of changes is going in. The plan is to stabilize it by the end of Sept. and ask people to try it then. It will be too easy to break it right now - no fun for you. Code reviews are welcome any time.
          Hide
          Konstantin Boudnik added a comment -

          To me it sounds like a good idea. However, it might be advisable to wait until at least all development sub-tasks are completed.
          Also, feel free to pickup some of test development subtasks and help with them

          Show
          Konstantin Boudnik added a comment - To me it sounds like a good idea. However, it might be advisable to wait until at least all development sub-tasks are completed. Also, feel free to pickup some of test development subtasks and help with them
          Hide
          stack added a comment -

          Konstantin: Any chance of an update on state of the branch? Should we start trying to use it/break it? Thanks boss.

          Show
          stack added a comment - Konstantin: Any chance of an update on state of the branch? Should we start trying to use it/break it? Thanks boss.
          Hide
          Konstantin Boudnik added a comment -

          Certainly, it could be found in the append branch at https://svn.apache.org/repos/asf/hadoop/hdfs/branches/HDFS-265

          Show
          Konstantin Boudnik added a comment - Certainly, it could be found in the append branch at https://svn.apache.org/repos/asf/hadoop/hdfs/branches/HDFS-265
          Hide
          ryan rawson added a comment -

          hey guys, it's been 4 months since this issue started moving, can we get some source so we can review and possibly test?

          Thanks!

          Show
          ryan rawson added a comment - hey guys, it's been 4 months since this issue started moving, can we get some source so we can review and possibly test? Thanks!
          Hide
          Konstantin Boudnik added a comment -

          Normal pipeline tests information is updated per HDFS-608

          Show
          Konstantin Boudnik added a comment - Normal pipeline tests information is updated per HDFS-608
          Hide
          Konstantin Boudnik added a comment -

          Three pipeline_FI... testcases were removed as they seem to be excessive.

          Show
          Konstantin Boudnik added a comment - Three pipeline_FI... testcases were removed as they seem to be excessive.
          Hide
          Konstantin Boudnik added a comment -

          More fault injection test cases are added for hflush() per latest update of the patch

          Show
          Konstantin Boudnik added a comment - More fault injection test cases are added for hflush() per latest update of the patch
          Hide
          Konstantin Boudnik added a comment -

          Per this comment I'm adding up an additional test case for hlush()

          Show
          Konstantin Boudnik added a comment - Per this comment I'm adding up an additional test case for hlush()
          Hide
          Hairong Kuang added a comment -

          Thanks Konstantin and Suresh for their time reading the append design document and for their valuable feedback. Here is an updated version. Major changes include:
          1. Change GS/len finalized block state to be Committed Block state
          2. Add how to persist block/replica states and how to handle DataNode upgrade
          2. Add how to handle access token error
          3. Add how to handle the case when pipeline becomes empty
          4.Rewrite sections 3.1 and 3.2 and redraw pictures in these section to make idea clearer.

          Show
          Hairong Kuang added a comment - Thanks Konstantin and Suresh for their time reading the append design document and for their valuable feedback. Here is an updated version. Major changes include: 1. Change GS/len finalized block state to be Committed Block state 2. Add how to persist block/replica states and how to handle DataNode upgrade 2. Add how to handle access token error 3. Add how to handle the case when pipeline becomes empty 4.Rewrite sections 3.1 and 3.2 and redraw pictures in these section to make idea clearer.
          Hide
          Konstantin Boudnik added a comment -

          Proof reading of the test plan and fixes of hflush() test description.

          Show
          Konstantin Boudnik added a comment - Proof reading of the test plan and fixes of hflush() test description.
          Hide
          Konstantin Boudnik added a comment -

          Per my conversation with Konstantin it is a new exception to be introduced (the specs are yet to be updated I believe, thus I'm a little bit ahead of the steamer here). This to be thrown when a datanode is too busy and can't participate in the recovery process.

          Show
          Konstantin Boudnik added a comment - Per my conversation with Konstantin it is a new exception to be introduced (the specs are yet to be updated I believe, thus I'm a little bit ahead of the steamer here). This to be thrown when a datanode is too busy and can't participate in the recovery process.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          In the latest test plan, what is a TimeOutException?

          Show
          Tsz Wo Nicholas Sze added a comment - In the latest test plan, what is a TimeOutException?
          Hide
          Konstantin Boudnik added a comment -

          Test plan update including Hflush, performance/scalability, and stress tests.

          It is likely that perf/scalability will have to be taken care of much later in the process.

          Show
          Konstantin Boudnik added a comment - Test plan update including Hflush, performance/scalability, and stress tests. It is likely that perf/scalability will have to be taken care of much later in the process.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Found 2 typos:

          • There is only one new test case in Pipeline_FI. The test numbers are changed and there is no Pipeline_FI_25 in the new test plan.
          • HADOOP-265 should be HDFS-265
          Show
          Tsz Wo Nicholas Sze added a comment - Found 2 typos: There is only one new test case in Pipeline_FI. The test numbers are changed and there is no Pipeline_FI_25 in the new test plan. HADOOP-265 should be HDFS-265
          Hide
          Konstantin Boudnik added a comment -
          • Modifications are made to consider latest changes in the spec. document.
          • Two more pipeline FI tests are added for new corner cases (per Konstantin's review)
          • Placeholders for hflush(), stress, and scalability test cases are added
          • Test plan file is renamed to be consistent with other features test plans.
          Show
          Konstantin Boudnik added a comment - Modifications are made to consider latest changes in the spec. document. Two more pipeline FI tests are added for new corner cases (per Konstantin's review) Placeholders for hflush(), stress, and scalability test cases are added Test plan file is renamed to be consistent with other features test plans.
          Hide
          Hairong Kuang added a comment -

          Thanks Cos and Nicholas for the test plan. All tests are very good. I have two comments:
          1. Release 0.21 will focus more on hflush than append. Could we add tests on hflush?
          2. Could we have some performance/scalability tests to make sure that the implementation does not cause any performance/scalability regression?

          Show
          Hairong Kuang added a comment - Thanks Cos and Nicholas for the test plan. All tests are very good. I have two comments: 1. Release 0.21 will focus more on hflush than append. Could we add tests on hflush? 2. Could we have some performance/scalability tests to make sure that the implementation does not cause any performance/scalability regression?
          Hide
          Hairong Kuang added a comment -

          In this design, a new generation stamp is always fetched from NameNode before a new pipeline is set up when handling errors. So if an access token is also fetched along with the generation stamp, things should be OK.

          Show
          Hairong Kuang added a comment - In this design, a new generation stamp is always fetched from NameNode before a new pipeline is set up when handling errors. So if an access token is also fetched along with the generation stamp, things should be OK.
          Hide
          Kan Zhang added a comment -

          > Regarding Kan's comment, I believe the new design should work well with the access token expiration problem if a new access token is returned along with a new generation stamp.

          This is not sufficient. My point was whenever a new pipeline needs to be set up, the client needs to call the block recovery code to get a newly generated access token, even in cases where a new generation stamp is not needed.

          Show
          Kan Zhang added a comment - > Regarding Kan's comment, I believe the new design should work well with the access token expiration problem if a new access token is returned along with a new generation stamp. This is not sufficient. My point was whenever a new pipeline needs to be set up, the client needs to call the block recovery code to get a newly generated access token, even in cases where a new generation stamp is not needed.
          Hide
          Hairong Kuang added a comment -

          This version of design document incorporates all review comments.

          Regarding Kan's comment, I believe the new design should work well with the access token expiration problem if a new access token is returned along with a new generation stamp.

          Show
          Hairong Kuang added a comment - This version of design document incorporates all review comments. Regarding Kan's comment, I believe the new design should work well with the access token expiration problem if a new access token is returned along with a new generation stamp.
          Hide
          Hairong Kuang added a comment -

          >Page 4: "This algorithm chooses to do it right after 1.b and ..." Should "1.b" be "2.b"?
          Writing data to disk is chosen to be done after 1.b. 1, 2, and 3 do not need to be done in sequence.

          > Page 3: when the replicator files in Datanodes are created? During Stage 1 "Set up a pipeline"?
          Replica files are created during pipeline set up time.

          > Page 3: What will happen if a client setups a pipeline and then closes it? Will it result to some empty replicator files?
          A pipeline does not get to set up unless an application already has written a packet of data. So this scenario should not happen.

          > Page 4: In "3. Simplify buffer management since ...", would it slow down the pipeline?
          This is the same as what we have in the trunk. So it won't make the performance worse.

          Show
          Hairong Kuang added a comment - >Page 4: "This algorithm chooses to do it right after 1.b and ..." Should "1.b" be "2.b"? Writing data to disk is chosen to be done after 1.b. 1, 2, and 3 do not need to be done in sequence. > Page 3: when the replicator files in Datanodes are created? During Stage 1 "Set up a pipeline"? Replica files are created during pipeline set up time. > Page 3: What will happen if a client setups a pipeline and then closes it? Will it result to some empty replicator files? A pipeline does not get to set up unless an application already has written a packet of data. So this scenario should not happen. > Page 4: In "3. Simplify buffer management since ...", would it slow down the pipeline? This is the same as what we have in the trunk. So it won't make the performance worse.
          Hide
          Hairong Kuang added a comment -

          It looks that there are some confusions regarding the the pipeline diagram on page 3. Each solid line does not represent one packet. Instead it represent the starting of one packet and/or end of a packet. For example, t1 is the time that packet 0 starts to be sent and t2 is the time that packet 0 is finished and packet 1 gets started to be sent. Packet 2 is an hflushed packet. So packet 3 can not be sent before the ack for packet 2 is received.

          This diagram assumes that application writes faster than packets get sent to the network. So there is no waiting in between two packets. Hflush is called before t3.

          Show
          Hairong Kuang added a comment - It looks that there are some confusions regarding the the pipeline diagram on page 3. Each solid line does not represent one packet. Instead it represent the starting of one packet and/or end of a packet. For example, t1 is the time that packet 0 starts to be sent and t2 is the time that packet 0 is finished and packet 1 gets started to be sent. Packet 2 is an hflushed packet. So packet 3 can not be sent before the ack for packet 2 is received. This diagram assumes that application writes faster than packets get sent to the network. So there is no waiting in between two packets. Hflush is called before t3.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > First draft of the test plan for append, pipelines, block and lease recovy
          Thanks Cos for combining and posting the test plan. I am working on the pipeline FI tests.

          Show
          Tsz Wo Nicholas Sze added a comment - > First draft of the test plan for append, pipelines, block and lease recovy Thanks Cos for combining and posting the test plan. I am working on the pipeline FI tests.
          Hide
          Konstantin Boudnik added a comment -

          First draft of the test plan for append, pipelines, block and lease recovy

          Show
          Konstantin Boudnik added a comment - First draft of the test plan for append, pipelines, block and lease recovy
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > ... Nicholas will post a test plan.

          Don't want to complicate this issue. Will file new issues for test plans. Just have filed HDFS-451 for testing DataTransferProtocol.

          Show
          Tsz Wo Nicholas Sze added a comment - > ... Nicholas will post a test plan. Don't want to complicate this issue. Will file new issues for test plans. Just have filed HDFS-451 for testing DataTransferProtocol.
          Hide
          Kan Zhang added a comment -

          I haven't read the new design yet. FWIW, just wanted to add a note on dealing with access token errors based on past experience. Access tokens are used as capability tokens for accessing datanodes. They are checked only during pipeline setup. A client can't generate access tokens by itself; tokens have to be obtained from the namenode (or datanode during block recovery). An access token has an expiration date. Since a client may cache access tokens, access token error may occur during pipeline setup due to expired access tokens. Specifically, we have the following situations.

          1) Pipeline setup for writing to a new block. In this case, any error (including access token error) will cause the client to abandon the block and get a new blockID with a newly generated access token from the namenode before re-establishing the pipeline. Hence, no need to worry about expired access tokens when re-establishing the pipeline.

          2) A pipeline for writing has been successfully setup, however an error occurs during writing. In this case, the client needs to go through a block recovery process and subsequently re-establish the pipeline. At this time, a previously used access token may have become expired. The idea behind HDFS-195 is that the block recovery process will always return a newly generated access token to the client so that the client can use to re-establish the pipeline. One nice thing about using a newly generated access token is that if a subsequent access token error occurs, we can conclude it is the complaining datanode that is misbehaving and we can exclude it from the set of targets in our next retry.

          3) When setting up a pipeline initially for an append operation. Since in the current implementation (before the new design in this JIRA) the client needs to go through the same block recovery process as in case 2) before setting up the pipeline, the client will get a newly generated access token as a result and use it to set up the pipeline. So the same arguments apply.

          Show
          Kan Zhang added a comment - I haven't read the new design yet. FWIW, just wanted to add a note on dealing with access token errors based on past experience. Access tokens are used as capability tokens for accessing datanodes. They are checked only during pipeline setup. A client can't generate access tokens by itself; tokens have to be obtained from the namenode (or datanode during block recovery). An access token has an expiration date. Since a client may cache access tokens, access token error may occur during pipeline setup due to expired access tokens. Specifically, we have the following situations. 1) Pipeline setup for writing to a new block. In this case, any error (including access token error) will cause the client to abandon the block and get a new blockID with a newly generated access token from the namenode before re-establishing the pipeline. Hence, no need to worry about expired access tokens when re-establishing the pipeline. 2) A pipeline for writing has been successfully setup, however an error occurs during writing. In this case, the client needs to go through a block recovery process and subsequently re-establish the pipeline. At this time, a previously used access token may have become expired. The idea behind HDFS-195 is that the block recovery process will always return a newly generated access token to the client so that the client can use to re-establish the pipeline. One nice thing about using a newly generated access token is that if a subsequent access token error occurs, we can conclude it is the complaining datanode that is misbehaving and we can exclude it from the set of targets in our next retry. 3) When setting up a pipeline initially for an append operation. Since in the current implementation (before the new design in this JIRA) the client needs to go through the same block recovery process as in case 2) before setting up the pipeline, the client will get a newly generated access token as a result and use it to set up the pipeline. So the same arguments apply.
          Hide
          Konstantin Boudnik added a comment - - edited

          I have a few corrections and a couple of proposals about the specs:

          Corrections:

          • inconsistency of states' abbreviations: in some places it is all lower case, somewhere it's capitalized, etc.
          • page 1
            "...DHFS needs..." -> "...HDFS needs..."
          • page 2
            "...bytes in a bbw replica" -> "...bytes in a rbw replica"
          • page 5
            "...throws and EndOfFileException." -> "...throws EOFException." Otherwise it might sound like a different type of exception
          • page 10
            9.5.4.a.ii.1: ReplicaNotExistException -> ReplicaNotExistsException
            9.5.4.b.iii: "If max(Len for all returned DN) == 0..." -> max(Len from all reported DN) == 0..."
          • page 11
            9.5.4.c.iii: "...memeory..." -> "...memory..."

          Suggestions:

          • page 10
            9.5.4.a.ii.3: rename RecoveryUnderProgressException to RecoveryInProgressException
            9.5.4.a.ii.3: rename CorruptReplicaException to CorruptedReplicaException
          Show
          Konstantin Boudnik added a comment - - edited I have a few corrections and a couple of proposals about the specs: Corrections: inconsistency of states' abbreviations: in some places it is all lower case, somewhere it's capitalized, etc. page 1 "...DHFS needs..." -> "...HDFS needs..." page 2 "...bytes in a bbw replica" -> "...bytes in a rbw replica" page 5 "...throws and EndOfFileException." -> "...throws EOFException." Otherwise it might sound like a different type of exception page 10 9.5.4.a.ii.1: ReplicaNotExistException -> ReplicaNotExistsException 9.5.4.b.iii: "If max(Len for all returned DN) == 0..." -> max(Len from all reported DN) == 0..." page 11 9.5.4.c.iii: "...memeory..." -> "...memory..." Suggestions: page 10 9.5.4.a.ii.3: rename RecoveryUnderProgressException to RecoveryInProgressException 9.5.4.a.ii.3: rename CorruptReplicaException to CorruptedReplicaException
          Hide
          Konstantin Shvachko added a comment -

          I added page numbers, section numbers, name and date.
          Same document other than that.

          Show
          Konstantin Shvachko added a comment - I added page numbers, section numbers, name and date. Same document other than that.
          Hide
          stack added a comment -

          Agree with Tsz Wo that page numbers and sections would make discussion of this doc. the easier (doc. should also have author and date).

          Show
          stack added a comment - Agree with Tsz Wo that page numbers and sections would make discussion of this doc. the easier (doc. should also have author and date).
          Hide
          Konstantin Boudnik added a comment -

          I believe there's one more issue with this diagram:

          • the ack for packet 3 seems to be never received

          Is this intended to be so?

          > there are 7 solid arrows but only 5 packets (packet 0 to packet 4). I cannot
          tell the correspondence between the arrows and the packets.

          Show
          Konstantin Boudnik added a comment - I believe there's one more issue with this diagram: the ack for packet 3 seems to be never received Is this intended to be so? > there are 7 solid arrows but only 5 packets (packet 0 to packet 4). I cannot tell the correspondence between the arrows and the packets.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Came up some questions when I was working on the test plan:

          • Page 3: when the replicator files in Datanodes are created? During Stage 1 "Set up a pipeline"?
          • Page 3: What will happen if a client setups a pipeline and then closes it? Will it result to some empty replicator files?
          • Page 4: In "3. Simplify buffer management since ...", would it slow down the pipeline?
          Show
          Tsz Wo Nicholas Sze added a comment - Came up some questions when I was working on the test plan: Page 3: when the replicator files in Datanodes are created? During Stage 1 "Set up a pipeline"? Page 3: What will happen if a client setups a pipeline and then closes it? Will it result to some empty replicator files? Page 4: In "3. Simplify buffer management since ...", would it slow down the pipeline?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hairong, thanks for putting all the details in the doc. It is very useful for understanding the new design.

          Some comments:

          • Page 3: It is not easy to understand the pipeline diagram. It is clear that the time between t1 and t9 is the data streaming phase. However, there are 7 solid arrows but only 5 packets (packet 0 to packet 4). I cannot tell the correspondence between the arrows and the packets.
          • Page 3: when was hflush called exactly in the pipeline diagram?
          • Page 4: "This algorithm chooses to do it right after 1.b and ..."
            Should "1.b" be "2.b"?
          • Could you add page numbers and section numbers? It will be easily to refer the doc.
          Show
          Tsz Wo Nicholas Sze added a comment - Hairong, thanks for putting all the details in the doc. It is very useful for understanding the new design. Some comments: Page 3: It is not easy to understand the pipeline diagram. It is clear that the time between t1 and t9 is the data streaming phase. However, there are 7 solid arrows but only 5 packets (packet 0 to packet 4). I cannot tell the correspondence between the arrows and the packets. Page 3: when was hflush called exactly in the pipeline diagram? Page 4: "This algorithm chooses to do it right after 1.b and ..." Should "1.b" be "2.b"? Could you add page numbers and section numbers? It will be easily to refer the doc.
          Hide
          Hairong Kuang added a comment -

          A draft of the design document is attached. Please comment. Nicholas will post a test plan.

          I will be on vacation in the next few weeks. Please be not surprised if my response is delayed.

          Show
          Hairong Kuang added a comment - A draft of the design document is attached. Please comment. Nicholas will post a test plan. I will be on vacation in the next few weeks. Please be not surprised if my response is delayed.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          The new append design should be compatible with the recently committed access tokens feature. See HADOOP-4359.

          Show
          Tsz Wo Nicholas Sze added a comment - The new append design should be compatible with the recently committed access tokens feature. See HADOOP-4359 .
          Hide
          stack added a comment -

          @ Hairong

          .bq ..do you still need to recover the lease?

          No. If readers will see data near immediately after its hflushed, there is no need to do the append, wait-on-lease, close, and then new open sequence.

          .bq I might be wrong. But I do not think HADOOP-4379 addresses the problem that I raised. HADOOP-4379 tries to make flushed data visible to a new reader.

          To be clear, this is our use case. When we notice a process has died, a different process opens a new reader on the file the dead process had been writing (Later, in your '18/May/09 05:03 PM' comment, you say new readers will see up to the last hflush - - just making sure you understand our usage).

          Show
          stack added a comment - @ Hairong .bq ..do you still need to recover the lease? No. If readers will see data near immediately after its hflushed, there is no need to do the append, wait-on-lease, close, and then new open sequence. .bq I might be wrong. But I do not think HADOOP-4379 addresses the problem that I raised. HADOOP-4379 tries to make flushed data visible to a new reader. To be clear, this is our use case. When we notice a process has died, a different process opens a new reader on the file the dead process had been writing (Later, in your '18/May/09 05:03 PM' comment, you say new readers will see up to the last hflush - - just making sure you understand our usage).
          Hide
          Hairong Kuang added a comment -

          Jim/Stack, thanks for your comments. This makes us to understand our users' requirements better.

          > It doesn't have to be closed as long the reader is able to go all the ways up to the last sync made by the writer before crash.
          Yes, all flushed data are guaranteed to be read by a new reader. I also want you to be aware that the reader may read beyond the flushed data.

          > The file may have more bytes than when it was previously read. This is a norm case. Will this be an issue to hbase? How would this circumstance arise?
          Our idea is that not all the data received by a datanode are visible to a user. For example. if a dfs application calls flush, the data left the client machine and are being pushed to all datanodes, but the application dies before it receives the flush ack. This flush is considered as a failure. The flushed data may have reached none, one, or all datanodes. So the data is not guaranteed to be visible to a reader. But the lease recovery MAY still keep the data in the file if the data have reached one or all datanodes.

          Show
          Hairong Kuang added a comment - Jim/Stack, thanks for your comments. This makes us to understand our users' requirements better. > It doesn't have to be closed as long the reader is able to go all the ways up to the last sync made by the writer before crash. Yes, all flushed data are guaranteed to be read by a new reader. I also want you to be aware that the reader may read beyond the flushed data. > The file may have more bytes than when it was previously read. This is a norm case. Will this be an issue to hbase? How would this circumstance arise? Our idea is that not all the data received by a datanode are visible to a user. For example. if a dfs application calls flush, the data left the client machine and are being pushed to all datanodes, but the application dies before it receives the flush ack. This flush is considered as a failure. The flushed data may have reached none, one, or all datanodes. So the data is not guaranteed to be visible to a reader. But the lease recovery MAY still keep the data in the file if the data have reached one or all datanodes.
          Hide
          Hairong Kuang added a comment -

          Note this jira no longer uses "sync", instead we use hflush.

          The spec posted in this jira aims at API 3.

          > another process knows that the writer has crashed and needs to be able to read all the data up to the last sync()

          As I said, this spec guarantees that another process read all the data up to the last flush without the need to close this file. We also guarantee that the flushed data won't be removed as a result of lease recovery when the file is finally closed.

          > recover the lease (immediately)
          If another process can read all the flushed data without the file being closed, do you still need to recover the lease?

          > this is what HADOOP-4379 is trying to address with Doug Cutting's comment.
          I might be wrong. But I do not think HADOOP-4379 addresses the problem that I raised. HADOOP-4379 tries to make flushed data visible to a new reader.

          Show
          Hairong Kuang added a comment - Note this jira no longer uses "sync", instead we use hflush. The spec posted in this jira aims at API 3. > another process knows that the writer has crashed and needs to be able to read all the data up to the last sync() As I said, this spec guarantees that another process read all the data up to the last flush without the need to close this file. We also guarantee that the flushed data won't be removed as a result of lease recovery when the file is finally closed. > recover the lease (immediately) If another process can read all the flushed data without the file being closed, do you still need to recover the lease? > this is what HADOOP-4379 is trying to address with Doug Cutting's comment. I might be wrong. But I do not think HADOOP-4379 addresses the problem that I raised. HADOOP-4379 tries to make flushed data visible to a new reader.
          Hide
          stack added a comment -

          @ Hairong

          .bq My question is that why the file needs to be closed before it is read.

          It doesn't have to be closed as long the reader is able to go all the ways up to the last sync made by the writer before crash.

          .bq Is it OK for your client to trigger the close of the file but does not wait for it to close?

          Yes, as long as the reader is able to go all the ways up to the last sync made....

          .bq (1) may have more bytes than when it was previously read. This is a norm case. Will this be an issue to hbase?

          How would this circumstance arise?

          .bq Note that the current implementation in 0.20 does not provide the second guarantee described above.

          In my testing of hadoop-4379, i've only been killing the writer application. I should play with killing the writer application AND the local datanode.

          Show
          stack added a comment - @ Hairong .bq My question is that why the file needs to be closed before it is read. It doesn't have to be closed as long the reader is able to go all the ways up to the last sync made by the writer before crash. .bq Is it OK for your client to trigger the close of the file but does not wait for it to close? Yes, as long as the reader is able to go all the ways up to the last sync made.... .bq (1) may have more bytes than when it was previously read. This is a norm case. Will this be an issue to hbase? How would this circumstance arise? .bq Note that the current implementation in 0.20 does not provide the second guarantee described above. In my testing of hadoop-4379, i've only been killing the writer application. I should play with killing the writer application AND the local datanode.
          Hide
          Jim Kellerman added a comment -

          > Hairong Kuang added a comment - 18/May/09 03:53 PM
          > Note that the current implementation in 0.20 does not provide the second guarantee described above.
          > A byte saw by a reader become invisible when the replica it read from fails. Even when the replica that
          > a reader is reading from is alive, the recovery of an error caused by another replica in the pipeline may
          > remove a byte saw by a reader temporary from dfs but recover the byte later. A removed byte may never
          > get recovered if the writer dies or the writer does not have this byte any more.

          Yes, this is what HADOOP-4379 is trying to address with Doug Cutting's comment:

          > Doug Cutting added a comment - 26/Jan/09 11:20 AM
          >
          >I think it would be better to add a SequenceFile.Writer method that flushes and calls
          > FSDataOutputStream.sync. The problem is that SequenceFile.Writer#sync() should
          > probably be called flush(), not sync(), but that'd be hard to change. So perhaps we
          > should add a SequenceFile.Writer#syncFs() method?

          Show
          Jim Kellerman added a comment - > Hairong Kuang added a comment - 18/May/09 03:53 PM > Note that the current implementation in 0.20 does not provide the second guarantee described above. > A byte saw by a reader become invisible when the replica it read from fails. Even when the replica that > a reader is reading from is alive, the recovery of an error caused by another replica in the pipeline may > remove a byte saw by a reader temporary from dfs but recover the byte later. A removed byte may never > get recovered if the writer dies or the writer does not have this byte any more. Yes, this is what HADOOP-4379 is trying to address with Doug Cutting's comment: > Doug Cutting added a comment - 26/Jan/09 11:20 AM > >I think it would be better to add a SequenceFile.Writer method that flushes and calls > FSDataOutputStream.sync. The problem is that SequenceFile.Writer#sync() should > probably be called flush(), not sync(), but that'd be hard to change. So perhaps we > should add a SequenceFile.Writer#syncFs() method?
          Hide
          Jim Kellerman added a comment -

          The problem we need to solve is:

          • process writing a file crashes after doing some sync() operations
          • another process knows that the writer has crashed and needs to
            • recover the lease (immediately)
            • be able to read all the data up to the last sync()

          Of the APIs described in the following email, from Sanjay below, APIs 1-2 are are inadequate, API3 is ok provided HDFS does not fail. API4 works if the datanode(s) fail but not if the machine crashes. Only API5 will guarantee that we can read the data that has been sync'd.

          > From: Sanjay Radia sradia@yahoo-inc.com
          > Sent: Thursday, May 14, 2009 1:46 PM
          > To: Jim Kellerman (POWERSET)
          > Cc: Michael Stack; Chad Walters; Dhruba Borthakur; Sameer Paranjpye; Hairong Kuang; Robert Chansler
          > Subject: Re: Append, flush, sync write and HBase
          >
          >> On May 13, 2009, at 1:20 PM, Jim Kellerman (POWERSET) wrote:
          >>
          >> What we need are two things:
          >>
          >> 1. When we call sync() we want to be assured that any buffered data can be read by another process
          >
          > Actually I wanted to have a larger discussion to understand your
          > current and future requirements on append/flush/sync also on latency
          > of HDFS. I am trying to document current and future
          > requirements. Which is why I wanted to do a quick chat on the phone. I
          > will try this via email for now.
          >
          > BTW Hairong @ Y! is driving the re-implementation of append. See
          > HADOOP-5744. She has defined the semantics she is considering. Please
          > comment whether you agree or disagree.
          >
          > We are also looking at variation on semantics that may have lower
          > latencies and lesser guarantees. We would like to get your initial
          > feedback. Eventually we will update the Jira when we have semantics
          > and apis better formulated.
          >
          > Below is a list of APIs/semantics variations we are considering.
          > Which ones do you absolutely needed for HBase in the short term and
          > which ones may be useful to HBase in the longer term.
          >
          > API1: flushes out from the address space of client into the socket to the data nodes.
          >
          > On the return of the call there is no guarantee that that data is
          > out of the underlying node and no guarantee of having reached a
          > DN. Readers will see this data soon if there are no failures.
          >
          > For example, I suspect Scribe and chukwa will like the lower
          > latency of this API and are prepared to loose some records
          > occasionally in case of failures. Clearly a journal will not find
          > this api acceptable.
          >
          > API2: flushes out to at lease one data node and receives an ack.
          >
          > New readers will eventually see the data
          >
          > API3: flushes out to all replicas of the block. The data is in the buffers of the DNs but not on the DN's OS buffers
          >
          > New readers will see the data after the call has returned. (Hadoop
          > 5744 calls API3 hflush for now).
          >
          > API4: flushes out to all replicas and all replicas DNs have done a posix fflush equivalent - ie data is out the under lying OS file system of the DNs
          >
          > API5: flushes out to all replicas and all repliacs have done posix fsync equivalent - ie the OS has flushed it to the disk device (but the disk may have it in its cache).
          >
          > Does the HBase edits journal require API 3, 4 or 5?
          >
          > What are your latency requirements for the write operation. For
          > example can you tolerate occasional larger latency for the
          > fflush/fsycn operation?
          >

          Show
          Jim Kellerman added a comment - The problem we need to solve is: process writing a file crashes after doing some sync() operations another process knows that the writer has crashed and needs to recover the lease (immediately) be able to read all the data up to the last sync() Of the APIs described in the following email, from Sanjay below, APIs 1-2 are are inadequate, API3 is ok provided HDFS does not fail. API4 works if the datanode(s) fail but not if the machine crashes. Only API5 will guarantee that we can read the data that has been sync'd. > From: Sanjay Radia sradia@yahoo-inc.com > Sent: Thursday, May 14, 2009 1:46 PM > To: Jim Kellerman (POWERSET) > Cc: Michael Stack; Chad Walters; Dhruba Borthakur; Sameer Paranjpye; Hairong Kuang; Robert Chansler > Subject: Re: Append, flush, sync write and HBase > >> On May 13, 2009, at 1:20 PM, Jim Kellerman (POWERSET) wrote: >> >> What we need are two things: >> >> 1. When we call sync() we want to be assured that any buffered data can be read by another process > > Actually I wanted to have a larger discussion to understand your > current and future requirements on append/flush/sync also on latency > of HDFS. I am trying to document current and future > requirements. Which is why I wanted to do a quick chat on the phone. I > will try this via email for now. > > BTW Hairong @ Y! is driving the re-implementation of append. See > HADOOP-5744 . She has defined the semantics she is considering. Please > comment whether you agree or disagree. > > We are also looking at variation on semantics that may have lower > latencies and lesser guarantees. We would like to get your initial > feedback. Eventually we will update the Jira when we have semantics > and apis better formulated. > > Below is a list of APIs/semantics variations we are considering. > Which ones do you absolutely needed for HBase in the short term and > which ones may be useful to HBase in the longer term. > > API1: flushes out from the address space of client into the socket to the data nodes. > > On the return of the call there is no guarantee that that data is > out of the underlying node and no guarantee of having reached a > DN. Readers will see this data soon if there are no failures. > > For example, I suspect Scribe and chukwa will like the lower > latency of this API and are prepared to loose some records > occasionally in case of failures. Clearly a journal will not find > this api acceptable. > > API2: flushes out to at lease one data node and receives an ack. > > New readers will eventually see the data > > API3: flushes out to all replicas of the block. The data is in the buffers of the DNs but not on the DN's OS buffers > > New readers will see the data after the call has returned. (Hadoop > 5744 calls API3 hflush for now). > > API4: flushes out to all replicas and all replicas DNs have done a posix fflush equivalent - ie data is out the under lying OS file system of the DNs > > API5: flushes out to all replicas and all repliacs have done posix fsync equivalent - ie the OS has flushed it to the disk device (but the disk may have it in its cache). > > Does the HBase edits journal require API 3, 4 or 5? > > What are your latency requirements for the write operation. For > example can you tolerate occasional larger latency for the > fflush/fsycn operation? >
          Hide
          Hairong Kuang added a comment -

          Note that the current implementation in 0.20 does not provide the second guarantee described above. A byte saw by a reader become invisible when the replica it read from fails. Even when the replica that a reader is reading from is alive, the recovery of an error caused by another replica in the pipeline may remove a byte saw by a reader temporary from dfs but recover the byte later. A removed byte may never get recovered if the writer dies or the writer does not have this byte any more.

          Show
          Hairong Kuang added a comment - Note that the current implementation in 0.20 does not provide the second guarantee described above. A byte saw by a reader become invisible when the replica it read from fails. Even when the replica that a reader is reading from is alive, the recovery of an error caused by another replica in the pipeline may remove a byte saw by a reader temporary from dfs but recover the byte later. A removed byte may never get recovered if the writer dies or the writer does not have this byte any more.
          Hide
          Hairong Kuang added a comment -

          Dhruba told me that hbase depended on a client calling of append to trigger the close of a file that lost its writer. Once the file is closed, the client reads the file and starts to work from the state defined in the closed file.

          My question is that why the file needs to be closed before it is read. The read semantics defined in this jira guarantees that
          (1) any hflushed data become visible to any new readers;
          (2) Once a byte becomes visible to a reader, it continues to be visible to the reader except when all replicas containing the byte fail. This implies that a reader continues to see a byte it saw before even when the replica that it read from fails, during any error recovery, and after any error recovery as long as one replica containing the byte is available.

          Is it OK for your client to trigger the close of the file but does not wait for it to close? The idea is to read the file and resume working before the file gets closed. When the file finally gets closed, the file
          (1) may have more bytes than when it was previously read. This is a norm case. Will this be an issue to hbase?
          (2) If all replicas went down during the period the file is triggered to close and the time it is closed, the file may end up with less bytes. This is a rare case. The default time for this period is 10 minutes so the chance of losing the visible bytes is very slim. Can hbase tolerate this?

          Show
          Hairong Kuang added a comment - Dhruba told me that hbase depended on a client calling of append to trigger the close of a file that lost its writer. Once the file is closed, the client reads the file and starts to work from the state defined in the closed file. My question is that why the file needs to be closed before it is read. The read semantics defined in this jira guarantees that (1) any hflushed data become visible to any new readers; (2) Once a byte becomes visible to a reader, it continues to be visible to the reader except when all replicas containing the byte fail. This implies that a reader continues to see a byte it saw before even when the replica that it read from fails, during any error recovery, and after any error recovery as long as one replica containing the byte is available. Is it OK for your client to trigger the close of the file but does not wait for it to close? The idea is to read the file and resume working before the file gets closed. When the file finally gets closed, the file (1) may have more bytes than when it was previously read. This is a norm case. Will this be an issue to hbase? (2) If all replicas went down during the period the file is triggered to close and the time it is closed, the file may end up with less bytes. This is a rare case. The default time for this period is 10 minutes so the chance of losing the visible bytes is very slim. Can hbase tolerate this?
          Hide
          stack added a comment -

          Thanks Hairong. Any comment on the unqualified 'delay' made mention of in the spec? What kind of time scale should we be thinking in? If it is an 'hour' as I believe is the current default for lease timeout, hbase-like applications will have given up the wait long before. If its in the tens of seconds – ~30 seconds – as I'm currently seeing in HBASE-4379 patches I'm helping Dhruba test, then that is still too long but tolerable in a first release, as long as it can be improved on over time ("Instantaneous" would be the ideal). Thanks.

          Show
          stack added a comment - Thanks Hairong. Any comment on the unqualified 'delay' made mention of in the spec? What kind of time scale should we be thinking in? If it is an 'hour' as I believe is the current default for lease timeout, hbase-like applications will have given up the wait long before. If its in the tens of seconds – ~30 seconds – as I'm currently seeing in HBASE-4379 patches I'm helping Dhruba test, then that is still too long but tolerable in a first release, as long as it can be improved on over time ("Instantaneous" would be the ideal). Thanks.
          Hide
          Hairong Kuang added a comment -

          Flush needs to make data visible before a block is completed. This has changed a lot of assumptions in HDFS. Previous append work has put tremendous amount of great work and set up a foundation for improvement. However there are issues that seemed trivial initially but it turned out that they needed a thorough design. HADOOP-4379, 4663, 5027, 5133, and 4692 etc. were filed as bugs but are indeed caused by a lack of design. This issue aims at a design which solves them all. It's targeted for 0.21.

          If no hflush is called, dfs client pushes data to datanodes when a packet is filled up. The packet size is configurable with a default size of 64K.

          Show
          Hairong Kuang added a comment - Flush needs to make data visible before a block is completed. This has changed a lot of assumptions in HDFS. Previous append work has put tremendous amount of great work and set up a foundation for improvement. However there are issues that seemed trivial initially but it turned out that they needed a thorough design. HADOOP-4379 , 4663, 5027, 5133, and 4692 etc. were filed as bugs but are indeed caused by a lack of design. This issue aims at a design which solves them all. It's targeted for 0.21. If no hflush is called, dfs client pushes data to datanodes when a packet is filled up. The packet size is configurable with a default size of 64K.
          Hide
          stack added a comment -

          A few comments and questions:

          Document is without context. It is undated and has no author. In particular, I'd think there'd be some positioning of this document in relation to previous append work (Is hadoop-4379, etc., dead? Are we starting over as the name of this issue portends. If starting over, is 0.21 fix version optimistic?).

          What kind of 'delay' are we talking about here do you think in the section Read, item #4 ? "If a writer/appender does not call hflush, a reader can still progressively see previously written data with a delay"

          Show
          stack added a comment - A few comments and questions: Document is without context. It is undated and has no author. In particular, I'd think there'd be some positioning of this document in relation to previous append work (Is hadoop-4379, etc., dead? Are we starting over as the name of this issue portends. If starting over, is 0.21 fix version optimistic?). What kind of 'delay' are we talking about here do you think in the section Read, item #4 ? "If a writer/appender does not call hflush, a reader can still progressively see previously written data with a delay"
          Hide
          Konstantin Boudnik added a comment - - edited

          Hairong,

          I've a couple of comments about semantic of clause of the document:
          1. DFS provides "best effort durability" to data in an unclosed file:

          • ... If no replicas of a block being written is available, write fails

          While it really sounds like a best effort it has some implications from the stand point of deterministic testing of the feature. In this test case, shall we always expect the negative result of the test and dismiss occasional positive results as non-significant? Shall we apply some kind of heuristic to judge the test case's execution results?

          2. DFS provides "strong durability" to data in a closed file:

          • ... However, file close doesn't guarantee that data has hit the disk, If data does not hit the disk, restarting DataNode may cause the loss of the data.

          While I totally understand the somewhat non-deterministic behavior of this feature (e.g. a hardware failure might happen in the very instant between a file's close invocation and actual data storage), I still have a concern or two about the testing perspectives of it. As I understand, synchronous close() call returns when all DNs have reported the successful flash for their respective blocks of the file. However, and this is where non-determinism kicks in, a particular DN's OS might be slow in flushing its own IO buffers and writing the data to disk, and where the actual data loss could happen in case of sudden restart of the DN. However, from the testing perspective it creates a certain issues like the follows:

          • let's assume that a test client appends a file in DFS
          • at some point it calls close() and waits for its return
          • let's suppose the test client runs in a distributed harness where another piece of the test can simulate a failure on a DN, where a replica of the file is located. And the timing of the failure is happen to be right after this very DN reported the success of the flush() call. E.g. our assumed test harness can guarantee that data hasn't hit the disk. However, our test client will see a successful return from the call() method and report the test success respectively. So, I'm not sure how to deal with this situation too.
          Show
          Konstantin Boudnik added a comment - - edited Hairong, I've a couple of comments about semantic of clause of the document: 1. DFS provides "best effort durability" to data in an unclosed file: ... If no replicas of a block being written is available, write fails While it really sounds like a best effort it has some implications from the stand point of deterministic testing of the feature. In this test case, shall we always expect the negative result of the test and dismiss occasional positive results as non-significant? Shall we apply some kind of heuristic to judge the test case's execution results? 2. DFS provides "strong durability" to data in a closed file: ... However, file close doesn't guarantee that data has hit the disk, If data does not hit the disk, restarting DataNode may cause the loss of the data. While I totally understand the somewhat non-deterministic behavior of this feature (e.g. a hardware failure might happen in the very instant between a file's close invocation and actual data storage), I still have a concern or two about the testing perspectives of it. As I understand, synchronous close() call returns when all DNs have reported the successful flash for their respective blocks of the file. However, and this is where non-determinism kicks in, a particular DN's OS might be slow in flushing its own IO buffers and writing the data to disk, and where the actual data loss could happen in case of sudden restart of the DN. However, from the testing perspective it creates a certain issues like the follows: let's assume that a test client appends a file in DFS at some point it calls close() and waits for its return let's suppose the test client runs in a distributed harness where another piece of the test can simulate a failure on a DN, where a replica of the file is located. And the timing of the failure is happen to be right after this very DN reported the success of the flush() call. E.g. our assumed test harness can guarantee that data hasn't hit the disk. However, our test client will see a successful return from the call() method and report the test success respectively. So, I'm not sure how to deal with this situation too.
          Hide
          Hairong Kuang added a comment -

          Attached is the first draft of the append specification. Please comment if this specification meets your requirements.

          Show
          Hairong Kuang added a comment - Attached is the first draft of the append specification. Please comment if this specification meets your requirements.

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Hairong Kuang
            • Votes:
              5 Vote for this issue
              Watchers:
              58 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development