Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-6450

Support non-positional hedged reads in HDFS

    Details

    • Type: Improvement Improvement
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.4.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Target Version/s:

      Description

      HDFS-5776 added support for hedged positional reads. We should also support hedged non-position reads (aka regular reads).

        Issue Links

          Activity

          Hide
          Liang Xie added a comment -

          will dive into it a couple of days later.

          Show
          Liang Xie added a comment - will dive into it a couple of days later.
          Hide
          Liang Xie added a comment -

          Colin Patrick McCabe Stack I made a prototype already(still adding more test cases), but i think we need to clear one thing right now: how to handle the block reader for read().
          In current impl, the read() maintains a blockreader to reuse for furture read calls, and pread() will create a new blockreader in each call always.
          So if we want to enhance read() to have hedged read ability, should we assign the first completed read request's block reader to above maintained blockreader variable?(i guess most of situations, this behavior probably will be like change the local block reader to a remote block reader?)

          Show
          Liang Xie added a comment - Colin Patrick McCabe Stack I made a prototype already(still adding more test cases), but i think we need to clear one thing right now: how to handle the block reader for read(). In current impl, the read() maintains a blockreader to reuse for furture read calls, and pread() will create a new blockreader in each call always. So if we want to enhance read() to have hedged read ability, should we assign the first completed read request's block reader to above maintained blockreader variable?(i guess most of situations, this behavior probably will be like change the local block reader to a remote block reader?)
          Hide
          Liang Xie added a comment -

          Attached is a preliminary patch similar with pread() doing, means creating block reader always. It should be the simiplest way, but not very friendly for read(), since no block reader be reusable...

          Show
          Liang Xie added a comment - Attached is a preliminary patch similar with pread() doing, means creating block reader always. It should be the simiplest way, but not very friendly for read(), since no block reader be reusable...
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12654295/HDFS-6450-like-pread.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7285//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/7285//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7285//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12654295/HDFS-6450-like-pread.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 4 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7285//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/7285//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7285//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12654948/HDFS-6450-like-pread.txt
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.namenode.TestProcessCorruptBlocks

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7320//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7320//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12654948/HDFS-6450-like-pread.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.TestProcessCorruptBlocks +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7320//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7320//console This message is automatically generated.
          Hide
          Liang Xie added a comment -

          After a deep looking, it's kind of hard to reuse/maintain block reader as before.
          In pread(), we don't have this trouble, because we always create new block reader.
          In read(), if we want to support hedged read ability, in general:
          1) first read(r1) using the old block reader if possible, then wait hedged read timeout setting
          2) second read(r2) must create a new block reader, and submit into thread pool
          3) wait the first completed task, and return final read result to client side. Here we need to set(remember) this task's block reader to DFIS's block reader variable, and should keep it open, but we also need to close the other block reader to avoid leak.

          Another thing need to know is that if we remember the faster block reader, if it's a remote block reader, then the following read() will bypass local read in the following r1 operations...
          Any thought ? Colin Patrick McCabe, Stack ...

          Show
          Liang Xie added a comment - After a deep looking, it's kind of hard to reuse/maintain block reader as before. In pread(), we don't have this trouble, because we always create new block reader. In read(), if we want to support hedged read ability, in general: 1) first read(r1) using the old block reader if possible, then wait hedged read timeout setting 2) second read(r2) must create a new block reader, and submit into thread pool 3) wait the first completed task, and return final read result to client side. Here we need to set(remember) this task's block reader to DFIS's block reader variable, and should keep it open, but we also need to close the other block reader to avoid leak. Another thing need to know is that if we remember the faster block reader, if it's a remote block reader, then the following read() will bypass local read in the following r1 operations... Any thought ? Colin Patrick McCabe , Stack ...
          Hide
          Colin Patrick McCabe added a comment -

          Yeah... if the first read, done with the old block reader, doesn't finish first, then I think we should forget about that block reader... even if it's a BlockReaderLocal and the new one is remote. After all, slow or misbehaving local disks are one of the main problems we're trying to cover up with hedged reads. As you pointed out, we need to close the old block reader in this situation.

          I think from an implementation point of view, you might consider unsetting DFSInputStream#blockReader at the beginning of the hedged read (set to null and move the old block reader into the first future). Then have the "winning" future set it to the block reader that it used.

          There will be some performance regression just due to using threads and futures here, instead of just doing the read from the same thread that needs the data. Passing data between threads is slower because you might be going between CPU cores. I don't know if there's really a good way to address this without doing something like HDFS-6695, which is out of scope for this change. I think in the short term, applications will have to turn on non-positional hedged reads explicitly, and accept some small loss in throughput for a major reduction in long-tail latency.

          Show
          Colin Patrick McCabe added a comment - Yeah... if the first read, done with the old block reader, doesn't finish first, then I think we should forget about that block reader... even if it's a BlockReaderLocal and the new one is remote. After all, slow or misbehaving local disks are one of the main problems we're trying to cover up with hedged reads. As you pointed out, we need to close the old block reader in this situation. I think from an implementation point of view, you might consider unsetting DFSInputStream#blockReader at the beginning of the hedged read (set to null and move the old block reader into the first future). Then have the "winning" future set it to the block reader that it used. There will be some performance regression just due to using threads and futures here, instead of just doing the read from the same thread that needs the data. Passing data between threads is slower because you might be going between CPU cores. I don't know if there's really a good way to address this without doing something like HDFS-6695 , which is out of scope for this change. I think in the short term, applications will have to turn on non-positional hedged reads explicitly, and accept some small loss in throughput for a major reduction in long-tail latency.
          Hide
          Liang Xie added a comment -

          I think in the short term, applications will have to turn on non-positional hedged reads explicitly, and accept some small loss in throughput for a major reduction in long-tail latency.

          yes, it's a trade-off

          Show
          Liang Xie added a comment - I think in the short term, applications will have to turn on non-positional hedged reads explicitly, and accept some small loss in throughput for a major reduction in long-tail latency. yes, it's a trade-off
          Hide
          stack added a comment -

          I like Colin Patrick McCabe's take and his suggestion on unsetting current blockReader while the race is on.

          In the patch, nit, you have to make two instances of Random? Can't share?

          How is testHedgedReadBasic exercising the new functionality? Or is it just verifying nothing broke when it is turned on?

          readWithStrategy has synchronized added (you've moved the synchronize down a level it seems) but hedgedReadWithStrategy is not synchronized. Makes for interesting synchronizes inside hedgedReadWithStrategy. Could do w/ comments explaining what is going on. For example, the below is a little baffling until I look closely and see pos accesses are always inside synchronized methods (I think).

          + synchronized (this) {
          + if (pos >= getFileLength())

          { + return -1; + }

          + }

          I took a look at this seems to be the only data member that needs synchronize protection in this method. Any danger if someone else changes it under you while this method is running? (between sync blocks)? And it is ok having multiple threads inside hedgedReadWithStrategy at the one time when say readWithStrategy doesn't allow this to happen?

          I like your use of a an old english catchall when problem WARN logging a hedgedReadWithStrategy return. Might want to change that in v2 (smile).

          Lots of overlap w/ the pread version. Would be coolio if could have the two methods share code.

          Thanks for working on this one Liang Xie

          Show
          stack added a comment - I like Colin Patrick McCabe 's take and his suggestion on unsetting current blockReader while the race is on. In the patch, nit, you have to make two instances of Random? Can't share? How is testHedgedReadBasic exercising the new functionality? Or is it just verifying nothing broke when it is turned on? readWithStrategy has synchronized added (you've moved the synchronize down a level it seems) but hedgedReadWithStrategy is not synchronized. Makes for interesting synchronizes inside hedgedReadWithStrategy. Could do w/ comments explaining what is going on. For example, the below is a little baffling until I look closely and see pos accesses are always inside synchronized methods (I think). + synchronized (this) { + if (pos >= getFileLength()) { + return -1; + } + } I took a look at this seems to be the only data member that needs synchronize protection in this method. Any danger if someone else changes it under you while this method is running? (between sync blocks)? And it is ok having multiple threads inside hedgedReadWithStrategy at the one time when say readWithStrategy doesn't allow this to happen? I like your use of a an old english catchall when problem WARN logging a hedgedReadWithStrategy return. Might want to change that in v2 (smile). Lots of overlap w/ the pread version. Would be coolio if could have the two methods share code. Thanks for working on this one Liang Xie
          Hide
          Liang Xie added a comment -

          How is testHedgedReadBasic exercising the new functionality? Or is it just verifying nothing broke when it is turned on?

          It was not finished yet, this patch is a preliminary one, and i hold on due to the block reader stuff, i think i have got a feasible direction per Colin's last comments, i'll continue to do the rest asap

          Show
          Liang Xie added a comment - How is testHedgedReadBasic exercising the new functionality? Or is it just verifying nothing broke when it is turned on? It was not finished yet, this patch is a preliminary one, and i hold on due to the block reader stuff, i think i have got a feasible direction per Colin's last comments, i'll continue to do the rest asap
          Hide
          Zhe Zhang added a comment -

          In the erasure coding project we need to implement the logic of reading from a striped file (HDFS-7782). Hedged reading is a natural fit to support striping layout, because each I/O is likely to cover multiple DNs. So I'd like to dig a little deeper on this JIRA to see if hedged non-positional read should be added to DFIS or only the subclass for erasure coding.

          Yeah... if the first read, done with the old block reader, doesn't finish first, then I think we should forget about that block reader... even if it's a BlockReaderLocal and the new one is remote. After all, slow or misbehaving local disks are one of the main problems we're trying to cover up with hedged reads. As you pointed out, we need to close the old block reader in this situation.

          Colin Patrick McCabe I like the idea of reusing the blockReader at first Future, and updating it to the winning reader. The only concern is we'll never come back to the local reader if it goes slow once. Should we give it another chance?

          Show
          Zhe Zhang added a comment - In the erasure coding project we need to implement the logic of reading from a striped file ( HDFS-7782 ). Hedged reading is a natural fit to support striping layout, because each I/O is likely to cover multiple DNs. So I'd like to dig a little deeper on this JIRA to see if hedged non-positional read should be added to DFIS or only the subclass for erasure coding. Yeah... if the first read, done with the old block reader, doesn't finish first, then I think we should forget about that block reader... even if it's a BlockReaderLocal and the new one is remote. After all, slow or misbehaving local disks are one of the main problems we're trying to cover up with hedged reads. As you pointed out, we need to close the old block reader in this situation. Colin Patrick McCabe I like the idea of reusing the blockReader at first Future, and updating it to the winning reader. The only concern is we'll never come back to the local reader if it goes slow once. Should we give it another chance?
          Hide
          Zhe Zhang added a comment -

          I just posted a patch under HDFS-7782. It makes some changes to DFSInputStream that should partially support this JIRA too:

          1. A small HedgedReadResult class to return both a result buffer and a generated BlockReader which should be used to adjust the maintained blockBreader
          2. In hedgedFetchBlockByteRange, return the reader from the fastest Future
          3. That JIRA has the hedged read logic in readBuffer() instead of readWithStrategy level. I'm not so sure which option is better though.

          Liang Xie I wonder if you still plan to continue this work? In either case I'm happy to help complete this JIRA.

          Show
          Zhe Zhang added a comment - I just posted a patch under HDFS-7782 . It makes some changes to DFSInputStream that should partially support this JIRA too: A small HedgedReadResult class to return both a result buffer and a generated BlockReader which should be used to adjust the maintained blockBreader In hedgedFetchBlockByteRange , return the reader from the fastest Future That JIRA has the hedged read logic in readBuffer() instead of readWithStrategy level. I'm not so sure which option is better though. Liang Xie I wonder if you still plan to continue this work? In either case I'm happy to help complete this JIRA.
          Hide
          Liang Xie added a comment -

          Thanks Zhe Zhang for the comments, i am not active in this area recently, please go ahead

          Show
          Liang Xie added a comment - Thanks Zhe Zhang for the comments, i am not active in this area recently, please go ahead
          Hide
          Colin Patrick McCabe added a comment -

          I think it would be possible to support hedged non-positional reads in BlockReaderLocal, but difficult. First we would have to stop re-using the same FD for all instances of a BlockReaderLocal that were reading the same replica. Perhaps we could use dup to create a new FD per blockreader without doing multiple opens. Then we could close the blockreader FD if the local read were being slow.

          I think it's much easier to just implement hedged non-positional reads in the erasure coding-specific subclass of DFSInputStream.

          I also think we may want to create a base class for DFSInputStream that both the raid and the non-raid code path inherit from. Inheriting from the non-raid code path is weird because there is a lot of stuff that is not relevant.

          Show
          Colin Patrick McCabe added a comment - I think it would be possible to support hedged non-positional reads in BlockReaderLocal , but difficult. First we would have to stop re-using the same FD for all instances of a BlockReaderLocal that were reading the same replica. Perhaps we could use dup to create a new FD per blockreader without doing multiple opens. Then we could close the blockreader FD if the local read were being slow. I think it's much easier to just implement hedged non-positional reads in the erasure coding-specific subclass of DFSInputStream. I also think we may want to create a base class for DFSInputStream that both the raid and the non-raid code path inherit from. Inheriting from the non-raid code path is weird because there is a lot of stuff that is not relevant.
          Hide
          Zhe Zhang added a comment -

          Thanks for the suggestion Colin. Hedged pread is already handling BlockReaderLocal (by wrapping it as a Future). I guess it's reasonable to do the same for non-positional read too?

          Is it correct to understand hedged vs. non-hedged and positional vs. non-positional as orthogonal dimensions? If so, the only new requirement in hedged non-positional read is to utilize and maintain the states (pos, blockReader).

          I'm still getting myself around this complex reader code. So please let me know if I missed something.

          Show
          Zhe Zhang added a comment - Thanks for the suggestion Colin. Hedged pread is already handling BlockReaderLocal (by wrapping it as a Future). I guess it's reasonable to do the same for non-positional read too? Is it correct to understand hedged vs. non-hedged and positional vs. non-positional as orthogonal dimensions? If so, the only new requirement in hedged non-positional read is to utilize and maintain the states (pos, blockReader). I'm still getting myself around this complex reader code. So please let me know if I missed something.
          Hide
          Andrew Purtell added a comment -

          I think it's much easier to just implement hedged non-positional reads in the erasure coding-specific subclass of DFSInputStream.

          Would be a (small) shame though if to get hedged non-positional reads one has to bring in erasure coding.

          Show
          Andrew Purtell added a comment - I think it's much easier to just implement hedged non-positional reads in the erasure coding-specific subclass of DFSInputStream. Would be a (small) shame though if to get hedged non-positional reads one has to bring in erasure coding.
          Hide
          Colin Patrick McCabe added a comment -

          Would be a (small) shame though if to get hedged non-positional reads one has to bring in erasure coding.

          I understand the frustration. It would certainly be nice to have hedged non-positional reads, for HBase and other things. The problem that we have always had with hedged non-positional reads is that they don't work well (or really at all) with short-circuit reads. We simply don't have any way to interrupt a blocking read from a file descriptor, other than closing the file descriptor. And closing the FD will also disrupt any other clients that are reading from that FD (the short-circuit code shares FDs across multiple readers).

          This problem doesn't exist for erasure encoding because erasure encoding doesn't support short-circuit.

          Yes, the problem can be solved by doing the read in a Future, but the overhead of passing data across threads (and hence across CPU caches, much of the time) would be a significant performance regression.

          Show
          Colin Patrick McCabe added a comment - Would be a (small) shame though if to get hedged non-positional reads one has to bring in erasure coding. I understand the frustration. It would certainly be nice to have hedged non-positional reads, for HBase and other things. The problem that we have always had with hedged non-positional reads is that they don't work well (or really at all) with short-circuit reads. We simply don't have any way to interrupt a blocking read from a file descriptor, other than closing the file descriptor. And closing the FD will also disrupt any other clients that are reading from that FD (the short-circuit code shares FDs across multiple readers). This problem doesn't exist for erasure encoding because erasure encoding doesn't support short-circuit. Yes, the problem can be solved by doing the read in a Future, but the overhead of passing data across threads (and hence across CPU caches, much of the time) would be a significant performance regression.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Moving features/enhancements out of previously closed releases into the next minor release 2.8.0.

          Show
          Vinod Kumar Vavilapalli added a comment - Moving features/enhancements out of previously closed releases into the next minor release 2.8.0.

            People

            • Assignee:
              Liang Xie
              Reporter:
              Colin Patrick McCabe
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:

                Development