Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.1
    • Fix Version/s: 2.7.0
    • Component/s: hdfs-client
    • Labels:
      None

      Description

      Reviews of the patch posted the parent task suggest that we be more explicit about how DFSIS is expected to behave when being read by contending threads. It is also suggested that presumptions made internally be made explicit documenting expectations.

      Before we put up a patch we've made a document of assertions we'd like to make into tenets of DFSInputSteam. If agreement, we'll attach to this issue a patch that weaves the assumptions into DFSIS as javadoc and class comments.

      1. HDFS-6803v3.txt
        5 kB
        stack
      2. HDFS-6803v2.txt
        4 kB
        stack
      3. fsdatainputstream.md.v3.html
        45 kB
        stack
      4. DocumentingDFSClientDFSInputStream (1).pdf
        113 kB
        stack
      5. DocumentingDFSClientDFSInputStream.v2.pdf
        131 kB
        stack
      6. 9117.md.txt
        2 kB
        stack

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #18 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/18/)
          HDFS-6803 Document DFSClient#DFSInputStream expectations reading and preading in concurrent context. (stack via stevel) (stevel: rev aa7dac335960950d2254a5a78bd1f0786a290538)

          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #18 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/18/ ) HDFS-6803 Document DFSClient#DFSInputStream expectations reading and preading in concurrent context. (stack via stevel) (stevel: rev aa7dac335960950d2254a5a78bd1f0786a290538) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #1946 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1946/)
          HDFS-6803 Document DFSClient#DFSInputStream expectations reading and preading in concurrent context. (stack via stevel) (stevel: rev aa7dac335960950d2254a5a78bd1f0786a290538)

          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #1946 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1946/ ) HDFS-6803 Document DFSClient#DFSInputStream expectations reading and preading in concurrent context. (stack via stevel) (stevel: rev aa7dac335960950d2254a5a78bd1f0786a290538) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk #756 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/756/)
          HDFS-6803 Document DFSClient#DFSInputStream expectations reading and preading in concurrent context. (stack via stevel) (stevel: rev aa7dac335960950d2254a5a78bd1f0786a290538)

          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #756 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/756/ ) HDFS-6803 Document DFSClient#DFSInputStream expectations reading and preading in concurrent context. (stack via stevel) (stevel: rev aa7dac335960950d2254a5a78bd1f0786a290538) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #18 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/18/)
          HDFS-6803 Document DFSClient#DFSInputStream expectations reading and preading in concurrent context. (stack via stevel) (stevel: rev aa7dac335960950d2254a5a78bd1f0786a290538)

          • hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md
          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #18 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/18/ ) HDFS-6803 Document DFSClient#DFSInputStream expectations reading and preading in concurrent context. (stack via stevel) (stevel: rev aa7dac335960950d2254a5a78bd1f0786a290538) hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          stevel@apache.org Steve Loughran added a comment -

          hang on to it until I need you to vote up something of mine

          Show
          stevel@apache.org Steve Loughran added a comment - hang on to it until I need you to vote up something of mine
          Hide
          stack stack added a comment -

          Steve Loughran Where do I send the $1.00?

          Show
          stack stack added a comment - Steve Loughran Where do I send the $1.00?
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #17 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/17/)
          HDFS-6803 Document DFSClient#DFSInputStream expectations reading and preading in concurrent context. (stack via stevel) (stevel: rev aa7dac335960950d2254a5a78bd1f0786a290538)

          • hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md
          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #17 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/17/ ) HDFS-6803 Document DFSClient#DFSInputStream expectations reading and preading in concurrent context. (stack via stevel) (stevel: rev aa7dac335960950d2254a5a78bd1f0786a290538) hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #1969 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1969/)
          HDFS-6803 Document DFSClient#DFSInputStream expectations reading and preading in concurrent context. (stack via stevel) (stevel: rev aa7dac335960950d2254a5a78bd1f0786a290538)

          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1969 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1969/ ) HDFS-6803 Document DFSClient#DFSInputStream expectations reading and preading in concurrent context. (stack via stevel) (stevel: rev aa7dac335960950d2254a5a78bd1f0786a290538) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #6609 (See https://builds.apache.org/job/Hadoop-trunk-Commit/6609/)
          HDFS-6803 Document DFSClient#DFSInputStream expectations reading and preading in concurrent context. (stack via stevel) (stevel: rev aa7dac335960950d2254a5a78bd1f0786a290538)

          • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #6609 (See https://builds.apache.org/job/Hadoop-trunk-Commit/6609/ ) HDFS-6803 Document DFSClient#DFSInputStream expectations reading and preading in concurrent context. (stack via stevel) (stevel: rev aa7dac335960950d2254a5a78bd1f0786a290538) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md
          Hide
          stevel@apache.org Steve Loughran added a comment -

          committed patch -thanks!

          Show
          stevel@apache.org Steve Loughran added a comment - committed patch -thanks!
          Hide
          stevel@apache.org Steve Loughran added a comment -

          sorry, been at apachecon and neglecting that jira-list.

          +1 from me

          ps

          Would it be better marking those fs that do NOT support ConcurrentPositionalReads? (Probably not – too much work).

          I think ultimately we may want to make the contract XML files describing filesystems more visible, so you can see whether an FS is case sensitive, raises EOF on out of bound seeks, etc. But having all that information would only complicate code, wouldn't it? you don't want to start looking for the specific features of the FS you are working with. Simpler to say "assume it is HDFS or something which acts like it" and be done.

          I am planning to add a Blobstore tag to object stores though, to highlight they may be way off what HDFS offers, especially in a consistency model (which would then be listed per-instance to let callers understand what's going to be unreliable)

          Show
          stevel@apache.org Steve Loughran added a comment - sorry, been at apachecon and neglecting that jira-list. +1 from me ps Would it be better marking those fs that do NOT support ConcurrentPositionalReads? (Probably not – too much work). I think ultimately we may want to make the contract XML files describing filesystems more visible, so you can see whether an FS is case sensitive, raises EOF on out of bound seeks, etc. But having all that information would only complicate code, wouldn't it? you don't want to start looking for the specific features of the FS you are working with. Simpler to say "assume it is HDFS or something which acts like it" and be done. I am planning to add a Blobstore tag to object stores though, to highlight they may be way off what HDFS offers, especially in a consistency model (which would then be listed per-instance to let callers understand what's going to be unreliable)
          Hide
          stack stack added a comment -

          Steve Loughran Hey boss, $1.00 for a review?

          Show
          stack stack added a comment - Steve Loughran Hey boss, $1.00 for a review?
          Hide
          cmccabe Colin P. McCabe added a comment -

          +1 from me. Steve Loughran, what do you think?

          Maybe we can open another JIRA to have a longer discussion about whether input streams should be thread-safe (I think they should)

          Show
          cmccabe Colin P. McCabe added a comment - +1 from me. Steve Loughran , what do you think? Maybe we can open another JIRA to have a longer discussion about whether input streams should be thread-safe (I think they should)
          Hide
          stack stack added a comment -

          Patch that goes more in direction Colin P. McCabe suggests. It doesn't go all the way (I am afraid of getting my wrist slapped for overstepping). Hopefully its clear on whats guaranteed. I've also attached the markup rendered as an html doc to make it easier on reviewers. Thanks.

          Show
          stack stack added a comment - Patch that goes more in direction Colin P. McCabe suggests. It doesn't go all the way (I am afraid of getting my wrist slapped for overstepping). Hopefully its clear on whats guaranteed. I've also attached the markup rendered as an html doc to make it easier on reviewers. Thanks.
          Hide
          cmccabe Colin P. McCabe added a comment -

          I'm having trouble reconciling the idea that input streams "are not thread-safe" with the idea that multiple positional reads can be going on in parallel. It seems like if clients are going to have multiple pread calls in flight, they are counting on thread-safety. Maybe we can say that streams which implement PositionedReadable are thread-safe?

          Are there still Hadoop FileSystem implementations out there that have input streams that are not thread-safe? That seems like a recipe for broken code that runs on HDFS but not on other FSes. It seems like if we do have any such FS implemntations, they could be fixed pretty easily by putting synchronized on the methods.

          Show
          cmccabe Colin P. McCabe added a comment - I'm having trouble reconciling the idea that input streams "are not thread-safe" with the idea that multiple positional reads can be going on in parallel. It seems like if clients are going to have multiple pread calls in flight, they are counting on thread-safety. Maybe we can say that streams which implement PositionedReadable are thread-safe? Are there still Hadoop FileSystem implementations out there that have input streams that are not thread-safe? That seems like a recipe for broken code that runs on HDFS but not on other FSes. It seems like if we do have any such FS implemntations, they could be fixed pretty easily by putting synchronized on the methods.
          Hide
          stack stack added a comment -

          Am I in the right ballpark? Thanks (Need license to hack on dfsinputstream to make it more 'live' – thanks).

          Show
          stack stack added a comment - Am I in the right ballpark? Thanks (Need license to hack on dfsinputstream to make it more 'live' – thanks).
          Hide
          stack stack added a comment -

          Thanks Steve Loughran. Here is v2.

          Show
          stack stack added a comment - Thanks Steve Loughran . Here is v2.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          maybe we should say

          MUST be consistent with serialized operations
          SHOULD be concurrent

          What we really wants is for two parallel operations to always produce the right data; concurrency boosts throughput, but is not guarantees

           read(pos1,dest,, len) ->
            dest[0..len-1] = [data(FS, path, pos1), data(FS, path, pos1+1) ... data(FS, path, pos1+ len -1]
          

          and read(pos2, dest2, len2) does the same for pos2..pos2+len2-1

          This defines the isolation; the SHOULD/MAY sets the policy.

          Show
          stevel@apache.org Steve Loughran added a comment - maybe we should say MUST be consistent with serialized operations SHOULD be concurrent What we really wants is for two parallel operations to always produce the right data; concurrency boosts throughput, but is not guarantees read(pos1,dest,, len) -> dest[0..len-1] = [data(FS, path, pos1), data(FS, path, pos1+1) ... data(FS, path, pos1+ len -1] and read(pos2, dest2, len2) does the same for pos2..pos2+len2-1 This defines the isolation; the SHOULD/MAY sets the policy.
          Hide
          stack stack added a comment -

          Oh, Steve Loughran feedback appreciated too....

          Show
          stack stack added a comment - Oh, Steve Loughran feedback appreciated too....
          Hide
          stack stack added a comment -

          Adding tenets 2.1 and 2.2. to fsdatainputstream.md

          Changes the voice on the PositionedReadable notes so NOT being thread-safe is a problem rather than as it was where thread-safety seemed like an inappropriate label.

          Steve Loughran A few pointers please sir. What else would you like me do in this area?

          Show
          stack stack added a comment - Adding tenets 2.1 and 2.2. to fsdatainputstream.md Changes the voice on the PositionedReadable notes so NOT being thread-safe is a problem rather than as it was where thread-safety seemed like an inappropriate label. Steve Loughran A few pointers please sir. What else would you like me do in this area?
          Hide
          stack stack added a comment -

          Need license to improve pread so back here again (pardon the absence).

          Attach v2 of the doc addressing nice feedback. Dropped assertion 2.3 on filelength after Colin P. McCabe's feedback above ("too much" and "too little") so now we are just assertions "2.1. Positional read and non-positional read can run concurrently" and "2.2. Two or more positional reads can run concurrently"

          On adding a ConcurrentPositionalReads marker Interface, I'd say its probably too late for this now; we've assumed we could do concurrent positional reads always and it sort of works (slowly). Would it be better marking those fs that do NOT support ConcurrentPositionalReads? (Probably not – too much work).

          I think stack's point 2.1 and 2.2 imply that pread can safely be called from multiple threads concurrently. I guess we should document this too so that there's no confusion.

          This is in the attached doc. Will forward port it.

          Show
          stack stack added a comment - Need license to improve pread so back here again (pardon the absence). Attach v2 of the doc addressing nice feedback. Dropped assertion 2.3 on filelength after Colin P. McCabe 's feedback above ("too much" and "too little") so now we are just assertions "2.1. Positional read and non-positional read can run concurrently" and "2.2. Two or more positional reads can run concurrently" On adding a ConcurrentPositionalReads marker Interface, I'd say its probably too late for this now; we've assumed we could do concurrent positional reads always and it sort of works (slowly). Would it be better marking those fs that do NOT support ConcurrentPositionalReads? (Probably not – too much work). I think stack's point 2.1 and 2.2 imply that pread can safely be called from multiple threads concurrently. I guess we should document this too so that there's no confusion. This is in the attached doc. Will forward port it.
          Hide
          cmccabe Colin P. McCabe added a comment -

          Should we require the implementations of FSInputStream should be thread-safe (or at least for PositionedReadable)? And modify some implementations such as WebHDFS to make pread concurrently?

          I think stack's point 2.1 and 2.2 imply that pread can safely be called from multiple threads concurrently. I guess we should document this too so that there's no confusion.

          Show
          cmccabe Colin P. McCabe added a comment - Should we require the implementations of FSInputStream should be thread-safe (or at least for PositionedReadable)? And modify some implementations such as WebHDFS to make pread concurrently? I think stack's point 2.1 and 2.2 imply that pread can safely be called from multiple threads concurrently. I guess we should document this too so that there's no confusion.
          Hide
          hitliuyi Yi Liu added a comment -

          It seems like we are all in agreement on Stack's point 2.1 (positional read and non-positional can run concurrently), and 2.2 (two or more positional reads can run concurrently)

          Yes.

          Should we require the implementations of FSInputStream should be thread-safe (or at least for PositionedReadable)? And modify some implementations such as WebHDFS to make pread concurrently?

          There's always the strategy of adding a marker interface, say ConcurrentPositionalReads, which indicates the operations are concurrent.

          Maybe using existing PositionedReadable is OK and keep the concurrent in java doc as what we do now?

          Show
          hitliuyi Yi Liu added a comment - It seems like we are all in agreement on Stack's point 2.1 (positional read and non-positional can run concurrently), and 2.2 (two or more positional reads can run concurrently) Yes. Should we require the implementations of FSInputStream should be thread-safe (or at least for PositionedReadable)? And modify some implementations such as WebHDFS to make pread concurrently? There's always the strategy of adding a marker interface, say ConcurrentPositionalReads, which indicates the operations are concurrent. Maybe using existing PositionedReadable is OK and keep the concurrent in java doc as what we do now?
          Hide
          stack stack added a comment -

          Let me write up a proposal for 2.1 and 2.2 with folding in good stuff from the above discussion when get back to town. Thanks lads.

          Show
          stack stack added a comment - Let me write up a proposal for 2.1 and 2.2 with folding in good stuff from the above discussion when get back to town. Thanks lads.
          Hide
          cmccabe Colin P. McCabe added a comment -

          S3 might have a seek+read+seek implementation of pread right now, but that doesn't mean that it has to be that way forever. If it had something like HDFS's BlockReader abstraction, it could easily support preads which didn't affect the output of getPos, and preads that were concurrent. The s3 protocol itself lets you start reading an object at any offset you want.

          It seems like we are all in agreement on Stack's point 2.1 (positional read and non-positional can run concurrently), and 2.2 (two or more positional reads can run concurrently)? Perhaps we should just document those and move some of the other discussions to follow-up JIRAs. This is an important performance improvement for HBase and it seems like we have consensus on the concurrent pread issue at least.

          There's always the strategy of adding a marker interface, say ConcurrentPositionalReads, which indicates the operations are concurrent. Would it help HBase if this were added and looked for?

          Interesting, that might help use HBase on s3 or other alternate fses...

          Show
          cmccabe Colin P. McCabe added a comment - S3 might have a seek+read+seek implementation of pread right now, but that doesn't mean that it has to be that way forever. If it had something like HDFS's BlockReader abstraction, it could easily support preads which didn't affect the output of getPos, and preads that were concurrent. The s3 protocol itself lets you start reading an object at any offset you want. It seems like we are all in agreement on Stack's point 2.1 (positional read and non-positional can run concurrently), and 2.2 (two or more positional reads can run concurrently)? Perhaps we should just document those and move some of the other discussions to follow-up JIRAs. This is an important performance improvement for HBase and it seems like we have consensus on the concurrent pread issue at least. There's always the strategy of adding a marker interface, say ConcurrentPositionalReads, which indicates the operations are concurrent. Would it help HBase if this were added and looked for? Interesting, that might help use HBase on s3 or other alternate fses...
          Hide
          stevel@apache.org Steve Loughran added a comment -

          +1 to not making non-positional reads concurrent. It confuses the notion of "whereness".

          HDFS is probably the least-likely-to-fail input stream -the long haul ones are different, s3n:// etc. These also appear to be the ones with the seek/read/seek implementations, where they often go as far as closing the remote input stream on the seek (or following read), and reconnecting again with a different range on the GET.

          HADOOP-9565 proposes a blobstore tag on s3n. swift, avs, google storage, that could be taken as a warning that getPos() may show intermediate steps during positioned reads. For the other filesystems in the hadoop source tree, they could be reviewed to ensure isolation and re-entrancy.

          WebHDFS is a special case; it behaves like an object store on positioned read.

          There's always the strategy of adding a marker interface, say ConcurrentPositionalReads, which indicates the operations are concurrent. Would it help HBase if this were added and looked for? Or does HBase need to care that on some filesystems the calls block? Because if it's just an optimisation, it's a detail they can rely on for HDFS, and something they can hope for from external (true) filesystems.

          Show
          stevel@apache.org Steve Loughran added a comment - +1 to not making non-positional reads concurrent. It confuses the notion of "whereness". HDFS is probably the least-likely-to-fail input stream -the long haul ones are different, s3n:// etc. These also appear to be the ones with the seek/read/seek implementations, where they often go as far as closing the remote input stream on the seek (or following read), and reconnecting again with a different range on the GET. HADOOP-9565 proposes a blobstore tag on s3n. swift, avs, google storage, that could be taken as a warning that getPos() may show intermediate steps during positioned reads. For the other filesystems in the hadoop source tree, they could be reviewed to ensure isolation and re-entrancy. WebHDFS is a special case; it behaves like an object store on positioned read. There's always the strategy of adding a marker interface, say ConcurrentPositionalReads , which indicates the operations are concurrent. Would it help HBase if this were added and looked for? Or does HBase need to care that on some filesystems the calls block? Because if it's just an optimisation, it's a detail they can rely on for HDFS, and something they can hope for from external (true) filesystems.
          Hide
          cmccabe Colin P. McCabe added a comment -

          Steve Loughran: Hmm. I don't see much advantage in making non-positional reads concurrent. When two threads do non-positional reads, they inherently interfere with each other by modifying the position. So concurrent non-positional reads would not be very useful for most programmers, since you would basically not know what offset your read was starting at. it would depend on the peculiarities of thread timing.

          Concurrent positional reads (preads) are useful precisely because they don't have this problem. You're not sharing a stream position with any other thread, so you know what you're getting with your pread.

          I think if we do allow concurrent non-positional reads, we should also document that this is optional, and that the stream never reads the same byte offset more than once.

          getPos() may block for an arbitrary amount of time if another thread is attempting to perform a positioned read and is having some problem communicating with the far end. Is that something we really want? Is it something people expect?

          HDFS has has this behavior for a long time. I checked back in Hadoop 0.20 and the synchronized is there on getPos and read. I would be ok with getPos returning the position locklessly (perhaps from an AtomicLong?) but to my knowledge, nobody has ever requested that we change this.

          pread should never affect the output of getPos. That would go against the basic guarantee of positional read: that it doesn't alter the current stream position. It doesn't really help FSes that implement pread as seek+read+seek, either. Those filesystems have a basic problem-- the inability to do concurrent preads-- that weakening the getPos guarantee can't possibly solve. The real solution is to add a better pread implementation to those filesystems.

          (I do not think that concurrent pread should be required of all hadoop FSes, but it should be highly encouraged for all implementors. And implementing pread as seek+read+seek should be highly discouraged)

          I like the idea of saying that operations in "group P" (read, seek, skip, zero-copy read, releaseBuffer) can block each other, and every other operation is asynchronous. I think that fits the needs of HBase, MR, and other clients needs very well; what do you think?

          Show
          cmccabe Colin P. McCabe added a comment - Steve Loughran : Hmm. I don't see much advantage in making non-positional reads concurrent. When two threads do non-positional reads, they inherently interfere with each other by modifying the position. So concurrent non-positional reads would not be very useful for most programmers, since you would basically not know what offset your read was starting at. it would depend on the peculiarities of thread timing. Concurrent positional reads (preads) are useful precisely because they don't have this problem. You're not sharing a stream position with any other thread, so you know what you're getting with your pread. I think if we do allow concurrent non-positional reads, we should also document that this is optional, and that the stream never reads the same byte offset more than once. getPos() may block for an arbitrary amount of time if another thread is attempting to perform a positioned read and is having some problem communicating with the far end. Is that something we really want? Is it something people expect? HDFS has has this behavior for a long time. I checked back in Hadoop 0.20 and the synchronized is there on getPos and read. I would be ok with getPos returning the position locklessly (perhaps from an AtomicLong?) but to my knowledge, nobody has ever requested that we change this. pread should never affect the output of getPos . That would go against the basic guarantee of positional read: that it doesn't alter the current stream position. It doesn't really help FSes that implement pread as seek+read+seek, either. Those filesystems have a basic problem-- the inability to do concurrent preads-- that weakening the getPos guarantee can't possibly solve. The real solution is to add a better pread implementation to those filesystems. (I do not think that concurrent pread should be required of all hadoop FSes, but it should be highly encouraged for all implementors. And implementing pread as seek+read+seek should be highly discouraged) I like the idea of saying that operations in "group P" (read, seek, skip, zero-copy read, releaseBuffer) can block each other, and every other operation is asynchronous. I think that fits the needs of HBase, MR, and other clients needs very well; what do you think?
          Hide
          stevel@apache.org Steve Loughran added a comment -

          I'm very tempted to push for a lax model which says

          1. either the preads block or they are concurrent if the FS supports
          2. preads and classic (read@pos) reads may be concurrent or blocking
          3. getPos() may expose the position of a positioned read

          I know #3 goes up against all the rules of hiding things, but think of this: if we mandate that getPos() hides all intermediate positions on pread, then any class which uses the base implementation of seek+read+seek will require getPos() to be synced with the read, which implies that :

          getPos() may block for an arbitrary amount of time if another thread is attempting to perform a positioned read and is having some problem communicating with the far end.

          Is that something we really want? Is it something people expect?

          Show
          stevel@apache.org Steve Loughran added a comment - I'm very tempted to push for a lax model which says either the preads block or they are concurrent if the FS supports preads and classic (read@pos) reads may be concurrent or blocking getPos() may expose the position of a positioned read I know #3 goes up against all the rules of hiding things, but think of this: if we mandate that getPos() hides all intermediate positions on pread, then any class which uses the base implementation of seek+read+seek will require getPos() to be synced with the read, which implies that : getPos() may block for an arbitrary amount of time if another thread is attempting to perform a positioned read and is having some problem communicating with the far end. Is that something we really want? Is it something people expect?
          Hide
          stack stack added a comment -

          Colin P. McCabe Thanks for the very nice feedback. Let me incorporate/fix what has been posted.

          Steve Loughran Thanks for jumping on boss.

          Consistency with actual file data & metadata....

          Yes. Lets fold in your text. You are describing the FS as it is today.

          ...The second read() would succeed/return -1 depending on the position

          Do you mean the third read in above?

          When a pread is in progress, should that change be visible in getPos()?

          I like Colin P. McCabe's groupings which implies pread does not change getPos.

          How should I proceed with this issue Steve Loughran? I'd like to get 2.1 and 2.2 (from attached doc) blessed as scripture. Seems like a bit of cleanup is all that is needed in HDFS (and as Yi Liu suggests, we could probably remove some synchronizes). My guess is the other FS implementations have not been implemented the way HDFS has been and that backfilling a pread to run independent of a read would be a bunch of work. Would this work be a blocker on adding 2.1/2.2 to the spec and HDFS? Thanks Steve.

          Show
          stack stack added a comment - Colin P. McCabe Thanks for the very nice feedback. Let me incorporate/fix what has been posted. Steve Loughran Thanks for jumping on boss. Consistency with actual file data & metadata.... Yes. Lets fold in your text. You are describing the FS as it is today. ...The second read() would succeed/return -1 depending on the position Do you mean the third read in above? When a pread is in progress, should that change be visible in getPos()? I like Colin P. McCabe 's groupings which implies pread does not change getPos. How should I proceed with this issue Steve Loughran ? I'd like to get 2.1 and 2.2 (from attached doc) blessed as scripture. Seems like a bit of cleanup is all that is needed in HDFS (and as Yi Liu suggests, we could probably remove some synchronizes). My guess is the other FS implementations have not been implemented the way HDFS has been and that backfilling a pread to run independent of a read would be a bunch of work. Would this work be a blocker on adding 2.1/2.2 to the spec and HDFS? Thanks Steve.
          Hide
          hitliuyi Yi Liu added a comment -

          positional read and non-positional can run concurrently

          Agree, It's better to do like this.

          Steve Loughran said:

          Isolation of pread operations
          When a pread is in progress, should that change be visible in getPos()?
          If not, the method will need to be made synchronized on all implementations (it isn't right now; I checked). I
          If it can be visible, then we could pull the synchronized marker off some implementations and remove that as a lock point.


          I think It depends on inputstream implementation and whether pread needs to modify pos and rely on other Ops like seek().
          org.apache.hadoop.hdfs.DFSInputStream will not modify the pos, so we don't need to synchronize it with other methods such as read(), we just need to synchronize few object variables like failures, we can define local failures to resolve this problem.
          On the other hand, if pread relies on pos or Ops like seek(), then we should synchronized pread with all other methods.

          Show
          hitliuyi Yi Liu added a comment - positional read and non-positional can run concurrently Agree, It's better to do like this. Steve Loughran said: Isolation of pread operations When a pread is in progress, should that change be visible in getPos()? If not, the method will need to be made synchronized on all implementations (it isn't right now; I checked). I If it can be visible, then we could pull the synchronized marker off some implementations and remove that as a lock point. I think It depends on inputstream implementation and whether pread needs to modify pos and rely on other Ops like seek(). org.apache.hadoop.hdfs.DFSInputStream will not modify the pos, so we don't need to synchronize it with other methods such as read(), we just need to synchronize few object variables like failures , we can define local failures to resolve this problem. On the other hand, if pread relies on pos or Ops like seek(), then we should synchronized pread with all other methods.
          Hide
          hitliuyi Yi Liu added a comment -

          Hi Steve Loughran, yes, the thread safety model for Hadoop InputStream/OutputStream is a bit chaotic: most methods are synchronized, but some others are not. Positioned read requires implementions should be thread-safe, but WebHDFS inputstream(HDFS-6813), HarInputStream (HADOOP-10930) don't follow, DFSInputStream is not thread-safe(maybe it is after synchronizing few object variables). We should make them consistent.
          If we don't to make thread-safe for them, we'd better to remove related synchronized, then it's a bit more efficient. If we want thread-safe, we should ensure it.

          Show
          hitliuyi Yi Liu added a comment - Hi Steve Loughran , yes, the thread safety model for Hadoop InputStream/OutputStream is a bit chaotic: most methods are synchronized, but some others are not. Positioned read requires implementions should be thread-safe, but WebHDFS inputstream( HDFS-6813 ), HarInputStream ( HADOOP-10930 ) don't follow, DFSInputStream is not thread-safe(maybe it is after synchronizing few object variables). We should make them consistent. If we don't to make thread-safe for them, we'd better to remove related synchronized , then it's a bit more efficient. If we want thread-safe, we should ensure it.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          HADOOP-10938 is now raising the discussion on what the DFS thread safety model should be, HDFS-6813 looking at the webhdfs view, where the time for read & seek operations to complete can be significant.

          Show
          stevel@apache.org Steve Loughran added a comment - HADOOP-10938 is now raising the discussion on what the DFS thread safety model should be, HDFS-6813 looking at the webhdfs view, where the time for read & seek operations to complete can be significant.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          This is fun, stack's just opened up a whole new bag of inconsistencies.

          Consistency with actual file data & metadata

          We should state that changes to a file (length, contents, existence, perms) may not be visible to an open stream; if they do become visible there are no guarantees when those changes become visible. That could include partway through a readFully operation -this cannot guaranteed to be atomic.

          Isolation of pread operations

          When a pread is in progress, should that change be visible in getPos()?

          1. If not, the method will need to be made synchronized on all implementations (it isn't right now; I checked). I
          2. If it can be visible, then we could pull the synchronized marker off some implementations and remove that as a lock point.

          Failure Modes in concurrent/serialized operations

          One problem with concurrency on read+pread is something I hadn't thought of before: on any failure of a pread, the pos value must be reset to the previous one. Everything appears to do this; the test would be

          read()
          try{
          read(EOF+2)
          } catch (IOException) {
          }
          assertTrue(getPos()<=EOF)
          read()
          

          The second read() would succeed/return -1 depending on the position, and not an EOFException. The same outcome must happen for a negative pread attempt.

          If someone were to add this to AbstractContractSeekTest it'd get picked up by all the implementations and we could see what happens.

          Looking at the standard impl, it does seek() back in a finally block -but if there is an exception in the read(), then a subsequent exception in the final seek() would lose that. I think it should be reworked to catch any IOE in the read operation and do an exception-swallowing seek-back in this case. Or just do it for EOFException now that Hadoop 2.5+ has all the standard filesystems throwing EOFException consistently.

          Show
          stevel@apache.org Steve Loughran added a comment - This is fun, stack's just opened up a whole new bag of inconsistencies. Consistency with actual file data & metadata We should state that changes to a file (length, contents, existence, perms) may not be visible to an open stream; if they do become visible there are no guarantees when those changes become visible. That could include partway through a readFully operation -this cannot guaranteed to be atomic. Isolation of pread operations When a pread is in progress, should that change be visible in getPos() ? If not, the method will need to be made synchronized on all implementations (it isn't right now; I checked). I If it can be visible, then we could pull the synchronized marker off some implementations and remove that as a lock point. Failure Modes in concurrent/serialized operations One problem with concurrency on read+pread is something I hadn't thought of before: on any failure of a pread, the pos value must be reset to the previous one. Everything appears to do this; the test would be read() try { read(EOF+2) } catch (IOException) { } assertTrue(getPos()<=EOF) read() The second read() would succeed/return -1 depending on the position, and not an EOFException . The same outcome must happen for a negative pread attempt. If someone were to add this to AbstractContractSeekTest it'd get picked up by all the implementations and we could see what happens. Looking at the standard impl, it does seek() back in a finally block -but if there is an exception in the read(), then a subsequent exception in the final seek() would lose that. I think it should be reworked to catch any IOE in the read operation and do an exception-swallowing seek-back in this case. Or just do it for EOFException now that Hadoop 2.5+ has all the standard filesystems throwing EOFException consistently.
          Hide
          cmccabe Colin P. McCabe added a comment -

          I agree with 2.1 (positional read and non-positional can run concurrently), and 2.2 (Two or more positional reads can run concurrently.)

          2.3 seems both too strict and too loose at the same time, if that makes any sense. Too strict, because it talks about some internal details of HDFS (a file's length will not change if lastBlock is complete). When using a POSIX filesystem like Ceph or Lustre, a file's length can change at any time. We should try to accommodate the existence of those systems, even though we don't plan on adding random write support to the HCFS. Others may modify the files we're reading from outside of Hadoop. This problem exists with LocalFileSystem as well, of course.

          2.3 is too loose because it doesn't specify HOW getFileLength interacts with read, pread, and other calls. If we are using the new HDFS-6633 feature (HDFS tail) and new data is coming in, does getFileLength return that new length all the time? Or does it keep returning the old length? Can getFileLength run concurrently with any other functions?

          I would argue that getFileLength should be able to run concurrently with read and pread. I would also argue that it should be allowed to change over time, and even get shorter. (Of course it will never get shorter in the specific case of HDFS, but for LocalFileSystem... it can.) For HDFS, getFileLength should be able to return the last known file length without blocking or waiting for anything-- i.e. check an AtomicLong or take a mutex on something smaller than the whole stream.

          Also, it would be nice to add a section specifying that when we do two non-positional reads at the same time, they may wait for each other to complete before proceeding. And getPos, seek, and skip may wait for non-positional reads to complete before running.

          Basically, what this looks like is grouping the functions into two sets:
          Group P: read, getPos, seek, skip, zero-copy read, releaseBuffer
          Group N: pread, getFileLength, setReadahead, setDropBehind, getReadStatistics

          Functions in group P can all block each other (probably they grab the same mutex, although this isn't guaranteed).

          Functions in group N do not ever block each other or functions in group P for a long time (although they may take a mutex or two for a very short amount of time, it's not the same mutex as for group P, and they don't hang on to it while doing I/O.)

          Show
          cmccabe Colin P. McCabe added a comment - I agree with 2.1 (positional read and non-positional can run concurrently), and 2.2 (Two or more positional reads can run concurrently.) 2.3 seems both too strict and too loose at the same time, if that makes any sense. Too strict, because it talks about some internal details of HDFS (a file's length will not change if lastBlock is complete). When using a POSIX filesystem like Ceph or Lustre, a file's length can change at any time. We should try to accommodate the existence of those systems, even though we don't plan on adding random write support to the HCFS. Others may modify the files we're reading from outside of Hadoop. This problem exists with LocalFileSystem as well, of course. 2.3 is too loose because it doesn't specify HOW getFileLength interacts with read, pread, and other calls. If we are using the new HDFS-6633 feature (HDFS tail) and new data is coming in, does getFileLength return that new length all the time? Or does it keep returning the old length? Can getFileLength run concurrently with any other functions? I would argue that getFileLength should be able to run concurrently with read and pread . I would also argue that it should be allowed to change over time, and even get shorter. (Of course it will never get shorter in the specific case of HDFS, but for LocalFileSystem... it can.) For HDFS, getFileLength should be able to return the last known file length without blocking or waiting for anything-- i.e. check an AtomicLong or take a mutex on something smaller than the whole stream. Also, it would be nice to add a section specifying that when we do two non-positional reads at the same time, they may wait for each other to complete before proceeding. And getPos , seek , and skip may wait for non-positional reads to complete before running. Basically, what this looks like is grouping the functions into two sets: Group P: read, getPos, seek, skip, zero-copy read, releaseBuffer Group N: pread, getFileLength, setReadahead, setDropBehind, getReadStatistics Functions in group P can all block each other (probably they grab the same mutex, although this isn't guaranteed). Functions in group N do not ever block each other or functions in group P for a long time (although they may take a mutex or two for a very short amount of time, it's not the same mutex as for group P, and they don't hang on to it while doing I/O.)
          Hide
          stack stack added a comment -

          Colin P. McCabe Colin Patrick, you ok w/ the our converting our assertions to HDFS 'optimizations' (that may or may not hold in future versions)? Any others you'd suggest? Thanks sir.

          Show
          stack stack added a comment - Colin P. McCabe Colin Patrick, you ok w/ the our converting our assertions to HDFS 'optimizations' (that may or may not hold in future versions)? Any others you'd suggest? Thanks sir.
          Hide
          stack stack added a comment -

          Thank you mighty Steve Loughran

          Makes sense. Recasting "2.1. Positional read and non-positional read can run concurrently", and "2.2. Two or more positional reads can run concurrently" as HDFS 'optimizations' is how we should high-level characterize these asserts.

          Will make a patch for fsdatainputstream.md to do as you suggest along with adding code and javadoc comment to DFSIS itself.

          Sounds like you don't have problem with the asserts; you are just asking that we categorize them properly and ensure we doc our expectations in the right place (DFSIS is not enough).

          Thanks

          Show
          stack stack added a comment - Thank you mighty Steve Loughran Makes sense. Recasting "2.1. Positional read and non-positional read can run concurrently", and "2.2. Two or more positional reads can run concurrently" as HDFS 'optimizations' is how we should high-level characterize these asserts. Will make a patch for fsdatainputstream.md to do as you suggest along with adding code and javadoc comment to DFSIS itself. Sounds like you don't have problem with the asserts; you are just asking that we categorize them properly and ensure we doc our expectations in the right place (DFSIS is not enough). Thanks
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Stack, Liang, the HADOOP-9361 spec of FS API tries to define what is going on, though the semantics of concurrent relative reads is vague. Some of the javadocs say "isolated operation"; a lot of the implementations just do seek, read, seek-back.

          1. any tenents on concurrency are things that should go there
          2. I had put in "There is no requirement for the stream implementation to be thread-safe.", which was the conclusion I'd reached from looking at the code, especially the handling of read(offset) }} in most implementations. That is even though {{PositionedReadable.read() says "thread safe". That statement needs to be looked at.

          I would propose that you start a new section to hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md which lays down more rigorously what the thread-safety requirements of all implementations are, with reviews of all the existing input streams to see that they conform to what is defined. This isn't something that can be easily tested, so a code-review is all we have.

          Then for HDFS the concurrency tenents can be defined to state that certain operations are non-blocking as an optimisation. That's because other filesystems may behave differently, and it's potentially dangerous to make assumptions about concurrency that don't hold everywhere, including future versions of HDFS

          Show
          stevel@apache.org Steve Loughran added a comment - Stack, Liang, the HADOOP-9361 spec of FS API tries to define what is going on, though the semantics of concurrent relative reads is vague. Some of the javadocs say "isolated operation"; a lot of the implementations just do seek, read, seek-back . any tenents on concurrency are things that should go there I had put in "There is no requirement for the stream implementation to be thread-safe.", which was the conclusion I'd reached from looking at the code, especially the handling of read(offset) }} in most implementations. That is even though {{PositionedReadable.read() says "thread safe". That statement needs to be looked at. I would propose that you start a new section to hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md which lays down more rigorously what the thread-safety requirements of all implementations are, with reviews of all the existing input streams to see that they conform to what is defined. This isn't something that can be easily tested, so a code-review is all we have. Then for HDFS the concurrency tenents can be defined to state that certain operations are non-blocking as an optimisation . That's because other filesystems may behave differently, and it's potentially dangerous to make assumptions about concurrency that don't hold everywhere, including future versions of HDFS
          Hide
          stack stack added a comment -

          First cut. Please review and advise if we overstep. Thanks.

          Show
          stack stack added a comment - First cut. Please review and advise if we overstep. Thanks.

            People

            • Assignee:
              stack stack
              Reporter:
              stack stack
            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development