Solr
  1. Solr
  2. SOLR-5150

HdfsIndexInput may not fully read requested bytes.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.4
    • Fix Version/s: 4.5, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      Patrick Hunt noticed that our HdfsDirectory code was a bit behind Blur here - the read call we are using may not read all of the requested bytes - it returns the number of bytes actually written - which we ignore.

      Blur moved to using a seek and then readFully call - synchronizing across the two calls to deal with clones.

      We have seen that really kills performance, and using the readFully call that lets you pass the position rather than first doing a seek, performs much better and does not require the synchronization.

      I also noticed that the seekInternal impl should not seek but be a no op since we are seeking on the read.

        Activity

        Hide
        Uwe Schindler added a comment -

        Nice catch! As there is a positioned readFully we can handle that in a good way without loosing performance. Otherwise I would have suggested to use an approach like done in NIOFSDir (we using chunking + a while (remaining) loop and update the position pointer).

        I also noticed that the seekInternal impl should not seek but be a no-op since we are seeking on the read.

        Right! I dont know why seekInternal in the BufferedIndexInput is still there. IMHO, it should be removed from the base class, as it is no longer used anywhere (at least it should default to an empty method). No IndexInput in Lucene is implementing it anymore, because with positional reads it is not applicable and in the case of separate seek/read, the seek and read must be synchronized because of clones (unless every IndexInput has a separate file descriptor).

        Show
        Uwe Schindler added a comment - Nice catch! As there is a positioned readFully we can handle that in a good way without loosing performance. Otherwise I would have suggested to use an approach like done in NIOFSDir (we using chunking + a while (remaining) loop and update the position pointer). I also noticed that the seekInternal impl should not seek but be a no-op since we are seeking on the read. Right! I dont know why seekInternal in the BufferedIndexInput is still there. IMHO, it should be removed from the base class, as it is no longer used anywhere (at least it should default to an empty method). No IndexInput in Lucene is implementing it anymore, because with positional reads it is not applicable and in the case of separate seek/read, the seek and read must be synchronized because of clones (unless every IndexInput has a separate file descriptor).
        Hide
        Mark Miller added a comment - - edited

        I've held off on committing this because some performance tests indicate the upstream blur patch may have been more performant for merging/flushing while the current patch is much more performant for queries.

        We might be able to use one or the other based on the IOContext.

        I'm waiting until I can get some more results and testing done though - I've seen lots of random deadlock situations in some of my testing with the upstream blur fix (synchronization around two calls).

        Show
        Mark Miller added a comment - - edited I've held off on committing this because some performance tests indicate the upstream blur patch may have been more performant for merging/flushing while the current patch is much more performant for queries. We might be able to use one or the other based on the IOContext. I'm waiting until I can get some more results and testing done though - I've seen lots of random deadlock situations in some of my testing with the upstream blur fix (synchronization around two calls).
        Hide
        Mark Miller added a comment - - edited

        Patrick Hunt was on vacation, but is now back and may have some thoughts on this issue as well.

        Show
        Mark Miller added a comment - - edited Patrick Hunt was on vacation, but is now back and may have some thoughts on this issue as well.
        Hide
        Mark Miller added a comment -

        To describe that more fully: not deadlock - just really long pauses - no cpu or harddrive usage by either hdfs processes or solr for a long time - threads hanging out in socket waits of some kind it seemed.

        That is how I first saw the slowdown with the blur fix - I was running one of the HdfsDirectory tests on my mac and it took 10 min instead of 14 seconds. On linux, the test was still fast. Some other perf tests around querying took a nose dive on linux as well though. Meanwhile, some tests involving indexing sped up.

        The current patch sped that test back up on my mac and fixed the query perf test.

        We might be able to get the best of both worlds, or the synchronized version might not be worth it.

        Show
        Mark Miller added a comment - To describe that more fully: not deadlock - just really long pauses - no cpu or harddrive usage by either hdfs processes or solr for a long time - threads hanging out in socket waits of some kind it seemed. That is how I first saw the slowdown with the blur fix - I was running one of the HdfsDirectory tests on my mac and it took 10 min instead of 14 seconds. On linux, the test was still fast. Some other perf tests around querying took a nose dive on linux as well though. Meanwhile, some tests involving indexing sped up. The current patch sped that test back up on my mac and fixed the query perf test. We might be able to get the best of both worlds, or the synchronized version might not be worth it.
        Hide
        Uwe Schindler added a comment - - edited

        Hi Mark,
        I think your version should be preferred in both cases. The Apache Blur upstream version looks like SimpleFSIndexInput (which has synchronization on the RandomAccessFile). The difference is here, that reading from a real file has no network involved (at least not for local filesystems) so the time spent in the locked code block is shorter. Still SimpleFSDir is bad for queries.
        When merging the whole stuff works single-threaded per file so you would see no difference in both approaches. If the positional readFully approach would be slower, then this would be clearly a bug in Hdfs.
        Another alternative would be: When cloning a file also clone the underlying Hdfs connection. With RandomAccessFile we cannot do this in the JDK (we have no dup() for file descriptors), but if Hdfs supports some dup() like approach with delete on-last close semantics (the file could already be deleted when you dup the file descriptor) you could create 2 different connection for each thread.
        The backside: Lucene never closes clones - one reason why I gave up on implementig a Windows-Optimized directory that would clone underlying file descriptor: The clone would never close the dup

        Show
        Uwe Schindler added a comment - - edited Hi Mark, I think your version should be preferred in both cases. The Apache Blur upstream version looks like SimpleFSIndexInput (which has synchronization on the RandomAccessFile). The difference is here, that reading from a real file has no network involved (at least not for local filesystems) so the time spent in the locked code block is shorter. Still SimpleFSDir is bad for queries. When merging the whole stuff works single-threaded per file so you would see no difference in both approaches. If the positional readFully approach would be slower, then this would be clearly a bug in Hdfs. Another alternative would be: When cloning a file also clone the underlying Hdfs connection. With RandomAccessFile we cannot do this in the JDK (we have no dup() for file descriptors), but if Hdfs supports some dup() like approach with delete on-last close semantics (the file could already be deleted when you dup the file descriptor) you could create 2 different connection for each thread. The backside: Lucene never closes clones - one reason why I gave up on implementig a Windows-Optimized directory that would clone underlying file descriptor: The clone would never close the dup
        Hide
        Mark Miller added a comment -

        If the positional readFully approach would be slower, then this would be clearly a bug in Hdfs.

        Right - if that turns out to be the case, I'd raise an issue with the hdfs team. The performance difference actually looks fairly large on first glance though, so it might be worth working around for a while if possible. I don't really know yet.

        Show
        Mark Miller added a comment - If the positional readFully approach would be slower, then this would be clearly a bug in Hdfs. Right - if that turns out to be the case, I'd raise an issue with the hdfs team. The performance difference actually looks fairly large on first glance though, so it might be worth working around for a while if possible. I don't really know yet.
        Hide
        Patrick Hunt added a comment -

        Hi Mark Miller, thanks for filing this while I was out. I was trying to track down another issue and happened across it while reviewing code (then noticed that blur had changed from the original).

        I realized the seekInternal change while on vacation, was going to mention that but it looks like you fixed it already.

        I reviewed the HDFS client code for readInternal with a member of our HDFS team before generating the original patch. Based on the feedback I got the understood was that doing the seek followed by the readFully should have been highest performance. It's interesting that the query performance was so negatively impacted. We should followup with those folks again, perhaps you could provide more insight (than I) into how lucene accesses the underlying filesystem for query based reads vs other access patterns? Might help get more insight from the HDFS devs. Perhaps there is some way to trace those accesses...

        We have not yet tried "short circuit local HDFS client reads" (see 12.11.2 here http://hbase.apache.org/book/perf.hdfs.html) but we should at some point (soon) and that will further complicate things. Based on the results other clients have seen we should see significant performance benefits from that (at least when the blocks are indeed local).

        Show
        Patrick Hunt added a comment - Hi Mark Miller , thanks for filing this while I was out. I was trying to track down another issue and happened across it while reviewing code (then noticed that blur had changed from the original). I realized the seekInternal change while on vacation, was going to mention that but it looks like you fixed it already. I reviewed the HDFS client code for readInternal with a member of our HDFS team before generating the original patch. Based on the feedback I got the understood was that doing the seek followed by the readFully should have been highest performance. It's interesting that the query performance was so negatively impacted. We should followup with those folks again, perhaps you could provide more insight (than I) into how lucene accesses the underlying filesystem for query based reads vs other access patterns? Might help get more insight from the HDFS devs. Perhaps there is some way to trace those accesses... We have not yet tried "short circuit local HDFS client reads" (see 12.11.2 here http://hbase.apache.org/book/perf.hdfs.html ) but we should at some point (soon) and that will further complicate things. Based on the results other clients have seen we should see significant performance benefits from that (at least when the blocks are indeed local).
        Hide
        Aaron McCurry added a comment -

        First off I'm really happy to see other people trying to improve performance of the HDFSDirectory. So I will offer some reasons as to why I have landed on the current implementation in Blur.

        Why Blur doesn't clone the HDFS file handle for clone in Lucene.

        • Mainly because since Lucene 4 cloned file handles don't seem to get closed all the time. So I didn't want to have all those objects hanging around for long periods of time and not being closed. Related: Also for those that are interested, Blur has Directory reference counter so that files that are deleted by Lucene stick around long enough for running queries to finish.

        Why Blur doesn't use the read[Fully](position,buf,off,len) method instead of the seek plus read[Fully](buf,off,len).

        • When accessing the local file system with the call would take a huge amount of time because of some internal setup the Hadoop was doing for every call. This didn't seem to be an issue when using HDFS, but if you start using short-circuit reads it might become a problem. I have not tested this for 6 months, so this may have been improved in the newer versions of Hadoop.

        Why Blur uses readFully versus read.

        • Laziness? Not sure, I'm sure that I thought that a single call to seek + read from the filesystem would be better (even if it was more operations) than multiple calls with multiple seeks + reads. Perhaps though it would be better to not use the readFully as you all are discussing because of the sync call.

        How would I really like to implement it?

        • I would like to implement the file access system as a pool of file handles for each file. So that each file would have up to N (configured by default to 10 or something like that) file handles open and all the accesses from the base file objects and clones would check out the handle and release it when finished. So that way there is some limit to the number of handles but some parallel accesses are allowed.

        Hope this helps to explain why Blur has the implementation that is does.

        Aaron

        Show
        Aaron McCurry added a comment - First off I'm really happy to see other people trying to improve performance of the HDFSDirectory. So I will offer some reasons as to why I have landed on the current implementation in Blur. Why Blur doesn't clone the HDFS file handle for clone in Lucene. Mainly because since Lucene 4 cloned file handles don't seem to get closed all the time. So I didn't want to have all those objects hanging around for long periods of time and not being closed. Related: Also for those that are interested, Blur has Directory reference counter so that files that are deleted by Lucene stick around long enough for running queries to finish. Why Blur doesn't use the read [Fully] (position,buf,off,len) method instead of the seek plus read [Fully] (buf,off,len). When accessing the local file system with the call would take a huge amount of time because of some internal setup the Hadoop was doing for every call. This didn't seem to be an issue when using HDFS, but if you start using short-circuit reads it might become a problem. I have not tested this for 6 months, so this may have been improved in the newer versions of Hadoop. Why Blur uses readFully versus read. Laziness? Not sure, I'm sure that I thought that a single call to seek + read from the filesystem would be better (even if it was more operations) than multiple calls with multiple seeks + reads. Perhaps though it would be better to not use the readFully as you all are discussing because of the sync call. How would I really like to implement it? I would like to implement the file access system as a pool of file handles for each file. So that each file would have up to N (configured by default to 10 or something like that) file handles open and all the accesses from the base file objects and clones would check out the handle and release it when finished. So that way there is some limit to the number of handles but some parallel accesses are allowed. Hope this helps to explain why Blur has the implementation that is does. Aaron
        Hide
        Mark Miller added a comment -

        Based on the feedback I got the understood was that doing the seek followed by the readFully should have been highest performance.

        That would go along with speed improvements we saw in indexing. As Uwe said, that does seem kind of weird or buggish, but the numbers seem to bear it out.

        I think the issue is, in the indexing case, it's one thread dong the reading - and our perf tests do show it to be faster quite a bit faster. But the sync simply kills concurrent query reads. We see qps drop from like 38 qps to 3 qps. Totally unacceptable performance.

        This is why I've been looking at doing the synched seek + read or just the read based on the IO context.

        Show
        Mark Miller added a comment - Based on the feedback I got the understood was that doing the seek followed by the readFully should have been highest performance. That would go along with speed improvements we saw in indexing. As Uwe said, that does seem kind of weird or buggish, but the numbers seem to bear it out. I think the issue is, in the indexing case, it's one thread dong the reading - and our perf tests do show it to be faster quite a bit faster. But the sync simply kills concurrent query reads. We see qps drop from like 38 qps to 3 qps. Totally unacceptable performance. This is why I've been looking at doing the synched seek + read or just the read based on the IO context.
        Hide
        Aaron McCurry added a comment -

        Just as a FYI the IOContext won't give the expected behavior, because if the writer already has a index input open for searching it won't reopen the file. If my memory is correct it will just clone the existing index input that was open with a IOContext of Default (or whatever it uses for searching). So if you are doing NRT updates you may never see a IOContext of merge in the directory implementation. I know this was true in 4.2ish but I haven't reviewed the code for this situation lately.

        Aaron

        Show
        Aaron McCurry added a comment - Just as a FYI the IOContext won't give the expected behavior, because if the writer already has a index input open for searching it won't reopen the file. If my memory is correct it will just clone the existing index input that was open with a IOContext of Default (or whatever it uses for searching). So if you are doing NRT updates you may never see a IOContext of merge in the directory implementation. I know this was true in 4.2ish but I haven't reviewed the code for this situation lately. Aaron
        Hide
        Mark Miller added a comment -

        Michael McCandless, do you have any insight into the above comment?

        Show
        Mark Miller added a comment - Michael McCandless , do you have any insight into the above comment?
        Hide
        Michael McCandless added a comment -

        That's correct: if IndexWriter already has a SegmentReader open on that segment because it's pooling readers, in turn because an NRT reader was pulled, then if that segment is merged it will re-use that already opened SegmentReader rather than open a new one.

        Show
        Michael McCandless added a comment - That's correct: if IndexWriter already has a SegmentReader open on that segment because it's pooling readers, in turn because an NRT reader was pulled, then if that segment is merged it will re-use that already opened SegmentReader rather than open a new one.
        Hide
        Mark Miller added a comment -

        But the sync simply kills concurrent query reads.

        Sorry, I was not being very careful with my words. The 'sync' option (with the seek + read) kills concurrent query reads - but I don't think it's the sync at all. The first perf tests I looked at with just a readFully had a sync as well - which seems to make sense because this is not an NRT test or anything. Everything seems to be related to the hdfs calls.

        Show
        Mark Miller added a comment - But the sync simply kills concurrent query reads. Sorry, I was not being very careful with my words. The 'sync' option (with the seek + read) kills concurrent query reads - but I don't think it's the sync at all. The first perf tests I looked at with just a readFully had a sync as well - which seems to make sense because this is not an NRT test or anything. Everything seems to be related to the hdfs calls.
        Hide
        Mark Miller added a comment -

        I'm just going to commit the current fix and worry about any performance improvements in another issue.

        Show
        Mark Miller added a comment - I'm just going to commit the current fix and worry about any performance improvements in another issue.
        Hide
        ASF subversion and git services added a comment -

        Commit 1523693 from Mark Miller in branch 'dev/trunk'
        [ https://svn.apache.org/r1523693 ]

        SOLR-5150: HdfsIndexInput may not fully read requested bytes.

        Show
        ASF subversion and git services added a comment - Commit 1523693 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1523693 ] SOLR-5150 : HdfsIndexInput may not fully read requested bytes.
        Hide
        ASF subversion and git services added a comment -

        Commit 1523694 from Mark Miller in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1523694 ]

        SOLR-5150: HdfsIndexInput may not fully read requested bytes.

        Show
        ASF subversion and git services added a comment - Commit 1523694 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1523694 ] SOLR-5150 : HdfsIndexInput may not fully read requested bytes.
        Hide
        ASF subversion and git services added a comment -

        Commit 1523698 from Mark Miller in branch 'dev/branches/lucene_solr_4_5'
        [ https://svn.apache.org/r1523698 ]

        SOLR-5150: HdfsIndexInput may not fully read requested bytes.

        Show
        ASF subversion and git services added a comment - Commit 1523698 from Mark Miller in branch 'dev/branches/lucene_solr_4_5' [ https://svn.apache.org/r1523698 ] SOLR-5150 : HdfsIndexInput may not fully read requested bytes.
        Hide
        Adrien Grand added a comment -

        4.5 release -> bulk close

        Show
        Adrien Grand added a comment - 4.5 release -> bulk close

          People

          • Assignee:
            Mark Miller
            Reporter:
            Mark Miller
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development