Lucene - Core
  1. Lucene - Core
  2. LUCENE-709

[PATCH] Enable application-level management of IndexWriter.ramDirectory size

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.1
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Environment:

      All

    • Lucene Fields:
      New, Patch Available

      Description

      IndexWriter currently only supports bounding of in the in-memory index cache using maxBufferedDocs, which limits it to a fixed number of documents. When document sizes vary substantially, especially when documents cannot be truncated, this leads either to inefficiencies from a too-small value or OutOfMemoryErrors from a too large value.

      This simple patch exposes IndexWriter.flushRamSegments(), and provides access to size information about IndexWriter.ramDirectory so that an application can manage this based on total number of bytes consumed by the in-memory cache, thereby allow a larger number of smaller documents or a smaller number of larger documents. This can lead to much better performance while elimianting the possibility of OutOfMemoryErrors.

      The actual job of managing to a size constraint, or any other constraint, is left up the applicatation.

      The addition of synchronized to flushRamSegments() is only for safety of an external call. It has no significant effect on internal calls since they all come from a sychronized caller.

      1. ramdir.patch
        5 kB
        Yonik Seeley
      2. ramdir.patch
        2 kB
        Yonik Seeley
      3. ramDirSizeManagement.patch
        17 kB
        Chuck Williams
      4. ramDirSizeManagement.patch
        13 kB
        Chuck Williams
      5. ramDirSizeManagement.patch
        3 kB
        Chuck Williams
      6. ramDirSizeManagement.patch
        3 kB
        Chuck Williams

        Activity

        Hide
        Yonik Seeley added a comment -

        Thanks Chuck, I think I like this additional view/control into IndexWriter, and I don't think opening this up more further constrains future implementation. I'll wait a few days to see if others have comments though.

        I think there might be a thread safety issue with your patch: you use an unsynchronized fail-fast iterator in RAMDirectory.sizeInBytes(). I think using an Enumerator here should work, right?

        Too bad there doesn't seem to be an easy way to incrementally maintain sizeInBytes... waking over the whole Hashtable for each document addition isn't pretty for large maxBufferedDocs, esp if the number of indexed fields is large. At least this only affects people using this functionallity though.

        Show
        Yonik Seeley added a comment - Thanks Chuck, I think I like this additional view/control into IndexWriter, and I don't think opening this up more further constrains future implementation. I'll wait a few days to see if others have comments though. I think there might be a thread safety issue with your patch: you use an unsynchronized fail-fast iterator in RAMDirectory.sizeInBytes(). I think using an Enumerator here should work, right? Too bad there doesn't seem to be an easy way to incrementally maintain sizeInBytes... waking over the whole Hashtable for each document addition isn't pretty for large maxBufferedDocs, esp if the number of indexed fields is large. At least this only affects people using this functionallity though.
        Hide
        Mark Harwood added a comment -

        This is a really nice addition to have which takes some of the guess-work out of buffer sizing based on numbers of docs.

        I implemented something similar before now and found that although it looks ugly, the performance cost of calculating RAMDirectory.sizeInBytes in this way for each doc addition was negligible. That code too was without the thread-safety measure Yonik suggests so I don't know what overhead that will add.

        Show
        Mark Harwood added a comment - This is a really nice addition to have which takes some of the guess-work out of buffer sizing based on numbers of docs. I implemented something similar before now and found that although it looks ugly, the performance cost of calculating RAMDirectory.sizeInBytes in this way for each doc addition was negligible. That code too was without the thread-safety measure Yonik suggests so I don't know what overhead that will add.
        Hide
        Yonik Seeley added a comment -

        > That code too was without the thread-safety measure Yonik suggests so I don't know what overhead that will add.

        switching to an enumerator should be negligibly faster since Hashtable's iterator is implemented as it's enumerator plus extra concurrent modification checks. That might not be sufficient for total thread safety though.

        enumerating through the Hashtable while not synchronized means you can encounter an object that was just added by another thread. The other thread synchronized while adding the new object, but the thread enumerating didn't execute a read barrier. The new memory model provides "out-of-thin-air safety" and "initialization safety" guarantees. Thus, we are guaranteed to see a complete instance of RAMFile (just not necessarily current). In this specific usecase, I think it boils down to if updating the long length is atomic, which we can't guarantee for all platforms. Your count could be off by 4GB if you "see" the bottom 32 bits before the top.

        In this IndexWriter usecase, we should never see a long length that uses both 32 bit words, because we are talking about single segments though.

        Bottom line (I think): If you want getSizeBytes to work correctly 100% of the time in all instances and platforms, you need to synchronize it (and hence block any gets/puts during that time.... blech)

        Show
        Yonik Seeley added a comment - > That code too was without the thread-safety measure Yonik suggests so I don't know what overhead that will add. switching to an enumerator should be negligibly faster since Hashtable's iterator is implemented as it's enumerator plus extra concurrent modification checks. That might not be sufficient for total thread safety though. enumerating through the Hashtable while not synchronized means you can encounter an object that was just added by another thread. The other thread synchronized while adding the new object, but the thread enumerating didn't execute a read barrier. The new memory model provides "out-of-thin-air safety" and "initialization safety" guarantees. Thus, we are guaranteed to see a complete instance of RAMFile (just not necessarily current). In this specific usecase, I think it boils down to if updating the long length is atomic, which we can't guarantee for all platforms. Your count could be off by 4GB if you "see" the bottom 32 bits before the top. In this IndexWriter usecase, we should never see a long length that uses both 32 bit words, because we are talking about single segments though. Bottom line (I think): If you want getSizeBytes to work correctly 100% of the time in all instances and platforms, you need to synchronize it (and hence block any gets/puts during that time.... blech)
        Hide
        Chuck Williams added a comment -

        Mea Culpa! Bad bug on my part. Thanks for spotting it!

        I believe the solution is simple. RAMDirectory.files is a Hashtable, i.e. it is synchronized. Hashtable.values() tracks all changes to the ram directory as they occur. The fail-fast iterator does not accept concurrent modificaitons. So, the answer is to stop concurrent modifications during sizeInBytes(). This is accomplised by synchronizing on the same objects as the modificaitons already use, i.e. files. I'm attaching a new version of the the patch that I believe is correct.

        Please emabarass me again if there is another mistake!

        Show
        Chuck Williams added a comment - Mea Culpa! Bad bug on my part. Thanks for spotting it! I believe the solution is simple. RAMDirectory.files is a Hashtable, i.e. it is synchronized. Hashtable.values() tracks all changes to the ram directory as they occur. The fail-fast iterator does not accept concurrent modificaitons. So, the answer is to stop concurrent modifications during sizeInBytes(). This is accomplised by synchronizing on the same objects as the modificaitons already use, i.e. files. I'm attaching a new version of the the patch that I believe is correct. Please emabarass me again if there is another mistake!
        Hide
        Yonik Seeley added a comment -

        Thinking a little further on this:
        Synchronizing on the Hashtable here does not solve the whole problem, it only slows things down. The problem isn't the Hashtable (using an Enumerator rather than an Iterator would solve the fail-fast concurrent modification thing).

        The problem is unsynchronized access to RAMFile.length
        RAMFile and IndexInput/IndexOutput aren't meant to be MT-safe.
        The correct solution would be to synchronize that (have a RAMFile.getLength(), and a RAMFile.setLength())

        The question is... is it worth it? Probably...
        I don't think the cost should be too bad since RAMInputStream makes a local copy of the length, and RAMOutputStream inherits from BufferedOutputStream and only updates the length every buffer flush.

        Show
        Yonik Seeley added a comment - Thinking a little further on this: Synchronizing on the Hashtable here does not solve the whole problem, it only slows things down. The problem isn't the Hashtable (using an Enumerator rather than an Iterator would solve the fail-fast concurrent modification thing). The problem is unsynchronized access to RAMFile.length RAMFile and IndexInput/IndexOutput aren't meant to be MT-safe. The correct solution would be to synchronize that (have a RAMFile.getLength(), and a RAMFile.setLength()) The question is... is it worth it? Probably... I don't think the cost should be too bad since RAMInputStream makes a local copy of the length, and RAMOutputStream inherits from BufferedOutputStream and only updates the length every buffer flush.
        Hide
        Chuck Williams added a comment -

        Not synchronizing on the Hashtable, even if using an Enumerator, creates problems as the contents of the hash table may change during the sizeInBytes() iteration. Files might be deleted and/or added to the directory concurrently, causing the size to be computed from an invalid intermediate state. Using an Enumerator would cause the invalid value to be returned without an exception, while using an Iterator instead generates a ConcurrentModificationException. Synchronizing on files avoids the problem altogether without much cost as the loop is fast.

        Hashtable uses a single class, Hashtable.Enumerator, for both its iterator and its enumerator. There are a couple minor differences in the respective methods, such as the above, but not much.

        The issue with RAMFile.length being a long is an issue, but, this bug already exists in Lucene without sizeInBytes(). See RAMDirectory.fileLength(), which has the same problem now.

        I'll submit another verison of the patch that encapsulates RAMFile.length into a sychronized getter and setter. It's only used in a few places (RAMDIrectory, RAMInputStream and RAMOutputStream).

        Show
        Chuck Williams added a comment - Not synchronizing on the Hashtable, even if using an Enumerator, creates problems as the contents of the hash table may change during the sizeInBytes() iteration. Files might be deleted and/or added to the directory concurrently, causing the size to be computed from an invalid intermediate state. Using an Enumerator would cause the invalid value to be returned without an exception, while using an Iterator instead generates a ConcurrentModificationException. Synchronizing on files avoids the problem altogether without much cost as the loop is fast. Hashtable uses a single class, Hashtable.Enumerator, for both its iterator and its enumerator. There are a couple minor differences in the respective methods, such as the above, but not much. The issue with RAMFile.length being a long is an issue, but, this bug already exists in Lucene without sizeInBytes(). See RAMDirectory.fileLength(), which has the same problem now. I'll submit another verison of the patch that encapsulates RAMFile.length into a sychronized getter and setter. It's only used in a few places (RAMDIrectory, RAMInputStream and RAMOutputStream).
        Hide
        Yonik Seeley added a comment -

        > the contents of the hash table may change during the sizeInBytes() iteration.

        Yes, but that's OK.

        > Files might be deleted and/or added to the directory concurrently, causing the size to be computed from an invalid intermediate state

        Synchronizing at that low level doesn't make the computed size more valid though... you need synchronization at a higher level if you want to say more about what the size you are computing represents.

        Consider the case of two different uncoordinated threads... one adding a new file to the RAMDirectory, and the other calculating the size of the directory(). In the unsynchronized case, you don't know if the size will include the new file or not. If sizeInBytes() is synchronized, you still don't know which thread will acquire the lock first, so you still don't know if the size will include the new file. Synchronizing sizeInBytes() does nothing but add a bottleneck.

        > Synchronizing on files avoids the problem altogether without much cost as the loop is fast.

        I disagree that the loop will be fast... simpler loops have proven to take some time:
        LUCENE-388: Improve indexing performance when maxBufferedDocs is
        large by keeping a count of buffered documents rather than
        counting after each document addition.
        That was just counting the documents, not the number of files in each segment (which will be larger).
        Consider maxBufferedDocs of 1000 to 10000 with 10 or 20 indexed fields, and you end up with 17000 to 270000 files to calculate the size over.

        Show
        Yonik Seeley added a comment - > the contents of the hash table may change during the sizeInBytes() iteration. Yes, but that's OK. > Files might be deleted and/or added to the directory concurrently, causing the size to be computed from an invalid intermediate state Synchronizing at that low level doesn't make the computed size more valid though... you need synchronization at a higher level if you want to say more about what the size you are computing represents. Consider the case of two different uncoordinated threads... one adding a new file to the RAMDirectory, and the other calculating the size of the directory(). In the unsynchronized case, you don't know if the size will include the new file or not. If sizeInBytes() is synchronized, you still don't know which thread will acquire the lock first, so you still don't know if the size will include the new file. Synchronizing sizeInBytes() does nothing but add a bottleneck. > Synchronizing on files avoids the problem altogether without much cost as the loop is fast. I disagree that the loop will be fast... simpler loops have proven to take some time: LUCENE-388 : Improve indexing performance when maxBufferedDocs is large by keeping a count of buffered documents rather than counting after each document addition. That was just counting the documents, not the number of files in each segment (which will be larger). Consider maxBufferedDocs of 1000 to 10000 with 10 or 20 indexed fields, and you end up with 17000 to 270000 files to calculate the size over.
        Hide
        Yonik Seeley added a comment -

        > The issue with RAMFile.length being a long is an issue, but, this bug already exists in Lucene without sizeInBytes(). See RAMDirectory.fileLength(),

        Yes, fileLength() isn't mt-safe, but then again, it didn't need to be as it's never used in an unsafe manner in Lucene.
        sizeInBytes() introduces a new use of RAMFile.length that makes things unsafe, unless one specifies that it's illegal to call IndexWriter.ramSizeInBytes() concurrently with addDocument() .

        Show
        Yonik Seeley added a comment - > The issue with RAMFile.length being a long is an issue, but, this bug already exists in Lucene without sizeInBytes(). See RAMDirectory.fileLength(), Yes, fileLength() isn't mt-safe, but then again, it didn't need to be as it's never used in an unsafe manner in Lucene. sizeInBytes() introduces a new use of RAMFile.length that makes things unsafe, unless one specifies that it's illegal to call IndexWriter.ramSizeInBytes() concurrently with addDocument() .
        Hide
        Chuck Williams added a comment -

        I hadn' t considered the case of such large values for maxBufferedDocs, and agree that the loop execution time is non-trivial in such cases. Incremental management of the size seems most important, especially considering that this will also eliminate the cost of the synchronization.

        I still think the syncrhonization adds safety since it guarantees that the loop sees a state of the directory that did exist at some time. At that time, the directory did have the reported size. Without the synchronization the loop may compute a size for a set of files that never comprised the contents of the directory at any instant. Consider this case:

        1. Thread 1 adds a new document, creating a new segment with new index files, leading to segment merging, that creates new larger segment index files, and then deletes all replaced segment index files. Thread 1 then adds a second document, creating new segment index files.
        2. Thread 2 is computing sizeInBytes and happens to see a state where all the new files from both the first and second documents are added, but the deletions are not seen. This could happen if the deleted files happen to be earlier in the hash array than the added files for either document.

        In this case sizeInBytes() without the synchronization computes a larger size for the directory than ever actually existed.

        Re. RAMDIrectory.fileLength(), it is not used within Lucene at all, but it is public, and the restriction that is not valid when index operations are happening concurrently is not specified. I think that is a bug.

        I'll rethink the patch based on your observations, Yonik, and resubmit. Thanks.

        Show
        Chuck Williams added a comment - I hadn' t considered the case of such large values for maxBufferedDocs, and agree that the loop execution time is non-trivial in such cases. Incremental management of the size seems most important, especially considering that this will also eliminate the cost of the synchronization. I still think the syncrhonization adds safety since it guarantees that the loop sees a state of the directory that did exist at some time. At that time, the directory did have the reported size. Without the synchronization the loop may compute a size for a set of files that never comprised the contents of the directory at any instant. Consider this case: 1. Thread 1 adds a new document, creating a new segment with new index files, leading to segment merging, that creates new larger segment index files, and then deletes all replaced segment index files. Thread 1 then adds a second document, creating new segment index files. 2. Thread 2 is computing sizeInBytes and happens to see a state where all the new files from both the first and second documents are added, but the deletions are not seen. This could happen if the deleted files happen to be earlier in the hash array than the added files for either document. In this case sizeInBytes() without the synchronization computes a larger size for the directory than ever actually existed. Re. RAMDIrectory.fileLength(), it is not used within Lucene at all, but it is public, and the restriction that is not valid when index operations are happening concurrently is not specified. I think that is a bug. I'll rethink the patch based on your observations, Yonik, and resubmit. Thanks.
        Hide
        Yonik Seeley added a comment -

        A couple of points:

        • synchronizing the hashTable doesn't help solve the outlined scenario above since the logic that first adds merged segments and removes old segments doesn't synchronize on the hashTable. This is another case that requires synchronization at a higher level.
        • for the specific case of the buffered docs in the IndexWriter, they are not merged to the same RAMDirectory anyway.
        Show
        Yonik Seeley added a comment - A couple of points: synchronizing the hashTable doesn't help solve the outlined scenario above since the logic that first adds merged segments and removes old segments doesn't synchronize on the hashTable. This is another case that requires synchronization at a higher level. for the specific case of the buffered docs in the IndexWriter, they are not merged to the same RAMDirectory anyway.
        Hide
        Yonik Seeley added a comment -

        What do people think of this patch to RAMDirectory that keeps track of the size of closed files with a minimum of overhead, in addition to fixing the non-atomic rename (as well as it's javadoc)

        Show
        Yonik Seeley added a comment - What do people think of this patch to RAMDirectory that keeps track of the size of closed files with a minimum of overhead, in addition to fixing the non-atomic rename (as well as it's javadoc)
        Hide
        Yonik Seeley added a comment -

        I'm referring to ramdir.patch that I just attached (It would be nice if JIRA would show a link in the comments to what was just attached...)

        Show
        Yonik Seeley added a comment - I'm referring to ramdir.patch that I just attached (It would be nice if JIRA would show a link in the comments to what was just attached...)
        Hide
        Doug Cutting added a comment -

        This looks workable to me. Alternately, we could add a bufferAdded(int size) method to RAMOutputStream, and increment the directory size whenever this is called.

        Also, should renameFile throw an exception if the source doesn't exist?

        Show
        Doug Cutting added a comment - This looks workable to me. Alternately, we could add a bufferAdded(int size) method to RAMOutputStream, and increment the directory size whenever this is called. Also, should renameFile throw an exception if the source doesn't exist?
        Hide
        Yonik Seeley added a comment -

        > we could add a bufferAdded(int size) method to RAMOutputStream, and increment the directory size whenever this is called.

        Hmmm, that does reflect the heap usage better than summing the length of the files. For this specific usage, I think that is more what we want.

        > should renameFile throw an exception if the source doesn't exist?
        Definitely, I'll add that while I'm at it.

        Show
        Yonik Seeley added a comment - > we could add a bufferAdded(int size) method to RAMOutputStream, and increment the directory size whenever this is called. Hmmm, that does reflect the heap usage better than summing the length of the files. For this specific usage, I think that is more what we want. > should renameFile throw an exception if the source doesn't exist? Definitely, I'll add that while I'm at it.
        Hide
        Yonik Seeley added a comment -

        Keeping track of buffers does complicate figuring how much to subtract on a delete.
        We could:
        1) iterate over RAMFile.buffers(), subtracting the size of each (1024 byte buffers means many iterations for a big file though)
        2) calculate the total buffer size assuming that all bufferes are of size 1024 (fragile assumption?)
        3) store the cumulative buffer sizes in the RAMFile? (extra space... 8 bytes per RAMFile)

        Show
        Yonik Seeley added a comment - Keeping track of buffers does complicate figuring how much to subtract on a delete. We could: 1) iterate over RAMFile.buffers(), subtracting the size of each (1024 byte buffers means many iterations for a big file though) 2) calculate the total buffer size assuming that all bufferes are of size 1024 (fragile assumption?) 3) store the cumulative buffer sizes in the RAMFile? (extra space... 8 bytes per RAMFile)
        Hide
        Doug Cutting added a comment -

        If we add a RAMFile.totalBufferSize() method then we can easily change the implementation later. For now, I'd choose to multiply the number of the buffers times the size of each buffer, since the size of the buffers is currently a constant. If we ever make this non-constant, then we might have to change this (and a lot of other things).

        Show
        Doug Cutting added a comment - If we add a RAMFile.totalBufferSize() method then we can easily change the implementation later. For now, I'd choose to multiply the number of the buffers times the size of each buffer, since the size of the buffers is currently a constant. If we ever make this non-constant, then we might have to change this (and a lot of other things).
        Hide
        Yonik Seeley added a comment -

        Attaching new un-tested ramdir.patch:

        • keeps track of buffers added instead of file length.
        • handles overwrites
        • adds IOExceptions, mirroring FSDirectory when files don't exist.
          Note that this adds IOException to the method signatures, just like it's base class and FSDirectory.
        Show
        Yonik Seeley added a comment - Attaching new un-tested ramdir.patch: keeps track of buffers added instead of file length. handles overwrites adds IOExceptions, mirroring FSDirectory when files don't exist. Note that this adds IOException to the method signatures, just like it's base class and FSDirectory.
        Hide
        Chuck Williams added a comment -

        I've just attached my version of this patch. It includes a multi-threaded test case. I believe it is sound.

        A few notes:

        1. Re. Yonik's comment about my synchronization scenario. Synhronizing as described does resolve the issue. No higher level synchronization is requried. It doesn't matter how concurent operations on the directory are ordered or intereleaved, so long as any computation that does a loop sees some instance of the directory that corresponds to its actual content at any polnt in time. The result of the loop will then be accurate for that instant.

        2. Lucene has this same syncrhonization bug today in RAMDIrectory.list(). It can return a list of files that never comprised the contents of the directory. This is fixed in the attached.

        3. Also, the long synchronization bug exists in RAMDirectory.fileModified() as well as RAMDIrectory.fileLength() since both are public. These are fixed in the attached.

        4. I moved the synchronization off of the Hashtable (replacing it with a HashMap) up to the RAMDirectory as there are some operations that require synchronization at the directory level. Using just one lock seems better. As all Hashtable operations were already synchonized, I don't believe any material additional synchronization is added.

        5. Lucene currently make the assumption that if a file is being written by a stream then no other streams are simultaneously reading or writing it. I've maintained this assumption as an optimization, allowing the streams to access fields directly without syncrhonization. This is documented in the comments, as is the locking order.

        5. sizeInBytes is now maintained incrementally, efficiently.

        6. Yonik, your version (which I just now saw) has a bug in RAMDIrectory.renameFile(). The to file may already exist, in which case it is overwritten and it's size must be subtracted. I actually hit this in my test case for my implementation and fixed it (since Lucene renames a new version of the segments file).

        All Lucene tests, including the new test, pass. Some contrib tests fail, I believe none of these failures are in any way related to this patch.

        Show
        Chuck Williams added a comment - I've just attached my version of this patch. It includes a multi-threaded test case. I believe it is sound. A few notes: 1. Re. Yonik's comment about my synchronization scenario. Synhronizing as described does resolve the issue. No higher level synchronization is requried. It doesn't matter how concurent operations on the directory are ordered or intereleaved, so long as any computation that does a loop sees some instance of the directory that corresponds to its actual content at any polnt in time. The result of the loop will then be accurate for that instant. 2. Lucene has this same syncrhonization bug today in RAMDIrectory.list(). It can return a list of files that never comprised the contents of the directory. This is fixed in the attached. 3. Also, the long synchronization bug exists in RAMDirectory.fileModified() as well as RAMDIrectory.fileLength() since both are public. These are fixed in the attached. 4. I moved the synchronization off of the Hashtable (replacing it with a HashMap) up to the RAMDirectory as there are some operations that require synchronization at the directory level. Using just one lock seems better. As all Hashtable operations were already synchonized, I don't believe any material additional synchronization is added. 5. Lucene currently make the assumption that if a file is being written by a stream then no other streams are simultaneously reading or writing it. I've maintained this assumption as an optimization, allowing the streams to access fields directly without syncrhonization. This is documented in the comments, as is the locking order. 5. sizeInBytes is now maintained incrementally, efficiently. 6. Yonik, your version (which I just now saw) has a bug in RAMDIrectory.renameFile(). The to file may already exist, in which case it is overwritten and it's size must be subtracted. I actually hit this in my test case for my implementation and fixed it (since Lucene renames a new version of the segments file). All Lucene tests, including the new test, pass. Some contrib tests fail, I believe none of these failures are in any way related to this patch.
        Hide
        Chuck Williams added a comment -

        I didn't see Yonik's new version or comments until after my attach.

        Throwing IOExceptions when files that should exist don't is clearly a good thing. I'll add that to mine if you guys decide it is the one you would like to use.

        Counting buffer sizes rather than file length may be slightly more accurate, but at least for me it is not material. There are other inaccuracies as well (non-file-storage space in the RAMFiles and RAMDIrectory).

        If you guys decide to go with Yonik's version, I think my test case should still be used, and that the other synchronization errors I've fixed should be fixed (e.g., RAMDIrectory.list()).

        Show
        Chuck Williams added a comment - I didn't see Yonik's new version or comments until after my attach. Throwing IOExceptions when files that should exist don't is clearly a good thing. I'll add that to mine if you guys decide it is the one you would like to use. Counting buffer sizes rather than file length may be slightly more accurate, but at least for me it is not material. There are other inaccuracies as well (non-file-storage space in the RAMFiles and RAMDIrectory). If you guys decide to go with Yonik's version, I think my test case should still be used, and that the other synchronization errors I've fixed should be fixed (e.g., RAMDIrectory.list()).
        Hide
        Yonik Seeley added a comment -

        > 1. Re. Yonik's comment about my synchronization scenario. Synhronizing as described does resolve the issue.

        In your merging documents scenario, you state "Thread 1 adds a new document, creating a new segment with new index files, leading to segment merging, that creates new larger segment index files, and then deletes all replaced segment index files."

        If a different thread calls getSizeInBytes() after the merge but before the deletes, you will see both the old segments and new segments created by the merge and will be double counting. Synchronizing the directory-level getSizeInBytes() will not solve that... it requires higher level synchronization.

        Anyway, I think the point is moot as I think we should handle the size incrementally.

        >Counting buffer sizes rather than file length may be slightly more accurate, but at least for me it is not material.

        It could be much more accurate though. All buffering of documents in IndexWriter is done with single doc segments. That 1 byte norm file takes up 1024 bytes of buffer space!

        > I think my test case should still be used

        +1, I didn't do any testing

        BTW, some of the synchronization bugs were fixed in the recent lockless patch.

        Show
        Yonik Seeley added a comment - > 1. Re. Yonik's comment about my synchronization scenario. Synhronizing as described does resolve the issue. In your merging documents scenario, you state "Thread 1 adds a new document, creating a new segment with new index files, leading to segment merging, that creates new larger segment index files, and then deletes all replaced segment index files." If a different thread calls getSizeInBytes() after the merge but before the deletes, you will see both the old segments and new segments created by the merge and will be double counting. Synchronizing the directory-level getSizeInBytes() will not solve that... it requires higher level synchronization. Anyway, I think the point is moot as I think we should handle the size incrementally. >Counting buffer sizes rather than file length may be slightly more accurate, but at least for me it is not material. It could be much more accurate though. All buffering of documents in IndexWriter is done with single doc segments. That 1 byte norm file takes up 1024 bytes of buffer space! > I think my test case should still be used +1, I didn't do any testing BTW, some of the synchronization bugs were fixed in the recent lockless patch.
        Hide
        Chuck Williams added a comment -

        > In your merging documents scenario, you state "Thread 1 adds a new document, creating a new segment with new index files, leading to segment merging, that creates new larger segment index files, and then deletes all replaced segment index files."

        > If a different thread calls getSizeInBytes() after the merge but before the deletes, you will see both the old segments and new segments created by the merge and will be double counting. Synchronizing the directory-level getSizeInBytes() will not solve that... it requires higher level synchronization.

        Except there is no double counting there. The size after the merge before the deletes really is that big! This is what I mean by any computation involving a loop is accurate at that instant. Without the synchronization, you can get a result that was never accurate, i.e. represents a file set that never existed. For a size computation, that result could be larger or smaller than any actual size the directory ever attained. That is the point of my example. For a list() computaiton with an unprotected loop (as in lucene now) you can set a set of files that were never the contents of the directory at any instant.

        No higher level synchronization is required to achieve the semantics that a looping computaiton is accurate at the instant it is performed. Without directory (or files Hashtable) syncrhonization protecting the whole loop, the result can be random, having no correlation to any actual state the directory ever attained.

        > Anyway, I think the point is moot as I think we should handle the size incrementally.

        Not quite, because the bug already exists in lucene in RAMDirectory.list(). My version of the patch fixes this. It should be fixed.

        >>Counting buffer sizes rather than file length may be slightly more accurate, but at least for me it is not material.

        > It could be much more accurate though. All buffering of documents in IndexWriter is done with single doc segments. That 1 byte norm file takes up 1024 bytes of buffer space!

        Point taken that this is important in general. (These numbers are still small in my app because maxBufferedDocs is not large and i have some very large documents that cannot be truncated.)

        I can update my version of the patch with this improvement if that would be helpful. Or if you are going to merge my test case into your version of the patch (and I hope fix the remaining synchronization issues in RAMDIrectory.list() and the long synchronization issues in fileLength() and fileModified(), and the rename bug which will need to be fixed for test case to succeed), then I'll just hold off.

        Yonik, thanks for your interested and effort in this issue!

        Show
        Chuck Williams added a comment - > In your merging documents scenario, you state "Thread 1 adds a new document, creating a new segment with new index files, leading to segment merging, that creates new larger segment index files, and then deletes all replaced segment index files." > If a different thread calls getSizeInBytes() after the merge but before the deletes, you will see both the old segments and new segments created by the merge and will be double counting. Synchronizing the directory-level getSizeInBytes() will not solve that... it requires higher level synchronization. Except there is no double counting there. The size after the merge before the deletes really is that big! This is what I mean by any computation involving a loop is accurate at that instant. Without the synchronization, you can get a result that was never accurate, i.e. represents a file set that never existed. For a size computation, that result could be larger or smaller than any actual size the directory ever attained. That is the point of my example. For a list() computaiton with an unprotected loop (as in lucene now) you can set a set of files that were never the contents of the directory at any instant. No higher level synchronization is required to achieve the semantics that a looping computaiton is accurate at the instant it is performed. Without directory (or files Hashtable) syncrhonization protecting the whole loop, the result can be random, having no correlation to any actual state the directory ever attained. > Anyway, I think the point is moot as I think we should handle the size incrementally. Not quite, because the bug already exists in lucene in RAMDirectory.list(). My version of the patch fixes this. It should be fixed. >>Counting buffer sizes rather than file length may be slightly more accurate, but at least for me it is not material. > It could be much more accurate though. All buffering of documents in IndexWriter is done with single doc segments. That 1 byte norm file takes up 1024 bytes of buffer space! Point taken that this is important in general. (These numbers are still small in my app because maxBufferedDocs is not large and i have some very large documents that cannot be truncated.) I can update my version of the patch with this improvement if that would be helpful. Or if you are going to merge my test case into your version of the patch (and I hope fix the remaining synchronization issues in RAMDIrectory.list() and the long synchronization issues in fileLength() and fileModified(), and the rename bug which will need to be fixed for test case to succeed), then I'll just hold off. Yonik, thanks for your interested and effort in this issue!
        Hide
        Yonik Seeley added a comment -

        Sorry for not being clearer before Chuck, I actually did understand your point-in-time points.
        I was just trying to point out that for the usecases I had in mind, the extra sync didn't buy one much. Perhaps you have different usecases in mind where you can take action based on the size of a RAMDirectory without regard to what other modifiers are doing.

        > Not quite, because the bug already exists in lucene in RAMDirectory.list().

        I agree. On the first quick pass I only commented on it's non thread-safe behavior because I thought it was just a debugging method... looking at it again, I see it should be fixed. IIRC, I think Michael may have already fixed it in his lockless patch.

        I'm +1 on your other changes such as converting to a HashMap (too bad we can't use ConcurrentHashMap yet).
        I'd also be OK with changing the buffers Vector to an ArrayList while our eyes are on this part of the code. That might be more cosmetic than anything else though.

        Please continue with your patch if you would like.

        Show
        Yonik Seeley added a comment - Sorry for not being clearer before Chuck, I actually did understand your point-in-time points. I was just trying to point out that for the usecases I had in mind, the extra sync didn't buy one much. Perhaps you have different usecases in mind where you can take action based on the size of a RAMDirectory without regard to what other modifiers are doing. > Not quite, because the bug already exists in lucene in RAMDirectory.list(). I agree. On the first quick pass I only commented on it's non thread-safe behavior because I thought it was just a debugging method... looking at it again, I see it should be fixed. IIRC, I think Michael may have already fixed it in his lockless patch. I'm +1 on your other changes such as converting to a HashMap (too bad we can't use ConcurrentHashMap yet). I'd also be OK with changing the buffers Vector to an ArrayList while our eyes are on this part of the code. That might be more cosmetic than anything else though. Please continue with your patch if you would like.
        Hide
        Chuck Williams added a comment -

        This one should be golden as it addresses all the issues that have been raised and I believe the syncrhonization is fairly well optimized.

        Size is now computed based on buffer size, and so is a more accurate accounting of actual memory usage.

        I've added all the various checking and FileNotFoundExceptions that Doug suggested.

        I've also changed RamFile.buffers to an ArrayList per Yonik's last suggestion. This is probably better than cosmetic since it does allow some unnecessary syncrhonization to be eliminated.

        Unfortunately, my local Lucene differs now fairly substantially from the head – wish you guys would commit more of my patches so merging wasn't so difficult – so I'm not using the version submitted here, but I did merge it into the head carefully and all tests pass, including the new RAMDIrectory tests specifically for the functionality this patch provides.

        Show
        Chuck Williams added a comment - This one should be golden as it addresses all the issues that have been raised and I believe the syncrhonization is fairly well optimized. Size is now computed based on buffer size, and so is a more accurate accounting of actual memory usage. I've added all the various checking and FileNotFoundExceptions that Doug suggested. I've also changed RamFile.buffers to an ArrayList per Yonik's last suggestion. This is probably better than cosmetic since it does allow some unnecessary syncrhonization to be eliminated. Unfortunately, my local Lucene differs now fairly substantially from the head – wish you guys would commit more of my patches so merging wasn't so difficult – so I'm not using the version submitted here, but I did merge it into the head carefully and all tests pass, including the new RAMDIrectory tests specifically for the functionality this patch provides.
        Hide
        Yonik Seeley added a comment -

        Committed. Thanks for bearing with me though this Chuck!

        Show
        Yonik Seeley added a comment - Committed. Thanks for bearing with me though this Chuck!

          People

          • Assignee:
            Unassigned
            Reporter:
            Chuck Williams
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development