Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      I've noticed that when doing thousands of searches in a single thread the average time is quite low i.e. a few milliseconds. When adding more concurrent searches doing exactly the same search the average time increases drastically.
      I've profiled the search classes and found that the whole of lucene blocks on

      org.apache.lucene.index.SegmentCoreReaders.getTermsReader
      org.apache.lucene.util.VirtualMethod
      public synchronized int getImplementationDistance
      org.apache.lucene.util.AttributeSourcew.getAttributeInterfaces

      These cause search times to increase from a few milliseconds to up to 2 seconds when doing 500 concurrent searches on the same in memory index. Note: That the index is not being updates at all, so not refresh methods are called at any stage.

      Some questions:
      Why do we need synchronization here?
      There must be a non-lockable solution for these, they basically cause lucene to be ok for single thread applications but disastrous for any concurrent implementation.

      I'll do some experiments by removing the synchronization from the methods of these classes.

      1. Threads-LUCENE-3653.patch.png
        128 kB
        Gerrit Jansen van Vuuren
      2. profile_2_c.png
        135 kB
        Gerrit Jansen van Vuuren
      3. profile_2_b.png
        160 kB
        Gerrit Jansen van Vuuren
      4. profile_2_a.png
        128 kB
        Gerrit Jansen van Vuuren
      5. profile_1_d.png
        171 kB
        Gerrit Jansen van Vuuren
      6. profile_1_c.png
        162 kB
        Gerrit Jansen van Vuuren
      7. profile_1_b.png
        143 kB
        Gerrit Jansen van Vuuren
      8. profile_1_a.png
        133 kB
        Gerrit Jansen van Vuuren
      9. lucene-unsync.diff
        17 kB
        Gerrit Jansen van Vuuren
      10. LUCENE-3653-VirtualMethod+AttributeSource.patch
        17 kB
        Uwe Schindler
      11. LUCENE-3653-VirtualMethod+AttributeSource.patch
        17 kB
        Uwe Schindler
      12. LUCENE-3653-VirtualMethod+AttributeSource.patch
        21 kB
        Uwe Schindler
      13. LUCENE-3653-sync-.png
        32 kB
        Simon Willnauer
      14. LUCENE-3653-no-sync.png
        33 kB
        Simon Willnauer
      15. LUCENE-3653.patch-BiasedLockingStartupDelay_3.png
        129 kB
        Gerrit Jansen van Vuuren
      16. LUCENE-3653.patch-BiasedLockingStartupDelay_2.png
        129 kB
        Gerrit Jansen van Vuuren
      17. LUCENE-3653.patch-BiasedLockingStartupDelay_1.png
        130 kB
        Gerrit Jansen van Vuuren
      18. LUCENE-3653.patch
        2 kB
        Simon Willnauer
      19. LUCENE-3653.patch
        0.9 kB
        Simon Willnauer
      20. App.java
        3 kB
        Gerrit Jansen van Vuuren

        Activity

        Hide
        Uwe Schindler added a comment -

        The problems you are mentioning are no issues at all:

        • VirtualMethod is only used during class instantiations and class loading and must be synchronized. There is unlikely contention at all, just because its synchronized it does not mean its slow.
        • getAttributeInterfaces must be synchronized, too, as it has a reflection cache and is also only used during TokenStream instantiation. Analyzers should reuse TokenStreams so its not an issue at all. Fix your analyzers to resuse TokenStreams.

        On concurency the average time increases because of eventual contention in your file system directory implementation, not because methods may be synchronized.

        Show
        Uwe Schindler added a comment - The problems you are mentioning are no issues at all: VirtualMethod is only used during class instantiations and class loading and must be synchronized. There is unlikely contention at all, just because its synchronized it does not mean its slow. getAttributeInterfaces must be synchronized, too, as it has a reflection cache and is also only used during TokenStream instantiation. Analyzers should reuse TokenStreams so its not an issue at all. Fix your analyzers to resuse TokenStreams. On concurency the average time increases because of eventual contention in your file system directory implementation, not because methods may be synchronized.
        Hide
        Gerrit Jansen van Vuuren added a comment -

        profile_1_[x].png images shows lucene 3.4 used in the sample application App.java profiled using YourKit.

        1_a : shows the threads Blocking.
        1_b : shows where the threads block. I went through all of the methods and almost all were blocking on SegmentCoreReaders
        1_c : Shows the monitors on which the threads are blocking.

        Show
        Gerrit Jansen van Vuuren added a comment - profile_1_ [x] .png images shows lucene 3.4 used in the sample application App.java profiled using YourKit. 1_a : shows the threads Blocking. 1_b : shows where the threads block. I went through all of the methods and almost all were blocking on SegmentCoreReaders 1_c : Shows the monitors on which the threads are blocking.
        Hide
        Gerrit Jansen van Vuuren added a comment -

        profile_2_[X].png shows have lucene performs after having removed all synchronized methods from SegmentCoreReaders and some other areas which I'm still investigating if they had effect or not.

        2_a : shows the thread view. Threads are not blocking any more but there is some thread sleep going on.

        Show
        Gerrit Jansen van Vuuren added a comment - profile_2_ [X] .png shows have lucene performs after having removed all synchronized methods from SegmentCoreReaders and some other areas which I'm still investigating if they had effect or not. 2_a : shows the thread view. Threads are not blocking any more but there is some thread sleep going on.
        Hide
        Gerrit Jansen van Vuuren added a comment -

        Hi Uwe,

        Thanks for the fast response. Maybe it would help you to understand my use case:

        -------------- Start ------------------------
        I receive a customer request.
        The create a Query based on the customer data.
        Search a Lucene index in memory, not disk activity is done. i.e. I use RAMDirectory.
        Then return the response to the customer.

        With a single request I get on average 3ms response times.
        If I ramp it up to 200 requests I start getting 200-500ms response times.
        ------------ End ------------------------
        For my application every millisecond counts, just does and out of my control.

        Assumptions:
        (1) I create a new Instance of Query on each Request. The Query object is not shared ever. So, no locking is needed and no thread blocking should ever happen.
        (2) The index is read only, and in memory so no OS file locking is applicable or will ever occur.
        (3) No index refresh happens and no writing, this is purely read only.

        Point is no need for locking anywhere.

        Show
        Gerrit Jansen van Vuuren added a comment - Hi Uwe, Thanks for the fast response. Maybe it would help you to understand my use case: -------------- Start ------------------------ I receive a customer request. The create a Query based on the customer data. Search a Lucene index in memory, not disk activity is done. i.e. I use RAMDirectory. Then return the response to the customer. With a single request I get on average 3ms response times. If I ramp it up to 200 requests I start getting 200-500ms response times. ------------ End ------------------------ For my application every millisecond counts, just does and out of my control. Assumptions: (1) I create a new Instance of Query on each Request. The Query object is not shared ever. So, no locking is needed and no thread blocking should ever happen. (2) The index is read only, and in memory so no OS file locking is applicable or will ever occur. (3) No index refresh happens and no writing, this is purely read only. Point is no need for locking anywhere.
        Hide
        Gerrit Jansen van Vuuren added a comment -

        Just to explain App.java:

        I've created this app to show the problem.
        Basically I create 500 threads.
        Each thread do:
        Create a Single Query.
        Does 10000 search requests against the lucene index.
        Calculate max,min,average
        Wait for threads to complete.
        Print out max,min,average

        The images attached are taken from profiling while running this class.

        Show
        Gerrit Jansen van Vuuren added a comment - Just to explain App.java: I've created this app to show the problem. Basically I create 500 threads. Each thread do: Create a Single Query. Does 10000 search requests against the lucene index. Calculate max,min,average Wait for threads to complete. Print out max,min,average The images attached are taken from profiling while running this class.
        Hide
        Uwe Schindler added a comment -

        Hi,

        short comment about your code and how to remove parts of the sync:

        • The SegmentCoreReaders sync cannot be removed in Lucene 3.x, as segments are read/write. You can remove the synchronization partly in your Lucene instance by patching it, the risk is on your side! This is not a bug in Lucene. In Lucene trunk, most of this sync is removed as IndexReaders will be pure read-only. We are currently working on removing contention in SegmentReader's SegmentCore.
        • You are not resusing the Analyzer. Create one analyzer instance and reuse it for indexing and for QueryParser! When you do this, you will see no contention on TokenStream creation (VirtualMethod.getImplementationDistance, AttributeSource.getAttributeInterfaces). TokenStream creation by Analyzers is a heavy operation (not only the reflection cache contention), so use only one Analyzer per app and pass it to all you QueryParsers. Removing sync from AttributeSource/VirtualMethod caches will corrupt the cache and cause bugs - but thats unneeded if you reuse as already said.
        Show
        Uwe Schindler added a comment - Hi, short comment about your code and how to remove parts of the sync: The SegmentCoreReaders sync cannot be removed in Lucene 3.x, as segments are read/write. You can remove the synchronization partly in your Lucene instance by patching it, the risk is on your side! This is not a bug in Lucene. In Lucene trunk, most of this sync is removed as IndexReaders will be pure read-only. We are currently working on removing contention in SegmentReader's SegmentCore. You are not resusing the Analyzer. Create one analyzer instance and reuse it for indexing and for QueryParser! When you do this, you will see no contention on TokenStream creation (VirtualMethod.getImplementationDistance, AttributeSource.getAttributeInterfaces). TokenStream creation by Analyzers is a heavy operation (not only the reflection cache contention), so use only one Analyzer per app and pass it to all you QueryParsers. Removing sync from AttributeSource/VirtualMethod caches will corrupt the cache and cause bugs - but thats unneeded if you reuse as already said.
        Hide
        Simon Willnauer added a comment -

        The SegmentCoreReaders sync cannot be removed in Lucene 3.x, as segments are read/write. You can remove the synchronization partly in your Lucene instance by patching it, the risk is on your side! This is not a bug in Lucene. In Lucene trunk, most of this sync is removed as IndexReaders will be pure read-only. We are currently working on removing contention in SegmentReader's SegmentCore.

        Uwe, I had a quick look at it and I think we can remove this. We set the tis in the ctor or if we load a NRT reader. So basically we can assign the tisNoIndex reader to the tis instead of leaving it will a null ref and use a second boolean to actually signal if it was loaded or not. for read access this is not necessarily required to be synced since if you pull a NRT reader you can rely on the IW sync / mem barrier to see the latest reference. I will take a closer look at this next week.

        Show
        Simon Willnauer added a comment - The SegmentCoreReaders sync cannot be removed in Lucene 3.x, as segments are read/write. You can remove the synchronization partly in your Lucene instance by patching it, the risk is on your side! This is not a bug in Lucene. In Lucene trunk, most of this sync is removed as IndexReaders will be pure read-only. We are currently working on removing contention in SegmentReader's SegmentCore. Uwe, I had a quick look at it and I think we can remove this. We set the tis in the ctor or if we load a NRT reader. So basically we can assign the tisNoIndex reader to the tis instead of leaving it will a null ref and use a second boolean to actually signal if it was loaded or not. for read access this is not necessarily required to be synced since if you pull a NRT reader you can rely on the IW sync / mem barrier to see the latest reference. I will take a closer look at this next week.
        Hide
        Uwe Schindler added a comment -

        Thanks Simon, Mike is currently working on that in Trunk (SegmentReader is getting so simple now...). For 3.x your solution might work.

        Show
        Uwe Schindler added a comment - Thanks Simon, Mike is currently working on that in Trunk (SegmentReader is getting so simple now...). For 3.x your solution might work.
        Hide
        Uwe Schindler added a comment - - edited

        Based on my work for another issue I added removal of synchronization for the caches in VirtualMethod and AttributeSource. As those caches are generally read (should be without synchronization) and seldom populated, ConcurrentHashMap is the best choice, as gets are without locks.
        The problem is that those reflection caches need a WeakHashMap, but there is no ConcurrentWeakHashMap. Based on the work on issue LUCENE-3531, I reactivated my WeakIdentityMap and made the backing map exchangeable (there are 2 static factory methods allowing ConcurrentHashMap or simple HashMaps as backing map). For the reflection caches, identity is also fine (Class.equals/hashcode is based on class identity). The reflection caches also have no problem doing the heavy reflection work twice if two threads are requesting the same class at the same time - work is done twice and stored twice in map, but who cares?

        This patch improves concurrency on creation of TokenStreams and instantiation of all classes using VirtualMethod for backwards compatibility.

        Show
        Uwe Schindler added a comment - - edited Based on my work for another issue I added removal of synchronization for the caches in VirtualMethod and AttributeSource. As those caches are generally read (should be without synchronization) and seldom populated, ConcurrentHashMap is the best choice, as gets are without locks. The problem is that those reflection caches need a WeakHashMap, but there is no ConcurrentWeakHashMap. Based on the work on issue LUCENE-3531 , I reactivated my WeakIdentityMap and made the backing map exchangeable (there are 2 static factory methods allowing ConcurrentHashMap or simple HashMaps as backing map). For the reflection caches, identity is also fine (Class.equals/hashcode is based on class identity). The reflection caches also have no problem doing the heavy reflection work twice if two threads are requesting the same class at the same time - work is done twice and stored twice in map, but who cares? This patch improves concurrency on creation of TokenStreams and instantiation of all classes using VirtualMethod for backwards compatibility.
        Hide
        Uwe Schindler added a comment -

        Updated patch (code cleanup, javadocs)

        Show
        Uwe Schindler added a comment - Updated patch (code cleanup, javadocs)
        Hide
        Gerrit Jansen van Vuuren added a comment - - edited

        Thanks, I'll keep an eye on the things happening in the trunk.

        Creating a Single Tokenizer does help, but the thread blocking still happens because of the synchronization used in several classes.

        I've made some quick changes during the last few days on lucene-3.5, and have included the diff. (this is not an svn diff sorry).

        I agree, if anybody has to decide between, concurrency or storing things twice then concurrency wins, eventually all the cache data will be available to all threads, and the overhead goes away. But with synchronization the overhead never goes away.

        Some other points of contention are:
        RAMFile : all methods are synchronized.
        RAMInputStream: clone()
        This method came up during the profiling allot. I changed it from calling clone to: just create an new instance directly.

        I'll try to cleanup some of the code and add a better diff.

        Show
        Gerrit Jansen van Vuuren added a comment - - edited Thanks, I'll keep an eye on the things happening in the trunk. Creating a Single Tokenizer does help, but the thread blocking still happens because of the synchronization used in several classes. I've made some quick changes during the last few days on lucene-3.5, and have included the diff. (this is not an svn diff sorry). I agree, if anybody has to decide between, concurrency or storing things twice then concurrency wins, eventually all the cache data will be available to all threads, and the overhead goes away. But with synchronization the overhead never goes away. Some other points of contention are: RAMFile : all methods are synchronized. RAMInputStream: clone() This method came up during the profiling allot. I changed it from calling clone to: just create an new instance directly. I'll try to cleanup some of the code and add a better diff.
        Hide
        Uwe Schindler added a comment - - edited

        Creating a Single -Tokenizer-Analyzer does help, but the thread blocking still happens because of the synchronization used in several classes.

        Not reusing is stupid because of heavy construction cost (not contention)

        I agree, if anybody has to decide between, concurrency or storing things twice then concurrency wins, eventually all the cache data will be available to all threads, and the overhead goes away. But with synchronization the overhead never goes away.

        There are places where we cannot remove synchronization - and those places are no issue at all. Just because there is synchronization, there is not necessarily a bottleneck. Not everything you mention is an issue.

        RAMFile : all methods are synchronized.

        There is contention, but will not slowdown your search. Please keep synchronization there. every RAMFile is only opened once and then contention is gone. Not everything what your profiler shows as contention is one, only the first query will have some minor contention.

        RAMInputStream: clone() This method came up during the profiling allot. I changed it from calling clone to: just create an new instance directly.

        Thats fine, but same applies here. You only have contention on first few queries.

        I'll try to cleanup some of the code and add a better diff.

        The VirtualMethod and AttributeSource is already fixed in my patch.

        On the time-line of your profiler output I see no improvement in speed. How much faster does your code get?

        Show
        Uwe Schindler added a comment - - edited Creating a Single -Tokenizer-Analyzer does help, but the thread blocking still happens because of the synchronization used in several classes. Not reusing is stupid because of heavy construction cost (not contention) I agree, if anybody has to decide between, concurrency or storing things twice then concurrency wins, eventually all the cache data will be available to all threads, and the overhead goes away. But with synchronization the overhead never goes away. There are places where we cannot remove synchronization - and those places are no issue at all. Just because there is synchronization, there is not necessarily a bottleneck. Not everything you mention is an issue. RAMFile : all methods are synchronized. There is contention, but will not slowdown your search. Please keep synchronization there. every RAMFile is only opened once and then contention is gone. Not everything what your profiler shows as contention is one, only the first query will have some minor contention. RAMInputStream: clone() This method came up during the profiling allot. I changed it from calling clone to: just create an new instance directly. Thats fine, but same applies here. You only have contention on first few queries. I'll try to cleanup some of the code and add a better diff. The VirtualMethod and AttributeSource is already fixed in my patch. On the time-line of your profiler output I see no improvement in speed. How much faster does your code get?
        Hide
        Gerrit Jansen van Vuuren added a comment - - edited
        There is contention, but will not slowdown your search. Please keep synchronization there. every RAMFile is
        only opened once and then contention is gone.

        I'm not clear on this one, you mean every RAMFile is opened once per search? or Will be reused across all searches? If the later all threads will block on RAMFile always, this is not minor but major, especially taking into account that I want to move from 200 concurrent requests to 8000.

        Not everything what your profiler shows as contention is one,only the first query will have some minor contention.

        yes indeed. Thats why I've run several tests with warmups and try to fish out based on call frequency, total time spent and total time a Thread was blocking on a certain monitor.

        On the time-line of your profiler output I see no improvement in speed. How much faster does your code get?

        I've not profiled for individual search speed, but rather for contention, doing the tests on my laptop first and will move these changes to a production system during next week to test total search time with loading.

        With the current version i.e. with the synchronization in there the search times go up drastically, as I've mentioned above. On our production deploy and indexes search times go from 3ms on single requests to 200-500ms and more on average when 200 concurrent requests are made.

        Show
        Gerrit Jansen van Vuuren added a comment - - edited There is contention, but will not slowdown your search. Please keep synchronization there. every RAMFile is only opened once and then contention is gone. I'm not clear on this one, you mean every RAMFile is opened once per search? or Will be reused across all searches? If the later all threads will block on RAMFile always, this is not minor but major, especially taking into account that I want to move from 200 concurrent requests to 8000. Not everything what your profiler shows as contention is one,only the first query will have some minor contention. yes indeed. Thats why I've run several tests with warmups and try to fish out based on call frequency, total time spent and total time a Thread was blocking on a certain monitor. On the time-line of your profiler output I see no improvement in speed. How much faster does your code get? I've not profiled for individual search speed, but rather for contention, doing the tests on my laptop first and will move these changes to a production system during next week to test total search time with loading. With the current version i.e. with the synchronization in there the search times go up drastically, as I've mentioned above. On our production deploy and indexes search times go from 3ms on single requests to 200-500ms and more on average when 200 concurrent requests are made.
        Hide
        Uwe Schindler added a comment -

        I'm not clear on this one, you mean every RAMFile is opened once per search? or Will be reused across all searches? If the later all threads will block on RAMFile always, this is not minor but major, especially taking into account that I want to move from 200 concurrent requests to 8000.

        Once per opening IndexReader, during searches there is no file open/closing done at all. Synchronization on the directory of files is only done when writing the files, and only when opening files during IndexReader openm, but not on every access. When files are deleted there are other contention points, but not during searches, as IndexReader is opened R/O.

        Show
        Uwe Schindler added a comment - I'm not clear on this one, you mean every RAMFile is opened once per search? or Will be reused across all searches? If the later all threads will block on RAMFile always, this is not minor but major, especially taking into account that I want to move from 200 concurrent requests to 8000. Once per opening IndexReader, during searches there is no file open/closing done at all. Synchronization on the directory of files is only done when writing the files, and only when opening files during IndexReader openm, but not on every access. When files are deleted there are other contention points, but not during searches, as IndexReader is opened R/O.
        Hide
        Gerrit Jansen van Vuuren added a comment -

        OK, so for searching (after loading) the RAMFile synced methods would not be called, now bare with me for a moment:

        This does not seem to be the case when you retrieve the Document, which is probably to be expected.
        In my code I do:

        searcher.doc(matched[i].doc) //searcher == IndexSearcer and matched[] == ScoreDoc[]

        In my code I see the following call trace (Top to Bottom):

        IndexSearcher.doc
        IndexReader.document
        DirectoryReader.document
        SegementReader.document
        FieldsReader.doc
        RAMInputStream.seek
        RAMInputStream.switchCurrentBuffer
        RAMFile.getBuffer

        Which means I can search concurrently but as soon as I try to retrieve something again I hit contention.
        Now I appreciate that with File IO this is required but a fully in memory index should not have these problems. I'm trying to change the RAMFile usage so that it does not require synchronization.

        Show
        Gerrit Jansen van Vuuren added a comment - OK, so for searching (after loading) the RAMFile synced methods would not be called, now bare with me for a moment: This does not seem to be the case when you retrieve the Document, which is probably to be expected. In my code I do: searcher.doc(matched [i] .doc) //searcher == IndexSearcer and matched[] == ScoreDoc[] In my code I see the following call trace (Top to Bottom): IndexSearcher.doc IndexReader.document DirectoryReader.document SegementReader.document FieldsReader.doc RAMInputStream.seek RAMInputStream.switchCurrentBuffer RAMFile.getBuffer Which means I can search concurrently but as soon as I try to retrieve something again I hit contention. Now I appreciate that with File IO this is required but a fully in memory index should not have these problems. I'm trying to change the RAMFile usage so that it does not require synchronization.
        Hide
        Uwe Schindler added a comment -

        Updated patch with test for the concurrent weak hash map. I will commit this to 3.x and trunk as it improves the sophisticated™ backwards layers and creation cost of AttributeSource (because ConcurrentHashMap has lockless get).

        I will keep this issue open for Simon to check the SegmentCoreReaders.

        Show
        Uwe Schindler added a comment - Updated patch with test for the concurrent weak hash map. I will commit this to 3.x and trunk as it improves the sophisticated™ backwards layers and creation cost of AttributeSource (because ConcurrentHashMap has lockless get). I will keep this issue open for Simon to check the SegmentCoreReaders.
        Hide
        Uwe Schindler added a comment -

        Committed VirtualMethod/AttributeSource improvements in trunk revision: 1220458, 3.x revision: 1220464

        Show
        Uwe Schindler added a comment - Committed VirtualMethod/AttributeSource improvements in trunk revision: 1220458, 3.x revision: 1220464
        Hide
        Uwe Schindler added a comment -

        Committed improvement for WeakIndentityMap to not use ReferenceQueue on get/contains/remove operations, only put (trunk: 1220555, 3.x: r1220557)

        Show
        Uwe Schindler added a comment - Committed improvement for WeakIndentityMap to not use ReferenceQueue on get/contains/remove operations, only put (trunk: 1220555, 3.x: r1220557)
        Hide
        Simon Willnauer added a comment -

        Gerrit, before I spend time on fixing this single IMO likely uncontented lock in org.apache.lucene.index.SegmentCoreReaders.getTermsReader() can I ask you to run your tests / benchmarks with -XX:BiasedLockingStartupDelay=0 as far as I can see you are running micro benchmarks which start up very quickly. Locks accessed right after startup behave differently in the JVM than other locks ie. sync blocks. Can you retest and report back?

        simon

        Show
        Simon Willnauer added a comment - Gerrit, before I spend time on fixing this single IMO likely uncontented lock in org.apache.lucene.index.SegmentCoreReaders.getTermsReader() can I ask you to run your tests / benchmarks with -XX:BiasedLockingStartupDelay=0 as far as I can see you are running micro benchmarks which start up very quickly. Locks accessed right after startup behave differently in the JVM than other locks ie. sync blocks. Can you retest and report back? simon
        Hide
        Simon Willnauer added a comment -

        here is a patch that removes the sync on SegmentCoreReaders#getTermsReader() - since this method is access on basically every search we should try to prevent any sync even if they are cheap in the uncontended case.

        Show
        Simon Willnauer added a comment - here is a patch that removes the sync on SegmentCoreReaders#getTermsReader() - since this method is access on basically every search we should try to prevent any sync even if they are cheap in the uncontended case.
        Hide
        Gerrit Jansen van Vuuren added a comment -

        Hi Simon,
        Thanks, I'll try these on our production systems.
        One other change I've made, which I'll submit a patch for (pending testing) is introducing a ReadonlyRAMFile.

        Search for my use case is split into two:
        (1) Search
        (2) Get Stored data from index.
        Now in an in memory index almost zero locking can be achieved.
        All the patches will do great for (1), the ReadonlyRAMFile seems to solve item (2).

        Show
        Gerrit Jansen van Vuuren added a comment - Hi Simon, Thanks, I'll try these on our production systems. One other change I've made, which I'll submit a patch for (pending testing) is introducing a ReadonlyRAMFile. Search for my use case is split into two: (1) Search (2) Get Stored data from index. Now in an in memory index almost zero locking can be achieved. All the patches will do great for (1), the ReadonlyRAMFile seems to solve item (2).
        Hide
        Gerrit Jansen van Vuuren added a comment -

        These are the results from profiling with the Latest trunk checkout as from 2011-12-19 11:03.

        The Thread-LUCENE_3653.patch.png is without BiasedLockingStartupDelay.

        The others are with this option added to java + the patch.

        I've included 3 images of the later so show that contention has improved but still other areas of contention exist that are probably related to the RAMFile itself being fully synchronized.

        Show
        Gerrit Jansen van Vuuren added a comment - These are the results from profiling with the Latest trunk checkout as from 2011-12-19 11:03. The Thread-LUCENE_3653.patch.png is without BiasedLockingStartupDelay. The others are with this option added to java + the patch. I've included 3 images of the later so show that contention has improved but still other areas of contention exist that are probably related to the RAMFile itself being fully synchronized.
        Hide
        Robert Muir added a comment -

        I think this is all profiler ghosts.

        Guys lets be careful not to introduce bugs over what isn't a performance problem at all.

        Show
        Robert Muir added a comment - I think this is all profiler ghosts. Guys lets be careful not to introduce bugs over what isn't a performance problem at all.
        Hide
        Michael McCandless added a comment -

        Why are we using ConcurrentHashMap when we know there's a JVM deadlock bug, on certain 1.5.x's and certain CPUs...? (LUCENE-3235)

        Shouldn't we move in the other direction here (don't use CHM unless it cannot be avoided)?

        For example it's crazy to use this class to back the core/reader closed listeners.... and I don't think we should back the AttrSource with it...?

        Show
        Michael McCandless added a comment - Why are we using ConcurrentHashMap when we know there's a JVM deadlock bug, on certain 1.5.x's and certain CPUs...? ( LUCENE-3235 ) Shouldn't we move in the other direction here (don't use CHM unless it cannot be avoided)? For example it's crazy to use this class to back the core/reader closed listeners.... and I don't think we should back the AttrSource with it...?
        Hide
        Uwe Schindler added a comment -

        For example it's crazy to use this class to back the core/reader closed listeners.... and I don't think we should back the AttrSource with it...?

        CHM is only risky on contention. Contention here is unlikely as once all classes are "inspected" there are only read accesses.

        On my tests, creating lots of AttributeSources and IndexSearchers are improved. The problem is that both VirtualMethod and AttributeSource block on construction on these cache locks.

        Show
        Uwe Schindler added a comment - For example it's crazy to use this class to back the core/reader closed listeners.... and I don't think we should back the AttrSource with it...? CHM is only risky on contention. Contention here is unlikely as once all classes are "inspected" there are only read accesses. On my tests, creating lots of AttributeSources and IndexSearchers are improved. The problem is that both VirtualMethod and AttributeSource block on construction on these cache locks.
        Hide
        Simon Willnauer added a comment -

        Gerrit, I looked at your test and what you are seeing with soo many threads is certainly profiler ghosts. 500 threads is a lot even on a highly concurrent system. the blocking you see might not be due to a monitor.

        I think this is all profiler ghosts.

        I disagree I ran Gerrits simple test with 100 threads and its 2x faster without the locking on SegmentCoreReaders#getTermsReader() I agree this is not the part which takes time on a big index but this locking is certainly unnecessary and it takes up CPU resources. We should make the path down the IR as efficient as possible.

        I will attache my profiling session with and without the patch and you will see the differences.

        Show
        Simon Willnauer added a comment - Gerrit, I looked at your test and what you are seeing with soo many threads is certainly profiler ghosts. 500 threads is a lot even on a highly concurrent system. the blocking you see might not be due to a monitor. I think this is all profiler ghosts. I disagree I ran Gerrits simple test with 100 threads and its 2x faster without the locking on SegmentCoreReaders#getTermsReader() I agree this is not the part which takes time on a big index but this locking is certainly unnecessary and it takes up CPU resources. We should make the path down the IR as efficient as possible. I will attache my profiling session with and without the patch and you will see the differences.
        Hide
        Simon Willnauer added a comment -

        here are two runs with 100 threads and 100000 iterations per thread. I think this is definitly something we can prevent with the LUCENE-3653.patch

        Show
        Simon Willnauer added a comment - here are two runs with 100 threads and 100000 iterations per thread. I think this is definitly something we can prevent with the LUCENE-3653 .patch
        Hide
        Robert Muir added a comment -

        I disagree I ran Gerrits simple test with 100 threads and its 2x faster without the locking on SegmentCoreReaders#getTermsReader()

        Wait, if this stuff is a big problem, surely we can just use luceneutil with lots of threads right?

        I want to see the QPS change, I don't trust the benchmarks or the various fancy jvm tools being used on this issue. I think its all ghosts.

        Show
        Robert Muir added a comment - I disagree I ran Gerrits simple test with 100 threads and its 2x faster without the locking on SegmentCoreReaders#getTermsReader() Wait, if this stuff is a big problem, surely we can just use luceneutil with lots of threads right? I want to see the QPS change, I don't trust the benchmarks or the various fancy jvm tools being used on this issue. I think its all ghosts.
        Hide
        Simon Willnauer added a comment -

        Wait, if this stuff is a big problem, surely we can just use luceneutil with lots of threads right?

        as I said, the index is tiny so just measuring the overhead of the lock on SegmentCoreReaders#getTermsReader() - this is a low hanging fruit IMO which we should pick even if there is no real change in QPS. Its unnecessary and the patch is simple, any objections?

        Show
        Simon Willnauer added a comment - Wait, if this stuff is a big problem, surely we can just use luceneutil with lots of threads right? as I said, the index is tiny so just measuring the overhead of the lock on SegmentCoreReaders#getTermsReader() - this is a low hanging fruit IMO which we should pick even if there is no real change in QPS. Its unnecessary and the patch is simple, any objections?
        Hide
        Uwe Schindler added a comment -

        I general if we can remove contention anywhere we should do it. As I said, contention on thing that generally only reads but very seldom writes is horrible. Just write a test that creates 200 threads and creates TokenStreams or IndexSearchers (around a single IndexReader). They will all lock on the cache (ok, without reflection cache it would even be slower...)

        Show
        Uwe Schindler added a comment - I general if we can remove contention anywhere we should do it. As I said, contention on thing that generally only reads but very seldom writes is horrible. Just write a test that creates 200 threads and creates TokenStreams or IndexSearchers (around a single IndexReader). They will all lock on the cache (ok, without reflection cache it would even be slower...)
        Hide
        Robert Muir added a comment -

        Just write a test that creates 200 threads and creates TokenStreams or IndexSearchers (around a single IndexReader). They will all lock on the cache (ok, without reflection cache it would even be slower...)

        I don't care. They should be reusing tokenstreams in their application.

        We don't need to make lucene more complicated to (possibly) speed up someones broken code.

        Lots of commits here, a title that says 'lucene doesn't scale, lots of arguing that there are locking problems, but not one luceneutil benchmark run?

        Seriously?

        Show
        Robert Muir added a comment - Just write a test that creates 200 threads and creates TokenStreams or IndexSearchers (around a single IndexReader). They will all lock on the cache (ok, without reflection cache it would even be slower...) I don't care. They should be reusing tokenstreams in their application. We don't need to make lucene more complicated to (possibly) speed up someones broken code. Lots of commits here, a title that says 'lucene doesn't scale, lots of arguing that there are locking problems, but not one luceneutil benchmark run? Seriously?
        Hide
        Uwe Schindler added a comment -

        The code for the commit was standing here more than 24 hours and was tested extensively to remove contention on creation on all classes using VirtualMethod (IndexSearcher and TokenStream are only example). What's your problem with it? The latest commit 1 hr ago was only a fix in the tests and had nothing to do with ConcurrentHashMap.

        As I said, CHM is used for read-access only, but the guard is needed, if two ctors are running at same time and need to add a new entry to cache. All other cases are read-only, so lockless is better. It is just stupid to wait on a ctor, just because there is a lock on a map thats never be updated once your code is running an no new classes using VirtualMethod are loaded. We can also use Google Guava to use their MapMaker or wait until Java 8, where Doug Lea's ConcurrentWeakHashMap is maybe added (JSR-166).

        In fact the code of AttributeSource got simplier because we have no sync blocks.

        Show
        Uwe Schindler added a comment - The code for the commit was standing here more than 24 hours and was tested extensively to remove contention on creation on all classes using VirtualMethod (IndexSearcher and TokenStream are only example). What's your problem with it? The latest commit 1 hr ago was only a fix in the tests and had nothing to do with ConcurrentHashMap. As I said, CHM is used for read-access only, but the guard is needed, if two ctors are running at same time and need to add a new entry to cache. All other cases are read-only, so lockless is better. It is just stupid to wait on a ctor, just because there is a lock on a map thats never be updated once your code is running an no new classes using VirtualMethod are loaded. We can also use Google Guava to use their MapMaker or wait until Java 8, where Doug Lea's ConcurrentWeakHashMap is maybe added (JSR-166). In fact the code of AttributeSource got simplier because we have no sync blocks.
        Hide
        Uwe Schindler added a comment - - edited

        For example it's crazy to use this class [ConcurrentHashMap] to back the core/reader closed listeners

        This is indeed totally crazy, as this is the wrong use-case for this map. This map should be used when you have lot's of reads and less writes to the map. For this case, you have only one thread reading (on close only, LOL), but maybe mayn adding listeners. So this is in general a bug to use CHM here. Should be a simple sycnrhonized HashMap (which performs better on writes, as it has less complexity inside).

        => We should open issue for that!

        Show
        Uwe Schindler added a comment - - edited For example it's crazy to use this class [ConcurrentHashMap] to back the core/reader closed listeners This is indeed totally crazy, as this is the wrong use-case for this map. This map should be used when you have lot's of reads and less writes to the map. For this case, you have only one thread reading (on close only, LOL), but maybe mayn adding listeners. So this is in general a bug to use CHM here. Should be a simple sycnrhonized HashMap (which performs better on writes, as it has less complexity inside). => We should open issue for that!
        Hide
        Gerrit Jansen van Vuuren added a comment - - edited

        Just to clear up on some comments:

        Guys I'm testing this on a production deploy and the use case is simple. Parser Query, Search, Get Doc fields. Contention was seen without profiling and profiling was used to find where the contention in the application.

        I don't care. They should be reusing tokenstreams in their application.

        Token streams is not the only place of contention discussed here. Please look at the test attached (App.java). It does not recreate the token streams on each search, only (by mistake) for each thread. But still this only has a very small impact. Searches are repeated from each thread reusing the same token stream.

        We don't need to make lucene more complicated to (possibly) speed up someones broken code.

        Please write an example and show me your findings. I've done the same already. Have a look at App.java

        I want to see the QPS change, I don't trust the benchmarks or the various fancy jvm tools being used on this issue.
        I think its all ghosts.

        Yourkit is a standard Profiling tool. I didn't just wake up and say hey I'll profile lucene. I only started after I noticed the contention on a production application where we are trying to use lucene in memory.

        Lots of commits here, a title that says 'lucene doesn't scale, lots of arguing that there are locking problems, but not one luceneutil benchmark run

        ??, You don't need to be a scientist nor use a profiling tool to see that all threads will block on a synchronized block or method on each request if that method is called during each search request, from that it doesn't take much to know that this does not scale.

        Again, this is not a go at lucene, the heading 'Lucene Search not scalling' is causing toooo much problems here I see, I could have chosen a better heading and if we can change it to something else please do so. I think lucene is really great, thats why I'm trying to use it. Thanks for the support, ideas and patches so far.

        Show
        Gerrit Jansen van Vuuren added a comment - - edited Just to clear up on some comments: Guys I'm testing this on a production deploy and the use case is simple. Parser Query, Search, Get Doc fields. Contention was seen without profiling and profiling was used to find where the contention in the application. I don't care. They should be reusing tokenstreams in their application. Token streams is not the only place of contention discussed here. Please look at the test attached (App.java). It does not recreate the token streams on each search, only (by mistake) for each thread. But still this only has a very small impact. Searches are repeated from each thread reusing the same token stream. We don't need to make lucene more complicated to (possibly) speed up someones broken code. Please write an example and show me your findings. I've done the same already. Have a look at App.java I want to see the QPS change, I don't trust the benchmarks or the various fancy jvm tools being used on this issue. I think its all ghosts. Yourkit is a standard Profiling tool. I didn't just wake up and say hey I'll profile lucene. I only started after I noticed the contention on a production application where we are trying to use lucene in memory. Lots of commits here, a title that says 'lucene doesn't scale, lots of arguing that there are locking problems, but not one luceneutil benchmark run ??, You don't need to be a scientist nor use a profiling tool to see that all threads will block on a synchronized block or method on each request if that method is called during each search request, from that it doesn't take much to know that this does not scale. Again, this is not a go at lucene, the heading 'Lucene Search not scalling' is causing toooo much problems here I see, I could have chosen a better heading and if we can change it to something else please do so. I think lucene is really great, thats why I'm trying to use it. Thanks for the support, ideas and patches so far.
        Hide
        Robert Muir added a comment -

        Sorry, I don't care what Yourkit says: its wrong.

        a synchronized method that is just a getter only called once per-segment is NOT locking up your search.

        Show
        Robert Muir added a comment - Sorry, I don't care what Yourkit says: its wrong. a synchronized method that is just a getter only called once per-segment is NOT locking up your search.
        Hide
        Simon Willnauer added a comment -

        Gerrit, what robert is saying is that if your index has a reasonable size the cost of a search is not dominated by calling the getter nor by the lock its using. You App is using 1k documents with a tiny number of terms. you are also using 500 threads where scheduling actually dominates the cost. A reasonable benchmark will not see a significant impact due to this synchronization in SegmentCoreReaders#getTermsReader() it will be dominated by retrieving the documents etc. What you are measuring is literally the impact of the lock in a case where your search is super super cheap. I still think we can and should get rid of the SegmentCoreReaders#getTermsReader() sync but in general saving a couple of cpu cycles but paying the price for more complicated code is not worth it IMO. so I agree with robert there will likely be no contention there.

        Show
        Simon Willnauer added a comment - Gerrit, what robert is saying is that if your index has a reasonable size the cost of a search is not dominated by calling the getter nor by the lock its using. You App is using 1k documents with a tiny number of terms. you are also using 500 threads where scheduling actually dominates the cost. A reasonable benchmark will not see a significant impact due to this synchronization in SegmentCoreReaders#getTermsReader() it will be dominated by retrieving the documents etc. What you are measuring is literally the impact of the lock in a case where your search is super super cheap. I still think we can and should get rid of the SegmentCoreReaders#getTermsReader() sync but in general saving a couple of cpu cycles but paying the price for more complicated code is not worth it IMO. so I agree with robert there will likely be no contention there.
        Hide
        Gerrit Jansen van Vuuren added a comment -

        this is not a problem if your ok with 200 or less requests on a server and your search time can be 0.5 or more seconds. But if your whole execution window is measured in milliseconds and you get 7000-10000 requests per second on a server then yes it does matter.

        If that method is called in each thread that is trying to search on the index then it becomes a point of contention, if there are several synchronized separately synchronized methods it only becomes worse. Your basically saying its ok for all threads to wait sequentially in several areas of the code during an in memory readonly index search.

        Show
        Gerrit Jansen van Vuuren added a comment - this is not a problem if your ok with 200 or less requests on a server and your search time can be 0.5 or more seconds. But if your whole execution window is measured in milliseconds and you get 7000-10000 requests per second on a server then yes it does matter. If that method is called in each thread that is trying to search on the index then it becomes a point of contention, if there are several synchronized separately synchronized methods it only becomes worse. Your basically saying its ok for all threads to wait sequentially in several areas of the code during an in memory readonly index search.
        Hide
        Gerrit Jansen van Vuuren added a comment - - edited

        Just to clarify, my index is 2 gigs (may grow to 10 gigs, I've got 72gigs of RAM to work with), with thousands of documents, documents are not big. The total processing time in my app without the index searches is on average 2-3 milliseconds (after warmup), this includes searching through a treemap with 11million entries.

        Contention was not only seen on the single method in SegmentCoreReaders#getTermsReader() but also on all methods in RAMFile and SegmentNorms etc. Its not just about a single method, that's what I've been trying to make clear in my comments from the start.

        Show
        Gerrit Jansen van Vuuren added a comment - - edited Just to clarify, my index is 2 gigs (may grow to 10 gigs, I've got 72gigs of RAM to work with), with thousands of documents, documents are not big. The total processing time in my app without the index searches is on average 2-3 milliseconds (after warmup), this includes searching through a treemap with 11million entries. Contention was not only seen on the single method in SegmentCoreReaders#getTermsReader() but also on all methods in RAMFile and SegmentNorms etc. Its not just about a single method, that's what I've been trying to make clear in my comments from the start.
        Hide
        Robert Muir added a comment -

        Contention was not only seen on the single method in SegmentCoreReaders#getTermsReader() but also on all methods in RAMFile and SegmentNorms etc.

        Not actually seen, just spit out by a profiling tool. I really don't think you should interpret what Yourkit is telling you as gospel truth.

        Show
        Robert Muir added a comment - Contention was not only seen on the single method in SegmentCoreReaders#getTermsReader() but also on all methods in RAMFile and SegmentNorms etc. Not actually seen, just spit out by a profiling tool. I really don't think you should interpret what Yourkit is telling you as gospel truth.
        Hide
        Gerrit Jansen van Vuuren added a comment -

        IndexSearch.doc calls RAMFile indirectly through RAMInputStream, once search is complete you need to know what's been found so IndexSearch.doc needs to be called.
        i.e. RAMInputStream calls RAMFile.numBuffers and getBuffer on the switchCurrentBuffer which happens allot during my searches.

        IndexSearcher calls TermQuery$TermWeightScorer.scorer which calls SegmentReader.norms, which calls SegmentNorms.bytes on each search.
        SegmentNorms has almost all methods synchronized.

        Show
        Gerrit Jansen van Vuuren added a comment - IndexSearch.doc calls RAMFile indirectly through RAMInputStream, once search is complete you need to know what's been found so IndexSearch.doc needs to be called. i.e. RAMInputStream calls RAMFile.numBuffers and getBuffer on the switchCurrentBuffer which happens allot during my searches. IndexSearcher calls TermQuery$TermWeightScorer.scorer which calls SegmentReader.norms, which calls SegmentNorms.bytes on each search. SegmentNorms has almost all methods synchronized.
        Hide
        Uwe Schindler added a comment - - edited

        IndexSearcher calls TermQuery$TermWeightScorer.scorer which calls SegmentReader.norms, which calls SegmentNorms.bytes on each search. SegmentNorms has almost all methods synchronized.

        Thats the idea behind this class, it caches norms, as they are heavy to load. Even accidently loading them two times because of double-checked-locking is too expensive. Hotspot does the removal of sync later!

        Have you thought about yourkit effectively disabling hotspot, so your analysis is bogus here?

        Show
        Uwe Schindler added a comment - - edited IndexSearcher calls TermQuery$TermWeightScorer.scorer which calls SegmentReader.norms, which calls SegmentNorms.bytes on each search. SegmentNorms has almost all methods synchronized. Thats the idea behind this class, it caches norms, as they are heavy to load. Even accidently loading them two times because of double-checked-locking is too expensive. Hotspot does the removal of sync later! Have you thought about yourkit effectively disabling hotspot, so your analysis is bogus here?
        Hide
        Gerrit Jansen van Vuuren added a comment -

        OK got you. Just keep in mind that I'm still profiling this and will post my findings as I go along.

        Show
        Gerrit Jansen van Vuuren added a comment - OK got you. Just keep in mind that I'm still profiling this and will post my findings as I go along.
        Hide
        Uwe Schindler added a comment - - edited

        IndexSearch.doc calls RAMFile indirectly through RAMInputStream, once search is complete you need to know what's been found so IndexSearch.doc needs to be called. i.e. RAMInputStream calls RAMFile.numBuffers and getBuffer on the switchCurrentBuffer which happens allot during my searches.

        If you use RAMDirectory on a large index its slowing down things as it drives the garbage collector crazy. Use an on-disk index with MMapDirectory, which has no locking at all (only on sometimes called IndexInput.clone, but if you remove that your JVM will SIGSEGV if you use Lucene incorrectly with multiple threads).

        RAMDirectory is written for tests, not for production use. There are already plans to remove it from Lucene trunk and move to tests only. Have you seen that it allocates buffers in 8 Kilobytes blocks? Calculate how many byte[] you have on a 50 Gigabytes index... GC will drive crazy when it starts to cleanup. And then it stops your whole application, not because it locks inside RAMFile, because it does a stop-the world GC.

        We are working on a RAM-Dir like approach storing the files outside Java heap using a large DirectByteBuffer (which is the same code as MMapDirctory). The problem is writing to such a directory, but reading is as fast (or even faster) than RAMDirectory without locks.

        Show
        Uwe Schindler added a comment - - edited IndexSearch.doc calls RAMFile indirectly through RAMInputStream, once search is complete you need to know what's been found so IndexSearch.doc needs to be called. i.e. RAMInputStream calls RAMFile.numBuffers and getBuffer on the switchCurrentBuffer which happens allot during my searches. If you use RAMDirectory on a large index its slowing down things as it drives the garbage collector crazy. Use an on-disk index with MMapDirectory, which has no locking at all (only on sometimes called IndexInput.clone, but if you remove that your JVM will SIGSEGV if you use Lucene incorrectly with multiple threads). RAMDirectory is written for tests, not for production use. There are already plans to remove it from Lucene trunk and move to tests only. Have you seen that it allocates buffers in 8 Kilobytes blocks? Calculate how many byte[] you have on a 50 Gigabytes index... GC will drive crazy when it starts to cleanup. And then it stops your whole application, not because it locks inside RAMFile, because it does a stop-the world GC. We are working on a RAM-Dir like approach storing the files outside Java heap using a large DirectByteBuffer (which is the same code as MMapDirctory). The problem is writing to such a directory, but reading is as fast (or even faster) than RAMDirectory without locks.
        Hide
        Gerrit Jansen van Vuuren added a comment -

        Oh, ok I thought RAMDirectory would make things faster, good to know. I will try with MMapDirectory then post back.

        Show
        Gerrit Jansen van Vuuren added a comment - Oh, ok I thought RAMDirectory would make things faster, good to know. I will try with MMapDirectory then post back.
        Hide
        Uwe Schindler added a comment -

        Oh, ok I thought RAMDirectory would make things faster, good to know. I will try with MMapDirectory then post back.

        Thats a hard to eliminate misbelief - same like "optimizing indexes is good".

        Just remember that MMapDirectory will use memory outside the JVM heap. If the whole index is in file system cache, Lucene can directly access the cache using MMapDirectory (thats like a swap file, loaded on demand). So reduce JVM heap and supply lots of space for the OS kernel to cache file contents, as those are mapped by MMAP into Java's process space.

        Show
        Uwe Schindler added a comment - Oh, ok I thought RAMDirectory would make things faster, good to know. I will try with MMapDirectory then post back. Thats a hard to eliminate misbelief - same like "optimizing indexes is good". Just remember that MMapDirectory will use memory outside the JVM heap. If the whole index is in file system cache, Lucene can directly access the cache using MMapDirectory (thats like a swap file, loaded on demand). So reduce JVM heap and supply lots of space for the OS kernel to cache file contents, as those are mapped by MMAP into Java's process space.
        Hide
        DM Smith added a comment -

        Thats a hard to eliminate misbelief - same like "optimizing indexes is good".

        Not sure why it is a misbelief? On Win98 and XP with Lucene 1.4.x, when creating an index, with Symantec AV and Windows Fast indexing both on, indexing of 64K small documents, took 40+ minutes. With them off, it took about 4 minutes. Going with a RAM dir took >40 seconds.

        From what I could tell (and remember), Lucene was writing and deleting tens of thousands of documents. It appeared, that the OS was updating its fast search index for each file change hitting the disk. And Symantec, seemed to be scanning every on file creation.

        As this was a desktop application, we couldn't ask end users to turn off those features or to tune them.

        Does MMapDirectory avoid this too many transient files problem?

        Show
        DM Smith added a comment - Thats a hard to eliminate misbelief - same like "optimizing indexes is good". Not sure why it is a misbelief? On Win98 and XP with Lucene 1.4.x, when creating an index, with Symantec AV and Windows Fast indexing both on, indexing of 64K small documents, took 40+ minutes. With them off, it took about 4 minutes. Going with a RAM dir took >40 seconds. From what I could tell (and remember), Lucene was writing and deleting tens of thousands of documents. It appeared, that the OS was updating its fast search index for each file change hitting the disk. And Symantec, seemed to be scanning every on file creation. As this was a desktop application, we couldn't ask end users to turn off those features or to tune them. Does MMapDirectory avoid this too many transient files problem?
        Hide
        Uwe Schindler added a comment - - edited

        DM: At lucene 1.4 times the indexer was working different and really created lots of files. With new merging this is no longer the case.

        In fact we were talking here about searching not indexing. It makes no sense to clone a huge RAM directory from disk to heap and run searches on it.

        Show
        Uwe Schindler added a comment - - edited DM: At lucene 1.4 times the indexer was working different and really created lots of files. With new merging this is no longer the case. In fact we were talking here about searching not indexing. It makes no sense to clone a huge RAM directory from disk to heap and run searches on it.
        Hide
        Michael McCandless added a comment -

        I still think we can and should get rid of the SegmentCoreReaders#getTermsReader() sync

        +1, I think Simon's patch is a good improvement, even if in practice this sync'd getter should be a tiny cost.

        Show
        Michael McCandless added a comment - I still think we can and should get rid of the SegmentCoreReaders#getTermsReader() sync +1, I think Simon's patch is a good improvement, even if in practice this sync'd getter should be a tiny cost.
        Hide
        DM Smith added a comment - - edited

        In fact we were talking here about searching not indexing. It makes no sense to clone a huge RAM directory from disk to heap and run searches on it.

        I saw that this issue is on searching not indexing. I didn't mean to try to hijack it. I was responding to the statement that RAMDirectory going away (BTW, the statement on optimize does not regard a search feature but an index one). I'll have to test to see if the same index problem still is around.

        Regarding MMap, there is an open issue with it on Windows: Lucene-1669.

        Show
        DM Smith added a comment - - edited In fact we were talking here about searching not indexing. It makes no sense to clone a huge RAM directory from disk to heap and run searches on it. I saw that this issue is on searching not indexing. I didn't mean to try to hijack it. I was responding to the statement that RAMDirectory going away (BTW, the statement on optimize does not regard a search feature but an index one). I'll have to test to see if the same index problem still is around. Regarding MMap, there is an open issue with it on Windows: Lucene-1669.
        Hide
        Uwe Schindler added a comment - - edited

        LUCENE-1669 seems to be relict and should already be solved since Lucene 2.9 in newer Windows versions? And BTW, MMap is only useful for reading indexes, written are they by NIO/SimpleFS.

        BTW, the statement on optimize does not regard a search feature but an index one

        This statement was just a comparison, to underline that both statements "RAMDir is fast" and "Optimize is needed" are myths.

        I was responding to the statement that RAMDirectory going away

        See my previous comment:

        We are working on a RAM-Dir like approach storing the files outside Java heap using a large DirectByteBuffer (which is the same code as MMapDirctory). The problem is writing to such a directory, but reading is as fast (or even faster) than RAMDirectory without locks.

        Show
        Uwe Schindler added a comment - - edited LUCENE-1669 seems to be relict and should already be solved since Lucene 2.9 in newer Windows versions? And BTW, MMap is only useful for reading indexes, written are they by NIO/SimpleFS. BTW, the statement on optimize does not regard a search feature but an index one This statement was just a comparison, to underline that both statements "RAMDir is fast" and "Optimize is needed" are myths. I was responding to the statement that RAMDirectory going away See my previous comment: We are working on a RAM-Dir like approach storing the files outside Java heap using a large DirectByteBuffer (which is the same code as MMapDirctory). The problem is writing to such a directory, but reading is as fast (or even faster) than RAMDirectory without locks.
        Hide
        Gerrit Jansen van Vuuren added a comment -

        Hi,

        I've tested my application using

        • Using a single Tokenizer
        • MMapDirectory,
        • the latest trunk build that contains the patches committed here.
        • with the LUCENE-3653.patch applied.
        • plus using the XX:BiasedLockingStartupDelay=0 option.

        My app request time has gone down from a terrible 400 ms to 120-140ms on average and is constant over 10-200-500 threads (after a 3 hour warmup). (Running java 1.6.0_29 64 bit centos 5.4).

        Still strange that I can do a single threaded request 10000 times and get 3-5 ms on average response time.

        Thanks, Uwe, Simon, your input, comments, and patches have helped allot.

        Show
        Gerrit Jansen van Vuuren added a comment - Hi, I've tested my application using Using a single Tokenizer MMapDirectory, the latest trunk build that contains the patches committed here. with the LUCENE-3653 .patch applied. plus using the XX:BiasedLockingStartupDelay=0 option. My app request time has gone down from a terrible 400 ms to 120-140ms on average and is constant over 10-200-500 threads (after a 3 hour warmup). (Running java 1.6.0_29 64 bit centos 5.4). Still strange that I can do a single threaded request 10000 times and get 3-5 ms on average response time. Thanks, Uwe, Simon, your input, comments, and patches have helped allot.
        Hide
        Simon Willnauer added a comment -

        +1, I think Simon's patch is a good improvement, even if in practice this sync'd getter should be a tiny cost.

        I agree. However, I don't think we can get entirely rid of a mem-barrier here. Even if this is synced by the IW a thread could reopen the reader and another search thread gets the new instance without yet another barrier. This means that search thread would not use the dictionary resulting in a possibly slow search. I removed the sync and made the tis member volatile. That seems the safest option to me.

        Show
        Simon Willnauer added a comment - +1, I think Simon's patch is a good improvement, even if in practice this sync'd getter should be a tiny cost. I agree. However, I don't think we can get entirely rid of a mem-barrier here. Even if this is synced by the IW a thread could reopen the reader and another search thread gets the new instance without yet another barrier. This means that search thread would not use the dictionary resulting in a possibly slow search. I removed the sync and made the tis member volatile. That seems the safest option to me.
        Hide
        Uwe Schindler added a comment -

        Hi Simon, for 3.x I agree. In trunk we already have no synchronization at all (committed 2 days ago; improved yesterday: LUCENE-3631), so there is no need to change anything.

        Show
        Uwe Schindler added a comment - Hi Simon, for 3.x I agree. In trunk we already have no synchronization at all (committed 2 days ago; improved yesterday: LUCENE-3631 ), so there is no need to change anything.
        Hide
        Simon Willnauer added a comment -

        Hi Simon, for 3.x I agree. In trunk we already have no synchronization at all (committed 2 days ago; improved yesterday: LUCENE-3631), so there is no need to change anything.

        uwe that patch is against 3.x - I didn't intend to apply this to 4.0 since we do totally different things there. I will commit this if nobody objects in a bit.

        Show
        Simon Willnauer added a comment - Hi Simon, for 3.x I agree. In trunk we already have no synchronization at all (committed 2 days ago; improved yesterday: LUCENE-3631 ), so there is no need to change anything. uwe that patch is against 3.x - I didn't intend to apply this to 4.0 since we do totally different things there. I will commit this if nobody objects in a bit.
        Hide
        Uwe Schindler added a comment -

        +1

        Show
        Uwe Schindler added a comment - +1

          People

          • Assignee:
            Simon Willnauer
            Reporter:
            Gerrit Jansen van Vuuren
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development