Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1, Trunk
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Currently LuceneTaxonomyReader does not support NRT - i.e., on changes to LuceneTaxonomyWriter, you cannot have the reader updated, like IndexReader/Writer. In order to do that we need to do the following:

      1. Add ctor to LuceneTaxonomyReader to allow you to instantiate it with LuceneTaxonomyWriter.
      2. Add API to LuceneTaxonomyWriter to expose its internal IndexReader
      3. Change LTR.refresh() to return an LTR, rather than void. This is actually not strictly related to that issue, but since we'll need to modify refresh() impl, I think it'll be good to change its API as well. Since all of facet API is @lucene.experimental, no backwards issues here (and the sooner we do it, the better).
      1. LUCENE-3441.patch
        103 kB
        Shai Erera
      2. LUCENE-3441.patch
        91 kB
        Shai Erera

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Shai Erera
          http://svn.apache.org/viewvc?view=revision&revision=1412232

          LUCENE-3441: facets NRT support

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1412232 LUCENE-3441 : facets NRT support
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Shai Erera
          http://svn.apache.org/viewvc?view=revision&revision=1412252

          LUCENE-3441: add another test that reproduced a bug on a wrong assert

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1412252 LUCENE-3441 : add another test that reproduced a bug on a wrong assert
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Shai Erera
          http://svn.apache.org/viewvc?view=revision&revision=1412235

          LUCENE-3441: fix bug in ParentArray.initFromReader

          Show
          Commit Tag Bot added a comment - [trunk commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1412235 LUCENE-3441 : fix bug in ParentArray.initFromReader
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Shai Erera
          http://svn.apache.org/viewvc?view=revision&revision=1412249

          LUCENE-3441: add another test that reproduced a bug on a wrong assert

          Show
          Commit Tag Bot added a comment - [trunk commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1412249 LUCENE-3441 : add another test that reproduced a bug on a wrong assert
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Shai Erera
          http://svn.apache.org/viewvc?view=revision&revision=1412232

          LUCENE-3441: facets NRT support

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1412232 LUCENE-3441 : facets NRT support
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Shai Erera
          http://svn.apache.org/viewvc?view=revision&revision=1412252

          LUCENE-3441: add another test that reproduced a bug on a wrong assert

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1412252 LUCENE-3441 : add another test that reproduced a bug on a wrong assert
          Hide
          Shai Erera added a comment -

          Added another test which uncovered a wrong assert in ParentArray ctor.

          All tests pass on 4x and trunk. Hopefully it'll remain like that .

          Resolving for now. Thanks all for the help !

          Show
          Shai Erera added a comment - Added another test which uncovered a wrong assert in ParentArray ctor. All tests pass on 4x and trunk. Hopefully it'll remain like that . Resolving for now. Thanks all for the help !
          Hide
          Shai Erera added a comment -

          Found the bug. If a new NRT TaxoReader was opened, after the taxonomy segments were merged, it missed few ordinals. This was introduced when I handled a TODO to avoid MultiFields (obviously I didn't handle it well ). I re-introduced MultiFields because the code only iterates on a single posting list, and MultiFields help to keep the code simple, and obviously bug free !

          I added a non-random test which exposes the bug.

          Will commit to 4x and trunk, and port the new test + fix to trunk as well.

          Show
          Shai Erera added a comment - Found the bug. If a new NRT TaxoReader was opened, after the taxonomy segments were merged, it missed few ordinals. This was introduced when I handled a TODO to avoid MultiFields (obviously I didn't handle it well ). I re-introduced MultiFields because the code only iterates on a single posting list, and MultiFields help to keep the code simple, and obviously bug free ! I added a non-random test which exposes the bug. Will commit to 4x and trunk, and port the new test + fix to trunk as well.
          Hide
          Shai Erera added a comment -

          I added a TODO to consider switching to use DoubleBarrleLRUCache (to DirTaxoReader). Also added some javadocs to DTR.doOpenIfChanged, noting about the cache sharing.

          I ran all tests with 30 iterations on trunk and committed.
          After I ported to 4x, I hit this failure:

          org.apache.lucene.index.CorruptIndexException: Missing parent data for category 0
          at __randomizedtesting.SeedInfo.seed([A790471136DA8197:3D43CAE660B7F604]:0)
          at org.apache.lucene.facet.taxonomy.directory.ParentArray.initFromReader(ParentArray.java:116)
          at org.apache.lucene.facet.taxonomy.directory.ParentArray.<init>(ParentArray.java:66)
          at org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader.<init>(DirectoryTaxonomyReader.java:92)
          at org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader.doOpenIfChanged(DirectoryTaxonomyReader.java:238)
          at org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader.doOpenIfChanged(DirectoryTaxonomyReader.java:1)
          at org.apache.lucene.facet.taxonomy.TaxonomyReader.openIfChanged(TaxonomyReader.java:95)
          at org.apache.lucene.facet.taxonomy.directory.TestDirectoryTaxonomyReader.testOpenIfChangedManySegments(TestDirectoryTaxonomyReader.java:224)

          NOTE: reproduce with: ant test -Dtestcase=TestDirectoryTaxonomyReader -Dtests.method="testOpenIfChangedManySegments

          {#66 seed=[A790471136DA8197:3D43CAE660B7F604]}

          " -Dtests.seed=A790471136DA8197 -Dtests.slow=true -Dtests.locale=lt_LT -Dtests.timezone=Australia/South -Dtests.file.encoding=UTF-8
          NOTE: test params are: codec=SimpleText, sim=DefaultSimilarity, locale=lt_LT, timezone=Australia/South
          NOTE: Windows 7 6.1 build 7600 amd64/IBM Corporation 1.6.0 (64-bit)/cpus=2,threads=1,free=1071336,total=15596544
          NOTE: All tests run in this JVM: [TestDirectoryTaxonomyReader]

          Even though the test is single-threaded, it's not reproducible no matter if I use the master seed or the iteration seed. I can reproduce quite easily though if I add -Dtests.iters=100 and -Dtests.failfast=yes – but it makes it hard to debug ...

          I'm digging. I won't revert the commit to trunk yet, hopefully I'll get to the bottom of it quickly.

          Show
          Shai Erera added a comment - I added a TODO to consider switching to use DoubleBarrleLRUCache (to DirTaxoReader). Also added some javadocs to DTR.doOpenIfChanged, noting about the cache sharing. I ran all tests with 30 iterations on trunk and committed. After I ported to 4x, I hit this failure: org.apache.lucene.index.CorruptIndexException: Missing parent data for category 0 at __randomizedtesting.SeedInfo.seed( [A790471136DA8197:3D43CAE660B7F604] :0) at org.apache.lucene.facet.taxonomy.directory.ParentArray.initFromReader(ParentArray.java:116) at org.apache.lucene.facet.taxonomy.directory.ParentArray.<init>(ParentArray.java:66) at org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader.<init>(DirectoryTaxonomyReader.java:92) at org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader.doOpenIfChanged(DirectoryTaxonomyReader.java:238) at org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader.doOpenIfChanged(DirectoryTaxonomyReader.java:1) at org.apache.lucene.facet.taxonomy.TaxonomyReader.openIfChanged(TaxonomyReader.java:95) at org.apache.lucene.facet.taxonomy.directory.TestDirectoryTaxonomyReader.testOpenIfChangedManySegments(TestDirectoryTaxonomyReader.java:224) NOTE: reproduce with: ant test -Dtestcase=TestDirectoryTaxonomyReader -Dtests.method="testOpenIfChangedManySegments {#66 seed=[A790471136DA8197:3D43CAE660B7F604]} " -Dtests.seed=A790471136DA8197 -Dtests.slow=true -Dtests.locale=lt_LT -Dtests.timezone=Australia/South -Dtests.file.encoding=UTF-8 NOTE: test params are: codec=SimpleText, sim=DefaultSimilarity, locale=lt_LT, timezone=Australia/South NOTE: Windows 7 6.1 build 7600 amd64/IBM Corporation 1.6.0 (64-bit)/cpus=2,threads=1,free=1071336,total=15596544 NOTE: All tests run in this JVM: [TestDirectoryTaxonomyReader] Even though the test is single-threaded, it's not reproducible no matter if I use the master seed or the iteration seed. I can reproduce quite easily though if I add -Dtests.iters=100 and -Dtests.failfast=yes – but it makes it hard to debug ... I'm digging. I won't revert the commit to trunk yet, hopefully I'll get to the bottom of it quickly.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Shai Erera
          http://svn.apache.org/viewvc?view=revision&revision=1412149

          LUCENE-3441: facets NRT support

          Show
          Commit Tag Bot added a comment - [trunk commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1412149 LUCENE-3441 : facets NRT support
          Hide
          Gilad Barkai added a comment -

          Patch looks very good.

          +1.

          Show
          Gilad Barkai added a comment - Patch looks very good. +1.
          Hide
          Shai Erera added a comment -

          Patch addresses most of Sivan's comments. Also fixes the cache sharing by sharing the instances and stop putting in the cache the fact that a category wasn't found. Also DirTaxoReader does a bound check when needed, because of cache sharing.

          I added few tests (w/ the help of Gilad) and on the way realized that I didn't handle DirTaxoWriter.replaceTaxonomy(), so had to fix doOpenIfChanged to check the taxoEpoch member from DirTaxoWriter, in case we're in NRT.

          All tests pass. In general I think that it's ready to go in.

          Show
          Shai Erera added a comment - Patch addresses most of Sivan's comments. Also fixes the cache sharing by sharing the instances and stop putting in the cache the fact that a category wasn't found. Also DirTaxoReader does a bound check when needed, because of cache sharing. I added few tests (w/ the help of Gilad) and on the way realized that I didn't handle DirTaxoWriter.replaceTaxonomy(), so had to fix doOpenIfChanged to check the taxoEpoch member from DirTaxoWriter, in case we're in NRT. All tests pass. In general I think that it's ready to go in.
          Hide
          Shai Erera added a comment -

          would the old TR ever receive a request to look up an out-of-bounds ord

          In fact, it could. Like IR, the old TR still lives for old queries in flight. It is possible that TR-1 received a FacetRequest with a category that it doesn't recognize, but because the cache is now shared, and the TR-2 recognizes that category and put it in the cache, TR-1 will receive its ordinal, which is out-of-bound.

          Took a re-look at the code, and I think that it's solvable by having TR do a bounds check. It is a very simple and cheap check, only one 'if', which will happen, in practice, only for the top-K categories and the FacetRequest initially. I.e., the entire faceted search process – aggregating the categories, determining top-K etc., all happen on the ordinals that are known to the search index (IR-1, that is coupled with TR-1).

          In fact, the labeling that IR-1/TR-1 will do to the top-K cannot out-of-bound, so actually only the FacetRequest may out-of-bound. Still, I think that it's safer to add the bounds check to both caches.

          I will post a new patch soon, compiled with the comments from Sivan (thanks !), and removing the nocommit.

          Show
          Shai Erera added a comment - would the old TR ever receive a request to look up an out-of-bounds ord In fact, it could. Like IR, the old TR still lives for old queries in flight. It is possible that TR-1 received a FacetRequest with a category that it doesn't recognize, but because the cache is now shared, and the TR-2 recognizes that category and put it in the cache, TR-1 will receive its ordinal, which is out-of-bound. Took a re-look at the code, and I think that it's solvable by having TR do a bounds check. It is a very simple and cheap check, only one 'if', which will happen, in practice, only for the top-K categories and the FacetRequest initially. I.e., the entire faceted search process – aggregating the categories, determining top-K etc., all happen on the ordinals that are known to the search index (IR-1, that is coupled with TR-1). In fact, the labeling that IR-1/TR-1 will do to the top-K cannot out-of-bound, so actually only the FacetRequest may out-of-bound. Still, I think that it's safer to add the bounds check to both caches. I will post a new patch soon, compiled with the comments from Sivan (thanks !), and removing the nocommit.
          Hide
          Michael McCandless added a comment -

          Since the usage of openIfChanged is well defined, and assuming that people who interact with the taxonomy are aware Lucen's NRT, should we rely on the old taxo reader instances to not be used after openIfChanged?

          Hmm, for NRT the old reader should continue working just fine even if a new reader was openIfChange'd off of it. EG there may be queries in-flight still against the old reader.

          However, for TR, would the old TR ever receive a request to look up an out-of-bounds ord? It shouldn't, right? Can we declare that the results when you pass out of bounds ords are not defined, ie that's abuse of the API?

          Show
          Michael McCandless added a comment - Since the usage of openIfChanged is well defined, and assuming that people who interact with the taxonomy are aware Lucen's NRT, should we rely on the old taxo reader instances to not be used after openIfChanged? Hmm, for NRT the old reader should continue working just fine even if a new reader was openIfChange'd off of it. EG there may be queries in-flight still against the old reader. However, for TR, would the old TR ever receive a request to look up an out-of-bounds ord? It shouldn't, right? Can we declare that the results when you pass out of bounds ords are not defined, ie that's abuse of the API?
          Hide
          Sivan Yogev added a comment -

          Went through the patch and it looks good.

          One symmetry question I have - TR is now abstract class and TW still interface. Can it be that we should make TW also abstract class?

          A few minor comments:

          ParentArray:

          • constructors are public but have no Javadoc. Are they meant for external (outside the taxonomy package) usage?

          DTR:

          • Semi-cosmetic issue - I would move the first (non-public) constructor after the two public ones, confused me a bit to see that one first.
          • Line 127: unnecessary "to" in Javadoc of param directory.
          • 4000 appears 4 times in the code, would rather have it once as a constant.
          Show
          Sivan Yogev added a comment - Went through the patch and it looks good. One symmetry question I have - TR is now abstract class and TW still interface. Can it be that we should make TW also abstract class? A few minor comments: ParentArray: constructors are public but have no Javadoc. Are they meant for external (outside the taxonomy package) usage? DTR: Semi-cosmetic issue - I would move the first (non-public) constructor after the two public ones, confused me a bit to see that one first. Line 127: unnecessary "to" in Javadoc of param directory. 4000 appears 4 times in the code, would rather have it once as a constant.
          Hide
          Shai Erera added a comment -

          The side effect of sharing the cache is for TR-1 to receive an ordinal which doesn't actually exist. We can document that if you use openIfChanged, but continue to use the old reader, then this sort of thing can happen. Really, this is not how people should use openIfChanged ...

          Another idea I had is to wrap the cache with an object that keeps track of 'upto'. I.e. TR-1's instance's upto will be e.g. 123. On openIfChanged we allocate a new instance for TR-2, with upto=1035, sharing the same LRU instance. On each access to the cache, we check if the returned ordinal is <= upto, otherwise return INVALID.

          But that adds another 'if' to every cache-check. How expensive is it, not sure yet, since the taxonomy is not accessed for so many facets.

          It feels like we need to make some compromise here - either at runtime performance or possible correctness. Since the usage of openIfChanged is well defined, and assuming that people who interact with the taxonomy are aware Lucen's NRT, should we rely on the old taxo reader instances to not be used after openIfChanged?

          Show
          Shai Erera added a comment - The side effect of sharing the cache is for TR-1 to receive an ordinal which doesn't actually exist. We can document that if you use openIfChanged, but continue to use the old reader, then this sort of thing can happen. Really, this is not how people should use openIfChanged ... Another idea I had is to wrap the cache with an object that keeps track of 'upto'. I.e. TR-1's instance's upto will be e.g. 123. On openIfChanged we allocate a new instance for TR-2, with upto=1035, sharing the same LRU instance. On each access to the cache, we check if the returned ordinal is <= upto, otherwise return INVALID. But that adds another 'if' to every cache-check. How expensive is it, not sure yet, since the taxonomy is not accessed for so many facets. It feels like we need to make some compromise here - either at runtime performance or possible correctness. Since the usage of openIfChanged is well defined, and assuming that people who interact with the taxonomy are aware Lucen's NRT, should we rely on the old taxo reader instances to not be used after openIfChanged?
          Hide
          Shai Erera added a comment -

          why not have a single instance of the LRUCache for all time, and never call .clear() on it?

          That will help as long as previous TR instances are indeed on their way to die. Otherwise, if e.g. an app, for some reason, reopens a TR but doesn't close the old one and uses both (again, for some really unknown reason), then two TR instances might affect each other.

          Now, since that's a very stupid thing to do, I'm not sure that I care about this much, as long as we preserve correctness. Meaning, that that one instance may reduce the size of the cache, while another increases it - that's the app problem. That that the two instances might evict entries from the LRU cache left and center, that's the app problem.

          The correctness issues that I'm worried about is (suppose that TR-1 and TR-2 share the same instance):

          • TR-1 looks for a category "foo", doesn't find it and adds to the cache the fact that the category is unknown
          • TR-2 looks for the category "foo", which exists in its newer version of the taxonomy, and receives the ordinal -1, which denotes that the category doesn't exist — WRONG !!

          To solve that, we could not store the fact that a category does not exist in the cache. Really, this shouldn't happen - apps do not ask the taxonomy for random categories. So then:

          • TR-1 looks for a category "foo", doesn't find it in the cache and DOES NOT update the cache w/ that info. It goes to disk, doesn't find it there, returns -1.
          • TR-2 looks for the category "foo", which exists in its newer version of the taxonomy, fetches it from disk and stores the ordinal in the cache.
          • TR-1 looks for the category "foo" again, now receives an ordinal which is larger than its taxonomy size — might be a problem !!

          In general, since I don't think that apps access the taxonomy for random ordinals or categories, the second solution might be good. Never store in the cache the fact that an ordinal/category is not found + don't clear() the cache, only nullify its reference + hope for the best ?

          Show
          Shai Erera added a comment - why not have a single instance of the LRUCache for all time, and never call .clear() on it? That will help as long as previous TR instances are indeed on their way to die. Otherwise, if e.g. an app, for some reason, reopens a TR but doesn't close the old one and uses both (again, for some really unknown reason), then two TR instances might affect each other. Now, since that's a very stupid thing to do, I'm not sure that I care about this much, as long as we preserve correctness. Meaning, that that one instance may reduce the size of the cache, while another increases it - that's the app problem. That that the two instances might evict entries from the LRU cache left and center, that's the app problem. The correctness issues that I'm worried about is (suppose that TR-1 and TR-2 share the same instance): TR-1 looks for a category "foo", doesn't find it and adds to the cache the fact that the category is unknown TR-2 looks for the category "foo", which exists in its newer version of the taxonomy, and receives the ordinal -1, which denotes that the category doesn't exist — WRONG !! To solve that, we could not store the fact that a category does not exist in the cache. Really, this shouldn't happen - apps do not ask the taxonomy for random categories. So then: TR-1 looks for a category "foo", doesn't find it in the cache and DOES NOT update the cache w/ that info. It goes to disk, doesn't find it there, returns -1. TR-2 looks for the category "foo", which exists in its newer version of the taxonomy, fetches it from disk and stores the ordinal in the cache. TR-1 looks for the category "foo" again, now receives an ordinal which is larger than its taxonomy size — might be a problem !! In general, since I don't think that apps access the taxonomy for random ordinals or categories, the second solution might be good. Never store in the cache the fact that an ordinal/category is not found + don't clear() the cache, only nullify its reference + hope for the best ?
          Hide
          Michael McCandless added a comment -

          I don't fully follow all changes in the patch, but generally this sounds like an awesome change!

          On that nocommit ... why not have a single instance of the LRUCache for all time, and never call .clear() on it? I suspect in practice that .clear() isn't helping much? Ie it's rare that an app would close the LTR and hold onto the reference to it ...?

          Show
          Michael McCandless added a comment - I don't fully follow all changes in the patch, but generally this sounds like an awesome change! On that nocommit ... why not have a single instance of the LRUCache for all time, and never call .clear() on it? I suspect in practice that .clear() isn't helping much? Ie it's rare that an app would close the LTR and hold onto the reference to it ...?
          Hide
          Shai Erera added a comment -

          While I could just share the instance

          Sorry, forgot to complete the idea – simply sharing the instances currently won't work, b/c e.g. when TR.close() is called, the cache is clear()'ed, which will affect the other instance as well. A simple solution would be to remove the 'final' modifier from the cache member, and nullify the cache on close(), instead of clear()-ing.

          That way, if we rely on TR instances that are reopened to be immediately closed by the caller, and sharing the cache instances may not be so bad.

          Show
          Shai Erera added a comment - While I could just share the instance Sorry, forgot to complete the idea – simply sharing the instances currently won't work, b/c e.g. when TR.close() is called, the cache is clear()'ed, which will affect the other instance as well. A simple solution would be to remove the 'final' modifier from the cache member, and nullify the cache on close(), instead of clear()-ing. That way, if we rely on TR instances that are reopened to be immediately closed by the caller, and sharing the cache instances may not be so bad.
          Hide
          Shai Erera added a comment -

          Patch introduces NRT support by doing the following:

          • Add a constructor which takes DirTaxoWriter, from which DirTaxoReader obtains the internal IndexWriter instance, to obtain NRT readers.
          • Remove refresh() in exchange for a static TaxonomyReader.openIfChanged. Similar to DirectoryReader, the method either returns null if no changes were made to the taxonomy, or a new TR instance otherwise.
          • Extracted the logic of creating the ParentArray and ChildrenArrays from DirTaxoReader into their own classes. As a result:
            • DirTaxoReader code greatly simplified
            • These classes are now immutable, which simplified even more the logic of DirTaxoReader.
          • TaxonomyReader made abstract class instead of an interface, and few methods (e.g. close(), incRef(), decRef() etc.) were pulled to it from DirTaxoReader and made final.

          Not strictly related, but I could not avoid these changes too:

          • Removed the over-verbosing in DirTaxoReader. Some is unnecessary anymore b/c DirTaxoReader is simplified, other was just too much IMO.
          • Improved the documentation of the different methods, again mostly by shortening them and keep them focused.

          NOTE: I put a CHANGES entry under the back-compat section of 4.1. I intend to commit this to 4.x, and it is sort of a back-compat break, even though a simple one.

          There's one nocommit which I'd love if someone can take a look at and perhaps propose a solution. I documented it there, but I'll repeat the issue here - DirTaxoReader maintains two LRU caches which I'd like to share with the new instance returned from openIfChanged. Currently the code copies them fully, which is not so efficient in an NRT case.

          While I could just share the instance, I'm worried that two TR instances have e.g. the ability to change the cache size, or add/remove entries from it.

          Also note the weird behavior I mentioned about cloning the cache, as opposed to add it all to a new instance. I still didn't get to the bottom of why cloning the cache is so horribly slow, but adding it to a fresh new instance is so cheap ...

          Show
          Shai Erera added a comment - Patch introduces NRT support by doing the following: Add a constructor which takes DirTaxoWriter, from which DirTaxoReader obtains the internal IndexWriter instance, to obtain NRT readers. Remove refresh() in exchange for a static TaxonomyReader.openIfChanged. Similar to DirectoryReader, the method either returns null if no changes were made to the taxonomy, or a new TR instance otherwise. Extracted the logic of creating the ParentArray and ChildrenArrays from DirTaxoReader into their own classes. As a result: DirTaxoReader code greatly simplified These classes are now immutable, which simplified even more the logic of DirTaxoReader. TaxonomyReader made abstract class instead of an interface, and few methods (e.g. close(), incRef(), decRef() etc.) were pulled to it from DirTaxoReader and made final. Not strictly related, but I could not avoid these changes too: Removed the over-verbosing in DirTaxoReader. Some is unnecessary anymore b/c DirTaxoReader is simplified, other was just too much IMO. Improved the documentation of the different methods, again mostly by shortening them and keep them focused. NOTE: I put a CHANGES entry under the back-compat section of 4.1. I intend to commit this to 4.x, and it is sort of a back-compat break, even though a simple one. There's one nocommit which I'd love if someone can take a look at and perhaps propose a solution. I documented it there, but I'll repeat the issue here - DirTaxoReader maintains two LRU caches which I'd like to share with the new instance returned from openIfChanged. Currently the code copies them fully, which is not so efficient in an NRT case. While I could just share the instance, I'm worried that two TR instances have e.g. the ability to change the cache size, or add/remove entries from it. Also note the weird behavior I mentioned about cloning the cache, as opposed to add it all to a new instance. I still didn't get to the bottom of why cloning the cache is so horribly slow, but adding it to a fresh new instance is so cheap ...
          Hide
          Shai Erera added a comment -

          Shouldn't you also commit in the constructor

          LuceneTaxonomyWriter behaves just like IndexWriter. Today (I think since 3.1), opening an IndexWriter is just another transaction that you should commit if you want IndexReaders to see it. So if you try:

          IndexWriter w = new IndexWriter(emptyDir, new IWC());
          IndexReader r = IndexReader.open(emptyDir);
          

          you'll get an exception as well.

          If you want that to work, you must insert a commit() call after line #1, and LTW follows this logic.

          Also an explanation of what it's doing underneath

          Refreshing LTR means reopening its internal IndexReader instance. If it has changed, then LTR updates its parents array with the newly added categories. Usually, assuming the taxonomy does not grow a lot (i.e., usually after some point your taxonomy is relatively fixed, and new categories are not added often – much like an index lexicon), this additional update of the parents array is quick.

          Show
          Shai Erera added a comment - Shouldn't you also commit in the constructor LuceneTaxonomyWriter behaves just like IndexWriter. Today (I think since 3.1), opening an IndexWriter is just another transaction that you should commit if you want IndexReaders to see it. So if you try: IndexWriter w = new IndexWriter(emptyDir, new IWC()); IndexReader r = IndexReader.open(emptyDir); you'll get an exception as well. If you want that to work, you must insert a commit() call after line #1, and LTW follows this logic. Also an explanation of what it's doing underneath Refreshing LTR means reopening its internal IndexReader instance. If it has changed, then LTR updates its parents array with the newly added categories. Usually, assuming the taxonomy does not grow a lot (i.e., usually after some point your taxonomy is relatively fixed, and new categories are not added often – much like an index lexicon), this additional update of the parents array is quick.
          Hide
          Jason Rutherglen added a comment -

          It would be great if the cost of (re)opening a new LTR is. Also an explanation of what it's doing underneath.

          Show
          Jason Rutherglen added a comment - It would be great if the cost of (re)opening a new LTR is. Also an explanation of what it's doing underneath.
          Hide
          Mihai Caraman added a comment - - edited

          Newb question: Shouldn't you also commit in the constructor, so you can create a reader right after? For example, to later work with the taxReader through refresh(), when i start clean, I have to initialize:
          w= LuceneTaxonomyWriter(...),
          w.commit(),
          new LuceneTaxonomyReader(...),
          else it throws no segment exception(segments which you'd expect to be there because of the taxWriter ctor, or is that just me ?).

          Show
          Mihai Caraman added a comment - - edited Newb question: Shouldn't you also commit in the constructor, so you can create a reader right after? For example, to later work with the taxReader through refresh(), when i start clean, I have to initialize: w= LuceneTaxonomyWriter(...), w.commit(), new LuceneTaxonomyReader(...), else it throws no segment exception(segments which you'd expect to be there because of the taxWriter ctor, or is that just me ?).

            People

            • Assignee:
              Shai Erera
              Reporter:
              Shai Erera
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development