Lucene - Core
  1. Lucene - Core
  2. LUCENE-3573

TaxonomyReader.refresh() is broken, replace its logic with reopen(), following IR.reopen pattern

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      When recreating the taxonomy index, TR's assumption that categories are only added does not hold anymore.
      As result, calling TR.refresh() will be incorrect at best, but usually throw an AIOOBE.

      1. LUCENE-3573.patch
        3 kB
        Doron Cohen
      2. LUCENE-3573.patch
        19 kB
        Doron Cohen
      3. LUCENE-3573.patch
        17 kB
        Doron Cohen

        Activity

        Hide
        Doron Cohen added a comment -

        Attached patch for trunk adds two tests:

        • one of them is opening a new TR and passes
        • the other is refreshing the TR and fails.
        Show
        Doron Cohen added a comment - Attached patch for trunk adds two tests: one of them is opening a new TR and passes the other is refreshing the TR and fails.
        Hide
        Shai Erera added a comment -

        +1. I think that we should nuke refresh() and adopt the IR approach, even though I don't like the 'maybe' and 'if', might as well make the API consistent. So instead of refresh() we'll have a static TR.openIfChanged that either returns null (no changes, or the taxonomy wasn't recreated) or a new instance in case it was recreated.

        Note that unlike IndexReader, if the taxonomy index wasn't recreated, openIfChanged will modify the internal state of TR. That's ok since the taxonomy index was built for it: existing TR instances (that weren't refreshed) won't be affected as they won't know about the new categories (and taxonomy index doesn't support deletes) and the caller can use the same TR instance in that case.

        Whatever we end up doing, we should remove refresh(). Even though we're not committed to back-compat yet (it's all experimental), I think it is dangerous if we'll simply modify refresh() behavior, because users may not be aware of the change. So a new method is a must.

        Besides that, the test looks good. Was there any reason to add it to TestTaxonomyCombined?

        Show
        Shai Erera added a comment - +1. I think that we should nuke refresh() and adopt the IR approach, even though I don't like the 'maybe' and 'if', might as well make the API consistent. So instead of refresh() we'll have a static TR.openIfChanged that either returns null (no changes, or the taxonomy wasn't recreated) or a new instance in case it was recreated. Note that unlike IndexReader, if the taxonomy index wasn't recreated, openIfChanged will modify the internal state of TR. That's ok since the taxonomy index was built for it: existing TR instances (that weren't refreshed) won't be affected as they won't know about the new categories (and taxonomy index doesn't support deletes) and the caller can use the same TR instance in that case. Whatever we end up doing, we should remove refresh(). Even though we're not committed to back-compat yet (it's all experimental), I think it is dangerous if we'll simply modify refresh() behavior, because users may not be aware of the change. So a new method is a must. Besides that, the test looks good. Was there any reason to add it to TestTaxonomyCombined?
        Hide
        Doron Cohen added a comment -

        I agree about keeping the same notions as IR.

        returns null (no changes, or the taxonomy wasn't recreated)

        In fact I was thinking of a different "contract".

        So we have two approaches here for the returned value:

        • Option A:
          1. new TR - if the taxonomy was recreated.
          2. null - if the taxonomy was either not modified or just grew.
        • Option B:
          1. new TR - if the taxonomy was modified (either recreated or just grew)
          2. null - if the taxonomy was not modified.

        Option A is simpler to implement, but I think it has two drawbacks:

        • it is confusingly different from that of IR
        • the fact that the TR was refreshed is hidden from the caller.

        Option B is a bit more involved to implement:

        • would need to copy arrays' data from old TR to new one in case the taxonomy only grew

        I started to implement option B but now rethinking this...

        Was there any reason to add it to TestTaxonomyCombined?

        Good point, should probably move this to TestDirectoryTaxonomyReader.

        Show
        Doron Cohen added a comment - I agree about keeping the same notions as IR. returns null (no changes, or the taxonomy wasn't recreated) In fact I was thinking of a different "contract". So we have two approaches here for the returned value: Option A: new TR - if the taxonomy was recreated. null - if the taxonomy was either not modified or just grew. Option B: new TR - if the taxonomy was modified (either recreated or just grew) null - if the taxonomy was not modified. Option A is simpler to implement, but I think it has two drawbacks: it is confusingly different from that of IR the fact that the TR was refreshed is hidden from the caller. Option B is a bit more involved to implement: would need to copy arrays' data from old TR to new one in case the taxonomy only grew I started to implement option B but now rethinking this... Was there any reason to add it to TestTaxonomyCombined? Good point, should probably move this to TestDirectoryTaxonomyReader.
        Hide
        Doron Cohen added a comment -

        One more thing

        • In approach B, the fact that the taxonomy just grew simply allows an optimization (read only the new ordinals), and so it is not a part of the API logic, and the only logic is - was the taxonomy modified or not.
        • In approach A, this fact is part of the API logic.
        Show
        Doron Cohen added a comment - One more thing In approach B, the fact that the taxonomy just grew simply allows an optimization (read only the new ordinals), and so it is not a part of the API logic, and the only logic is - was the taxonomy modified or not. In approach A, this fact is part of the API logic.
        Hide
        Shai Erera added a comment -

        My thinking is that today refresh() modifies the internal state, and does so in an optimized manner (however I think we thought there might be concurrency issues with it). And people (whoever they are) got used to it. And the TaxonomyReader does not imply in any way that it is an index (i.e. we've been thinking about implementing the taxonomy as a serialized FST maybe).

        So in fact, let's not call it openIfChanged, because may not be meaningful. We can call it reload() maybe (just to have a better name). Now the question is whether we want to trust javadocs to note that the returned value may not be equals to the given one (ala the old IndexReader.reopen()), or follow the same IR concept where reload() is static.

        If we want to follow IR semantics fully, then option B is the right one. But can it be done in an optimized manner? I bet we won't have any concurrency issues with it ...

        Show
        Shai Erera added a comment - My thinking is that today refresh() modifies the internal state, and does so in an optimized manner (however I think we thought there might be concurrency issues with it). And people (whoever they are) got used to it. And the TaxonomyReader does not imply in any way that it is an index (i.e. we've been thinking about implementing the taxonomy as a serialized FST maybe). So in fact, let's not call it openIfChanged, because may not be meaningful. We can call it reload() maybe (just to have a better name). Now the question is whether we want to trust javadocs to note that the returned value may not be equals to the given one (ala the old IndexReader.reopen()), or follow the same IR concept where reload() is static. If we want to follow IR semantics fully, then option B is the right one. But can it be done in an optimized manner? I bet we won't have any concurrency issues with it ...
        Hide
        Shai Erera added a comment -

        I thought about it more, and I really think that we should go with option A. The common case is that the taxonomy index is not recreated, however it may be updated very frequently. TR.refresh() denotes exactly that – it refreshes the internal state of the TaxonomyReader. This method must be very efficient. I.e., with NRT, people rely on IR.openIfChanged to only open new segments, which is what DTR.refresh() does (calls IR.openIfChanged).

        So putting the name aside, we should have the method X() which either returns null (and updates or not the internal state) or returns a new TR. Unfortunately refresh() is not a bad name (reload() sounds more drastic and less efficient), so maybe refreshIfChanged?

        Show
        Shai Erera added a comment - I thought about it more, and I really think that we should go with option A. The common case is that the taxonomy index is not recreated, however it may be updated very frequently. TR.refresh() denotes exactly that – it refreshes the internal state of the TaxonomyReader. This method must be very efficient. I.e., with NRT, people rely on IR.openIfChanged to only open new segments, which is what DTR.refresh() does (calls IR.openIfChanged). So putting the name aside, we should have the method X() which either returns null (and updates or not the internal state) or returns a new TR. Unfortunately refresh() is not a bad name (reload() sounds more drastic and less efficient), so maybe refreshIfChanged?
        Hide
        Doron Cohen added a comment -

        So in fact, let's not call it openIfChanged, because may not be meaningful.

        Yes this bothered me too.

        so maybe refreshIfChanged?

        ... let's stick to refresh() (but...)

        Current refresh impl is efficient in that (1) arrays only grow if needed (2) caches only cleaned from wrong 'invalid ordinals'. In that, it relies on that the taxonomy can only grow (unless it is recreated, hence this issue).

        So I now think it would be best to modify slightly refresh() - in case it detects that the taxonomy was created, it will throw a new (checked) exception - telling the application that this TR cannot be refreshed but the app can open a new TR.

        This way there is no 3-way logic - either the TR was refreshed or it was not.

        And while we are at this, refresh() is void. I think it would be useful to return boolean, indicating whether any refresh took place.

        Show
        Doron Cohen added a comment - So in fact, let's not call it openIfChanged, because may not be meaningful. Yes this bothered me too. so maybe refreshIfChanged? ... let's stick to refresh() (but...) Current refresh impl is efficient in that (1) arrays only grow if needed (2) caches only cleaned from wrong 'invalid ordinals'. In that, it relies on that the taxonomy can only grow (unless it is recreated, hence this issue). So I now think it would be best to modify slightly refresh() - in case it detects that the taxonomy was created, it will throw a new (checked) exception - telling the application that this TR cannot be refreshed but the app can open a new TR. This way there is no 3-way logic - either the TR was refreshed or it was not. And while we are at this, refresh() is void. I think it would be useful to return boolean, indicating whether any refresh took place.
        Hide
        Shai Erera added a comment -

        This way there is no 3-way logic - either the TR was refreshed or it was not.

        +1.

        And while we are at this, refresh() is void. I think it would be useful to return boolean, indicating whether any refresh took place.

        +1.

        Show
        Shai Erera added a comment - This way there is no 3-way logic - either the TR was refreshed or it was not. +1. And while we are at this, refresh() is void. I think it would be useful to return boolean, indicating whether any refresh took place. +1.
        Hide
        Doron Cohen added a comment -

        Patch, in principle ready to commit, though I plan to go through it once more.

        In this patch:

        • new tests moved to TestDirectoryTaxonomyReader
        • an exception added: InconsistentTaxonomyException
        • when the reader cannot refresh because the taxonomy was recreated since the last time open/refresh, that exception is thrown and the application should open a fresh taxonomy reader.

        Bumped into 3 TODO's while working on this:

        • FilterIndexReader does not implement getCommitUserData(). Once this is fixed can resolvethe TODO in TestIndexClose. I'll open an issue later.
        • TR.refresh() should return a boolean indicating anything was changed (issue).
        • DTW.rollback() seems wrong to me - it rollback the internal IW, which also closes it, but then it refreshes its internal TR, seems wrong...
        Show
        Doron Cohen added a comment - Patch, in principle ready to commit, though I plan to go through it once more. In this patch: new tests moved to TestDirectoryTaxonomyReader an exception added: InconsistentTaxonomyException when the reader cannot refresh because the taxonomy was recreated since the last time open/refresh, that exception is thrown and the application should open a fresh taxonomy reader. Bumped into 3 TODO's while working on this: FilterIndexReader does not implement getCommitUserData(). Once this is fixed can resolvethe TODO in TestIndexClose. I'll open an issue later. TR.refresh() should return a boolean indicating anything was changed (issue). DTW.rollback() seems wrong to me - it rollback the internal IW, which also closes it, but then it refreshes its internal TR, seems wrong...
        Hide
        Shai Erera added a comment -

        Patch looks good ! Few minor comments:

        • DirTW: shoudl --> should
        • Can we name 'taxoCreateTimeToCommit' just 'taxoCreateTime'?
        • OpenMode.CREATE.equals(openMode) can be OpenMode.CREATE == openMode
        • Perhaps DirTW.commit() can call commit(null) to reduce code duplication? IndexWriter anyway calls commit(null) in its commit(), so it's not an error to call iw.commit(null).
          • Same for prepareCommit()
        • I think that DirectoryTaxonomyReader.DIR_TAXONOMY_CREATE_TIME_PROP should be in DirTW since it is the one writing it. We also have this notion in other places in the code, i.e. TermInfosWriter declares the format constants, and other classes reference them.
          • Also, how about renaming it to INDEX_CREATE_TIME, since technically it is the index that has been recreated. This is to avoid confusion when storing this time in e.g. the search index's commit data, which we do in order to synchronize the taxo and search indices.
        • Maybe for debugging add a message to InconsistentTaxonomyException which will tell us that the taxo has been recreated + the recreate time?

        About the other comments:

        FilterIndexReader does not implement getCommitUserData(). Once this is fixed can resolvethe TODO in TestIndexClose. I'll open an issue later.

        We need a separate issue to write a test which ensures FilterIndexReader overrides all non-final methods of IndexReader. But why not fix FIR to override getCommitData in this issue? It's a simple fix, no need for the TODO (or any change to TestIndexClose) and it is technically related to the issue, as it exposes a bug in FIR.

        TR.refresh() should return a boolean indicating anything was changed (issue).

        I prefer that we change the method signature once. Why not modify it to return a boolean? Isn't that a simple change?

        DTW.rollback() seems wrong to me

        I added it when TW was changed to extend TwoPhaseCommit. I'll take a look. Thanks for raising that.

        Show
        Shai Erera added a comment - Patch looks good ! Few minor comments: DirTW: shoudl --> should Can we name 'taxoCreateTimeToCommit' just 'taxoCreateTime'? OpenMode.CREATE.equals(openMode) can be OpenMode.CREATE == openMode Perhaps DirTW.commit() can call commit(null) to reduce code duplication? IndexWriter anyway calls commit(null) in its commit(), so it's not an error to call iw.commit(null). Same for prepareCommit() I think that DirectoryTaxonomyReader.DIR_TAXONOMY_CREATE_TIME_PROP should be in DirTW since it is the one writing it. We also have this notion in other places in the code, i.e. TermInfosWriter declares the format constants, and other classes reference them. Also, how about renaming it to INDEX_CREATE_TIME, since technically it is the index that has been recreated. This is to avoid confusion when storing this time in e.g. the search index's commit data, which we do in order to synchronize the taxo and search indices. Maybe for debugging add a message to InconsistentTaxonomyException which will tell us that the taxo has been recreated + the recreate time? About the other comments: FilterIndexReader does not implement getCommitUserData(). Once this is fixed can resolvethe TODO in TestIndexClose. I'll open an issue later. We need a separate issue to write a test which ensures FilterIndexReader overrides all non-final methods of IndexReader. But why not fix FIR to override getCommitData in this issue? It's a simple fix, no need for the TODO (or any change to TestIndexClose) and it is technically related to the issue, as it exposes a bug in FIR. TR.refresh() should return a boolean indicating anything was changed (issue). I prefer that we change the method signature once. Why not modify it to return a boolean? Isn't that a simple change? DTW.rollback() seems wrong to me I added it when TW was changed to extend TwoPhaseCommit. I'll take a look. Thanks for raising that.
        Hide
        Shai Erera added a comment -

        DTW.rollback() seems wrong to me

        I modified the method impl to call close() instead of refreshReader() and added an appropriate test to DirTWTest to assert that following rollback, no more actions are allowed on DirTW.

        Separately, I think that we should add ensureOpen() to DirTW so that if you call its API after rollback()/close(), you get a proper exception rather than random exceptions (like NPE). I will open an issue for that (and cover the rollback changes there too).

        Show
        Shai Erera added a comment - DTW.rollback() seems wrong to me I modified the method impl to call close() instead of refreshReader() and added an appropriate test to DirTWTest to assert that following rollback, no more actions are allowed on DirTW. Separately, I think that we should add ensureOpen() to DirTW so that if you call its API after rollback()/close(), you get a proper exception rather than random exceptions (like NPE). I will open an issue for that (and cover the rollback changes there too).
        Hide
        Doron Cohen added a comment -

        Thanks for the review and comments Shai, and also thanks for taking care of DTW.rollback() in LUCENE-3579.

        I fixed the typos and the == as suggested.

        name 'taxoCreateTimeToCommit' just 'taxoCreateTime'

        Given your further comment renamed it to 'taxoIndexCreateTime'

        Perhaps DirTW.commit() can call commit(null)

        I considered this when first coding, as it would have compacted the code. But felt uncomfortable (still do) relying on a non documented behavior of IW.commit(null).

        DirectoryTaxonomyReader.DIR_TAXONOMY_CREATE_TIME_PROP should be in DirTW + renaming to INDEX_CREATE_TIME

        Done.

        add a message to InconsistentTaxonomyException

        Added.

        But why not fix FIR to override getCommitData in this issue?

        Done. Now it feels a bit wrong that this will not appear in lucene/CHANGES since this issue is in lucene/contrib. Guess this is not too bad...?

        TR.refresh() should return a boolean indicating anything was changed (issue). I prefer that we change the method signature once...

        Good point. Added a test case in TestDirectoryTaxonomyReader for this.

        Show
        Doron Cohen added a comment - Thanks for the review and comments Shai, and also thanks for taking care of DTW.rollback() in LUCENE-3579 . I fixed the typos and the == as suggested. name 'taxoCreateTimeToCommit' just 'taxoCreateTime' Given your further comment renamed it to 'taxoIndexCreateTime' Perhaps DirTW.commit() can call commit(null) I considered this when first coding, as it would have compacted the code. But felt uncomfortable (still do) relying on a non documented behavior of IW.commit(null). DirectoryTaxonomyReader.DIR_TAXONOMY_CREATE_TIME_PROP should be in DirTW + renaming to INDEX_CREATE_TIME Done. add a message to InconsistentTaxonomyException Added. But why not fix FIR to override getCommitData in this issue? Done. Now it feels a bit wrong that this will not appear in lucene/CHANGES since this issue is in lucene/contrib. Guess this is not too bad...? TR.refresh() should return a boolean indicating anything was changed (issue). I prefer that we change the method signature once... Good point. Added a test case in TestDirectoryTaxonomyReader for this.
        Hide
        Doron Cohen added a comment -

        Hmm, now that there is a test for LTW.rollback() my changes fail LTW's testRollback() because LTW.close() now may call IW.commit(Map) (which it did not do before my changes).

        For fixing this:

        • added private doClose() which closes IW and nullifies it, and calls closeResources().
        • rollback() calls doClose() instead of close().

        Also, rollback() now made synchronized.

        Show
        Doron Cohen added a comment - Hmm, now that there is a test for LTW.rollback() my changes fail LTW's testRollback() because LTW.close() now may call IW.commit(Map) (which it did not do before my changes). For fixing this: added private doClose() which closes IW and nullifies it, and calls closeResources(). rollback() calls doClose() instead of close(). Also, rollback() now made synchronized.
        Hide
        Shai Erera added a comment -

        For fixing this:

        added private doClose() which closes IW and nullifies it, and calls closeResources().
        rollback() calls doClose() instead of close().

        Also, rollback() now made synchronized.

        Good catch. It is dangerous that rollback is not sync'ed when commit/prepare/close are.

        +1 to commit !

        Show
        Shai Erera added a comment - For fixing this: added private doClose() which closes IW and nullifies it, and calls closeResources(). rollback() calls doClose() instead of close(). Also, rollback() now made synchronized. Good catch. It is dangerous that rollback is not sync'ed when commit/prepare/close are. +1 to commit !
        Hide
        Doron Cohen added a comment -

        Final patch.

        Also updated the user-guide about refresh() behavior.

        Removed the changes entry - for facet this would go only into 3x.

        Planning to commit this soon.

        Show
        Doron Cohen added a comment - Final patch. Also updated the user-guide about refresh() behavior. Removed the changes entry - for facet this would go only into 3x. Planning to commit this soon.
        Hide
        Doron Cohen added a comment -

        Committed:

        • r1202754 - trunk
        • r1203165 - 3x
        Show
        Doron Cohen added a comment - Committed: r1202754 - trunk r1203165 - 3x

          People

          • Assignee:
            Doron Cohen
            Reporter:
            Doron Cohen
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development