Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.6, 5.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      In LUCENE-4258 we started to work on incremental field updates, however the amount of changes are immense and hard to follow/consume. The reason is that we targeted postings, stored fields, DV etc., all from the get go.

      I'd like to start afresh here, with numeric-dv-field updates only. There are a couple of reasons to that:

      • NumericDV fields should be easier to update, if e.g. we write all the values of all the documents in a segment for the updated field (similar to how livedocs work, and previously norms).
      • It's a fairly contained issue, attempting to handle just one data type to update, yet requires many changes to core code which will also be useful for updating other data types.
      • It has value in and on itself, and we don't need to allow updating all the data types in Lucene at once ... we can do that gradually.

      I have some working patch already which I'll upload next, explaining the changes.

      1. LUCENE-5189-updates-order.patch
        9 kB
        Shai Erera
      2. LUCENE-5189-updates-order.patch
        12 kB
        Shai Erera
      3. LUCENE-5189-segdv.patch
        11 kB
        Shai Erera
      4. LUCENE-5189-renames.patch
        221 kB
        Shai Erera
      5. LUCENE-5189-no-lost-updates.patch
        9 kB
        Shai Erera
      6. LUCENE-5189-4x.patch
        216 kB
        Shai Erera
      7. LUCENE-5189-4x.patch
        398 kB
        Shai Erera
      8. LUCENE-5189.patch
        125 kB
        Shai Erera
      9. LUCENE-5189.patch
        125 kB
        Shai Erera
      10. LUCENE-5189.patch
        114 kB
        Shai Erera
      11. LUCENE-5189.patch
        118 kB
        Shai Erera
      12. LUCENE-5189.patch
        123 kB
        Shai Erera
      13. LUCENE-5189.patch
        129 kB
        Shai Erera
      14. LUCENE-5189.patch
        131 kB
        Shai Erera
      15. LUCENE-5189.patch
        149 kB
        Shai Erera
      16. LUCENE-5189.patch
        164 kB
        Shai Erera
      17. LUCENE-5189.patch
        164 kB
        Shai Erera
      18. LUCENE-5189.patch
        167 kB
        Shai Erera
      19. LUCENE-5189_process_events.patch
        2 kB
        Simon Willnauer
      20. LUCENE-5189_process_events.patch
        4 kB
        Simon Willnauer

        Activity

        Hide
        Mikhail Khludnev added a comment -

        One more note for adopters. if you start with experimenting with a test, make sure you suppress older codecs (check how a certain DV update test does it), otherwise you can spend some time to digging in really odd exception.

        Show
        Mikhail Khludnev added a comment - One more note for adopters. if you start with experimenting with a test, make sure you suppress older codecs (check how a certain DV update test does it), otherwise you can spend some time to digging in really odd exception.
        Hide
        Shai Erera added a comment -

        I checked the code and it looks the same with e.g. deleteDocuments(Term) - the Term isn't cloned internally. So your comment pertains to other IW methods.

        Show
        Shai Erera added a comment - I checked the code and it looks the same with e.g. deleteDocuments(Term) - the Term isn't cloned internally. So your comment pertains to other IW methods.
        Hide
        Mikhail Khludnev added a comment -

        Just want to leave one caveat for memories. When you call

        IW.updateNumericDocValue(Term, String, Long)

        make sure that the term is deeply cloned before. Otherwise, if you modify term or bytes, then the modified version will be applied. That's might be a problem.

        Show
        Mikhail Khludnev added a comment - Just want to leave one caveat for memories. When you call IW.updateNumericDocValue(Term, String , Long ) make sure that the term is deeply cloned before. Otherwise, if you modify term or bytes, then the modified version will be applied. That's might be a problem.
        Hide
        Shai Erera added a comment -

        You're right Simon. The updates are buffered in their raw form in memory until a flush is needed (e.g. commit(), or NRT-open). At that point they are resolved and written to the Directory. This is where it differs from deletes - while deletes are small enough to keep the resolved form in-memory, updates aren't - a single update can affect millions of documents, each takes a long (updated value) ... perhaps future work could be to distinguish between small and large updates, and keep the small updates still in memory. But I believe that will affect a lot more code, e.g. SegReader will now need to be aware of in-memory NDV and on-disk and do a kind of merge between them when an NDV is requested for such field ... it's not going to be pretty-looking code I imagine.

        Show
        Shai Erera added a comment - You're right Simon. The updates are buffered in their raw form in memory until a flush is needed (e.g. commit(), or NRT-open). At that point they are resolved and written to the Directory. This is where it differs from deletes - while deletes are small enough to keep the resolved form in-memory, updates aren't - a single update can affect millions of documents, each takes a long (updated value) ... perhaps future work could be to distinguish between small and large updates, and keep the small updates still in memory. But I believe that will affect a lot more code, e.g. SegReader will now need to be aware of in-memory NDV and on-disk and do a kind of merge between them when an NDV is requested for such field ... it's not going to be pretty-looking code I imagine.
        Hide
        Simon Willnauer added a comment -

        Mikhail Khludnev the updates are buffered in memory "just like deletes" and the batched when deletes are flushed to disk / are applied. correct me if I am wrong shai.

        Show
        Simon Willnauer added a comment - Mikhail Khludnev the updates are buffered in memory "just like deletes" and the batched when deletes are flushed to disk / are applied. correct me if I am wrong shai.
        Hide
        Mikhail Khludnev added a comment -

        Shai Erera let me ask a trivial question.

        1. am I right that it writes to file when I call IW.updateNumericDocValue() ?
        2. and obviously, if it does so, I suppose it will be more efficient to somehow bulk series of updates, and flush it afterwards?
        3. or I'm wrong and there is some implicit flush semantics?

        thanks!

        Show
        Mikhail Khludnev added a comment - Shai Erera let me ask a trivial question. am I right that it writes to file when I call IW.updateNumericDocValue() ? and obviously, if it does so, I suppose it will be more efficient to somehow bulk series of updates, and flush it afterwards? or I'm wrong and there is some implicit flush semantics? thanks!
        Hide
        ASF subversion and git services added a comment -

        Commit 1538258 from Shai Erera in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1538258 ]

        LUCENE-5189: add @Deprecated annotation to SegmentInfo.attributes

        Show
        ASF subversion and git services added a comment - Commit 1538258 from Shai Erera in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1538258 ] LUCENE-5189 : add @Deprecated annotation to SegmentInfo.attributes
        Hide
        Shai Erera added a comment -

        Finished backporting the changes to 4x. Thanks all for your valuable comments and help to get this feature in!

        Show
        Shai Erera added a comment - Finished backporting the changes to 4x. Thanks all for your valuable comments and help to get this feature in!
        Hide
        ASF subversion and git services added a comment -

        Commit 1538146 from Shai Erera in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1538146 ]

        LUCENE-5189: rename internal API following NumericDocValues updates

        Show
        ASF subversion and git services added a comment - Commit 1538146 from Shai Erera in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1538146 ] LUCENE-5189 : rename internal API following NumericDocValues updates
        Hide
        ASF subversion and git services added a comment -

        Commit 1538143 from Shai Erera in branch 'dev/trunk'
        [ https://svn.apache.org/r1538143 ]

        LUCENE-5189: rename internal API following NumericDocValues updates

        Show
        ASF subversion and git services added a comment - Commit 1538143 from Shai Erera in branch 'dev/trunk' [ https://svn.apache.org/r1538143 ] LUCENE-5189 : rename internal API following NumericDocValues updates
        Hide
        Shai Erera added a comment -

        Handle "TODO (DVU_RENAME)" comments. It's really a rote renaming + I renamed some members that referred to these objects.

        Show
        Shai Erera added a comment - Handle "TODO (DVU_RENAME)" comments. It's really a rote renaming + I renamed some members that referred to these objects.
        Hide
        Shai Erera added a comment -

        Backported to 4x. I'll handle the "TODO (DVU_RENAME)" tasks (rote renaming of internal API).

        Show
        Shai Erera added a comment - Backported to 4x. I'll handle the "TODO (DVU_RENAME)" tasks (rote renaming of internal API).
        Hide
        ASF subversion and git services added a comment -

        Commit 1537832 from Shai Erera in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1537832 ]

        LUCENE-5189: add NumericDocValues field updates

        Show
        ASF subversion and git services added a comment - Commit 1537832 from Shai Erera in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1537832 ] LUCENE-5189 : add NumericDocValues field updates
        Hide
        ASF subversion and git services added a comment -

        Commit 1537829 from Shai Erera in branch 'dev/trunk'
        [ https://svn.apache.org/r1537829 ]

        LUCENE-5189: add CHANGES

        Show
        ASF subversion and git services added a comment - Commit 1537829 from Shai Erera in branch 'dev/trunk' [ https://svn.apache.org/r1537829 ] LUCENE-5189 : add CHANGES
        Hide
        Michael McCandless added a comment -

        +1 to backport.

        Show
        Michael McCandless added a comment - +1 to backport.
        Hide
        Shai Erera added a comment -

        Patch covers the work from all the issues, ported to 4x (created w/ --show-copies-as-adds). I think it's ready (tests pass several times).

        If there are no objections, I will add a CHANGES entry and commit it.

        Show
        Shai Erera added a comment - Patch covers the work from all the issues, ported to 4x (created w/ --show-copies-as-adds). I think it's ready (tests pass several times). If there are no objections, I will add a CHANGES entry and commit it.
        Hide
        ASF subversion and git services added a comment -

        Commit 1535526 from Shai Erera in branch 'dev/trunk'
        [ https://svn.apache.org/r1535526 ]

        LUCENE-5189: factor out SegmentDocValues from SegmentReader

        Show
        ASF subversion and git services added a comment - Commit 1535526 from Shai Erera in branch 'dev/trunk' [ https://svn.apache.org/r1535526 ] LUCENE-5189 : factor out SegmentDocValues from SegmentReader
        Hide
        Shai Erera added a comment -

        I reviewed the issue and noticed I forgot to handle the refactoring Robert and Uwe asked for (SegmentDocValues). Here's a patch that handles it - quite straightforward.

        Show
        Shai Erera added a comment - I reviewed the issue and noticed I forgot to handle the refactoring Robert and Uwe asked for (SegmentDocValues). Here's a patch that handles it - quite straightforward.
        Hide
        Shai Erera added a comment -

        I think it's ready to backport to 4x.

        Show
        Shai Erera added a comment - I think it's ready to backport to 4x.
        Hide
        ASF subversion and git services added a comment -

        Commit 1531620 from Shai Erera in branch 'dev/trunk'
        [ https://svn.apache.org/r1531620 ]

        LUCENE-5189: add field updates to TestIndexWriterDelete.testNoLostDeletesOrUpdatesOnIOException

        Show
        ASF subversion and git services added a comment - Commit 1531620 from Shai Erera in branch 'dev/trunk' [ https://svn.apache.org/r1531620 ] LUCENE-5189 : add field updates to TestIndexWriterDelete.testNoLostDeletesOrUpdatesOnIOException
        Hide
        Shai Erera added a comment -

        Thanks for the comments Mike, I fixed the code.

        That's a great catch on forceMerge carrying the wrapped IOExc forward. Why is jenkins not hitting this already...?

        I think because liveDocs aren't written during forceMerge, i.e. they are carried in RAM?

        Show
        Shai Erera added a comment - Thanks for the comments Mike, I fixed the code. That's a great catch on forceMerge carrying the wrapped IOExc forward. Why is jenkins not hitting this already...? I think because liveDocs aren't written during forceMerge, i.e. they are carried in RAM?
        Hide
        Michael McCandless added a comment -

        The test case changes look good!

        This confused me:

        +            } else if (!fieldUpdate || random().nextBoolean()) { // sometimes do both deletes and updates
        

        Shouldn't that just be its own if (not else if)? Otherwise I think the comment is wrong ...

        Also I think this:

        +          if (liveDocs != null && liveDocs.get(i)) {
        

        should be:

        +          if (liveDocs == null || livedocs.get(i)) {
        

        ?

        That's a great catch on forceMerge carrying the wrapped IOExc forward. Why is jenkins not hitting this already...?

        Show
        Michael McCandless added a comment - The test case changes look good! This confused me: + } else if (!fieldUpdate || random().nextBoolean()) { // sometimes do both deletes and updates Shouldn't that just be its own if (not else if)? Otherwise I think the comment is wrong ... Also I think this: + if (liveDocs != null && liveDocs.get(i)) { should be: + if (liveDocs == null || livedocs.get(i)) { ? That's a great catch on forceMerge carrying the wrapped IOExc forward. Why is jenkins not hitting this already...?
        Hide
        Shai Erera added a comment -

        Add field updates to TestIndexWriterDelete.testNoLostDeletesOrUpdates. I had to change to test to catch IOException and ignore if it's FakeIOE, or ioe.getCause() is a FakeIOE. The reason is that if the exception happens during merge (in mergeMiddle), IW registers the exception in mergeExceptions and later throws it as a wrapped IOE. This caused the test to falsely fail.

        Show
        Shai Erera added a comment - Add field updates to TestIndexWriterDelete.testNoLostDeletesOrUpdates. I had to change to test to catch IOException and ignore if it's FakeIOE, or ioe.getCause() is a FakeIOE. The reason is that if the exception happens during merge (in mergeMiddle), IW registers the exception in mergeExceptions and later throws it as a wrapped IOE. This caused the test to falsely fail.
        Hide
        ASF subversion and git services added a comment -

        Commit 1531496 from Shai Erera in branch 'dev/trunk'
        [ https://svn.apache.org/r1531496 ]

        LUCENE-5189: test unsetting a document's value while the segment is merging

        Show
        ASF subversion and git services added a comment - Commit 1531496 from Shai Erera in branch 'dev/trunk' [ https://svn.apache.org/r1531496 ] LUCENE-5189 : test unsetting a document's value while the segment is merging
        Hide
        Shai Erera added a comment -

        If SR itself doesnt need ref-counting, perhaps we can pull this out of SR then? (rote-refactor into DV-thingy or something).

        You mean something like SegmentDocValues? It's doable I guess. SR would need to keep track of the DV.gens it uses though, so that in SR.doClose it can call segDV.decRef(gens) so that the latter can decRef all the DVPs that are used for these gens. If it also removes a gen from the map when it's no longer referenced by any SR, we don't need to take care of clearing the genDVP map when all SRs were closed (otherwise I think we'll need to refCount SegDV too, like SCR).

        I'll give it a shot.

        Show
        Shai Erera added a comment - If SR itself doesnt need ref-counting, perhaps we can pull this out of SR then? (rote-refactor into DV-thingy or something). You mean something like SegmentDocValues? It's doable I guess. SR would need to keep track of the DV.gens it uses though, so that in SR.doClose it can call segDV.decRef(gens) so that the latter can decRef all the DVPs that are used for these gens. If it also removes a gen from the map when it's no longer referenced by any SR, we don't need to take care of clearing the genDVP map when all SRs were closed (otherwise I think we'll need to refCount SegDV too, like SCR). I'll give it a shot.
        Hide
        Robert Muir added a comment -

        Thanks shai... I think for me, this is the main one:

        SR itself doesn't need any ref-counting, it's the DVPs that need them. Putting them in SCR I think only complicates things (or at least doesn't simplify).

        If SR itself doesnt need ref-counting, perhaps we can pull this out of SR then? (rote-refactor into DV-thingy or something).

        Show
        Robert Muir added a comment - Thanks shai... I think for me, this is the main one: SR itself doesn't need any ref-counting, it's the DVPs that need them. Putting them in SCR I think only complicates things (or at least doesn't simplify). If SR itself doesnt need ref-counting, perhaps we can pull this out of SR then? (rote-refactor into DV-thingy or something).
        Hide
        Shai Erera added a comment -

        because there isn't a single DVP that all SRs share

        let me clarify that – if you don't use DV updates, you can say that the DVPs for gen=-1 are shared between all the readers. But with updates, gen=-1 may soon be unused by any reader. I prefer for all DVPs management to be in one place than to split it between SCR and SR.

        Nevertheless, if you think this can be simplified somehow, I'm open to suggestions. I don't want to move them to SCR just for the sake of saying they are in SCR. The code which decides if to reuse a certain DVP or create a new one will still be in SR, and so it makes sense to me that SR manages them.

        Show
        Shai Erera added a comment - because there isn't a single DVP that all SRs share let me clarify that – if you don't use DV updates, you can say that the DVPs for gen=-1 are shared between all the readers. But with updates, gen=-1 may soon be unused by any reader. I prefer for all DVPs management to be in one place than to split it between SCR and SR. Nevertheless, if you think this can be simplified somehow, I'm open to suggestions. I don't want to move them to SCR just for the sake of saying they are in SCR. The code which decides if to reuse a certain DVP or create a new one will still be in SR, and so it makes sense to me that SR manages them.
        Hide
        Shai Erera added a comment -

        Let me explain: SCR is still used for all the shared objects between all SRs, objects that are closed when the last of the SRs using this 'core' is closed. With gen'd DVPs, they no longer belong in SCR, because there isn't a single DVP that all SRs share. For example, suppose that you start with an index without updates, then the DV fields gen is -1. So a DVP for field 'f' will be created. Next you update a field and reopen, the gen for 'f' is incremented to 1 and a new DVP is needed. The DVP for gen=-1 is no longer needed.

        Rob, if we did what you propose - store the initial DVPs in SCR and the updated ones in SRs (as they are updated and reopened), Soon the DVPs in SCR will not be used by any SR, and just hang there (possibly consuming expensive RAM, e.g. MemoryDVF) until the last SR is closed.

        Rather, DVPs are shared between SRs. RefCount is a simple object which keeps ref-counting for an object. When an SR is opened anew (e.g. initializes a new SCR), the DVPs it initializes all have RefCount(1). When an SR is opened by sharing another SR (e.g. NRT, DirReader.openIfChanged), it lists all the fields with DV. If the other SR has a DVP for their dvGen, it reuses it and inc-ref it, otherwise it opens a new DVP with RC(1).

        SR itself doesn't need any ref-counting, it's the DVPs that need them. Putting them in SCR I think only complicates things (or at least doesn't simplify). For example, currently when SR.doClose is called, it dec-ref all DVPs that it uses. And DocValuesRefCount.release() closes the DVP when its ref-count reaches 0. If we moved all the DVPs to SCR, then SR.doClose would need to go an dec-ref all DVPs that it uses in SCR .. but how does it know which DVP it uses if all DVPs just sit there in SCR - even ones that it doesn't use? I think that that they are in SR actually simplifies and keeps the code clear.

        Rob, if you don't use DV updates, all DV fields have dvGen=-1 and they are shared between all SRs.

        Show
        Shai Erera added a comment - Let me explain: SCR is still used for all the shared objects between all SRs, objects that are closed when the last of the SRs using this 'core' is closed. With gen'd DVPs, they no longer belong in SCR, because there isn't a single DVP that all SRs share. For example, suppose that you start with an index without updates, then the DV fields gen is -1. So a DVP for field 'f' will be created. Next you update a field and reopen, the gen for 'f' is incremented to 1 and a new DVP is needed. The DVP for gen=-1 is no longer needed. Rob, if we did what you propose - store the initial DVPs in SCR and the updated ones in SRs (as they are updated and reopened), Soon the DVPs in SCR will not be used by any SR, and just hang there (possibly consuming expensive RAM, e.g. MemoryDVF) until the last SR is closed. Rather, DVPs are shared between SRs. RefCount is a simple object which keeps ref-counting for an object. When an SR is opened anew (e.g. initializes a new SCR), the DVPs it initializes all have RefCount(1). When an SR is opened by sharing another SR (e.g. NRT, DirReader.openIfChanged), it lists all the fields with DV. If the other SR has a DVP for their dvGen, it reuses it and inc-ref it, otherwise it opens a new DVP with RC(1). SR itself doesn't need any ref-counting, it's the DVPs that need them. Putting them in SCR I think only complicates things (or at least doesn't simplify). For example, currently when SR.doClose is called, it dec-ref all DVPs that it uses. And DocValuesRefCount.release() closes the DVP when its ref-count reaches 0. If we moved all the DVPs to SCR, then SR.doClose would need to go an dec-ref all DVPs that it uses in SCR .. but how does it know which DVP it uses if all DVPs just sit there in SCR - even ones that it doesn't use? I think that that they are in SR actually simplifies and keeps the code clear. Rob, if you don't use DV updates, all DV fields have dvGen=-1 and they are shared between all SRs.
        Hide
        Uwe Schindler added a comment -

        Hi, to me the whole code in SegmentReader looks like too complicated and contains things that should not be in SegmentReader: SegmenReaders are unmodifiabe and final. Why does SegmentReader need to use refcounting? the Refcounting should be in SegmentCoreReaders, SegmentReaders should only have final fields and unmodifiable data structures, please no refcounts!.

        If a DirectoryReader is reopened you get a new instance of SegmentReader with same corecache key, why is it then having modifiable stuff?

        Maybe it is just moving stuff to SegmentCoreReaders, but I did so much hard work to keep this class simple when we removed modiications from IndexReaders in 4.x, thanks to Robert for the help! But now its as complicated or more complicated than before (see number of code lines in Lucene 3.6!).

        Show
        Uwe Schindler added a comment - Hi, to me the whole code in SegmentReader looks like too complicated and contains things that should not be in SegmentReader: SegmenReaders are unmodifiabe and final. Why does SegmentReader need to use refcounting? the Refcounting should be in SegmentCoreReaders, SegmentReaders should only have final fields and unmodifiable data structures, please no refcounts!. If a DirectoryReader is reopened you get a new instance of SegmentReader with same corecache key, why is it then having modifiable stuff? Maybe it is just moving stuff to SegmentCoreReaders, but I did so much hard work to keep this class simple when we removed modiications from IndexReaders in 4.x, thanks to Robert for the help! But now its as complicated or more complicated than before (see number of code lines in Lucene 3.6!).
        Hide
        Robert Muir added a comment -

        Or maybe its a simple oversight that that RefCount thing is in SR versus SCR?

        I dont understand why a SR would need refcounting, since its unmodifiable?

        Show
        Robert Muir added a comment - Or maybe its a simple oversight that that RefCount thing is in SR versus SCR? I dont understand why a SR would need refcounting, since its unmodifiable?
        Hide
        Robert Muir added a comment -

        To get NRT working for docvalues again, I feel like the logic should be something like this in SegmentReader:

        fooDV(field X) {
        final Producer producer;
        if (gen != core.gen)

        { producer = this.producer; }

        else

        { producer = core.producer; }

        ...
        }

        this way, SR only opens up docvalues for ones that have been updated since the SCR was open, but otherwise uses the shared producer as before, and NRT works with docvalues again.

        Show
        Robert Muir added a comment - To get NRT working for docvalues again, I feel like the logic should be something like this in SegmentReader: fooDV(field X) { final Producer producer; if (gen != core.gen ) { producer = this.producer; } else { producer = core.producer; } ... } this way, SR only opens up docvalues for ones that have been updated since the SCR was open, but otherwise uses the shared producer as before, and NRT works with docvalues again.
        Hide
        Robert Muir added a comment -

        If someone isn't doing numeric updates, and just using (e.g. memory docvalues), are we really re-loading docvalues for each segmentreader now versus holding it in the segment core?

        Show
        Robert Muir added a comment - If someone isn't doing numeric updates, and just using (e.g. memory docvalues), are we really re-loading docvalues for each segmentreader now versus holding it in the segment core?
        Hide
        Shai Erera added a comment -

        Committed a fix to SegmentReader (r1529611) to clean unused memory (un-referenced DVPs) as well as don't use an anonymous RefCount class since it resulted in a memory leak, where those anonymous classes held a reference to their SR, therefore we always referenced unused DVPs.

        Show
        Shai Erera added a comment - Committed a fix to SegmentReader (r1529611) to clean unused memory (un-referenced DVPs) as well as don't use an anonymous RefCount class since it resulted in a memory leak, where those anonymous classes held a reference to their SR, therefore we always referenced unused DVPs.
        Hide
        ASF subversion and git services added a comment -

        Commit 1528837 from Shai Erera in branch 'dev/trunk'
        [ https://svn.apache.org/r1528837 ]

        LUCENE-5189: fix updates-order and docsWithField bugs

        Show
        ASF subversion and git services added a comment - Commit 1528837 from Shai Erera in branch 'dev/trunk' [ https://svn.apache.org/r1528837 ] LUCENE-5189 : fix updates-order and docsWithField bugs
        Hide
        Shai Erera added a comment -

        As I continued to work on LUCENE-5248, I realized that the code in ReaderAndLiveDocs does not work well for documents without field (i.e. the segment has the field, but some docs don't, and they aren't updated). It falsely read their value from the existing ndv (reading 0L), which means that before the update, docsWithField returned false and after the updated it returned true. I consider this a bug as well, and incorporated a fix and a test in this patch.

        Show
        Shai Erera added a comment - As I continued to work on LUCENE-5248 , I realized that the code in ReaderAndLiveDocs does not work well for documents without field (i.e. the segment has the field, but some docs don't, and they aren't updated). It falsely read their value from the existing ndv (reading 0L), which means that before the update, docsWithField returned false and after the updated it returned true. I consider this a bug as well, and incorporated a fix and a test in this patch.
        Hide
        Shai Erera added a comment -

        While debugging LUCENE-5248, I've hit a bug when same terms update same doc multiple times. E.g. if the updates are sent key'd by the following term sequences: t1, t2, t1 – the updated value was that of 't2' and not 't1'. This is caused because LinkedHashMap traverses in insertion-order and when we encounter the second reference of 't1', we should remove and re-add it to the map. But the fix isn't that simple because BufDeletes currently holds a Map<Term,Map<String,NumericUpdate>> (for each Term, all the fields that it affects). We cannot remove and re-add a Term entry from the outer map, because we will move all the affected fields to the end of the iteration, which is wrong.

        I changed the map to be Map<String,LinkedHashMap<Term,NumericUpdate>> (per-field, all terms that update it, ordered) and wrote a simple testcase which reproduces this. TestNumericDVUpdates passed for few hundred iterations.

        Show
        Shai Erera added a comment - While debugging LUCENE-5248 , I've hit a bug when same terms update same doc multiple times. E.g. if the updates are sent key'd by the following term sequences: t1, t2, t1 – the updated value was that of 't2' and not 't1'. This is caused because LinkedHashMap traverses in insertion-order and when we encounter the second reference of 't1', we should remove and re-add it to the map. But the fix isn't that simple because BufDeletes currently holds a Map<Term,Map<String,NumericUpdate>> (for each Term, all the fields that it affects). We cannot remove and re-add a Term entry from the outer map, because we will move all the affected fields to the end of the iteration, which is wrong. I changed the map to be Map<String,LinkedHashMap<Term,NumericUpdate>> (per-field, all terms that update it, ordered) and wrote a simple testcase which reproduces this. TestNumericDVUpdates passed for few hundred iterations.
        Hide
        Simon Willnauer added a comment -

        backported the renaming to 4x as well.

        Show
        Simon Willnauer added a comment - backported the renaming to 4x as well.
        Hide
        ASF subversion and git services added a comment -

        Commit 1528081 from Simon Willnauer in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1528081 ]

        LUCENE-5189: s/doApplyAllDeletes/getAndResetApplyAllDeletes/

        Show
        ASF subversion and git services added a comment - Commit 1528081 from Simon Willnauer in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1528081 ] LUCENE-5189 : s/doApplyAllDeletes/getAndResetApplyAllDeletes/
        Hide
        ASF subversion and git services added a comment -

        Commit 1528077 from Simon Willnauer in branch 'dev/trunk'
        [ https://svn.apache.org/r1528077 ]

        LUCENE-5189: s/doApplyAllDeletes/getAndResetApplyAllDeletes/

        Show
        ASF subversion and git services added a comment - Commit 1528077 from Simon Willnauer in branch 'dev/trunk' [ https://svn.apache.org/r1528077 ] LUCENE-5189 : s/doApplyAllDeletes/getAndResetApplyAllDeletes/
        Hide
        ASF subversion and git services added a comment -

        Commit 1528076 from Simon Willnauer in branch 'dev/trunk'
        [ https://svn.apache.org/r1528076 ]

        LUCENE-5189: Check if deletes need to be applied when docvalues are updated and process IW event if needed

        Show
        ASF subversion and git services added a comment - Commit 1528076 from Simon Willnauer in branch 'dev/trunk' [ https://svn.apache.org/r1528076 ] LUCENE-5189 : Check if deletes need to be applied when docvalues are updated and process IW event if needed
        Hide
        Michael McCandless added a comment -

        +1 for getAndResetApplyDeletes ... I don't think it needs a separate issue ... thanks!

        Show
        Michael McCandless added a comment - +1 for getAndResetApplyDeletes ... I don't think it needs a separate issue ... thanks!
        Hide
        Simon Willnauer added a comment - - edited

        So, basically, if you only call IW.updateNumericDocValue, then we were failing to flush once RAM usage was over the limit?

        yes

        Maybe it should be named "getAndClearDoApplyAllDeletes" to make this behavior clear?

        I think we should change it to "getAndResetApplyDeletes()" I agree. Will do that in a separate commit and open a dedicated issue for it.

        Show
        Simon Willnauer added a comment - - edited So, basically, if you only call IW.updateNumericDocValue, then we were failing to flush once RAM usage was over the limit? yes Maybe it should be named "getAndClearDoApplyAllDeletes" to make this behavior clear? I think we should change it to "getAndResetApplyDeletes()" I agree. Will do that in a separate commit and open a dedicated issue for it.
        Hide
        Michael McCandless added a comment -

        +1 for Simon's patch; good catch. Sort of sneaky that
        "doApplyAllDeletes" as a side effect clears that bit. Maybe it should
        be named "getAndClearDoApplyAllDeletes" to make this behavior clear?

        So, basically, if you only call IW.updateNumericDocValue, then we were
        failing to flush once RAM usage was over the limit?

        Show
        Michael McCandless added a comment - +1 for Simon's patch; good catch. Sort of sneaky that "doApplyAllDeletes" as a side effect clears that bit. Maybe it should be named "getAndClearDoApplyAllDeletes" to make this behavior clear? So, basically, if you only call IW.updateNumericDocValue, then we were failing to flush once RAM usage was over the limit?
        Hide
        Shai Erera added a comment -

        Looks good. Thanks Simon.

        Show
        Shai Erera added a comment - Looks good. Thanks Simon.
        Hide
        Simon Willnauer added a comment -

        here is an updated patch that contains a test for this as well - will commit this later

        Show
        Simon Willnauer added a comment - here is an updated patch that contains a test for this as well - will commit this later
        Hide
        Shai Erera added a comment -

        Thanks Simon, good catch. I think there's no test for it because updates are not lost, they are just not applied based on RAM buffer settings. So e.g. if we reopened a reader, committed or kicked off a merge, they will be applied, right? Nevertheless, we should fix it as I'm seeing this affects my attempts to work on LUCENE-5248, where the updates are only applied when IW.close is called (since I don't reopen, merge or commit).

        Show
        Shai Erera added a comment - Thanks Simon, good catch. I think there's no test for it because updates are not lost, they are just not applied based on RAM buffer settings. So e.g. if we reopened a reader, committed or kicked off a merge, they will be applied, right? Nevertheless, we should fix it as I'm seeing this affects my attempts to work on LUCENE-5248 , where the updates are only applied when IW.close is called (since I don't reopen, merge or commit).
        Hide
        Simon Willnauer added a comment -

        I found a problem with this on how updates are processed. The updateNumericDocValues method on DW doesn't correctly handle if flushControl.doApplyAllDeletes() returns true and we might miss flushing deletes at all. It also doesn't process IW events at all which this patch fixes.

        What worries me more here is that we don't have a test failing on this.

        Show
        Simon Willnauer added a comment - I found a problem with this on how updates are processed. The updateNumericDocValues method on DW doesn't correctly handle if flushControl.doApplyAllDeletes() returns true and we might miss flushing deletes at all. It also doesn't process IW events at all which this patch fixes. What worries me more here is that we don't have a test failing on this.
        Hide
        Michael McCandless added a comment -

        I'd not be happy with this being released tomorrow.

        Can you give more details here? I.e., do you consider the optimizations necessary for release? Or are there other things?

        Show
        Michael McCandless added a comment - I'd not be happy with this being released tomorrow. Can you give more details here? I.e., do you consider the optimizations necessary for release? Or are there other things?
        Hide
        Shai Erera added a comment -

        I opened LUCENE-5248 to handle the data structure (irrespective of whether we choose to backport first or not).

        Show
        Shai Erera added a comment - I opened LUCENE-5248 to handle the data structure (irrespective of whether we choose to backport first or not).
        Hide
        Shai Erera added a comment -

        I don't mind if we backport the whole thing in one commit. Just thought it will be cleaner to backport each issue's commits. I doubt anyone would "hit" an issue within the couple of hours it will take. But I'll do this in one backport.

        Only port stuff to the stable branch unless you'd be happy to release it tomorrow

        I agree, though what if we decide to release 5.0 in one month? Do we revert the whole feature? I just think that it's software, and software always improves. Even if we optimize the way updates are kept (the problem is in ReaderAndLiveDocs), it can always be improved tomorrow even more. That's why the feature is marked @lucene.experimental – it may not be the most optimized thing, but it works and more importantly - it doesn't affect users that don't use it ("do no harm").

        I will look into improving the way updates are kept in RALD (Map<String,Map<Integer,Long>>), though honestly, we have no data points as to whether it's efficient or not, or whether the new structure is more efficient. What I think we can do is keep the updates in conceptually an int[] and long[] pair arrays (maybe one of those **Buffer we have for better compression). I'll start w/ that.

        Show
        Shai Erera added a comment - I don't mind if we backport the whole thing in one commit. Just thought it will be cleaner to backport each issue's commits. I doubt anyone would "hit" an issue within the couple of hours it will take. But I'll do this in one backport. Only port stuff to the stable branch unless you'd be happy to release it tomorrow I agree, though what if we decide to release 5.0 in one month? Do we revert the whole feature? I just think that it's software, and software always improves. Even if we optimize the way updates are kept (the problem is in ReaderAndLiveDocs), it can always be improved tomorrow even more. That's why the feature is marked @lucene.experimental – it may not be the most optimized thing, but it works and more importantly - it doesn't affect users that don't use it ("do no harm"). I will look into improving the way updates are kept in RALD (Map<String,Map<Integer,Long>>), though honestly, we have no data points as to whether it's efficient or not, or whether the new structure is more efficient. What I think we can do is keep the updates in conceptually an int[] and long[] pair arrays (maybe one of those **Buffer we have for better compression). I'll start w/ that.
        Hide
        Simon Willnauer added a comment -

        I suppose we could also wait some more (the test did uncover a big issue recently), but I don't think we need to...

        I think we should apply a general rule here that we (at least I believe so) in Lucene land had for a long time. ->> "Only port stuff to the stable branch unless you'd be happy to release it tomorrow." I looked at the change and honestly this looks pretty scary to me. I'd not be happy with this being released tomorrow. We can happily aim for this being in 4.6 but baking in should happen in trunk. This is a huge change and I am thankful for the work that has been done on it but I think we should not rush this into the stable branch. This should IMO bake in more in trunk and have several rounds of optimization to it in terms of datastructures before we go and release it.

        Show
        Simon Willnauer added a comment - I suppose we could also wait some more (the test did uncover a big issue recently), but I don't think we need to... I think we should apply a general rule here that we (at least I believe so) in Lucene land had for a long time. ->> "Only port stuff to the stable branch unless you'd be happy to release it tomorrow." I looked at the change and honestly this looks pretty scary to me. I'd not be happy with this being released tomorrow. We can happily aim for this being in 4.6 but baking in should happen in trunk. This is a huge change and I am thankful for the work that has been done on it but I think we should not rush this into the stable branch. This should IMO bake in more in trunk and have several rounds of optimization to it in terms of datastructures before we go and release it.
        Hide
        Robert Muir added a comment -

        I don't think we should do that: then tests maybe fail, users hit too many open files, hit exceptions when segments dont contain their field, etc, etc in 4.x and we say "sorry, 4.x is only temporarily broken: maybe you should try trunk for a more stable user experience?"

        Show
        Robert Muir added a comment - I don't think we should do that: then tests maybe fail, users hit too many open files, hit exceptions when segments dont contain their field, etc, etc in 4.x and we say "sorry, 4.x is only temporarily broken: maybe you should try trunk for a more stable user experience?"
        Hide
        Shai Erera added a comment -

        I wrote (somewhere, don't remember where) that I'd like to backport issue-by-issue. So here I backport the changes done in this issue, after that I'll backport 5215, 5216 and 5246. That way the commits to 4x will be associated with the proper issues. Do you see a problem w/ that strategy? I definitely intend to backport all the changes in one go, only do that in multiple commits, one per issue.

        Show
        Shai Erera added a comment - I wrote (somewhere, don't remember where) that I'd like to backport issue-by-issue. So here I backport the changes done in this issue, after that I'll backport 5215, 5216 and 5246. That way the commits to 4x will be associated with the proper issues. Do you see a problem w/ that strategy? I definitely intend to backport all the changes in one go, only do that in multiple commits, one per issue.
        Hide
        Robert Muir added a comment -

        I'm only confused by the strategy of the backport: I see things like SegmentWriteState.isFieldUpdate here that were fixed in other issues.

        Is this intentional? If the argument is to get user testing, I think to the user this would actually be "unbaking" if such a term exists...?

        Can you svn merge the revs to LUCENE-5215 and any other issues (e.g. bugfixes) so we have one coherent backport?

        Show
        Robert Muir added a comment - I'm only confused by the strategy of the backport: I see things like SegmentWriteState.isFieldUpdate here that were fixed in other issues. Is this intentional? If the argument is to get user testing, I think to the user this would actually be "unbaking" if such a term exists...? Can you svn merge the revs to LUCENE-5215 and any other issues (e.g. bugfixes) so we have one coherent backport?
        Hide
        Michael McCandless added a comment -

        I think backporting now is OK: we are at the start of the release cycle for 4.6 (so Jenkins will have lots of time to bake the changes for 4.6), the change has baked for ~ 2 weeks in trunk now, perf graphs look fine ( http://people.apache.org/~mikemccand/lucenebench/ ).

        I agree we can optimize in-RAM data structures, but I don't think that needs to hold up backporting.

        I suppose we could also wait some more (the test did uncover a big issue recently), but I don't think we need to...

        Show
        Michael McCandless added a comment - I think backporting now is OK: we are at the start of the release cycle for 4.6 (so Jenkins will have lots of time to bake the changes for 4.6), the change has baked for ~ 2 weeks in trunk now, perf graphs look fine ( http://people.apache.org/~mikemccand/lucenebench/ ). I agree we can optimize in-RAM data structures, but I don't think that needs to hold up backporting. I suppose we could also wait some more (the test did uncover a big issue recently), but I don't think we need to...
        Hide
        Shai Erera added a comment -

        I don't understand - why can't it live in 4x and be tested by Jenkins as well as users? DocValues went through two releases before they were overhauled, I don't see how this issue is different. It baked in trunk for 2+ weeks, I think it can bake in the 4x branch as well (I believe we have enough time until 4.6 will be released?). Optimizations, which are important, can come next as improvements. For instance, we don't know yet if the data structures used (Maps!) are inefficient, because nobody uses this feature yet - maybe we'll spend a lot of time optimizing while in practice it won't make a big difference? I think that as long as this feature does not affect anyone who doesn't use numeric DV updates, it's safe to backport.

        Show
        Shai Erera added a comment - I don't understand - why can't it live in 4x and be tested by Jenkins as well as users? DocValues went through two releases before they were overhauled, I don't see how this issue is different. It baked in trunk for 2+ weeks, I think it can bake in the 4x branch as well (I believe we have enough time until 4.6 will be released?). Optimizations, which are important, can come next as improvements. For instance, we don't know yet if the data structures used (Maps!) are inefficient, because nobody uses this feature yet - maybe we'll spend a lot of time optimizing while in practice it won't make a big difference? I think that as long as this feature does not affect anyone who doesn't use numeric DV updates, it's safe to backport.
        Hide
        Simon Willnauer added a comment -

        I don't think we should backport this to 4.x yet. I'd want this to bake in further and maybe optimize the memory useage of the internal datastructures as well before this goes into the stable branch. This is a significant change and we should be careful with backporting things like this. can we wait with this a bit more?

        Show
        Simon Willnauer added a comment - I don't think we should backport this to 4.x yet. I'd want this to bake in further and maybe optimize the memory useage of the internal datastructures as well before this goes into the stable branch. This is a significant change and we should be careful with backporting things like this. can we wait with this a bit more?
        Hide
        Shai Erera added a comment -

        Backported all the changes related to this issue to 4x. All tests pass, I think it's ready.

        Show
        Shai Erera added a comment - Backported all the changes related to this issue to 4x. All tests pass, I think it's ready.
        Hide
        ASF subversion and git services added a comment -

        Commit 1527147 from Shai Erera in branch 'dev/trunk'
        [ https://svn.apache.org/r1527147 ]

        LUCENE-5189: disable merges in testChangeCodec

        Show
        ASF subversion and git services added a comment - Commit 1527147 from Shai Erera in branch 'dev/trunk' [ https://svn.apache.org/r1527147 ] LUCENE-5189 : disable merges in testChangeCodec
        Hide
        ASF subversion and git services added a comment -

        Commit 1524901 from Shai Erera in branch 'dev/trunk'
        [ https://svn.apache.org/r1524901 ]

        LUCENE-5189: remove leftover

        Show
        ASF subversion and git services added a comment - Commit 1524901 from Shai Erera in branch 'dev/trunk' [ https://svn.apache.org/r1524901 ] LUCENE-5189 : remove leftover
        Hide
        ASF subversion and git services added a comment -

        Commit 1524900 from Shai Erera in branch 'dev/trunk'
        [ https://svn.apache.org/r1524900 ]

        LUCENE-5189: fixed concurrency bug

        Show
        ASF subversion and git services added a comment - Commit 1524900 from Shai Erera in branch 'dev/trunk' [ https://svn.apache.org/r1524900 ] LUCENE-5189 : fixed concurrency bug
        Hide
        Shai Erera added a comment -

        Jenkins reported this failure, which I'm unable to reproduce with and without the seed (master and child), with iters.

        1 tests failed.
        REGRESSION:  org.apache.lucene.index.TestNumericDocValuesUpdates.testManyReopensAndFields
        
        Error Message:
        invalid value for doc=351, field=f1 expected:<15> but was:<14>
        
        Stack Trace:
        java.lang.AssertionError: invalid value for doc=351, field=f1 expected:<15> but was:<14>
                at __randomizedtesting.SeedInfo.seed([5E1E0079E35D52E:331D82281FC0B632]:0)
                at org.junit.Assert.fail(Assert.java:93)
                at org.junit.Assert.failNotEquals(Assert.java:647)
                at org.junit.Assert.assertEquals(Assert.java:128)
                at org.junit.Assert.assertEquals(Assert.java:472)
                at org.apache.lucene.index.TestNumericDocValuesUpdates.testManyReopensAndFields(TestNumericDocValuesUpdates.java:757)
                ...
        
        Build Log:
        [...truncated 776 lines...]
           [junit4] Suite: org.apache.lucene.index.TestNumericDocValuesUpdates
           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestNumericDocValuesUpdates -Dtests.method=testManyReopensAndFields -Dtests.seed=5E1E0079E35D52E -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=tr -Dtests.timezone=Etc/GMT-6 -Dtests.file.encoding=US-ASCII
           [junit4] FAILURE 1.40s J0 | TestNumericDocValuesUpdates.testManyReopensAndFields <<<
           [junit4]    > Throwable #1: java.lang.AssertionError: invalid value for doc=351, field=f1 expected:<15> but was:<14>
           [junit4]    >        at __randomizedtesting.SeedInfo.seed([5E1E0079E35D52E:331D82281FC0B632]:0)
           [junit4]    >        at org.apache.lucene.index.TestNumericDocValuesUpdates.testManyReopensAndFields(TestNumericDocValuesUpdates.java:757)
           [junit4]    >        at java.lang.Thread.run(Thread.java:724)
           [junit4]   2> NOTE: test params are: codec=Asserting, sim=RandomSimilarityProvider(queryNorm=false,coord=no): {}, locale=tr, timezone=Etc/GMT-6
           [junit4]   2> NOTE: Linux 3.2.0-53-generic amd64/Oracle Corporation 1.8.0-ea (64-bit)/cpus=8,threads=1,free=66621176,total=210272256
           [junit4]   2> NOTE: All tests run in this JVM: [TestSegmentReader, TestStressNRT, TestSort, TestShardSearching, TestEliasFanoSequence, TestBytesRefHash, TestPhrasePrefixQuery, TestLucene45DocValuesFormat, TestFastCompressionMode, TestEliasFanoDocIdSet, TestSearchForDuplicates, TestFixedBitSet, TestIsCurrent, TestFilteredSearch, TestFieldCacheSanityChecker, TestSegmentTermEnum, TestDeletionPolicy, TestSimpleExplanations, TestRegexpRandom, TestIndexCommit, TestCloseableThreadLocal, TestNumericRangeQuery32, TestTwoPhaseCommitTool, TestIndexWriterOnDiskFull, TestPhraseQuery, TestSearchAfter, TestParallelReaderEmptyIndex, TestMaxTermFrequency, TestFlushByRamOrCountsPolicy, TestSimilarity, TestNumericRangeQuery64, TestByteSlices, TestSameScoresWithThreads, TestDocValuesWithThreads, TestMockAnalyzer, TestArrayUtil, TestPostingsOffsets, TestCompressingTermVectorsFormat, TestSentinelIntSet, TestCustomNorms, TestExternalCodecs, TestNumericDocValuesUpdates]
           [junit4] Completed on J0 in 83.46s, 24 tests, 1 failure <<< FAILURES!
        
        Show
        Shai Erera added a comment - Jenkins reported this failure, which I'm unable to reproduce with and without the seed (master and child), with iters. 1 tests failed. REGRESSION: org.apache.lucene.index.TestNumericDocValuesUpdates.testManyReopensAndFields Error Message: invalid value for doc=351, field=f1 expected:<15> but was:<14> Stack Trace: java.lang.AssertionError: invalid value for doc=351, field=f1 expected:<15> but was:<14> at __randomizedtesting.SeedInfo.seed([5E1E0079E35D52E:331D82281FC0B632]:0) at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.apache.lucene.index.TestNumericDocValuesUpdates.testManyReopensAndFields(TestNumericDocValuesUpdates.java:757) ... Build Log: [...truncated 776 lines...] [junit4] Suite: org.apache.lucene.index.TestNumericDocValuesUpdates [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestNumericDocValuesUpdates -Dtests.method=testManyReopensAndFields -Dtests.seed=5E1E0079E35D52E -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=tr -Dtests.timezone=Etc/GMT-6 -Dtests.file.encoding=US-ASCII [junit4] FAILURE 1.40s J0 | TestNumericDocValuesUpdates.testManyReopensAndFields <<< [junit4] > Throwable #1: java.lang.AssertionError: invalid value for doc=351, field=f1 expected:<15> but was:<14> [junit4] > at __randomizedtesting.SeedInfo.seed([5E1E0079E35D52E:331D82281FC0B632]:0) [junit4] > at org.apache.lucene.index.TestNumericDocValuesUpdates.testManyReopensAndFields(TestNumericDocValuesUpdates.java:757) [junit4] > at java.lang.Thread.run(Thread.java:724) [junit4] 2> NOTE: test params are: codec=Asserting, sim=RandomSimilarityProvider(queryNorm=false,coord=no): {}, locale=tr, timezone=Etc/GMT-6 [junit4] 2> NOTE: Linux 3.2.0-53-generic amd64/Oracle Corporation 1.8.0-ea (64-bit)/cpus=8,threads=1,free=66621176,total=210272256 [junit4] 2> NOTE: All tests run in this JVM: [TestSegmentReader, TestStressNRT, TestSort, TestShardSearching, TestEliasFanoSequence, TestBytesRefHash, TestPhrasePrefixQuery, TestLucene45DocValuesFormat, TestFastCompressionMode, TestEliasFanoDocIdSet, TestSearchForDuplicates, TestFixedBitSet, TestIsCurrent, TestFilteredSearch, TestFieldCacheSanityChecker, TestSegmentTermEnum, TestDeletionPolicy, TestSimpleExplanations, TestRegexpRandom, TestIndexCommit, TestCloseableThreadLocal, TestNumericRangeQuery32, TestTwoPhaseCommitTool, TestIndexWriterOnDiskFull, TestPhraseQuery, TestSearchAfter, TestParallelReaderEmptyIndex, TestMaxTermFrequency, TestFlushByRamOrCountsPolicy, TestSimilarity, TestNumericRangeQuery64, TestByteSlices, TestSameScoresWithThreads, TestDocValuesWithThreads, TestMockAnalyzer, TestArrayUtil, TestPostingsOffsets, TestCompressingTermVectorsFormat, TestSentinelIntSet, TestCustomNorms, TestExternalCodecs, TestNumericDocValuesUpdates] [junit4] Completed on J0 in 83.46s, 24 tests, 1 failure <<< FAILURES!
        Hide
        Shai Erera added a comment -

        SegmentWriteState flushState; in DWPT is unused

        +1 to remove it. Indeed it's unused, but because it's package-private, eclipse doesn't complain about it.

        In DW the `updateNumericDocValue` method is synchronized

        I followed the other two delete methods. I'm fine with opening a separate issue to remove the synchronization, especially if it's not trivial.

        I really like the way how this is implemented piggybacking on the delete queue to get a total ordering

        Thanks, it was very helpful to have deletes already covered like that. I only had to follow their breadcrumbs .

        Show
        Shai Erera added a comment - SegmentWriteState flushState; in DWPT is unused +1 to remove it. Indeed it's unused, but because it's package-private, eclipse doesn't complain about it. In DW the `updateNumericDocValue` method is synchronized I followed the other two delete methods. I'm fine with opening a separate issue to remove the synchronization, especially if it's not trivial. I really like the way how this is implemented piggybacking on the delete queue to get a total ordering Thanks, it was very helpful to have deletes already covered like that. I only had to follow their breadcrumbs .
        Hide
        Simon Willnauer added a comment -

        I only briefly looked at the changed in DW, DWPT, IW & BDS and I have 2 questions:

        • SegmentWriteState flushState; in DWPT is unused - can we remove it? (I generally want this class to have only final members as well if possible)
        • In DW the `updateNumericDocValue` method is synchronized - I don't think it needs to. The other two deletes methods don't need to be synced either - maybe we can open another issue to remove the synchronization? It won't be possible to just drop it but it won't be much work.

        I really like the way how this is implemented piggybacking on the delete queue to get a total ordering
        nice one!

        Show
        Simon Willnauer added a comment - I only briefly looked at the changed in DW, DWPT, IW & BDS and I have 2 questions: SegmentWriteState flushState; in DWPT is unused - can we remove it? (I generally want this class to have only final members as well if possible) In DW the `updateNumericDocValue` method is synchronized - I don't think it needs to. The other two deletes methods don't need to be synced either - maybe we can open another issue to remove the synchronization? It won't be possible to just drop it but it won't be much work. I really like the way how this is implemented piggybacking on the delete queue to get a total ordering nice one!
        Hide
        ASF subversion and git services added a comment -

        Commit 1523525 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1523525 ]

        LUCENE-5189: add testcase

        Show
        ASF subversion and git services added a comment - Commit 1523525 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1523525 ] LUCENE-5189 : add testcase
        Hide
        Shai Erera added a comment -

        Thanks, committed to trunk, revision 1523461. After we resolve all corner issues, and let Jenkins sleep on it for a while, I'll port to 4x.

        Show
        Shai Erera added a comment - Thanks, committed to trunk, revision 1523461. After we resolve all corner issues, and let Jenkins sleep on it for a while, I'll port to 4x.
        Hide
        ASF subversion and git services added a comment -

        Commit 1523461 from Shai Erera in branch 'dev/trunk'
        [ https://svn.apache.org/r1523461 ]

        LUCENE-5189: add NumericDocValues updates

        Show
        ASF subversion and git services added a comment - Commit 1523461 from Shai Erera in branch 'dev/trunk' [ https://svn.apache.org/r1523461 ] LUCENE-5189 : add NumericDocValues updates
        Hide
        Robert Muir added a comment -

        +1 to go to trunk. thanks Shai.

        Show
        Robert Muir added a comment - +1 to go to trunk. thanks Shai.
        Hide
        Shai Erera added a comment -

        Added some javadocs, converted all nocommits to TODOs. I think it's ready for trunk. I'd like to handle FIS.gen next.

        Show
        Shai Erera added a comment - Added some javadocs, converted all nocommits to TODOs. I think it's ready for trunk. I'd like to handle FIS.gen next.
        Hide
        Shai Erera added a comment -

        Thanks Mike. Fixed the two ctors to call doClose if (!success), as well as added a comment about gen=-1.

        I also fixed the two tests which ensure we don't allow updating a field in a segment where it doesn't exist (testUpdateSegmentWithPostingButNoDocValues() and testUpdateSegmentWithNoDocValues()) to set NoMergePolicy, otherwise the segments could merge and the update becomes legit. This exposes the weirdness of the exception – you may not hit it if segments are merged. Once we gen FIS, these tests will change though, to ensure these cases ARE allowed .

        Show
        Shai Erera added a comment - Thanks Mike. Fixed the two ctors to call doClose if (!success) , as well as added a comment about gen=-1. I also fixed the two tests which ensure we don't allow updating a field in a segment where it doesn't exist ( testUpdateSegmentWithPostingButNoDocValues() and testUpdateSegmentWithNoDocValues() ) to set NoMergePolicy, otherwise the segments could merge and the update becomes legit. This exposes the weirdness of the exception – you may not hit it if segments are merged. Once we gen FIS, these tests will change though, to ensure these cases ARE allowed .
        Hide
        Michael McCandless added a comment -

        New patch looks great! The ref counting is very clean. Maybe add a comment that gen is allowed to be -1, just means the field has no DV updates yet, in SegmentReader when building up the map? And then call doClose() in a finally clause if !success so on exception we don't leak open files.

        Show
        Michael McCandless added a comment - New patch looks great! The ref counting is very clean. Maybe add a comment that gen is allowed to be -1, just means the field has no DV updates yet, in SegmentReader when building up the map? And then call doClose() in a finally clause if !success so on exception we don't leak open files.
        Hide
        Shai Erera added a comment -

        Fixed NRT support – DVProducers moved from SegmentCoreReaders to SegmentReader, where they are being wrapped by RefCount which keeps track of ref counts (new class I added). When an SR is shared, SR inc-refs the DVProducers that it uses (according to SIPC.getDVGen(field)) and dec-ref on doClose().

        All Lucene tests pass, including the new numeric DV updates ones. A review would be appreciated though.

        The remaining nocommits are for renames and FieldInfos.gen. I think I'll leave the renames as TODOs, to handle prior to merging to 4x (after this one bakes in trunk), that way avoiding potential messy merges. I can handle FieldInfos.gen either as part of this issue, or a separate one. Preferences?

        Show
        Shai Erera added a comment - Fixed NRT support – DVProducers moved from SegmentCoreReaders to SegmentReader, where they are being wrapped by RefCount which keeps track of ref counts (new class I added). When an SR is shared, SR inc-refs the DVProducers that it uses (according to SIPC.getDVGen(field)) and dec-ref on doClose(). All Lucene tests pass, including the new numeric DV updates ones. A review would be appreciated though. The remaining nocommits are for renames and FieldInfos.gen. I think I'll leave the renames as TODOs, to handle prior to merging to 4x (after this one bakes in trunk), that way avoiding potential messy merges. I can handle FieldInfos.gen either as part of this issue, or a separate one. Preferences?
        Hide
        Shai Erera added a comment -

        I've been busy hunting down concurrency bugs and making field updates work with index sorting (thought it will be an easy nocommit to handle... boy, I was wrong!). Patch adds field updates to TestSortingMP as well as improves TestNumericDVUpdates.testSegmentMerges to stress that more.

        There are some nocommit (RENAME) which I'd love to get some feedback on. Also there are two standing out nocommits around FieldInfos.gen and an optimization to SegmentCoreReaders/SegmentReader around reusing DVProducers of update gens. I think these two can be handled in separate issues, just to get this issue focused.

        I'd like to commit this to trunk, so that I can move on to the other issues. Therefore I'd appreciate some review on the patch.

        Show
        Shai Erera added a comment - I've been busy hunting down concurrency bugs and making field updates work with index sorting (thought it will be an easy nocommit to handle... boy, I was wrong!). Patch adds field updates to TestSortingMP as well as improves TestNumericDVUpdates.testSegmentMerges to stress that more. There are some nocommit (RENAME) which I'd love to get some feedback on. Also there are two standing out nocommits around FieldInfos.gen and an optimization to SegmentCoreReaders/SegmentReader around reusing DVProducers of update gens. I think these two can be handled in separate issues, just to get this issue focused. I'd like to commit this to trunk, so that I can move on to the other issues. Therefore I'd appreciate some review on the patch.
        Hide
        Shai Erera added a comment -

        Correct, that's a problem that Rob identified few days ago and it can be solved if we gen FieldInfos, because ReaderAndLiveDocs will detect that case and add a new FieldInfo, as well as create a new gen for this segment's FIS.I have two tests in TestNumericDVUpdates which currently test that this is not supported – once we gen FIS, we'll change them to assert it is supported.

        Show
        Shai Erera added a comment - Correct, that's a problem that Rob identified few days ago and it can be solved if we gen FieldInfos, because ReaderAndLiveDocs will detect that case and add a new FieldInfo, as well as create a new gen for this segment's FIS.I have two tests in TestNumericDVUpdates which currently test that this is not supported – once we gen FIS, we'll change them to assert it is supported.
        Hide
        Yonik Seeley added a comment -

        The problem that Mike highlights "some segments might be missing the field entirely so you cannot update those", is pretty bad though. Things work differently (i.e. your update may fail) depending on exactly how segment flushes and merges are done.

        Show
        Yonik Seeley added a comment - The problem that Mike highlights "some segments might be missing the field entirely so you cannot update those", is pretty bad though. Things work differently (i.e. your update may fail) depending on exactly how segment flushes and merges are done.
        Hide
        Shai Erera added a comment -

        I think global FIS is an interesting idea, but per-segment FIS.gen is a lower hanging fruit. I did it once and it was quite straightforward (maybe someone will have reservations on how I did it though):

        • SIS tracks fieldInfosGen (in this patch, rename all dvGen in SIS to fisGen)
        • FI tracks dvGen
        • A new FIS45Format reads/writes each FI's dvGen
        • ReaderAndLiveDocs writes a new FIS gen, containing the entire FIS, so SR only reads the latest gen to load FIS

        I think we should explore global FIS separately, because it brings its own issues, e.g. do we keep FISFormat or nuke it? Who invokes it (probably SIS)? It's also quite orthogonal to that issue, or at least, we can proceed with it and improve FIS gen'ing later with global FIS.

        As for SI.attributes(), I think we can move them under SIS. We should open an issue to do that.

        Show
        Shai Erera added a comment - I think global FIS is an interesting idea, but per-segment FIS.gen is a lower hanging fruit. I did it once and it was quite straightforward (maybe someone will have reservations on how I did it though): SIS tracks fieldInfosGen (in this patch, rename all dvGen in SIS to fisGen) FI tracks dvGen A new FIS45Format reads/writes each FI's dvGen ReaderAndLiveDocs writes a new FIS gen, containing the entire FIS, so SR only reads the latest gen to load FIS I think we should explore global FIS separately, because it brings its own issues, e.g. do we keep FISFormat or nuke it? Who invokes it (probably SIS)? It's also quite orthogonal to that issue, or at least, we can proceed with it and improve FIS gen'ing later with global FIS. As for SI.attributes(), I think we can move them under SIS. We should open an issue to do that.
        Hide
        Michael McCandless added a comment -

        Actually, that would also solve the other problems as well?

        Ie, the global FieldInfos would be gen'd: on commit we'd write a new FIS file, which all segments in that commit point would use.

        Any attribute changes to a FieldInfo would be saved, even on update; new fields could be created via update; any segments that have no documents with the field won't be an issue.

        Show
        Michael McCandless added a comment - Actually, that would also solve the other problems as well? Ie, the global FieldInfos would be gen'd: on commit we'd write a new FIS file, which all segments in that commit point would use. Any attribute changes to a FieldInfo would be saved, even on update; new fields could be created via update; any segments that have no documents with the field won't be an issue.
        Hide
        Michael McCandless added a comment -

        One option, to solve the "some segments might be missing the field entirely so you cannot update those" would be to have the FieldInfos accumulate across segments, i.e. a more global FieldInfos, maybe written to a separate global file (not per segment).

        This way, if any doc in any segment has added the field, then the global FieldInfos would contain it.

        Not saying this is an appealing option (there are tons of tradeoffs), but I think it would address that limitation.

        Show
        Michael McCandless added a comment - One option, to solve the "some segments might be missing the field entirely so you cannot update those" would be to have the FieldInfos accumulate across segments, i.e. a more global FieldInfos, maybe written to a separate global file (not per segment). This way, if any doc in any segment has added the field, then the global FieldInfos would contain it. Not saying this is an appealing option (there are tons of tradeoffs), but I think it would address that limitation.
        Hide
        Michael McCandless added a comment -

        Frankly I am tired of hearing this phrase being used in this way

        Actually, I think this is a fair use of "progress not perfection".
        Either that or I don't understand what you're calling "broken APIs" in
        the current patch.

        As I understand it, what's "broken" here is that you cannot set the
        attributes in SegmentInfo nor FieldInfo from your DocValuesFormat
        writer when it's an update being written: the changes won't be saved.

        So, I proposed that we document this as a limitation of the SI/FI
        attributes API: when writing updates, any changes will be lost. For
        "normal" segment flushes, they work correctly. It'd be a documented
        limitation, and we can later fix it.

        I think this situation is very similar to LUCENE-5197, which I would
        also call "progress not perfection": we are adding a new API
        (SegmentReader.ramBytesUsed), with an initial implementation that we
        think might be improved by later cutting over to RamUsageEstimator.
        But I think we should commit the initial approach (it's useful, it
        should work well) and later improve the implementation.

        Show
        Michael McCandless added a comment - Frankly I am tired of hearing this phrase being used in this way Actually, I think this is a fair use of "progress not perfection". Either that or I don't understand what you're calling "broken APIs" in the current patch. As I understand it, what's "broken" here is that you cannot set the attributes in SegmentInfo nor FieldInfo from your DocValuesFormat writer when it's an update being written: the changes won't be saved. So, I proposed that we document this as a limitation of the SI/FI attributes API: when writing updates, any changes will be lost. For "normal" segment flushes, they work correctly. It'd be a documented limitation, and we can later fix it. I think this situation is very similar to LUCENE-5197 , which I would also call "progress not perfection": we are adding a new API (SegmentReader.ramBytesUsed), with an initial implementation that we think might be improved by later cutting over to RamUsageEstimator. But I think we should commit the initial approach (it's useful, it should work well) and later improve the implementation.
        Hide
        Shai Erera added a comment -

        It does not solve problem #2 (SegmentInfos.attributes)

        Correct. So this API is broken today for LiveDocsFormat (since it's the only updateable thing), but field updates only broaden the broken-ness into other formats (now only DVF, but in the future others too). Correct?

        I think that moving this API into the commit is not an overkill. I remember Mike and I once discussed if we can use that API to save per-segment facets "schema details". I don't remember how this ended, but maybe we shouldn't remove it? Alternatively, we could gen SIFormat too ... that may be an overkill though. Recording per-segment StringStringMap in SIS seems simple enough.

        Regarding FIS.gen, I honestly thought to keep it simple by writing all FIS entirely in each gen and not complicate the code by writing parts of an FI in different gens and merging them by SR. This is what I plan to do in this issue.

        Show
        Shai Erera added a comment - It does not solve problem #2 (SegmentInfos.attributes) Correct. So this API is broken today for LiveDocsFormat (since it's the only updateable thing), but field updates only broaden the broken-ness into other formats (now only DVF, but in the future others too). Correct? I think that moving this API into the commit is not an overkill. I remember Mike and I once discussed if we can use that API to save per-segment facets "schema details". I don't remember how this ended, but maybe we shouldn't remove it? Alternatively, we could gen SIFormat too ... that may be an overkill though. Recording per-segment StringStringMap in SIS seems simple enough. Regarding FIS.gen, I honestly thought to keep it simple by writing all FIS entirely in each gen and not complicate the code by writing parts of an FI in different gens and merging them by SR. This is what I plan to do in this issue.
        Hide
        Robert Muir added a comment -

        Just so I understand, if we gen FieldInfos, does that solve the brokenness of the Codec APIs (in addition to the other things that it solves)? If not, in what way are they broken, and is this break a new thing that NDV updates cause/expose, or it's a break that exists in general? Can you list the breaks here (because I think that FIS.gen solves all the points you raised above).

        It does not solve problem #2 (SegmentInfos.attributes). This API should removed, deprecated, made internal-only, or something like that. Another option is to move this stuff into the commit, but that might be overkill: today this stuff is only used as a backwards-compatibility crutch (i think) to read 3.x indexes: so it can possibly be just removed in trunk right now.

        Gen'ing FieldInfos brings about its own set of questions as far as when/how/if any new fieldinfo information is merged and when/how its visible to the codec API. its very scary but I don't see any alternative at the moment.

        Show
        Robert Muir added a comment - Just so I understand, if we gen FieldInfos, does that solve the brokenness of the Codec APIs (in addition to the other things that it solves)? If not, in what way are they broken, and is this break a new thing that NDV updates cause/expose, or it's a break that exists in general? Can you list the breaks here (because I think that FIS.gen solves all the points you raised above). It does not solve problem #2 (SegmentInfos.attributes). This API should removed, deprecated, made internal-only, or something like that. Another option is to move this stuff into the commit, but that might be overkill: today this stuff is only used as a backwards-compatibility crutch (i think) to read 3.x indexes: so it can possibly be just removed in trunk right now. Gen'ing FieldInfos brings about its own set of questions as far as when/how/if any new fieldinfo information is merged and when/how its visible to the codec API. its very scary but I don't see any alternative at the moment.
        Hide
        Robert Muir added a comment -

        This would let us proceed (progress not perfection) and then later, we address it. Ie, I think the added boolean is a fair compromise.

        Its not a fair compromise at all.

        To me, as a search engine library, this is not progress. its going backwards.
        Yes: I'm looking at it solely from an API perspective.
        Yes: others look at things from only features/performance perspective and do not seem to care about APIs.

        But as a library, the API is all that matters.

        So I just want to make it clear: saying "progress not perfection" is not a good excuse for leaving broken APIs about the codebase and shoving in features as fast as possible: its not progress to me so I simply do not see it that way.

        Frankly I am tired of hearing this phrase being used in this way, and when I see it in the future, it will encourage me to take a closer inspection of APIs and do pickier reviews.

        Show
        Robert Muir added a comment - This would let us proceed (progress not perfection) and then later, we address it. Ie, I think the added boolean is a fair compromise. Its not a fair compromise at all. To me, as a search engine library, this is not progress. its going backwards. Yes: I'm looking at it solely from an API perspective. Yes: others look at things from only features/performance perspective and do not seem to care about APIs. But as a library, the API is all that matters. So I just want to make it clear: saying "progress not perfection" is not a good excuse for leaving broken APIs about the codebase and shoving in features as fast as possible: its not progress to me so I simply do not see it that way. Frankly I am tired of hearing this phrase being used in this way, and when I see it in the future, it will encourage me to take a closer inspection of APIs and do pickier reviews.
        Hide
        Shai Erera added a comment -

        Just so I understand, if we gen FieldInfos, does that solve the brokenness of the Codec APIs (in addition to the other things that it solves)? If not, in what way are they broken, and is this break a new thing that NDV updates cause/expose, or it's a break that exists in general? Can you list the breaks here (because I think that FIS.gen solves all the points you raised above).

        Show
        Shai Erera added a comment - Just so I understand, if we gen FieldInfos, does that solve the brokenness of the Codec APIs (in addition to the other things that it solves)? If not, in what way are they broken, and is this break a new thing that NDV updates cause/expose, or it's a break that exists in general? Can you list the breaks here (because I think that FIS.gen solves all the points you raised above).
        Hide
        Robert Muir added a comment -

        But I also don't mind moving forward with SWS.isFieldUpdate and remove it in a follow on issue ... as long as it's done before 4.5.

        I don't think that will be an issue at all.

        if we want to iterate and leave the codec APIs broken, I won't object: but simple rule.

        Trunk only.

        We can't do this kind of stuff on the stable branch at all: Things that get backported there need to be "ready to ship".

        Show
        Robert Muir added a comment - But I also don't mind moving forward with SWS.isFieldUpdate and remove it in a follow on issue ... as long as it's done before 4.5. I don't think that will be an issue at all. if we want to iterate and leave the codec APIs broken, I won't object: but simple rule. Trunk only. We can't do this kind of stuff on the stable branch at all: Things that get backported there need to be "ready to ship".
        Hide
        Shai Erera added a comment -

        I think it's important to solve FIS.gen, either on this issue or a separate one, but before 4.5 is out. Because now SegmentInfos records per-field dvGen and if we gen FIS, this will be recorded by a new Lucene45FieldInfosFormat, and SIS will need to record fieldInfosGen. I actually don't mind to do it in this issue. It's work that's needed and affects NDV-updates (e.g. sparse fields which now hit a too late cryptic exception).

        But I also don't mind moving forward with SWS.isFieldUpdate and remove it in a follow on issue ... as long as it's done before 4.5.

        Show
        Shai Erera added a comment - I think it's important to solve FIS.gen, either on this issue or a separate one, but before 4.5 is out. Because now SegmentInfos records per-field dvGen and if we gen FIS, this will be recorded by a new Lucene45FieldInfosFormat, and SIS will need to record fieldInfosGen. I actually don't mind to do it in this issue. It's work that's needed and affects NDV-updates (e.g. sparse fields which now hit a too late cryptic exception). But I also don't mind moving forward with SWS.isFieldUpdate and remove it in a follow on issue ... as long as it's done before 4.5.
        Hide
        Michael McCandless added a comment -

        We could simply document this as a limitation, today? Ie, that if it's an update, the DVFormat cannot use the attributes APIs. This would let us proceed (progress not perfection) and then later, we address it. Ie, I think the added boolean is a fair compromise.

        Or, we can pursue gen'ing FIS on this patch, but this is going to add a lot of trickiness/complexity; I think it'd be better to explore it separately.

        Show
        Michael McCandless added a comment - We could simply document this as a limitation, today? Ie, that if it's an update, the DVFormat cannot use the attributes APIs. This would let us proceed (progress not perfection) and then later, we address it. Ie, I think the added boolean is a fair compromise. Or, we can pursue gen'ing FIS on this patch, but this is going to add a lot of trickiness/complexity; I think it'd be better to explore it separately.
        Hide
        Shai Erera added a comment -

        OK, so now I get your point. The problem is that we pass to Codec FI.attributes with say an attribute 'foo=bar'. The Codec, unaware that this is an update, looks at the given numericFields and decides to encode them using method "bar2", so it encodes into the attributes 'foo=bar2', but those attributes get lost because they're not rewritten to FIS. Do I understand correctly?

        Of course, we could say that since the Codec has to peek into SWS.isFieldUpdate, thereby making it updates-aware, it should not encode stuff in a different format, but SWS.isFieldUpdate is not enough to enforce that.

        I don't think that gen'ing FIS solves the problem of obtaining the right DVF in the first place. Sure, after we do that, the Codec can put whatever attributes that it wants, they will be recorded in the new FIS.gen.

        But maybe we can solve these two problems by gen'ing FIS:

        • Add FieldInfo.dvGen. The Codec will receive the FieldInfos with their dvGen bumped up.
        • Codec can choose to look at FI.dvGen and pull the right DVF e.g. like PerField does.
          • Or it can choose to completely ignore it, and always write udpates using the new format.
        • Codec is free to record whatever attributes it wants on this FI. Since we gen FIS, they will be recorded and used by the reader.

        What do you think?

        Show
        Shai Erera added a comment - OK, so now I get your point. The problem is that we pass to Codec FI.attributes with say an attribute 'foo=bar'. The Codec, unaware that this is an update, looks at the given numericFields and decides to encode them using method "bar2", so it encodes into the attributes 'foo=bar2', but those attributes get lost because they're not rewritten to FIS. Do I understand correctly? Of course, we could say that since the Codec has to peek into SWS.isFieldUpdate, thereby making it updates-aware, it should not encode stuff in a different format, but SWS.isFieldUpdate is not enough to enforce that. I don't think that gen'ing FIS solves the problem of obtaining the right DVF in the first place. Sure, after we do that, the Codec can put whatever attributes that it wants, they will be recorded in the new FIS.gen. But maybe we can solve these two problems by gen'ing FIS: Add FieldInfo.dvGen. The Codec will receive the FieldInfos with their dvGen bumped up. Codec can choose to look at FI.dvGen and pull the right DVF e.g. like PerField does. Or it can choose to completely ignore it, and always write udpates using the new format. Codec is free to record whatever attributes it wants on this FI. Since we gen FIS, they will be recorded and used by the reader. What do you think?
        Hide
        Robert Muir added a comment -

        By the way: the "general" issue is that for updates, its unfortunately not enough to concern ourselves with data, we have to worry about metadata too:

        I see at least 4 problems (and i have not thought about it completely):

        1. FieldInfo.attributes: these "writes" by the NumericDocValues impl will be completely discarded during update, because its per-segment, not per-commit.
        2. SegmentInfo.attributes: same as the above
        3. Field doesnt exist in FieldInfo at all: (because the segment the update applies to happens to have no values for the field)
        4. Field exists in FieldInfo, but is incomplete: (because the segment the update applies to, had say a stored-only or stored+indexed value for the field, but no dv one).

        PerFieldDVF is just one implementation that happens to use #1. Fixing it is fixing the symptom, thats why I say we really need to instead fix the disease, or things will get very ugly.

        The only reasons you dont see more problems with #1 and #2, is that currently its not used very much (only by PerField and back-compat). If we had more codecs exercising the APIs, you would be seeing these problems already.

        A perfectly good solution would be to remove these APIs completely for public use (which would solve #1 and #2). PerField(PF/DVF) could write its own .per file instead. Back compat cruft could then use these now-internal-only-APIs (and it wont matter since they dont support updates), or we could implement their hacks in another way.

        But this still leaves issues like #3 and #4.

        Adding a boolean 'isFieldUpdate' doesn't really solve anything, and it totally breaks the whole concept of the codec being unaware of updates.

        It is the wrong direction.

        Show
        Robert Muir added a comment - By the way: the "general" issue is that for updates, its unfortunately not enough to concern ourselves with data, we have to worry about metadata too: I see at least 4 problems (and i have not thought about it completely): FieldInfo.attributes: these "writes" by the NumericDocValues impl will be completely discarded during update, because its per-segment, not per-commit. SegmentInfo.attributes: same as the above Field doesnt exist in FieldInfo at all: (because the segment the update applies to happens to have no values for the field) Field exists in FieldInfo, but is incomplete: (because the segment the update applies to, had say a stored-only or stored+indexed value for the field, but no dv one). PerFieldDVF is just one implementation that happens to use #1. Fixing it is fixing the symptom, thats why I say we really need to instead fix the disease, or things will get very ugly. The only reasons you dont see more problems with #1 and #2, is that currently its not used very much (only by PerField and back-compat). If we had more codecs exercising the APIs, you would be seeing these problems already. A perfectly good solution would be to remove these APIs completely for public use (which would solve #1 and #2). PerField(PF/DVF) could write its own .per file instead. Back compat cruft could then use these now-internal-only-APIs (and it wont matter since they dont support updates), or we could implement their hacks in another way. But this still leaves issues like #3 and #4. Adding a boolean 'isFieldUpdate' doesn't really solve anything, and it totally breaks the whole concept of the codec being unaware of updates. It is the wrong direction.
        Hide
        Robert Muir added a comment -

        It really doesn't work: its definitely a blocker for me!

        This leaves the general api (FieldInfo.attributes and SegmentInfo.attributes) broken for codecs, and only hacks a specific implementation that uses them.

        With or without the current boolean, if a numeric docvalues impl puts something in FieldInfo.attributes during an update, it will go into a black hole, because FieldInfos is write-once per-segment (and not per-commit). Same goes with SegmentInfo.attributes.

        Show
        Robert Muir added a comment - It really doesn't work: its definitely a blocker for me! This leaves the general api (FieldInfo.attributes and SegmentInfo.attributes) broken for codecs, and only hacks a specific implementation that uses them. With or without the current boolean, if a numeric docvalues impl puts something in FieldInfo.attributes during an update, it will go into a black hole, because FieldInfos is write-once per-segment (and not per-commit). Same goes with SegmentInfo.attributes.
        Hide
        Shai Erera added a comment -

        I don't understand the problem that you raise. Until then, I think that SWS.isFieldUpdate is fine. It works, it's simple, and most importantly, it allows me to move forward. Let's discuss how to improve it even further, but I don't think this is a blocker. We can always improve that later on.

        Show
        Shai Erera added a comment - I don't understand the problem that you raise. Until then, I think that SWS.isFieldUpdate is fine. It works, it's simple, and most importantly, it allows me to move forward. Let's discuss how to improve it even further, but I don't think this is a blocker. We can always improve that later on.
        Hide
        Robert Muir added a comment -

        Patch adds per-field support. I currently do that by adding a boolean 'isFieldUpdate' to SegWriteState which is set to true only by ReaderAndLiveDocs. PerFieldDVF then peeks into that boolean and if it's true, it reads the format name from FieldInfo.attributes() instead of relying on Codec.getPerFieldDVF(). If we'll eventually gen FieldInfos, there won't be a need for this boolean as PerFieldDVF will get that from FI.dvGen.

        We can't move forward really with this boolean: it only attacks the symptom (puts a HACK in per-field) without fixing the disease (the codec API).

        In general if a codec needs to write to and read from FieldInfos/SegmentInfos.attributes, it doesnt work here: this api needs to be fixed.

        Show
        Robert Muir added a comment - Patch adds per-field support. I currently do that by adding a boolean 'isFieldUpdate' to SegWriteState which is set to true only by ReaderAndLiveDocs. PerFieldDVF then peeks into that boolean and if it's true, it reads the format name from FieldInfo.attributes() instead of relying on Codec.getPerFieldDVF(). If we'll eventually gen FieldInfos, there won't be a need for this boolean as PerFieldDVF will get that from FI.dvGen. We can't move forward really with this boolean: it only attacks the symptom (puts a HACK in per-field) without fixing the disease (the codec API). In general if a codec needs to write to and read from FieldInfos/SegmentInfos.attributes, it doesnt work here: this api needs to be fixed.
        Hide
        Shai Erera added a comment -

        Patch improves RAM accounting in BufferedDeletes and FrozenBD. I added NumericUpdate.sizeInBytes() so most of the logic is done there. BD adds two constants - for adding to outer Map (includes inner map OBJ_HEADER) and adding an actual update to inner map (excludes map's OBJ_HEADER). Only the pointers are taken into account.

        Show
        Shai Erera added a comment - Patch improves RAM accounting in BufferedDeletes and FrozenBD. I added NumericUpdate.sizeInBytes() so most of the logic is done there. BD adds two constants - for adding to outer Map (includes inner map OBJ_HEADER) and adding an actual update to inner map (excludes map's OBJ_HEADER). Only the pointers are taken into account.
        Hide
        Shai Erera added a comment -

        Thanks for the review Mike. I nuked the two unused methodsand I like SegmentCommitInfo, so changed the nocommit text.

        I changed the nocomit in SIPC to a TODO. Don't think we need to tackle it in this issue.

        I'm working on improving the RAM accounting. I want to add to NumericUpdate.sizeInBytes() and count it per-update that is actually added. Then, because it's a Map<Term,Map<String,NumericUpdate>> and the Term and String are both in NumericUdpate already (and will be accounted for in its calculation, only their PTR needs to be taken into account. Also, the sizeInBytes should grow by new entry to outer map only when one is actually added, same for inner map. Therefore I don't think we can have a single constant here, but instead maybe two: for every Entry<Term,Map> added to the outer map and every Entry<String,NumericUpdate> added to the inner map. I think, because I need to compute the shallow sizes only (since Term and String are accounted for in NumericUpdate), it's a single constant for Entry<Object,Object>?

        Show
        Shai Erera added a comment - Thanks for the review Mike. I nuked the two unused methodsand I like SegmentCommitInfo, so changed the nocommit text. I changed the nocomit in SIPC to a TODO. Don't think we need to tackle it in this issue. I'm working on improving the RAM accounting. I want to add to NumericUpdate.sizeInBytes() and count it per-update that is actually added. Then, because it's a Map<Term,Map<String,NumericUpdate>> and the Term and String are both in NumericUdpate already (and will be accounted for in its calculation, only their PTR needs to be taken into account. Also, the sizeInBytes should grow by new entry to outer map only when one is actually added, same for inner map. Therefore I don't think we can have a single constant here, but instead maybe two: for every Entry<Term,Map> added to the outer map and every Entry<String,NumericUpdate> added to the inner map. I think, because I need to compute the shallow sizes only (since Term and String are accounted for in NumericUpdate), it's a single constant for Entry<Object,Object>?
        Hide
        Shai Erera added a comment -

        Patch adds per-field support. I currently do that by adding a boolean 'isFieldUpdate' to SegWriteState which is set to true only by ReaderAndLiveDocs. PerFieldDVF then peeks into that boolean and if it's true, it reads the format name from FieldInfo.attributes() instead of relying on Codec.getPerFieldDVF(). If we'll eventually gen FieldInfos, there won't be a need for this boolean as PerFieldDVF will get that from FI.dvGen.

        So far all Codecs work. I had to remove an assert from SimpleText which tested that all fields read from the file are in the state.fieldInfos. But it doesn't use that information, only an assert. And SegCoreReader now passes to each DVProducer only the fields it needs to read.

        Added some tests too.

        Show
        Shai Erera added a comment - Patch adds per-field support. I currently do that by adding a boolean 'isFieldUpdate' to SegWriteState which is set to true only by ReaderAndLiveDocs. PerFieldDVF then peeks into that boolean and if it's true, it reads the format name from FieldInfo.attributes() instead of relying on Codec.getPerFieldDVF(). If we'll eventually gen FieldInfos, there won't be a need for this boolean as PerFieldDVF will get that from FI.dvGen. So far all Codecs work. I had to remove an assert from SimpleText which tested that all fields read from the file are in the state.fieldInfos. But it doesn't use that information, only an assert. And SegCoreReader now passes to each DVProducer only the fields it needs to read. Added some tests too.
        Hide
        Michael McCandless added a comment -

        Patch looks great! I review some of the remaining nocommits:

        // nocommit no one calls this method, why do we have it? and if we need it, do we need one for docValuesGen too?
        public void setDelGen(long delGen) {

        Nuke it! We only use .advanceNextWriteDelGen (and the patch adds this
        for DVs too).

        // nocommit no one calls this, remove?
        void clearDelGen() {

        Nuke it!

        class ReadersAndLiveDocs { // nocommit (RENAME) to ReaderAndUpdates?

        +1 for ReaderAndUpdates

        // nocommit why do we do that, vs relying on TrackingDir.getCreatedFiles(),
        // like we do for updates?

        That's a good question ... I'm not sure. We in fact already use
        TrackingDirWrapper (in ReadersAndLiveDocs.writeLiveDocs)... so we
        could in theory record those files in SIPC and remove
        LiveDocsFormat.files(). Maybe make this a TODO though?

        // nocommit: review!
        final static int BYTES_PER_NUMERIC_UPDATE = BYTES_PER_DEL_TERM + 2*RamUsageEstimator.NUM_BYTES_OBJECT_REF + RamUsageEstimator.NUM_BYTES_INT + RamUsageEstimator.NUM_BYTES_LONG;

        I think it makes sense to start from BYTES_PER_DEL_TERM, but then
        instead of mapping to value Integer we map to value
        Map<String,NumericUpdate> whose per-Term RAM usage is something like:

          PTR (for LinkedHashMap, since it must link each entry to the next?)
        
          Map
            HEADER
            PTR (to array)
            3 INT
            1 FLOAT
        
          for each occupied Entry<String,NumericUpdate>
            PTR (from Map's entries array) * 2 (overhead for load factor)
            HEADER
            PTR * 2 (key, value)
        
            String key
              HEADER
              INT
              PTR (to char[])
              
              ARRAY_HEADER + 2 * length-of-string (field)
        
            NumericUpdate value
              HEADER
              PTR (to Term; ram already accounted for)
              PTR (to String; ram already accounted for)
              PTR (to Long value) + HEADER + 8 (long)    
              INT
        

        The thing is, this is so hairy ... that I think maybe we should
        instead use RamUsageEstimator to "calibrate" this? Ie, make a
        standalone test that keeps adding Term + fields into this structure
        and measures the RAM with RUE? Do this on 32 bit and on 64 bit JVM,
        and then conditionalize the constants. You'll still need to add in
        bytes according to field/term lengths...

        +public class SegmentInfoPerCommit { // nocommit (RENAME) to SegmentCommit?

        Not sure about that rename ... since this class is just the "metadata"
        about a commit, not an "actual" commit. Maybe SegmentCommitInfo?

        Show
        Michael McCandless added a comment - Patch looks great! I review some of the remaining nocommits: // nocommit no one calls this method, why do we have it? and if we need it, do we need one for docValuesGen too? public void setDelGen(long delGen) { Nuke it! We only use .advanceNextWriteDelGen (and the patch adds this for DVs too). // nocommit no one calls this, remove? void clearDelGen() { Nuke it! class ReadersAndLiveDocs { // nocommit (RENAME) to ReaderAndUpdates? +1 for ReaderAndUpdates // nocommit why do we do that, vs relying on TrackingDir.getCreatedFiles(), // like we do for updates? That's a good question ... I'm not sure. We in fact already use TrackingDirWrapper (in ReadersAndLiveDocs.writeLiveDocs)... so we could in theory record those files in SIPC and remove LiveDocsFormat.files(). Maybe make this a TODO though? // nocommit: review! final static int BYTES_PER_NUMERIC_UPDATE = BYTES_PER_DEL_TERM + 2*RamUsageEstimator.NUM_BYTES_OBJECT_REF + RamUsageEstimator.NUM_BYTES_INT + RamUsageEstimator.NUM_BYTES_LONG; I think it makes sense to start from BYTES_PER_DEL_TERM, but then instead of mapping to value Integer we map to value Map<String,NumericUpdate> whose per-Term RAM usage is something like: PTR (for LinkedHashMap, since it must link each entry to the next?) Map HEADER PTR (to array) 3 INT 1 FLOAT for each occupied Entry<String,NumericUpdate> PTR (from Map's entries array) * 2 (overhead for load factor) HEADER PTR * 2 (key, value) String key HEADER INT PTR (to char[]) ARRAY_HEADER + 2 * length-of-string (field) NumericUpdate value HEADER PTR (to Term; ram already accounted for) PTR (to String; ram already accounted for) PTR (to Long value) + HEADER + 8 (long) INT The thing is, this is so hairy ... that I think maybe we should instead use RamUsageEstimator to "calibrate" this? Ie, make a standalone test that keeps adding Term + fields into this structure and measures the RAM with RUE? Do this on 32 bit and on 64 bit JVM, and then conditionalize the constants. You'll still need to add in bytes according to field/term lengths... +public class SegmentInfoPerCommit { // nocommit (RENAME) to SegmentCommit? Not sure about that rename ... since this class is just the "metadata" about a commit, not an "actual" commit. Maybe SegmentCommitInfo?
        Hide
        Shai Erera added a comment -

        Patch adds testStressMultiThreading and testUpdateOldSegments to ensure updating segments written with older format than "Lucene45" is not supported.

        Show
        Shai Erera added a comment - Patch adds testStressMultiThreading and testUpdateOldSegments to ensure updating segments written with older format than "Lucene45" is not supported.
        Hide
        Uwe Schindler added a comment -

        This callback would need to also get the docid I guess (it's missing in your API example)?

        Of course we could add this. Java 8 would also support this cool syntax, something like: writer.updateDocValues(term, (docid, value) -> value+1);

        The Java 8 example here was just syntactic sugar: For all this its only important that it is an interface with only one method that gets as many parameters as needed and returns one value. We automatically get the cool java 8 syntax for users, if we design the callback interface to these guidelines. One common example is the "Comparator" interface in Java. Every Comparator<T> can be written in this cool syntax

        Show
        Uwe Schindler added a comment - This callback would need to also get the docid I guess (it's missing in your API example)? Of course we could add this. Java 8 would also support this cool syntax, something like: writer.updateDocValues(term, (docid, value) -> value+1); The Java 8 example here was just syntactic sugar: For all this its only important that it is an interface with only one method that gets as many parameters as needed and returns one value. We automatically get the cool java 8 syntax for users, if we design the callback interface to these guidelines. One common example is the "Comparator" interface in Java. Every Comparator<T> can be written in this cool syntax
        Hide
        Shai Erera added a comment -

        I definitely want to add update by query, but in a separate issue. And the callback idea is interesting. This callback would need to also get the docid I guess (it's missing in your API example)?

        Show
        Shai Erera added a comment - I definitely want to add update by query, but in a separate issue. And the callback idea is interesting. This callback would need to also get the docid I guess (it's missing in your API example)?
        Hide
        Uwe Schindler added a comment -

        Hi,
        I had an idea yesterday when thinking about this. Currently (like for deletes) we can update DocValues based on an ID term (by docid is not easily possible with IndexWriter). As the ID term can be anything, you could also use some (group) key that updates lots of documents (like you can delete all documents with a specific term). The current code updates the given field for all those documents to a fixed value. My two ideas are:

        • we could also support update by query (means like for deletes you provide a query that selects the documents to update)
        • we could make "modifications" possible: Instead of giving a value that is set for all selected documents, we could provide a "callback" interface that is used to modify the current docvalue (numeric or String) of the document to update and returns a changed value. This would be a one-method interface, so it could be used as closure in Java 8, like writer.updateDocValues(term, value -> value+1); (in Java 6/7 this would be writer.updateDocValues(term, new NumericDocValuesUpdater() { public long update(long value) { return value+1; }});). Servers like Solr or ElasticSearch could implement this interface/closure using e.g. javascript, so one could execute a docvalues update and pass a javascript function applied to every value. We just have to think about concurency: What happens if 2 threads are updating the same value at the same time - maybe this is already handled by the BufferedDeletesQueue!?

        I just wanted to write this down in this issue, so we could think about allowing to implement this. Of course the current patch is more important to get the whole game running! The updateable by term/query is just one thing which is often requested by users. The typical example is a webapp where you can vote for a document. In that case one would execute the closure value -> value+1. If we implement this so low level, the whole concurreny should be easier than how it is currently impelemented e.g. in Solr or ES.

        Show
        Uwe Schindler added a comment - Hi, I had an idea yesterday when thinking about this. Currently (like for deletes) we can update DocValues based on an ID term (by docid is not easily possible with IndexWriter). As the ID term can be anything, you could also use some (group) key that updates lots of documents (like you can delete all documents with a specific term). The current code updates the given field for all those documents to a fixed value. My two ideas are: we could also support update by query (means like for deletes you provide a query that selects the documents to update) we could make "modifications" possible: Instead of giving a value that is set for all selected documents, we could provide a "callback" interface that is used to modify the current docvalue (numeric or String) of the document to update and returns a changed value. This would be a one-method interface, so it could be used as closure in Java 8, like writer.updateDocValues(term, value -> value+1); (in Java 6/7 this would be writer.updateDocValues(term, new NumericDocValuesUpdater() { public long update(long value) { return value+1; }}); ). Servers like Solr or ElasticSearch could implement this interface/closure using e.g. javascript, so one could execute a docvalues update and pass a javascript function applied to every value. We just have to think about concurency: What happens if 2 threads are updating the same value at the same time - maybe this is already handled by the BufferedDeletesQueue!? I just wanted to write this down in this issue, so we could think about allowing to implement this. Of course the current patch is more important to get the whole game running! The updateable by term/query is just one thing which is often requested by users. The typical example is a webapp where you can vote for a document. In that case one would execute the closure value -> value+1 . If we implement this so low level, the whole concurreny should be easier than how it is currently impelemented e.g. in Solr or ES.
        Hide
        Shai Erera added a comment -

        Regarding problem 1, I hardwired the test to use Lucene45Codec for now so that I'm not blocked. I thought about Codec.serlize/attributes and now I realize it's not a good idea since those attributes must be recorded per-segment, yet the Codec is single-instance for all segments. We can however record these in SegmentInfo.attributes(). The documentation suggests this is where the Codec should record stuff per-segment. Would it work if PerFieldDVF recorded the per-field-format in SegWriteStage.si.attributes() and read them later, instead of FieldInfo.attributes?

        Show
        Shai Erera added a comment - Regarding problem 1, I hardwired the test to use Lucene45Codec for now so that I'm not blocked. I thought about Codec.serlize/attributes and now I realize it's not a good idea since those attributes must be recorded per-segment, yet the Codec is single-instance for all segments. We can however record these in SegmentInfo.attributes(). The documentation suggests this is where the Codec should record stuff per-segment. Would it work if PerFieldDVF recorded the per-field-format in SegWriteStage.si.attributes() and read them later, instead of FieldInfo.attributes?
        Hide
        Shai Erera added a comment -

        Regarding problem 3, Mike helped me construct a simple test which reproduces the bug - I opened LUCENE-5192 to fix.

        Show
        Shai Erera added a comment - Regarding problem 3, Mike helped me construct a simple test which reproduces the bug - I opened LUCENE-5192 to fix.
        Hide
        Shai Erera added a comment -

        BTW, this may generally not be a bad idea, to let the Codec serialize some stuff which is later given to it in Codec.init(BytesRef). E.g. if a Codec is initialized with some parameters that are also important during search (e.g FacetsCodec can be initialized with FacetIndexingParams, which get lost during search because the Codec is initialized with default ctor), this could be a way for it to serialize/deserialize itself. The name will be used for the newInstance(), the rest to initialize the Codec.

        Show
        Shai Erera added a comment - BTW, this may generally not be a bad idea, to let the Codec serialize some stuff which is later given to it in Codec.init(BytesRef). E.g. if a Codec is initialized with some parameters that are also important during search (e.g FacetsCodec can be initialized with FacetIndexingParams, which get lost during search because the Codec is initialized with default ctor), this could be a way for it to serialize/deserialize itself. The name will be used for the newInstance(), the rest to initialize the Codec.
        Hide
        Shai Erera added a comment -

        Regarding problem 1, I don't know if it's a valid solution, but maybe if we recorded a per-field-format map for each SegInfo, Lucene45Codec could initialize its dvFormat accordingly? This is not generic though .. it's like we need to have a Codec.serialize() method which dumps stuff to SegInfo (or returns a BytesRef/String from which it can later initialize itself). We'd then not need the attributes on FieldInfo. We have to somehow employ the same logic as we do in PerFieldDVF.FieldsReader, in PerFieldDVF.FieldsWriter for updating existing segments. Whatever solution we'll do here, will help us when we come to implement field updates for postings.

        Show
        Shai Erera added a comment - Regarding problem 1, I don't know if it's a valid solution, but maybe if we recorded a per-field-format map for each SegInfo, Lucene45Codec could initialize its dvFormat accordingly? This is not generic though .. it's like we need to have a Codec.serialize() method which dumps stuff to SegInfo (or returns a BytesRef/String from which it can later initialize itself). We'd then not need the attributes on FieldInfo. We have to somehow employ the same logic as we do in PerFieldDVF.FieldsReader, in PerFieldDVF.FieldsWriter for updating existing segments. Whatever solution we'll do here, will help us when we come to implement field updates for postings.
        Hide
        Shai Erera added a comment -

        Patch adds some nocommits and tests that expose some problems:

        Problem 1
        If you run the test with -Dtests.method=testSegmentMerges -Dtests.seed=7651E2AEEBC55BDF, you'll hit an exception:

        NOTE: reproduce with: ant test  -Dtestcase=TestNumericDocValuesUpdates -Dtests.method=testSegmentMerges -Dtests.seed=7651E2AEEBC55BDF -Dtests.locale=en_AU -Dtests.timezone=Etc/GMT+11 -Dtests.file.encoding=UTF-8
        Aug 29, 2013 11:57:35 AM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
        WARNING: Will linger awaiting termination of 1 leaked thread(s).
        Aug 29, 2013 11:57:35 AM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
        WARNING: Uncaught exception in thread: Thread[Lucene Merge Thread #0,6,TGRP-TestNumericDocValuesUpdates]
        org.apache.lucene.index.MergePolicy$MergeException: java.lang.AssertionError: formatName=Lucene45 prevValue=Memory
        	at __randomizedtesting.SeedInfo.seed([7651E2AEEBC55BDF]:0)
        	at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
        	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
        Caused by: java.lang.AssertionError: formatName=Lucene45 prevValue=Memory
        	at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.getInstance(PerFieldDocValuesFormat.java:133)
        	at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.addNumericField(PerFieldDocValuesFormat.java:105)
        	at org.apache.lucene.index.ReadersAndLiveDocs.writeLiveDocs(ReadersAndLiveDocs.java:389)
        	at org.apache.lucene.index.ReadersAndLiveDocs.getReader(ReadersAndLiveDocs.java:178)
        	at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3732)
        	at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3401)
        	at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
        	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
        

        What happens is the test uses RandomCodec and picks MemoryDVF for writing that field. Later, when ReaderAndLiveDocs applies updates to that field, it uses SI.codec, which is not RandomCodec anymore, but Lucene45Codec (or in this case Facet45Codec - based on Codec.forName("Lucene45")), and its DVF returns for that field Lucene45DVF, because Lucene45Codec always returns that. The way it works during search is that PerFieldDVF.FieldsReader does not rely on the Codec at all, but rather looks up an attribute in FieldInfo which tells it the DVFormat.name and then it calls DVF.forName. But for writing, it relies on the Codec.

        I am not sure how to resolve this. I don't think ReaderAndLiveDocs is doing anything wrong – per-field is not exposed on Codec API, therefore it shouldn't assume it should do any per-field stuff. But on the other hand, Lucene45Codec instances return per-field DVF based on what the instance says, and don't look at the FieldInfo attributes, as PerFieldDVF.FieldsReader does. Any ideas?

        Problem 2
        Robert thought of this usecase: if you have a sparse DocValue field 'f', such that say in segment 1 only doc1 has a value, but in segment 2 none of the documents have values, you cannot really update documents in segment 2, because the FieldInfos for that segment won't list the field as having DocValues at all. For now, I catch that case in ReaderAndLiveDocs and throw an exception. The workaround is to make sure you always have values for a field in a segment, by e.g. always setting some default value. But this is ugly and exposes internal stuff (e.g. segments) to users. Also, it's bad because e.g. if segments 1+2 are merged, you suddenly can update documents that were in segment2 before.

        A way to solve it is to gen FieldInfos as well. That will allow us to additionally support adding new fields through field updates, though that's optional and we can still choose to forbid it. If we gen FieldInfos though, the changes I've done to SegmentInfos (recording per-field dvGen) need to be reverted. So it's important that we come to a resolution about this in this issue. This is somewhat of a corner case (sparse fields), but I don't like the fact that users can trip on exceptions that depend whether or not the segment was merged...

        Problem 3
        FieldInfos.Builder neglect to update globalFieldNumbers.docValuesType map, if it updates a FieldInfo's DocValueType. It's an easy fix, and I added a test to numeric updates. If someone has an idea how to reproduce this outside of numeric updates scope, I'll be happy handle this in a separate issue. The problem is that if you add two fields to a document with same name, one as a posting and one as DV, globalFieldNumbers does not record that field name in its docValuesType map. This map however is currently used by IW.updateNumericDVField and by an assert in FieldInfos.Builder, therefore I wasn's able to reproduce it outside of numeric updates scope.

        Show
        Shai Erera added a comment - Patch adds some nocommits and tests that expose some problems: Problem 1 If you run the test with -Dtests.method=testSegmentMerges -Dtests.seed=7651E2AEEBC55BDF , you'll hit an exception: NOTE: reproduce with: ant test -Dtestcase=TestNumericDocValuesUpdates -Dtests.method=testSegmentMerges -Dtests.seed=7651E2AEEBC55BDF -Dtests.locale=en_AU -Dtests.timezone=Etc/GMT+11 -Dtests.file.encoding=UTF-8 Aug 29, 2013 11:57:35 AM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks WARNING: Will linger awaiting termination of 1 leaked thread(s). Aug 29, 2013 11:57:35 AM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException WARNING: Uncaught exception in thread: Thread[Lucene Merge Thread #0,6,TGRP-TestNumericDocValuesUpdates] org.apache.lucene.index.MergePolicy$MergeException: java.lang.AssertionError: formatName=Lucene45 prevValue=Memory at __randomizedtesting.SeedInfo.seed([7651E2AEEBC55BDF]:0) at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518) Caused by: java.lang.AssertionError: formatName=Lucene45 prevValue=Memory at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.getInstance(PerFieldDocValuesFormat.java:133) at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.addNumericField(PerFieldDocValuesFormat.java:105) at org.apache.lucene.index.ReadersAndLiveDocs.writeLiveDocs(ReadersAndLiveDocs.java:389) at org.apache.lucene.index.ReadersAndLiveDocs.getReader(ReadersAndLiveDocs.java:178) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3732) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3401) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482) What happens is the test uses RandomCodec and picks MemoryDVF for writing that field. Later, when ReaderAndLiveDocs applies updates to that field, it uses SI.codec, which is not RandomCodec anymore, but Lucene45Codec (or in this case Facet45Codec - based on Codec.forName("Lucene45")), and its DVF returns for that field Lucene45DVF, because Lucene45Codec always returns that. The way it works during search is that PerFieldDVF.FieldsReader does not rely on the Codec at all, but rather looks up an attribute in FieldInfo which tells it the DVFormat.name and then it calls DVF.forName. But for writing, it relies on the Codec. I am not sure how to resolve this. I don't think ReaderAndLiveDocs is doing anything wrong – per-field is not exposed on Codec API, therefore it shouldn't assume it should do any per-field stuff. But on the other hand, Lucene45Codec instances return per-field DVF based on what the instance says, and don't look at the FieldInfo attributes, as PerFieldDVF.FieldsReader does. Any ideas? Problem 2 Robert thought of this usecase: if you have a sparse DocValue field 'f', such that say in segment 1 only doc1 has a value, but in segment 2 none of the documents have values, you cannot really update documents in segment 2, because the FieldInfos for that segment won't list the field as having DocValues at all. For now, I catch that case in ReaderAndLiveDocs and throw an exception. The workaround is to make sure you always have values for a field in a segment, by e.g. always setting some default value. But this is ugly and exposes internal stuff (e.g. segments) to users. Also, it's bad because e.g. if segments 1+2 are merged, you suddenly can update documents that were in segment2 before. A way to solve it is to gen FieldInfos as well. That will allow us to additionally support adding new fields through field updates, though that's optional and we can still choose to forbid it. If we gen FieldInfos though, the changes I've done to SegmentInfos (recording per-field dvGen) need to be reverted. So it's important that we come to a resolution about this in this issue. This is somewhat of a corner case (sparse fields), but I don't like the fact that users can trip on exceptions that depend whether or not the segment was merged... Problem 3 FieldInfos.Builder neglect to update globalFieldNumbers.docValuesType map, if it updates a FieldInfo's DocValueType. It's an easy fix, and I added a test to numeric updates. If someone has an idea how to reproduce this outside of numeric updates scope, I'll be happy handle this in a separate issue. The problem is that if you add two fields to a document with same name, one as a posting and one as DV, globalFieldNumbers does not record that field name in its docValuesType map. This map however is currently used by IW.updateNumericDVField and by an assert in FieldInfos.Builder, therefore I wasn's able to reproduce it outside of numeric updates scope.
        Hide
        Robert Muir added a comment -

        In the case of old codecs: what we do is pretty tricky for testing:

        • we make them read-only officially for the user (so that new segments are written in the latest format, but old segments can still be read).
        • this has the additional caveat they are not "purely" read-only, because actually we allow liveDocs updates (deletes) against the old formats. so they are "mostly read-only".
        • tests have read-write versions (like in branch4x: PreFlexRWCodec). These allow in tests for us to override the read-only-ness, and write like the old formats did and read them in transparently in tests.
        • Of course they cannot support the newest features with this "impersonator" testing we do, but in general we get a lot more test coverage than if we relied solely upon TestBackwardsCompatibility.
        Show
        Robert Muir added a comment - In the case of old codecs: what we do is pretty tricky for testing: we make them read-only officially for the user (so that new segments are written in the latest format, but old segments can still be read). this has the additional caveat they are not "purely" read-only, because actually we allow liveDocs updates (deletes) against the old formats. so they are "mostly read-only". tests have read-write versions (like in branch4x: PreFlexRWCodec). These allow in tests for us to override the read-only-ness, and write like the old formats did and read them in transparently in tests. Of course they cannot support the newest features with this "impersonator" testing we do, but in general we get a lot more test coverage than if we relied solely upon TestBackwardsCompatibility.
        Hide
        Shai Erera added a comment -

        Thanks Rob. I forgot about SuppressCodecs . I guess I was confused by why Lucene40 was picked in the first place, as I thought we don't test writing indexes with old Codecs?

        Show
        Shai Erera added a comment - Thanks Rob. I forgot about SuppressCodecs . I guess I was confused by why Lucene40 was picked in the first place, as I thought we don't test writing indexes with old Codecs?
        Hide
        Robert Muir added a comment -

        You can add @SuppressCodecs({"Lucene40", "SomethingElse", ...}) annotation to the top of your test for this.

        Show
        Robert Muir added a comment - You can add @SuppressCodecs({"Lucene40", "SomethingElse", ...}) annotation to the top of your test for this.
        Hide
        Shai Erera added a comment -

        Patch addresses Rob's idea:

        • ReaderAndLiveDocs and SegCoreReaders set segmentSuffix to docValuesGen and also set SegReadState.directory accordingly (CFS or si.info.dir if dvGen != -1).
        • All the changes to DVFormat were removed (including the 45Producer/Consumer). I had to fix a bug in PerFieldDVF which ignore state.segmentSuffix (and also resolved a TODO on the way, since it now respects it).
        • Removed the nocommit in ReaderAndLiveDocs regarding letting TrackingDirWrapper forbid createOutput on a file which is referenced by a commit, since now Codecs are not aware of dvGen at all. As long as they don't ignore segmentSuffix (which they better, otherwise they're broken), they can be upgraded safely to support DVUpdate. We can still do that though under a separate issue, as another safety mechanism.

        I wanted to get rid of the nocommit in TestNumericDocValuesUpdates which sets the default Codec to Lucene45 since now presumably all Codecs should support dv-update. But when the test runs with Lucene40 (I haven't tried other codecs, it's the first one that failed), I hit an exception as if trying to write to the same CFS file. Looking at Lucene40DVF.fieldsProducer, I see that it defaults to CFS extension and also Lucene40DVWriter uses hard-coded segmentSuffix="dv" and ignore state.segmentSuffix. I guess that the actual Codec that was used is Lucene40RWDocValuesFormat, otherwise fieldsProducer should have hit an exception. I didn't know our tests pick "old" codecs at random too . How can I avoid picking the "old" Codecs (40, 42)? I still want to test other codecs, such as Asserting, maybe MemoryDVF (if it's chosen at random).

        Show
        Shai Erera added a comment - Patch addresses Rob's idea: ReaderAndLiveDocs and SegCoreReaders set segmentSuffix to docValuesGen and also set SegReadState.directory accordingly (CFS or si.info.dir if dvGen != -1). All the changes to DVFormat were removed (including the 45Producer/Consumer). I had to fix a bug in PerFieldDVF which ignore state.segmentSuffix (and also resolved a TODO on the way, since it now respects it). Removed the nocommit in ReaderAndLiveDocs regarding letting TrackingDirWrapper forbid createOutput on a file which is referenced by a commit, since now Codecs are not aware of dvGen at all. As long as they don't ignore segmentSuffix (which they better, otherwise they're broken), they can be upgraded safely to support DVUpdate. We can still do that though under a separate issue, as another safety mechanism. I wanted to get rid of the nocommit in TestNumericDocValuesUpdates which sets the default Codec to Lucene45 since now presumably all Codecs should support dv-update. But when the test runs with Lucene40 (I haven't tried other codecs, it's the first one that failed), I hit an exception as if trying to write to the same CFS file. Looking at Lucene40DVF.fieldsProducer, I see that it defaults to CFS extension and also Lucene40DVWriter uses hard-coded segmentSuffix="dv" and ignore state.segmentSuffix. I guess that the actual Codec that was used is Lucene40RWDocValuesFormat, otherwise fieldsProducer should have hit an exception. I didn't know our tests pick "old" codecs at random too . How can I avoid picking the "old" Codecs (40, 42)? I still want to test other codecs, such as Asserting, maybe MemoryDVF (if it's chosen at random).
        Hide
        Robert Muir added a comment -

        I think its true we can tackle this in a separate issue: but I think i'd rather have SR/SCR just pass the correct directory always in the segmentreadstate/segmentwritestate to the different codec components (e.g. segmentreadstate.dir is always the 'correct' directory the codec component should use, and even when CFS is enabled, livedocsformat always gets the inner one and so on).

        Its ok if we want to have the 'inner dir' accessible in SegmentInfo for SR/SCR to do this: like we could make it package private and everything would then just work?

        This would greatly reduce the possibility of mistakes. I think having CFS fall back on its inner directory on FileNotFoundException is less desirable.

        Show
        Robert Muir added a comment - I think its true we can tackle this in a separate issue: but I think i'd rather have SR/SCR just pass the correct directory always in the segmentreadstate/segmentwritestate to the different codec components (e.g. segmentreadstate.dir is always the 'correct' directory the codec component should use, and even when CFS is enabled, livedocsformat always gets the inner one and so on). Its ok if we want to have the 'inner dir' accessible in SegmentInfo for SR/SCR to do this: like we could make it package private and everything would then just work? This would greatly reduce the possibility of mistakes. I think having CFS fall back on its inner directory on FileNotFoundException is less desirable.
        Hide
        Shai Erera added a comment -

        The problem is that there are two Directories and the logic of where the file is read from depends if it's gen'd or not (so far it has been only livedocs). Maybe what we can do is have CFS revert to directory.openInput if file does not exist? We can do that in a separate issue? If we fix that, then I think we might really be able to "hide" the gen from the Codec cleanly. Actually, if the fix is that simple (CFD.openInput reverting to dir.openInput), I can do it as part of this issue, it's small enough?

        Show
        Shai Erera added a comment - The problem is that there are two Directories and the logic of where the file is read from depends if it's gen'd or not (so far it has been only livedocs). Maybe what we can do is have CFS revert to directory.openInput if file does not exist? We can do that in a separate issue? If we fix that, then I think we might really be able to "hide" the gen from the Codec cleanly. Actually, if the fix is that simple (CFD.openInput reverting to dir.openInput), I can do it as part of this issue, it's small enough?
        Hide
        Shai Erera added a comment -

        Rename fieldInfosGen to docValuesGen. I think I got all places fixed.

        Show
        Shai Erera added a comment - Rename fieldInfosGen to docValuesGen. I think I got all places fixed.
        Hide
        Robert Muir added a comment -

        by the way if we want to fix the state.directory vs state.segmentInfo.dir i would love to help with this, its bothered me forever.

        Show
        Robert Muir added a comment - by the way if we want to fix the state.directory vs state.segmentInfo.dir i would love to help with this, its bothered me forever.
        Hide
        Robert Muir added a comment -

        Maybe for writing it can work, but the producer needs to know from which Directory to read the file, e.g. if it's CFS, the gen'd files are written outside.

        Wait... we shouldnt design around this bug though (and it is an api bug). this problem you point out is definitely existing bogusness: I think we should fix this instead so a codec gets a single "directory" and doesnt need to know or care what its impl is, and whether its got TrackingDirectoryWrapper or CFS or whatever around it.

        Show
        Robert Muir added a comment - Maybe for writing it can work, but the producer needs to know from which Directory to read the file, e.g. if it's CFS, the gen'd files are written outside. Wait... we shouldnt design around this bug though (and it is an api bug). this problem you point out is definitely existing bogusness: I think we should fix this instead so a codec gets a single "directory" and doesnt need to know or care what its impl is, and whether its got TrackingDirectoryWrapper or CFS or whatever around it.
        Hide
        Shai Erera added a comment -

        Can we refer to this consistently as docValuesGen

        Yes, I think that makes sense. At some point I supported this by gen'ing FieldInfos hence the name, but things have changed since. I'll rename.

        Maybe we shouldnt pass this parameter to the codec at all. Instead IndexWriter can just put this into the segment suffix and the codec can be blissfully unaware?

        Maybe for writing it can work, but the producer needs to know from which Directory to read the file, e.g. if it's CFS, the gen'd files are written outside. I have this code in Lucene45DVProducer:

            final Directory dir;
            if (fieldInfosGen != -1) {
              dir = state.segmentInfo.dir; // gen'd files are written outside CFS, so use SegInfo directory
            } else {
              dir = state.directory;
            }
        

        I think that if we want to mask that away from the Codec entirely, we should somehow tell the Codec the segmentSuffix and the Directory from which to read the file. Would another Directory parameter be confusing (since we also have it in SegReadState)?

        I hope we can do this in a cleaner way than 3.x did it for setNorm, that was really crazy

        Well ... I don't really know how setNorm worked in 3.x, so I'll do what I think and you tell me if it's crazy or not?

        Show
        Shai Erera added a comment - Can we refer to this consistently as docValuesGen Yes, I think that makes sense. At some point I supported this by gen'ing FieldInfos hence the name, but things have changed since. I'll rename. Maybe we shouldnt pass this parameter to the codec at all. Instead IndexWriter can just put this into the segment suffix and the codec can be blissfully unaware? Maybe for writing it can work, but the producer needs to know from which Directory to read the file, e.g. if it's CFS, the gen'd files are written outside. I have this code in Lucene45DVProducer: final Directory dir; if (fieldInfosGen != -1) { dir = state.segmentInfo.dir; // gen'd files are written outside CFS, so use SegInfo directory } else { dir = state.directory; } I think that if we want to mask that away from the Codec entirely, we should somehow tell the Codec the segmentSuffix and the Directory from which to read the file. Would another Directory parameter be confusing (since we also have it in SegReadState)? I hope we can do this in a cleaner way than 3.x did it for setNorm, that was really crazy Well ... I don't really know how setNorm worked in 3.x, so I'll do what I think and you tell me if it's crazy or not?
        Hide
        Robert Muir added a comment -

        That way you can end up with e.g. _0_Lucene45_0_1.dvd and *.dvm for field 'f'
        ...
        I put a nocommit in DVFormat.fieldsConsumer/Producer by adding another variant which takes fieldInfosGen.
        ...
        I want to have only one variant of that method, thereby breaking the API. This is important IMO cause we need to ensure that whatever custom DVFormats out there pay attention to the new fieldInfosGen parameter, or otherwise they might overwrite previously created files.

        Maybe we shouldnt pass this parameter to the codec at all. Instead IndexWriter can just put this into the segment suffix and the codec can be blissfully unaware?

        SegmentCoreReaders no longer has a single DVConsumer it uses, but rather per field it uses the appropriate DVConsumer (depends on the 'gen').

        I put a nocommit to remove DVConsumers from SegCoreReaders into a RefCount'd object in SegmentReader so that we can keep SegCoreReaders manage the 'readers' that are shared between all SegReaders, and also make sure to reuse same DVConsumer by multiple SegReaders. I'll do that later.

        I hope we can do this in a cleaner way than 3.x did it for setNorm, that was really crazy

        I put a nocommit in DVFormat.fieldsConsumer/Producer by adding another variant which takes fieldInfosGen.

        Can we refer to this consistently as docValuesGen or something else (I see the patch already does this in some places, but other places its called fieldInfosGen). I dont think this should ever be referred to as fieldInfosGen because, its not a generation for the fieldinfos data, and that would be horribly scary!

        Show
        Robert Muir added a comment - That way you can end up with e.g. _0_Lucene45_0_1.dvd and *.dvm for field 'f' ... I put a nocommit in DVFormat.fieldsConsumer/Producer by adding another variant which takes fieldInfosGen. ... I want to have only one variant of that method, thereby breaking the API. This is important IMO cause we need to ensure that whatever custom DVFormats out there pay attention to the new fieldInfosGen parameter, or otherwise they might overwrite previously created files. Maybe we shouldnt pass this parameter to the codec at all. Instead IndexWriter can just put this into the segment suffix and the codec can be blissfully unaware? SegmentCoreReaders no longer has a single DVConsumer it uses, but rather per field it uses the appropriate DVConsumer (depends on the 'gen'). I put a nocommit to remove DVConsumers from SegCoreReaders into a RefCount'd object in SegmentReader so that we can keep SegCoreReaders manage the 'readers' that are shared between all SegReaders, and also make sure to reuse same DVConsumer by multiple SegReaders. I'll do that later. I hope we can do this in a cleaner way than 3.x did it for setNorm, that was really crazy I put a nocommit in DVFormat.fieldsConsumer/Producer by adding another variant which takes fieldInfosGen. Can we refer to this consistently as docValuesGen or something else (I see the patch already does this in some places, but other places its called fieldInfosGen). I dont think this should ever be referred to as fieldInfosGen because, its not a generation for the fieldinfos data, and that would be horribly scary!
        Hide
        Shai Erera added a comment -

        I forgot to mention – this work started from a simple patch Rob sent me few weeks, so thanks Rob! And also, thanks Mike for helping me get around Lucene core code! At times I felt like I'm literally hammering through the code to get the thing working .

        Show
        Shai Erera added a comment - I forgot to mention – this work started from a simple patch Rob sent me few weeks, so thanks Rob! And also, thanks Mike for helping me get around Lucene core code! At times I felt like I'm literally hammering through the code to get the thing working .
        Hide
        Shai Erera added a comment -

        Patch adds numeric-dv field updates capabilities:

        • IndexWriter.updateNumericDocValue(term, field, value) updates the value of 'field' of all documents associated with 'term' to the new 'value'
        • When you update the value of field 'f' of few documents, a new pair of .dvd/.dvm files are created, with the values of all documents for that field.
          • That way you can end up with e.g. _0_Lucene45_0_1.dvd and *.dvm for field 'f' and the _0.cfs for other fields which were not updated.
          • SegmentInfoPerCommit tracks for each field in which 'gen' it's recorded, and SegmentCoreReaders uses that map to read the values of the field from the respective gen.
        • TestNumericDocValuesUpdates contains a dozen or so testcases which cover different angles, from simple updates, to unsetting values, merging segments, deletes etc. During development I ran into many interesting scenarios .
        • ReaderAndLiveDocs.writeLiveDocs applies in addition to the deletes, the field updates too. BufferedDeletes tracks the updates, similar to how it tracks deletes.
        • SegmentCoreReaders no longer has a single DVConsumer it uses, but rather per field it uses the appropriate DVConsumer (depends on the 'gen').
          • I put a nocommit to remove DVConsumers from SegCoreReaders into a RefCount'd object in SegmentReader so that we can keep SegCoreReaders manage the 'readers' that are shared between all SegReaders, and also make sure to reuse same DVConsumer by multiple SegReaders. I'll do that later.
        • Segment merging is supported in that when a segment with updates is merged, the correct values are written to the merged segment and the resulting segment has no 'gen' .dvd.
        • I put a nocommit in DVFormat.fieldsConsumer/Producer by adding another variant which takes fieldInfosGen. The default impl throws UnsupportedOpException, while Lucene45 implements it.
          • I want to have only one variant of that method, thereby breaking the API. This is important IMO cause we need to ensure that whatever custom DVFormats out there pay attention to the new fieldInfosGen parameter, or otherwise they might overwrite previously created files.
          • There is also a nocommit touching that with a suggestion to forbid createOutput call in TrackingDir if the file is already referenced by an IndexCommit.
          • It is important that we break something here so that users/apps pay attention to the new feature – suggestions are welcome!

        Few remarks:

        • For now, only updating by a single term is supported (simplicity).
        • You cannot add a new field through field update, only update existing fields. This is a 'schema' change, and there are other way to do it, e.g. through addIndexes and FilterAtomicReader. Attempting to support it means that we need to created gen'd FieldInfosFormat also, which complicates matters.
        • I dropped some nocommits about renaming classes/methods. I didn't want to do it yet, cause it creates an unnecessarily bloated patch. Feel free to comment, we can take care of the renames later.
        • I will probably create a branch for that feature cause there are some things that need to be take care of (add some tests, finish Codecs support etc.)
        • Also, I haven't yet benchmarked the effect of field updates on indexing/search ... I will get to it at some point, but if someone wants to help, I promise not to say no .

        I may have forgot to describe some changes, feel free to ask for clarification!

        Show
        Shai Erera added a comment - Patch adds numeric-dv field updates capabilities: IndexWriter.updateNumericDocValue(term, field, value) updates the value of 'field' of all documents associated with 'term' to the new 'value' When you update the value of field 'f' of few documents, a new pair of .dvd/.dvm files are created, with the values of all documents for that field. That way you can end up with e.g. _0_Lucene45_0_1.dvd and *.dvm for field 'f' and the _0.cfs for other fields which were not updated. SegmentInfoPerCommit tracks for each field in which 'gen' it's recorded, and SegmentCoreReaders uses that map to read the values of the field from the respective gen. TestNumericDocValuesUpdates contains a dozen or so testcases which cover different angles, from simple updates, to unsetting values, merging segments, deletes etc. During development I ran into many interesting scenarios . ReaderAndLiveDocs.writeLiveDocs applies in addition to the deletes, the field updates too. BufferedDeletes tracks the updates, similar to how it tracks deletes. SegmentCoreReaders no longer has a single DVConsumer it uses, but rather per field it uses the appropriate DVConsumer (depends on the 'gen'). I put a nocommit to remove DVConsumers from SegCoreReaders into a RefCount'd object in SegmentReader so that we can keep SegCoreReaders manage the 'readers' that are shared between all SegReaders, and also make sure to reuse same DVConsumer by multiple SegReaders. I'll do that later. Segment merging is supported in that when a segment with updates is merged, the correct values are written to the merged segment and the resulting segment has no 'gen' .dvd. I put a nocommit in DVFormat.fieldsConsumer/Producer by adding another variant which takes fieldInfosGen. The default impl throws UnsupportedOpException, while Lucene45 implements it. I want to have only one variant of that method, thereby breaking the API. This is important IMO cause we need to ensure that whatever custom DVFormats out there pay attention to the new fieldInfosGen parameter, or otherwise they might overwrite previously created files. There is also a nocommit touching that with a suggestion to forbid createOutput call in TrackingDir if the file is already referenced by an IndexCommit. It is important that we break something here so that users/apps pay attention to the new feature – suggestions are welcome! Few remarks: For now, only updating by a single term is supported (simplicity). You cannot add a new field through field update, only update existing fields. This is a 'schema' change, and there are other way to do it, e.g. through addIndexes and FilterAtomicReader. Attempting to support it means that we need to created gen'd FieldInfosFormat also, which complicates matters. I dropped some nocommits about renaming classes/methods. I didn't want to do it yet, cause it creates an unnecessarily bloated patch. Feel free to comment, we can take care of the renames later. I will probably create a branch for that feature cause there are some things that need to be take care of (add some tests, finish Codecs support etc.) Also, I haven't yet benchmarked the effect of field updates on indexing/search ... I will get to it at some point, but if someone wants to help, I promise not to say no . I may have forgot to describe some changes, feel free to ask for clarification!

          People

          • Assignee:
            Shai Erera
            Reporter:
            Shai Erera
          • Votes:
            8 Vote for this issue
            Watchers:
            17 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development