Lucene - Core
  1. Lucene - Core
  2. LUCENE-4258

Incremental Field Updates through Stacked Segments

    Details

    • Type: Improvement Improvement
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

      1. LUCENE-4258-API-changes.patch
        13 kB
        Sivan Yogev
      2. LUCENE-4258.r1423010.patch
        446 kB
        Sivan Yogev
      3. LUCENE-4258.r1422495.patch
        445 kB
        Sivan Yogev
      4. LUCENE-4258.r1416617.patch
        264 kB
        Sivan Yogev
      5. LUCENE-4258.r1416438.patch
        263 kB
        Sivan Yogev
      6. LUCENE-4258.r1412262.patch
        256 kB
        Sivan Yogev
      7. LUCENE-4258.r1410593.patch
        244 kB
        Sivan Yogev
      8. LUCENE-4258.branch3.patch
        61 kB
        Sivan Yogev
      9. LUCENE-4258.branch.6.patch
        54 kB
        Sivan Yogev
      10. LUCENE-4258.branch.6.patch
        621 kB
        Sivan Yogev
      11. LUCENE-4258.branch.5.patch
        33 kB
        Sivan Yogev
      12. LUCENE-4258.branch.4.patch
        29 kB
        Sivan Yogev
      13. LUCENE-4258.branch.2.patch
        134 kB
        Sivan Yogev
      14. LUCENE-4258.branch.1.patch
        113 kB
        Sivan Yogev
      15. IncrementalFieldUpdates.odp
        26 kB
        Sivan Yogev

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Move issue to Lucene 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Lucene 4.9.
          Hide
          Sunny Khatri added a comment -

          Do we know where this magic number of 10000 is defined. Now the position increments (0,1, 2 etc.) is the value getPositionIncrementGap() for the particular analyzer, but I did not notice 10000 being used as the start offset (I'm on Lucene4258 trunk, may be things different there ?).

          Also, seems like there are more pressing issues with the patch, apart from this. Do we have all the issues and limitations for the patch listed anywhere ?

          Show
          Sunny Khatri added a comment - Do we know where this magic number of 10000 is defined. Now the position increments (0,1, 2 etc.) is the value getPositionIncrementGap() for the particular analyzer, but I did not notice 10000 being used as the start offset (I'm on Lucene4258 trunk, may be things different there ?). Also, seems like there are more pressing issues with the patch, apart from this. Do we have all the issues and limitations for the patch listed anywhere ?
          Hide
          Sivan Yogev added a comment -

          Sunny - you are correct that the new terms are added stating at position 0, but this is handled by the reader, which if I remember correctly adds a padding of 10,000 for each new stacked segment. Therefore the actual positions returned by the reader will be:
          term4:::10000 term5:::10001
          term1:::10000 term2:::10001 term3:::10002

          Show
          Sivan Yogev added a comment - Sunny - you are correct that the new terms are added stating at position 0, but this is handled by the reader, which if I remember correctly adds a padding of 10,000 for each new stacked segment. Therefore the actual positions returned by the reader will be: term4:::10000 term5:::10001 term1:::10000 term2:::10001 term3:::10002
          Hide
          Shai Erera added a comment -

          Sunny, there are many issues w/ the current patch. It doesn't handle many corner cases and it's very complicated. I'm not saying the approach won't work, just that the patch attempts to solve many update types in one go. For that reason I opened LUCENE-5189 which laid out the ground for updating DocValues fields. Even though it currently supports only NumericDocValues, we already have a new issue LUCENE-5513 to add support for BinaryDV and I assume after that we'll move to cover the rest of the DV types.

          As for updating postings, I haven't yet given up on doing that, but it's more complicated and we may do it in baby steps too (e.g. allow to only add tokens for StringFields). But it will take time.

          Show
          Shai Erera added a comment - Sunny, there are many issues w/ the current patch. It doesn't handle many corner cases and it's very complicated. I'm not saying the approach won't work, just that the patch attempts to solve many update types in one go. For that reason I opened LUCENE-5189 which laid out the ground for updating DocValues fields. Even though it currently supports only NumericDocValues, we already have a new issue LUCENE-5513 to add support for BinaryDV and I assume after that we'll move to cover the rest of the DV types. As for updating postings, I haven't yet given up on doing that, but it's more complicated and we may do it in baby steps too (e.g. allow to only add tokens for StringFields). But it will take time.
          Hide
          Sunny Khatri added a comment - - edited

          Hi Guys,

          I've been looking at this patch and wanted to know if there's any update on the release date for this patch.

          I was able to try out this patch and observed some issues regarding the term offsets for the stacked up segment data. It seems like when a new update is made on top of the stack (Operation.ADD_FIELDS), their offsets begins back from 0. For example (and a use case) : Let a document be

          { term1 term2 term3 term4 term5}

          . Now we send the whole document in multiple chunks.
          Update 1: term1 term2 term3
          Update 2: term4 term5

          Now the stack looks like (along with their positions):
          term4:::0 term5:::1
          term1:::0 term2:::1 term3:::2

          So what we end up getting is two terms appearing at position 0, two on position1 etc.
          CONS: Phrase queries, etc, won't work in this case, for instance, as search for "term3 term4".

          Just wanted to have a take from you guys to see if that issue could be resolved easily ?

          PS: Not sure if it's trivial to resolve that as we'll need to know the max length of the actual document chunk in the previous stack, and not the max position of the last term added to the stack, as last term in the actual doc could be a stopword, hence won't appear in the index, based on the configuration.

          Show
          Sunny Khatri added a comment - - edited Hi Guys, I've been looking at this patch and wanted to know if there's any update on the release date for this patch. I was able to try out this patch and observed some issues regarding the term offsets for the stacked up segment data. It seems like when a new update is made on top of the stack (Operation.ADD_FIELDS), their offsets begins back from 0. For example (and a use case) : Let a document be { term1 term2 term3 term4 term5} . Now we send the whole document in multiple chunks. Update 1: term1 term2 term3 Update 2: term4 term5 Now the stack looks like (along with their positions): term4:::0 term5:::1 term1:::0 term2:::1 term3:::2 So what we end up getting is two terms appearing at position 0, two on position1 etc. CONS: Phrase queries, etc, won't work in this case, for instance, as search for "term3 term4". Just wanted to have a take from you guys to see if that issue could be resolved easily ? PS: Not sure if it's trivial to resolve that as we'll need to know the max length of the actual document chunk in the previous stack, and not the max position of the last term added to the stack, as last term in the actual doc could be a stopword, hence won't appear in the index, based on the configuration.
          Hide
          Ravikumar added a comment -

          I have integrated this awesome patch and seems to work nicely till now. There is one suggestion I would like to get an opinion on...

          There must be many users pursuing an update-codec external to lucene. One such PoC is described at http://www.flax.co.uk/blog/2012/06/22/updating-individual-fields-in-lucene-with-a-redis-backed-codec/

          Since all the hard-work of buffering updates to del-queue etc... is already done in this patch, how about exposing the actual write something on the lines of Codec, like an UpdateCodec?. This will free impls from worrying about segment-merges, NRTs etc...

          Ex:
          interface UpdateCodec

          { //May be exposed via IWConfig??}

          UpdatedSegment writeUpdatedSegment(SegmentReader reader, UpdatedSegmentData data, SegmentInfoPerCommit sipc)
          }

          public final class UpdatedSegment

          { public UpdatedSegment(SegmentInfo updtInfo, Set<String> genReplaceFiles) }

          Implementations not wishing to take a generational-approach [knowing fully the risks] will probably return "new UpdatedSegments(null, null)" and DocWriter will handle accordingly.

          The StackedReaders should also handle a non-generational update regime. Ex: Instead of a StackedTermsEnum, it should actually get the TermsEnum from UpdateCodec etc...

          What do you think?


          Ravi

          Show
          Ravikumar added a comment - I have integrated this awesome patch and seems to work nicely till now. There is one suggestion I would like to get an opinion on... There must be many users pursuing an update-codec external to lucene. One such PoC is described at http://www.flax.co.uk/blog/2012/06/22/updating-individual-fields-in-lucene-with-a-redis-backed-codec/ Since all the hard-work of buffering updates to del-queue etc... is already done in this patch, how about exposing the actual write something on the lines of Codec, like an UpdateCodec?. This will free impls from worrying about segment-merges, NRTs etc... Ex: interface UpdateCodec { //May be exposed via IWConfig??} UpdatedSegment writeUpdatedSegment(SegmentReader reader, UpdatedSegmentData data, SegmentInfoPerCommit sipc) } public final class UpdatedSegment { public UpdatedSegment(SegmentInfo updtInfo, Set<String> genReplaceFiles) } Implementations not wishing to take a generational-approach [knowing fully the risks] will probably return "new UpdatedSegments(null, null)" and DocWriter will handle accordingly. The StackedReaders should also handle a non-generational update regime. Ex: Instead of a StackedTermsEnum, it should actually get the TermsEnum from UpdateCodec etc... What do you think? – Ravi
          Hide
          Sivan Yogev added a comment -

          Started benchmark work using enwiki data. So far so good - running the same indexing and search on both trunk and the branch results in similar timings.

          Next step is to have an additional indexing phase, where for some of the documents titles are replaced through the field updates mechanism. The title will be replaced with the same value so the queries will be the same, trying to understand the effect of having updates to segments on the reader/searcher.

          Show
          Sivan Yogev added a comment - Started benchmark work using enwiki data. So far so good - running the same indexing and search on both trunk and the branch results in similar timings. Next step is to have an additional indexing phase, where for some of the documents titles are replaced through the field updates mechanism. The title will be replaced with the same value so the queries will be the same, trying to understand the effect of having updates to segments on the reader/searcher.
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Scott Schneider added a comment -

          Gotcha. Could my help speed this up significantly, or not much? I hesitate to even ask, because on the one hand, I'm not familiar with Lucene internals, and on the other hand (or the same hand), I doubt I could get time to work on this... but I am an experienced Java developer and if my help would make a difference, I could make a case for it!

          Show
          Scott Schneider added a comment - Gotcha. Could my help speed this up significantly, or not much? I hesitate to even ask, because on the one hand, I'm not familiar with Lucene internals, and on the other hand (or the same hand), I doubt I could get time to work on this... but I am an experienced Java developer and if my help would make a difference, I could make a case for it!
          Hide
          Shai Erera added a comment -

          I think we're looking at not less than 4 months, and that assumes performance shows no big concerns. Otherwise, it's game open again. But in a release .. that could take a while.

          Depending how adventurous you are, you can compile the branch and run with it for a while . That will surely help us pinpoint issues sooner.

          Show
          Shai Erera added a comment - I think we're looking at not less than 4 months, and that assumes performance shows no big concerns. Otherwise, it's game open again. But in a release .. that could take a while. Depending how adventurous you are, you can compile the branch and run with it for a while . That will surely help us pinpoint issues sooner.
          Hide
          Scott Schneider added a comment -

          Thanks, Shai! Do you have a rough idea of when this feature (the current phase – not necessarily all IFU functionality) might be finished? And how long after that might it get backported? If you can give me a ~2 month range, that would be helpful. This feature is important to our next release and I'd like to just have this info for planning.

          Show
          Scott Schneider added a comment - Thanks, Shai! Do you have a rough idea of when this feature (the current phase – not necessarily all IFU functionality) might be finished? And how long after that might it get backported? If you can give me a ~2 month range, that would be helpful. This feature is important to our next release and I'd like to just have this info for planning.
          Hide
          Sivan Yogev added a comment -

          Thanks Shai, I will use this version in the performance tests.

          Show
          Sivan Yogev added a comment - Thanks Shai, I will use this version in the performance tests.
          Hide
          Shai Erera added a comment -

          Sivan, I committed the patch to the branch

          Show
          Shai Erera added a comment - Sivan, I committed the patch to the branch
          Hide
          Shai Erera added a comment -

          Well, we definitely hope to release it in 5.0, but that of course is still subject to performance results and overall review of the code. Since 5.0 is nowhere near sight as far as I know, I have high hopes it will make it there. Whether it's then back-ported to 4.x is still an open question that we will address after it baked for a while in trunk.

          As for times, we've started to work on a benchmark for this. However we don't work on it at full capacity. If anyone is interested to help, feel free to chime in. I hope that we'll have some numbers within a few weeks time.

          Show
          Shai Erera added a comment - Well, we definitely hope to release it in 5.0, but that of course is still subject to performance results and overall review of the code. Since 5.0 is nowhere near sight as far as I know, I have high hopes it will make it there. Whether it's then back-ported to 4.x is still an open question that we will address after it baked for a while in trunk. As for times, we've started to work on a benchmark for this. However we don't work on it at full capacity. If anyone is interested to help, feel free to chime in. I hope that we'll have some numbers within a few weeks time.
          Hide
          Scott Schneider added a comment -

          I'd like to echo Jack's comment. I have been looking for exactly this feature and I'm thrilled to see people working on it! I'd like the same high level info: What version will have this feature, when will it be released (roughly), is there a high-level functional spec, and what impact do performance tests show? Or if this information isn't available yet, when might it be (roughly)?

          Show
          Scott Schneider added a comment - I'd like to echo Jack's comment. I have been looking for exactly this feature and I'm thrilled to see people working on it! I'd like the same high level info: What version will have this feature, when will it be released (roughly), is there a high-level functional spec, and what impact do performance tests show? Or if this information isn't available yet, when might it be (roughly)?
          Hide
          Jack Krupansky added a comment -

          It would be nice to have the feature description incorporated right here in the Jira itself. At least a concise summary of what it will do, how it will do it, and an example or two of how it would be used. Solr "atomic update" has an "increment" feature for numeric fields as well as an "add" for multivalued fields - does that proposal include (or at least be compatible with) those extensions as well? I mean, it would be nice if Solr could switch over to this approach, if possible. Of course, that begs the question of how this proposal would "stack up" against the technique that Solr uses today for incremental field update.

          Also, it sounds like this is more of a 5.0 feature, even if maybe it does eventually make it into some future 4.x release as well. I mean, 4.4 is expected within the next few weeks, right? While this feature sounds like it will require a lot of testing and baking to be ready for prime time.

          It sounds like it has an assignee, as well. Or, does that have to be a committer.

          Show
          Jack Krupansky added a comment - It would be nice to have the feature description incorporated right here in the Jira itself. At least a concise summary of what it will do, how it will do it, and an example or two of how it would be used. Solr "atomic update" has an "increment" feature for numeric fields as well as an "add" for multivalued fields - does that proposal include (or at least be compatible with) those extensions as well? I mean, it would be nice if Solr could switch over to this approach, if possible. Of course, that begs the question of how this proposal would "stack up" against the technique that Solr uses today for incremental field update. Also, it sounds like this is more of a 5.0 feature, even if maybe it does eventually make it into some future 4.x release as well. I mean, 4.4 is expected within the next few weeks, right? While this feature sounds like it will require a lot of testing and baking to be ready for prime time. It sounds like it has an assignee, as well. Or, does that have to be a committer.
          Hide
          Sivan Yogev added a comment -

          New version to previous patch, incorporating changes from trunk up to revision 1487827.
          Shai, can you please commit this to the branch?

          Show
          Sivan Yogev added a comment - New version to previous patch, incorporating changes from trunk up to revision 1487827. Shai, can you please commit this to the branch?
          Hide
          Sivan Yogev added a comment -

          New patch, all tests pass. After adding a test which mixed updates and deletes I realized that mixing should be constrained, since it opens the door to complex scenarios and will cause changes to the way deletes are saved and that's not a good idea. So I put in IndexWriter a flag which marks whether there are deletes pending, so when a fields update is added and the flag is set there is an automatic commit. So in FrozenBufferedDeletes we can be sure that deletes are to be applied after updates.
          I guess my implementation is not safe, Shai & Mike can you please take a look?

          Show
          Sivan Yogev added a comment - New patch, all tests pass. After adding a test which mixed updates and deletes I realized that mixing should be constrained, since it opens the door to complex scenarios and will cause changes to the way deletes are saved and that's not a good idea. So I put in IndexWriter a flag which marks whether there are deletes pending, so when a fields update is added and the flag is set there is an automatic commit. So in FrozenBufferedDeletes we can be sure that deletes are to be applied after updates. I guess my implementation is not safe, Shai & Mike can you please take a look?
          Hide
          Shai Erera added a comment -

          Committed this to the branch + upgraded to trunk. Sivan, there are tests still failing – is this expected?

          Show
          Shai Erera added a comment - Committed this to the branch + upgraded to trunk. Sivan, there are tests still failing – is this expected?
          Hide
          Sivan Yogev added a comment -

          New patch handling another scenario of an update relating to previous update in the same segment.

          Show
          Sivan Yogev added a comment - New patch handling another scenario of an update relating to previous update in the same segment.
          Hide
          Sivan Yogev added a comment -

          1. Removed some assertions which were based on assumptions that stacked segments break
          2. Added a mechanism to apply updates on already applied ones, now all tests pass
          3. Did some house cleaning
          What's left? Improve calculation of bytes used, make merge policy updates-aware, add the option to collapse stacked segments for segments that cannot be merged, check performance...

          Show
          Sivan Yogev added a comment - 1. Removed some assertions which were based on assumptions that stacked segments break 2. Added a mechanism to apply updates on already applied ones, now all tests pass 3. Did some house cleaning What's left? Improve calculation of bytes used, make merge policy updates-aware, add the option to collapse stacked segments for segments that cannot be merged, check performance...
          Hide
          Sivan Yogev added a comment -

          Thanks Shai, will look into it, modifying FrozenBufferedDeletes is the main challenge so far.

          Show
          Sivan Yogev added a comment - Thanks Shai, will look into it, modifying FrozenBufferedDeletes is the main challenge so far.
          Hide
          Shai Erera added a comment -

          I upgraded branch to trunk. Sivan, I get many NPEs from tests like TestFieldReplacements. The NPEs come from FrozenBufferedDeletes which I had some difficulty merging. The NPEs are caused because 'deletes' are null. I checked trunk code, and FrozenBufferedDeletes assumes they are not null (as in the branch), so either I broke something while merging, or something has changed in trunk that the updates code doesn't respect. Can you take a look?

          Show
          Shai Erera added a comment - I upgraded branch to trunk. Sivan, I get many NPEs from tests like TestFieldReplacements. The NPEs come from FrozenBufferedDeletes which I had some difficulty merging. The NPEs are caused because 'deletes' are null. I checked trunk code, and FrozenBufferedDeletes assumes they are not null (as in the branch), so either I broke something while merging, or something has changed in trunk that the updates code doesn't respect. Can you take a look?
          Hide
          Shai Erera added a comment -

          Committed the patch to the branch. I'll also upgrade branch to trunk.

          Show
          Shai Erera added a comment - Committed the patch to the branch. I'll also upgrade branch to trunk.
          Hide
          Sivan Yogev added a comment -

          Shai, can you please commit the patch to the branch? Thanks.

          Show
          Sivan Yogev added a comment - Shai, can you please commit the patch to the branch? Thanks.
          Hide
          Sivan Yogev added a comment - - edited

          Added tests and fixed bugs. The most thorough test is testIndexEquality() which runs a complicated scenario of adding and replacing documents and fields and compares the resulting index to an index created only using addDocument. The IndexWriterConfig used can be either created using new IndexWriterConfig() or newIndexWriterConfig() which introduces randomness. When there is no randomness the test passes, with randomness and -Dtests.seed=FFC28997A6951FFB it fails.

          Show
          Sivan Yogev added a comment - - edited Added tests and fixed bugs. The most thorough test is testIndexEquality() which runs a complicated scenario of adding and replacing documents and fields and compares the resulting index to an index created only using addDocument. The IndexWriterConfig used can be either created using new IndexWriterConfig() or newIndexWriterConfig() which introduces randomness. When there is no randomness the test passes, with randomness and -Dtests.seed=FFC28997A6951FFB it fails.
          Hide
          Shai Erera added a comment -

          Committed the latest patch to the branch and upgraded branch to latest trunk.

          Show
          Shai Erera added a comment - Committed the latest patch to the branch and upgraded branch to latest trunk.
          Hide
          Sivan Yogev added a comment -

          Updated patch over branch after merging with trunk. New tests, still with bugs in handling replacements.

          Show
          Sivan Yogev added a comment - Updated patch over branch after merging with trunk. New tests, still with bugs in handling replacements.
          Hide
          Sivan Yogev added a comment - - edited

          Thanks Shai for applying the patch, it will take me some time to do adjustments to the changes in trunk, including some fixes to the branch.

          Show
          Sivan Yogev added a comment - - edited Thanks Shai for applying the patch, it will take me some time to do adjustments to the changes in trunk, including some fixes to the branch.
          Hide
          Shai Erera added a comment -

          Sivan, I tried to apply the patch to the branch (after the upgrade to trunk) but it does not apply. Some files are missing, some have issues. Can you perhaps bring this patch up to the current branch's version and post another one?

          Show
          Shai Erera added a comment - Sivan, I tried to apply the patch to the branch (after the upgrade to trunk) but it does not apply. Some files are missing, some have issues. Can you perhaps bring this patch up to the current branch's version and post another one?
          Hide
          Shai Erera added a comment -

          I upgraded branch to trunk (w/o latest patch). TestFieldUpdate fails on norms assertion. I'm not sure exactly what happened but SegmentReader had many strange conflicts, so I hope I didn't screw something up when resolving them. Will try to get to the bottom of it later.

          Show
          Shai Erera added a comment - I upgraded branch to trunk (w/o latest patch). TestFieldUpdate fails on norms assertion. I'm not sure exactly what happened but SegmentReader had many strange conflicts, so I hope I didn't screw something up when resolving them. Will try to get to the bottom of it later.
          Hide
          Sivan Yogev added a comment -

          New patch over the issue branch. Inversion of updated fields done directly when added into a RAMDirectory, and updated segments are created by merging such directories. If more than one update to be applied on same document, the later update is moved to another updated segment.

          Still missing:
          1. Implement RAM usage computation for updates,
          2. fix TestFieldReplacements.testIndexEquality().

          Show
          Sivan Yogev added a comment - New patch over the issue branch. Inversion of updated fields done directly when added into a RAMDirectory, and updated segments are created by merging such directories. If more than one update to be applied on same document, the later update is moved to another updated segment. Still missing: 1. Implement RAM usage computation for updates, 2. fix TestFieldReplacements.testIndexEquality().
          Hide
          Sivan Yogev added a comment -

          Started switching to the invert-first approach following Mike's advices. My thought was to have a single directory for each fields update, and when flushing do something similar to IndexWriter.addIndexes(IndexReader...) and build the stacked segment. However, I encountered two problems with this approach:
          1. If a certain document is updated more than once in a certain generation, two inverted documents should be merged into one,
          2. extension to 1, where a field added in the first update is to be replaced in the second one.
          So, what I will try to do in such cases is to move the later updates to a new update generation. This will increase the number of generations, but I think it's a fair price to pay in light of the benefits offered by the invert-first approach.

          Show
          Sivan Yogev added a comment - Started switching to the invert-first approach following Mike's advices. My thought was to have a single directory for each fields update, and when flushing do something similar to IndexWriter.addIndexes(IndexReader...) and build the stacked segment. However, I encountered two problems with this approach: 1. If a certain document is updated more than once in a certain generation, two inverted documents should be merged into one, 2. extension to 1, where a field added in the first update is to be replaced in the second one. So, what I will try to do in such cases is to move the later updates to a new update generation. This will increase the number of generations, but I think it's a fair price to pay in light of the benefits offered by the invert-first approach.
          Hide
          Shai Erera added a comment -

          I branched https://svn.apache.org/repos/asf/lucene/dev/branches/lucene4258. The patch is really immense and I think it'll be easier to work on this feature in a separate branch. Committed rev 1425441.

          Show
          Shai Erera added a comment - I branched https://svn.apache.org/repos/asf/lucene/dev/branches/lucene4258 . The patch is really immense and I think it'll be easier to work on this feature in a separate branch. Committed rev 1425441.
          Hide
          Shai Erera added a comment - - edited

          Woops, resolved wrong issue

          Show
          Shai Erera added a comment - - edited Woops, resolved wrong issue
          Hide
          Michael McCandless added a comment -

          After rethinking the point-of-inversion issue, seems like the right time to do it is ASAP - not to hold the added fields and invert them later, but rather invert them immediately and save their inverted version. 3 reasons for that:
          1. Take out the constraint I inserted to the API, so update fields can be reused and contain Reader/TokenStrem,
          2. NRT support: we cannot search until we invert, and if we invert earlier NRT support will be less complicated, probably some variation on multi-reader to view uncommitted updates,
          3. You are correct that we currently do not account for the RAM usage of the FieldsUpdate, since I thought using RAMUsageEstimator will be too costly. It will probably be more efficient to calculate RAM usage of the inverted fields, maybe even during inversion?

          +1

          I would also add "4. Inversion of updates is single-threaded", ie once
          we move inversion into .updateFields it will be multi-threaded again.

          So my question in that regard is how can I invert a document and hold its inverted form to be used by NRT and later inserted into stacked segment? Should I create a temporary Directory and invert into it? Is there another way to do this?

          I think we should somehow re-use the existing code that inverts (eg
          FreqProxTermsWriter)? Ie, invert into an in-RAM segment, with
          "temporary" docIDs, and then when it's time to apply the updates, you
          need to rewrite the postings to disk with the re-mapped docIDs.

          I wouldn't do anything special for NRT for starters, meaning, from
          NRT's standpoint, it opens these stacked segments from disk as it
          would if a new non-NRT reader was being opened. So I would leave that
          TODO in SegmentReader as a TODO for now Later, we can optimize
          this and have updates carry in RAM like we do for deletes, but I
          wouldn't start with that ...

          Merging is very important. Hmm, are we able to just merge all updates down to a single update? Ie, without merging the base segment? We can't express that today from MergePolicy right? In an NRT setting this seems very important (ie it'd be best bang (= improved search performance) for the buck (= merge cost)).

          Shai is helping in creation of a benchmark to test performance in various scenarios. I will start adding updates aspects to the merge policy. I am not sure if merging just updates of a segment is feasible. In what cases would it be better than collapsing all updates into the base segment?

          Imagine a huge segment that's accumulating updates ... say it has 20
          stacked segments. First off, those stacked segments are each tying up
          N file descriptors on open, right? (Well, only one if it's CFS). But
          second off, I would expect search perf with 1 base + 20 stacked is
          worse than 1 base + 1 stacked? We need to test if that's true
          ... it's likely that the most perf loss is going from no stacked
          segments to 1 stacked segment ... and then going from 1 to 20 stacked
          segments doesn't hurt "that much". We have to test and see.

          Simply merging that big base segment with its 20 stacked segments is
          going to be too costly to do very often.

          I think we need a test that indexes a known (randomly generated) set of documents, randomly sometimes using add and sometimes using update/replace field, mixing in deletes (just like TestField.addDocuments()), for the first index, and for the second index only using addDocument on the "surviving" documents, and then we assertIndexEquals(...) in the end? Maybe we can factor out code from TestDuelingCodecs or TestStressIndexing2.

          TestFieldReplacements already had a test which randomly adds documents, replaces documents, adds fields and replaces fields. I refactored it to enable using a seed, and created a "clean" version with only addDocument(...) calls. However, the FieldInfos of the "clean" version do not include things that the "full" version includes because in the full version fields possessing certain field traits where added and then deleted. I will look at the other suggestions.

          It should be fine if the FieldInfos don't match? Ie, when comparing
          the two indices we should not compare field numbers? We should be
          comparing by only external things like fieldName, which id we had
          indexed, etc.

          Show
          Michael McCandless added a comment - After rethinking the point-of-inversion issue, seems like the right time to do it is ASAP - not to hold the added fields and invert them later, but rather invert them immediately and save their inverted version. 3 reasons for that: 1. Take out the constraint I inserted to the API, so update fields can be reused and contain Reader/TokenStrem, 2. NRT support: we cannot search until we invert, and if we invert earlier NRT support will be less complicated, probably some variation on multi-reader to view uncommitted updates, 3. You are correct that we currently do not account for the RAM usage of the FieldsUpdate, since I thought using RAMUsageEstimator will be too costly. It will probably be more efficient to calculate RAM usage of the inverted fields, maybe even during inversion? +1 I would also add "4. Inversion of updates is single-threaded", ie once we move inversion into .updateFields it will be multi-threaded again. So my question in that regard is how can I invert a document and hold its inverted form to be used by NRT and later inserted into stacked segment? Should I create a temporary Directory and invert into it? Is there another way to do this? I think we should somehow re-use the existing code that inverts (eg FreqProxTermsWriter)? Ie, invert into an in-RAM segment, with "temporary" docIDs, and then when it's time to apply the updates, you need to rewrite the postings to disk with the re-mapped docIDs. I wouldn't do anything special for NRT for starters, meaning, from NRT's standpoint, it opens these stacked segments from disk as it would if a new non-NRT reader was being opened. So I would leave that TODO in SegmentReader as a TODO for now Later, we can optimize this and have updates carry in RAM like we do for deletes, but I wouldn't start with that ... Merging is very important. Hmm, are we able to just merge all updates down to a single update? Ie, without merging the base segment? We can't express that today from MergePolicy right? In an NRT setting this seems very important (ie it'd be best bang (= improved search performance) for the buck (= merge cost)). Shai is helping in creation of a benchmark to test performance in various scenarios. I will start adding updates aspects to the merge policy. I am not sure if merging just updates of a segment is feasible. In what cases would it be better than collapsing all updates into the base segment? Imagine a huge segment that's accumulating updates ... say it has 20 stacked segments. First off, those stacked segments are each tying up N file descriptors on open, right? (Well, only one if it's CFS). But second off, I would expect search perf with 1 base + 20 stacked is worse than 1 base + 1 stacked? We need to test if that's true ... it's likely that the most perf loss is going from no stacked segments to 1 stacked segment ... and then going from 1 to 20 stacked segments doesn't hurt "that much". We have to test and see. Simply merging that big base segment with its 20 stacked segments is going to be too costly to do very often. I think we need a test that indexes a known (randomly generated) set of documents, randomly sometimes using add and sometimes using update/replace field, mixing in deletes (just like TestField.addDocuments()), for the first index, and for the second index only using addDocument on the "surviving" documents, and then we assertIndexEquals(...) in the end? Maybe we can factor out code from TestDuelingCodecs or TestStressIndexing2. TestFieldReplacements already had a test which randomly adds documents, replaces documents, adds fields and replaces fields. I refactored it to enable using a seed, and created a "clean" version with only addDocument(...) calls. However, the FieldInfos of the "clean" version do not include things that the "full" version includes because in the full version fields possessing certain field traits where added and then deleted. I will look at the other suggestions. It should be fine if the FieldInfos don't match? Ie, when comparing the two indices we should not compare field numbers? We should be comparing by only external things like fieldName, which id we had indexed, etc.
          Hide
          Shai Erera added a comment -

          I am not sure if merging just updates of a segment is feasible. In what cases would it be better than collapsing all updates into the base segment?

          Just like expungeDeletes, I think that we should have collapseFieldUpdates() which can be called explicitly by the app, but also IW should call MP.findSegmentsForFieldUpdates() (or some such name). And it should collapse all updates into the segment, implies rewriting that segment. If we collapse all updates but keep the base segment + a single stacked segment, I don't think that we're doing much. The purpose is to get rid of updates entirely.

          Also, regarding statistics. I think that as a first step, we should not go out of our way to return the correct statistics. Just like the stats today do not account for deleted documents, so should the updates. I realize that it's not the same as deleted documents, but it certainly simplifies matters. Stats will be correct following collapseFieldUpdates or regular segment merges.

          As a second step, we can try to return statistics including stacked segments more efficiently. I.e., if a term appears in both the base and stacked segment, we return the stats from base. But if it exists only in the stacked segment, we can return the stats from there? I'm not too worried about the stats though, because that's a temporary thing, which gets fixed once updates are collapsed.

          And if the MergePolicy will have separate settings for collapsing field updates (I think it should!), then the collapsing could occur more frequently than regular merges (and expunging deleted documents). Also, it will give apps a way to control how often do they want to get accurate statistics.

          Can we leave statistics outside the scope of this issue? And for now change CheckIndex to detect that it's a segment with field updates, and therefore check stats from the base segment only? I think it does something like that with deleted documents already, no?

          Show
          Shai Erera added a comment - I am not sure if merging just updates of a segment is feasible. In what cases would it be better than collapsing all updates into the base segment? Just like expungeDeletes, I think that we should have collapseFieldUpdates() which can be called explicitly by the app, but also IW should call MP.findSegmentsForFieldUpdates() (or some such name). And it should collapse all updates into the segment, implies rewriting that segment. If we collapse all updates but keep the base segment + a single stacked segment, I don't think that we're doing much. The purpose is to get rid of updates entirely. Also, regarding statistics. I think that as a first step, we should not go out of our way to return the correct statistics. Just like the stats today do not account for deleted documents, so should the updates. I realize that it's not the same as deleted documents, but it certainly simplifies matters. Stats will be correct following collapseFieldUpdates or regular segment merges. As a second step, we can try to return statistics including stacked segments more efficiently. I.e., if a term appears in both the base and stacked segment, we return the stats from base. But if it exists only in the stacked segment, we can return the stats from there? I'm not too worried about the stats though, because that's a temporary thing, which gets fixed once updates are collapsed. And if the MergePolicy will have separate settings for collapsing field updates (I think it should!), then the collapsing could occur more frequently than regular merges (and expunging deleted documents). Also, it will give apps a way to control how often do they want to get accurate statistics. Can we leave statistics outside the scope of this issue? And for now change CheckIndex to detect that it's a segment with field updates, and therefore check stats from the base segment only? I think it does something like that with deleted documents already, no?
          Hide
          Sivan Yogev added a comment -

          After rethinking the point-of-inversion issue, seems like the right time to do it is ASAP - not to hold the added fields and invert them later, but rather invert them immediately and save their inverted version. 3 reasons for that:
          1. Take out the constraint I inserted to the API, so update fields can be reused and contain Reader/TokenStrem,
          2. NRT support: we cannot search until we invert, and if we invert earlier NRT support will be less complicated, probably some variation on multi-reader to view uncommitted updates,
          3. You are correct that we currently do not account for the RAM usage of the FieldsUpdate, since I thought using RAMUsageEstimator will be too costly. It will probably be more efficient to calculate RAM usage of the inverted fields, maybe even during inversion?

          So my question in that regard is how can I invert a document and hold its inverted form to be used by NRT and later inserted into stacked segment? Should I create a temporary Directory and invert into it? Is there another way to do this?

          Merging is very important. Hmm, are we able to just merge all updates down to a single update? Ie, without merging the base segment? We can't express that today from MergePolicy right? In an NRT setting this seems very important (ie it'd be best bang (= improved search performance) for the buck (= merge cost)).

          Shai is helping in creation of a benchmark to test performance in various scenarios. I will start adding updates aspects to the merge policy. I am not sure if merging just updates of a segment is feasible. In what cases would it be better than collapsing all updates into the base segment?

          I think we need a test that indexes a known (randomly generated) set of documents, randomly sometimes using add and sometimes using update/replace field, mixing in deletes (just like TestField.addDocuments()), for the first index, and for the second index only using addDocument on the "surviving" documents, and then we assertIndexEquals(...) in the end? Maybe we can factor out code from TestDuelingCodecs or TestStressIndexing2.

          TestFieldReplacements already had a test which randomly adds documents, replaces documents, adds fields and replaces fields. I refactored it to enable using a seed, and created a "clean" version with only addDocument(...) calls. However, the FieldInfos of the "clean" version do not include things that the "full" version includes because in the full version fields possessing certain field traits where added and then deleted. I will look at the other suggestions.

          Show
          Sivan Yogev added a comment - After rethinking the point-of-inversion issue, seems like the right time to do it is ASAP - not to hold the added fields and invert them later, but rather invert them immediately and save their inverted version. 3 reasons for that: 1. Take out the constraint I inserted to the API, so update fields can be reused and contain Reader/TokenStrem, 2. NRT support: we cannot search until we invert, and if we invert earlier NRT support will be less complicated, probably some variation on multi-reader to view uncommitted updates, 3. You are correct that we currently do not account for the RAM usage of the FieldsUpdate, since I thought using RAMUsageEstimator will be too costly. It will probably be more efficient to calculate RAM usage of the inverted fields, maybe even during inversion? So my question in that regard is how can I invert a document and hold its inverted form to be used by NRT and later inserted into stacked segment? Should I create a temporary Directory and invert into it? Is there another way to do this? Merging is very important. Hmm, are we able to just merge all updates down to a single update? Ie, without merging the base segment? We can't express that today from MergePolicy right? In an NRT setting this seems very important (ie it'd be best bang (= improved search performance) for the buck (= merge cost)). Shai is helping in creation of a benchmark to test performance in various scenarios. I will start adding updates aspects to the merge policy. I am not sure if merging just updates of a segment is feasible. In what cases would it be better than collapsing all updates into the base segment? I think we need a test that indexes a known (randomly generated) set of documents, randomly sometimes using add and sometimes using update/replace field, mixing in deletes (just like TestField.addDocuments()), for the first index, and for the second index only using addDocument on the "surviving" documents, and then we assertIndexEquals(...) in the end? Maybe we can factor out code from TestDuelingCodecs or TestStressIndexing2. TestFieldReplacements already had a test which randomly adds documents, replaces documents, adds fields and replaces fields. I refactored it to enable using a seed, and created a "clean" version with only addDocument(...) calls. However, the FieldInfos of the "clean" version do not include things that the "full" version includes because in the full version fields possessing certain field traits where added and then deleted. I will look at the other suggestions.
          Hide
          Michael McCandless added a comment -

          Are stored fields now sparse? Meaning if I have a segment w/ many docs, and I update stored fields on one doc, in that tiny stacked segments will the stored fields files also be tiny?

          In such case you will get the equivalent of a segment with multiple docs with only one of them containing stored fields. I assume the impls of stored fields handle these cases well and you will indeed get tiny stored fields.

          You're right, this is up to the codec ... hmm but the API isn't sparse (you have
          to .addDocument 1M times to "skip over" 1M docs right?), and I'm not sure how well our
          current default (Lucene41StoredFieldsFormat) handles it. Have you tested it?

          Regarding the API - I made some cleanup, and removed also Operation.ADD_DOCUMENT. Now there is only one way to perform each operation, and updateFields only allows you to add or replace fields given a term.

          OK thanks!

          This means you cannot reuse fields, you have to be careful with pre-tokenized fields (can't reuse the TokenStream), etc.

          This is referred in the Javadoc of updateFields, let me know if there's a better way to address it.

          Maybe also state that one cannot reuse Field instances, since the
          Field may not be actually "consumed" until some later time (we should
          be vague since this really is an implementation detail).

          As for the heavier questions. NRT support should be considered separately, but the guideline I followed was to keep things as closely as possible to the way deletions are handled. Therefore, we need to add to SegmentReader a field named liveUpdates - an equivalent to liveDocs. I already put a TODO for this (SegmentReader line 131), implementing it won't be simple...

          OK ... yeah it's not simple!

          The performance tradeoff you are rightfully concerned about should be handled through merging. Once you merge an updated segment all updates are "cleaned", and the new segment has no performance issues. Apps that perform updates should make sure (through MergePolicy) to avoid reader-side updates as much as possible.

          Merging is very important. Hmm, are we able to just merge all updates
          down to a single update? Ie, without merging the base segment? We
          can't express that today from MergePolicy right? In an NRT setting
          this seems very important (ie it'd be best bang (= improved search
          performance) for the buck (= merge cost)).

          I suspect we need to do something with merging before committing
          here.

          Hmm I see that
          StackedTerms.size()/getSumTotalTermFreq()/getSumDocFreq() pulls a
          TermsEnum and goes and counts/aggregates all terms ... which in
          general is horribly costly? EG these methods are called per-query to
          setup the Sim for scoring ... I think we need another solution here
          (not sure what). Also getDocCount() just returns -1 now ... maybe we
          should only allow updates against DOCS_ONLY/omitsNorms fields for now?

          Have you done any performance tests on biggish indices?

          I think we need a test that indexes a known (randomly generated) set
          of documents, randomly sometimes using add and sometimes using
          update/replace field, mixing in deletes (just like TestField.addDocuments()),
          for the first index, and for the second index only using addDocument
          on the "surviving" documents, and then we assertIndexEquals(...) in the
          end? Maybe we can factor out code from TestDuelingCodecs or
          TestStressIndexing2.

          Where do we account for the RAM used by these buffered updates? I see
          BufferedUpdates.addTerm has some accounting the first time it sees a
          given term, but where do we actually add in the RAM used by the
          FieldsUpdate itself?

          Show
          Michael McCandless added a comment - Are stored fields now sparse? Meaning if I have a segment w/ many docs, and I update stored fields on one doc, in that tiny stacked segments will the stored fields files also be tiny? In such case you will get the equivalent of a segment with multiple docs with only one of them containing stored fields. I assume the impls of stored fields handle these cases well and you will indeed get tiny stored fields. You're right, this is up to the codec ... hmm but the API isn't sparse (you have to .addDocument 1M times to "skip over" 1M docs right?), and I'm not sure how well our current default (Lucene41StoredFieldsFormat) handles it. Have you tested it? Regarding the API - I made some cleanup, and removed also Operation.ADD_DOCUMENT. Now there is only one way to perform each operation, and updateFields only allows you to add or replace fields given a term. OK thanks! This means you cannot reuse fields, you have to be careful with pre-tokenized fields (can't reuse the TokenStream), etc. This is referred in the Javadoc of updateFields, let me know if there's a better way to address it. Maybe also state that one cannot reuse Field instances, since the Field may not be actually "consumed" until some later time (we should be vague since this really is an implementation detail). As for the heavier questions. NRT support should be considered separately, but the guideline I followed was to keep things as closely as possible to the way deletions are handled. Therefore, we need to add to SegmentReader a field named liveUpdates - an equivalent to liveDocs. I already put a TODO for this (SegmentReader line 131), implementing it won't be simple... OK ... yeah it's not simple! The performance tradeoff you are rightfully concerned about should be handled through merging. Once you merge an updated segment all updates are "cleaned", and the new segment has no performance issues. Apps that perform updates should make sure (through MergePolicy) to avoid reader-side updates as much as possible. Merging is very important. Hmm, are we able to just merge all updates down to a single update? Ie, without merging the base segment? We can't express that today from MergePolicy right? In an NRT setting this seems very important (ie it'd be best bang (= improved search performance) for the buck (= merge cost)). I suspect we need to do something with merging before committing here. Hmm I see that StackedTerms.size()/getSumTotalTermFreq()/getSumDocFreq() pulls a TermsEnum and goes and counts/aggregates all terms ... which in general is horribly costly? EG these methods are called per-query to setup the Sim for scoring ... I think we need another solution here (not sure what). Also getDocCount() just returns -1 now ... maybe we should only allow updates against DOCS_ONLY/omitsNorms fields for now? Have you done any performance tests on biggish indices? I think we need a test that indexes a known (randomly generated) set of documents, randomly sometimes using add and sometimes using update/replace field, mixing in deletes (just like TestField.addDocuments()), for the first index, and for the second index only using addDocument on the "surviving" documents, and then we assertIndexEquals(...) in the end? Maybe we can factor out code from TestDuelingCodecs or TestStressIndexing2. Where do we account for the RAM used by these buffered updates? I see BufferedUpdates.addTerm has some accounting the first time it sees a given term, but where do we actually add in the RAM used by the FieldsUpdate itself?
          Hide
          David Smiley added a comment -

          Its exciting to see progress here! I too live in the "world" that Shai speaks of – DOCS_ONLY (w/ no norms). I don't need to update a title field, I need to update ACLs and various categorical "tag" fields that will subsequently influence faceting or filtering.

          Hey Rob, early on you made this excellent point:

          A second problem (not solved by the above) is that many people are using scoring factors with a variety
          of signals and these are changing often. I think unfortunately, people are often putting these in
          a normal indexed field and uninverting these on the fieldcache, requiring the whole document to
          be reindexed just because of how they implemented the scoring factor. People could instead solve this
          by putting their apps primary key into a docvalues field, allowing them to keep these scoring factors
          completely external to lucene (e.g. their own array or whatever), indexed by their own primary key. But
          the problem is I think people want lucene to manage this, they don't want to implement themselves whats
          necessary to make it consistent with commits etc.

          So true. What if Lucene had more hooks to make it easier to manage commit-consistency with side-car data? I have no clue what's needed exactly, only that I don't dare do this without such hooks because I fear the complexity. With hooks and documentation, it can become clear how to maintain data alongside Lucene's index, and this opens doors. Like making it easier to store data in something custom (e.g. a DB) instead of stored-fields (won't have to pay needless merge cost), or putting metrics that influence scoring somewhere as you hinted at above.

          Show
          David Smiley added a comment - Its exciting to see progress here! I too live in the "world" that Shai speaks of – DOCS_ONLY (w/ no norms). I don't need to update a title field, I need to update ACLs and various categorical "tag" fields that will subsequently influence faceting or filtering. Hey Rob, early on you made this excellent point: A second problem (not solved by the above) is that many people are using scoring factors with a variety of signals and these are changing often. I think unfortunately, people are often putting these in a normal indexed field and uninverting these on the fieldcache, requiring the whole document to be reindexed just because of how they implemented the scoring factor. People could instead solve this by putting their apps primary key into a docvalues field, allowing them to keep these scoring factors completely external to lucene (e.g. their own array or whatever), indexed by their own primary key. But the problem is I think people want lucene to manage this, they don't want to implement themselves whats necessary to make it consistent with commits etc. So true. What if Lucene had more hooks to make it easier to manage commit-consistency with side-car data? I have no clue what's needed exactly, only that I don't dare do this without such hooks because I fear the complexity. With hooks and documentation, it can become clear how to maintain data alongside Lucene's index, and this opens doors. Like making it easier to store data in something custom (e.g. a DB) instead of stored-fields (won't have to pay needless merge cost), or putting metrics that influence scoring somewhere as you hinted at above.
          Hide
          Sivan Yogev added a comment -

          New patch, concurrency bugs fixed. All tests pass.

          Show
          Sivan Yogev added a comment - New patch, concurrency bugs fixed. All tests pass.
          Hide
          Sivan Yogev added a comment -

          Some existing tests fail with the latest patch, working on fixes.

          Show
          Sivan Yogev added a comment - Some existing tests fail with the latest patch, working on fixes.
          Hide
          Sivan Yogev added a comment -

          Patch with some additional bug fixes and more elaborate tests, all working. Ready to commit?

          Show
          Sivan Yogev added a comment - Patch with some additional bug fixes and more elaborate tests, all working. Ready to commit?
          Hide
          Sivan Yogev added a comment -

          Let me start with the last question.

          Are stored fields now sparse? Meaning if I have a segment w/ many docs, and I update stored fields on one doc, in that tiny stacked segments will the stored fields files also be tiny?

          In such case you will get the equivalent of a segment with multiple docs with only one of them containing stored fields. I assume the impls of stored fields handle these cases well and you will indeed get tiny stored fields.

          Regarding the API - I made some cleanup, and removed also Operation.ADD_DOCUMENT. Now there is only one way to perform each operation, and updateFields only allows you to add or replace fields given a term.

          This means you cannot reuse fields, you have to be careful with pre-tokenized fields (can't reuse the TokenStream), etc.

          This is referred in the Javadoc of updateFields, let me know if there's a better way to address it.

          As for the heavier questions. NRT support should be considered separately, but the guideline I followed was to keep things as closely as possible to the way deletions are handled. Therefore, we need to add to SegmentReader a field named liveUpdates - an equivalent to liveDocs. I already put a TODO for this (SegmentReader line 131), implementing it won't be simple...

          The performance tradeoff you are rightfully concerned about should be handled through merging. Once you merge an updated segment all updates are "cleaned", and the new segment has no performance issues. Apps that perform updates should make sure (through MergePolicy) to avoid reader-side updates as much as possible.

          Show
          Sivan Yogev added a comment - Let me start with the last question. Are stored fields now sparse? Meaning if I have a segment w/ many docs, and I update stored fields on one doc, in that tiny stacked segments will the stored fields files also be tiny? In such case you will get the equivalent of a segment with multiple docs with only one of them containing stored fields. I assume the impls of stored fields handle these cases well and you will indeed get tiny stored fields. Regarding the API - I made some cleanup, and removed also Operation.ADD_DOCUMENT. Now there is only one way to perform each operation, and updateFields only allows you to add or replace fields given a term. This means you cannot reuse fields, you have to be careful with pre-tokenized fields (can't reuse the TokenStream), etc. This is referred in the Javadoc of updateFields, let me know if there's a better way to address it. As for the heavier questions. NRT support should be considered separately, but the guideline I followed was to keep things as closely as possible to the way deletions are handled. Therefore, we need to add to SegmentReader a field named liveUpdates - an equivalent to liveDocs. I already put a TODO for this (SegmentReader line 131), implementing it won't be simple... The performance tradeoff you are rightfully concerned about should be handled through merging. Once you merge an updated segment all updates are "cleaned", and the new segment has no performance issues. Apps that perform updates should make sure (through MergePolicy) to avoid reader-side updates as much as possible.
          Hide
          Michael McCandless added a comment -

          Why do we have FieldsUpdate.Operation.ADD_DOCUMENT? It seems weird to pass that to IW.updateFields? Shouldn't apps just use IW.addDocument?

          We have ADD_ and REPLACE_ for FIELDS, and also REPLACE_DOCUMENTS, so having ADD_DOCUMENT would allow applications to work only with updateFields. There certainly are actions that can be performed in more than one way in this API, do you find this too confusing?

          Well I just generally prefer that there is one [obvious] way to do
          something ... it can cause confusion otherwise, ie users will wonder
          what's the difference between addDocument and
          updateFields(Operation.ADD_DOCUMENT, ...)

          Why do we need SegmentInfoReader.readFilesList? ...

          I considered the alternative you propose of having a segmentInfo for each stacked segment, and it seemed too complex to manage than what is done with .del files, so I chose the .del files approach. You are right about it's privacy, I removed it from SegmentInfoReader and the actual readers have it privately.

          OK.

          It looks like merge policies don't yet know about / target stacked segments ...

          I was planning to have it in another issue. should I create it already?

          Another issue is a good idea! No need to create it yet ... but it
          seems like it will be important for real usage.

          Do we have any sense of how performance degrades as the stack gets
          bigger? It's more on-the-fly merging at search-time...

          I'm worried about that search-time merge cost ... I think it's usually
          better to pay a higher indexing cost in exchange for faster search
          time, which makes LUCENE-4272 a compelling alternate approach...

          It seems like we don't invert the document updates until the updates are applied? ...

          I went for the simple solution trying to introduce as less new concepts as possible (and still the patch size is >7000 lines). Your proposal should certainly be considered and maybe tested. I need to make sure I do the RAM calculations right, the added documents must be reflected in the RAM consumption of the deletions queue.

          OK that makes sense; we should definitely do whatever's
          easiest/fastest to get to a dirt path.

          We should think through the tradeoffs. I think it may confuse apps
          that the Field is not "consumed" after IW.updateFields returns, but
          rather cached and processed later. This means you cannot reuse
          fields, you have to be careful with pre-tokenized fields (can't reuse
          the TokenStream), etc.

          It also means NRT reopen is unexpectedly costly, because only on flush
          will we invert & index the documents, and it's a single-threaded
          operation during reopen (vs per-thread if we invert up front).

          Still it makes sense to do this for starters ... it's simpler.

          Why does StoredFieldsReader.visitDocument need a Set for ignored fields?

          When fetching stored fields from a segment with replacements, it is possible that all contents of a certain field for the base and first n stacked segments should be ignored. Therefore, the implementation starts the visiting from the most recent updates. If we encounter at some stage a field replacement, that field name is added to the Set of ignored fields, and later the content of that field in the stacked segments we encounter (which are older updates) is ignored.

          Ahhh right.

          Are stored fields now sparse? Meaning if I have a segment w/ many
          docs, and I update stored fields on one doc, in that tiny stacked
          segments will the stored fields files also be tiny?

          Show
          Michael McCandless added a comment - Why do we have FieldsUpdate.Operation.ADD_DOCUMENT? It seems weird to pass that to IW.updateFields? Shouldn't apps just use IW.addDocument? We have ADD_ and REPLACE_ for FIELDS, and also REPLACE_DOCUMENTS, so having ADD_DOCUMENT would allow applications to work only with updateFields. There certainly are actions that can be performed in more than one way in this API, do you find this too confusing? Well I just generally prefer that there is one [obvious] way to do something ... it can cause confusion otherwise, ie users will wonder what's the difference between addDocument and updateFields(Operation.ADD_DOCUMENT, ...) Why do we need SegmentInfoReader.readFilesList? ... I considered the alternative you propose of having a segmentInfo for each stacked segment, and it seemed too complex to manage than what is done with .del files, so I chose the .del files approach. You are right about it's privacy, I removed it from SegmentInfoReader and the actual readers have it privately. OK. It looks like merge policies don't yet know about / target stacked segments ... I was planning to have it in another issue. should I create it already? Another issue is a good idea! No need to create it yet ... but it seems like it will be important for real usage. Do we have any sense of how performance degrades as the stack gets bigger? It's more on-the-fly merging at search-time... I'm worried about that search-time merge cost ... I think it's usually better to pay a higher indexing cost in exchange for faster search time, which makes LUCENE-4272 a compelling alternate approach... It seems like we don't invert the document updates until the updates are applied? ... I went for the simple solution trying to introduce as less new concepts as possible (and still the patch size is >7000 lines). Your proposal should certainly be considered and maybe tested. I need to make sure I do the RAM calculations right, the added documents must be reflected in the RAM consumption of the deletions queue. OK that makes sense; we should definitely do whatever's easiest/fastest to get to a dirt path. We should think through the tradeoffs. I think it may confuse apps that the Field is not "consumed" after IW.updateFields returns, but rather cached and processed later. This means you cannot reuse fields, you have to be careful with pre-tokenized fields (can't reuse the TokenStream), etc. It also means NRT reopen is unexpectedly costly, because only on flush will we invert & index the documents, and it's a single-threaded operation during reopen (vs per-thread if we invert up front). Still it makes sense to do this for starters ... it's simpler. Why does StoredFieldsReader.visitDocument need a Set for ignored fields? When fetching stored fields from a segment with replacements, it is possible that all contents of a certain field for the base and first n stacked segments should be ignored. Therefore, the implementation starts the visiting from the most recent updates. If we encounter at some stage a field replacement, that field name is added to the Set of ignored fields, and later the content of that field in the stacked segments we encounter (which are older updates) is ignored. Ahhh right. Are stored fields now sparse? Meaning if I have a segment w/ many docs, and I update stored fields on one doc, in that tiny stacked segments will the stored fields files also be tiny?
          Hide
          Sivan Yogev added a comment -

          Erick - thanks for the support!

          Mike - thanks for the comments, here's an attempt to supply answers:

          Regarding formatting and all deletes - I will check and try to fix those.

          Why do we have FieldsUpdate.Operation.ADD_DOCUMENT? It seems weird to
          pass that to IW.updateFields? Shouldn't apps just use
          IW.addDocument?

          We have ADD_ and REPLACE_ for FIELDS, and also REPLACE_DOCUMENTS, so having ADD_DOCUMENT would allow applications to work only with updateFields. There certainly are actions that can be performed in more than one way in this API, do you find this too confusing?

          Why do we need SegmentInfoReader.readFilesList? ...

          I considered the alternative you propose of having a segmentInfo for each stacked segment, and it seemed too complex to manage than what is done with .del files, so I chose the .del files approach. You are right about it's privacy, I removed it from SegmentInfoReader and the actual readers have it privately.

          It looks like merge policies don't yet know about / target stacked
          segments ...

          I was planning to have it in another issue. should I create it already?

          It seems like we don't invert the document updates until the updates
          are applied? ...

          I went for the simple solution trying to introduce as less new concepts as possible (and still the patch size is >7000 lines). Your proposal should certainly be considered and maybe tested. I need to make sure I do the RAM calculations right, the added documents must be reflected in the RAM consumption of the deletions queue.

          Why does StoredFieldsReader.visitDocument need a Set for ignored
          fields?

          When fetching stored fields from a segment with replacements, it is possible that all contents of a certain field for the base and first n stacked segments should be ignored. Therefore, the implementation starts the visiting from the most recent updates. If we encounter at some stage a field replacement, that field name is added to the Set of ignored fields, and later the content of that field in the stacked segments we encounter (which are older updates) is ignored.

          Show
          Sivan Yogev added a comment - Erick - thanks for the support! Mike - thanks for the comments, here's an attempt to supply answers: Regarding formatting and all deletes - I will check and try to fix those. Why do we have FieldsUpdate.Operation.ADD_DOCUMENT? It seems weird to pass that to IW.updateFields? Shouldn't apps just use IW.addDocument? We have ADD_ and REPLACE_ for FIELDS, and also REPLACE_DOCUMENTS, so having ADD_DOCUMENT would allow applications to work only with updateFields. There certainly are actions that can be performed in more than one way in this API, do you find this too confusing? Why do we need SegmentInfoReader.readFilesList? ... I considered the alternative you propose of having a segmentInfo for each stacked segment, and it seemed too complex to manage than what is done with .del files, so I chose the .del files approach. You are right about it's privacy, I removed it from SegmentInfoReader and the actual readers have it privately. It looks like merge policies don't yet know about / target stacked segments ... I was planning to have it in another issue. should I create it already? It seems like we don't invert the document updates until the updates are applied? ... I went for the simple solution trying to introduce as less new concepts as possible (and still the patch size is >7000 lines). Your proposal should certainly be considered and maybe tested. I need to make sure I do the RAM calculations right, the added documents must be reflected in the RAM consumption of the deletions queue. Why does StoredFieldsReader.visitDocument need a Set for ignored fields? When fetching stored fields from a segment with replacements, it is possible that all contents of a certain field for the base and first n stacked segments should be ignored. Therefore, the implementation starts the visiting from the most recent updates. If we encounter at some stage a field replacement, that field name is added to the Set of ignored fields, and later the content of that field in the stacked segments we encounter (which are older updates) is ignored.
          Hide
          Michael McCandless added a comment -

          Trying to catch up here ... I just have a bunch of random questions
          (don't fully understand the patch yet):

          Not sure why some files show as all deleted / all added lines, eg at
          least FrozenBufferedDeletes.java.

          Patch also has tabs, which should be spaces... (eg IndexWriter.java).

          Why do we have FieldsUpdate.Operation.ADD_DOCUMENT? It seems weird to
          pass that to IW.updateFields? Shouldn't apps just use
          IW.addDocument?

          Why do we need SegmentInfoReader.readFilesList? It seems like it's
          only privately used inside the codec? I'm confused why the "normal"
          file tracking we have on write is insufficient... oh I see, a single
          SegmentInfo references all stacked segments too? But since they are
          written "later" their files won't be automatically tracked ... ok. I
          wonder if each stacked segment should get its own SegmentInfo, linked
          to the base segment...

          It looks like merge policies don't yet know about / target stacked
          segments ...

          It seems like we don't invert the document updates until the updates
          are applied? Ie, we just buffer the IndexableField provided by the
          user and when it's time to apply updates, we then analyzing/invert?
          How do we track RAM in this case? (Eg the field could be something
          arbitrary, eg pre-tokenized). Another option is to do the invert and
          buffer the resulting postings, and then later "replay" them (remapping
          docIDs) when it's time to apply.

          Why does StoredFieldsReader.visitDocument need a Set for ignored
          fields?

          Show
          Michael McCandless added a comment - Trying to catch up here ... I just have a bunch of random questions (don't fully understand the patch yet): Not sure why some files show as all deleted / all added lines, eg at least FrozenBufferedDeletes.java. Patch also has tabs, which should be spaces... (eg IndexWriter.java). Why do we have FieldsUpdate.Operation.ADD_DOCUMENT? It seems weird to pass that to IW.updateFields? Shouldn't apps just use IW.addDocument? Why do we need SegmentInfoReader.readFilesList? It seems like it's only privately used inside the codec? I'm confused why the "normal" file tracking we have on write is insufficient... oh I see, a single SegmentInfo references all stacked segments too? But since they are written "later" their files won't be automatically tracked ... ok. I wonder if each stacked segment should get its own SegmentInfo, linked to the base segment... It looks like merge policies don't yet know about / target stacked segments ... It seems like we don't invert the document updates until the updates are applied? Ie, we just buffer the IndexableField provided by the user and when it's time to apply updates, we then analyzing/invert? How do we track RAM in this case? (Eg the field could be something arbitrary, eg pre-tokenized). Another option is to do the invert and buffer the resulting postings, and then later "replay" them (remapping docIDs) when it's time to apply. Why does StoredFieldsReader.visitDocument need a Set for ignored fields?
          Hide
          Sivan Yogev added a comment -

          Patch with stored fields bug fixed.

          Show
          Sivan Yogev added a comment - Patch with stored fields bug fixed.
          Hide
          Erick Erickson added a comment -

          Guys:

          It's great that you're tackling this. I wanted to encourage you to be patient about responses, there are just a few people in the universe who understand the details well enough to comment on the code (I'm sure not one of them!) , so it might feel like you're shouting down a well....

          Show
          Erick Erickson added a comment - Guys: It's great that you're tackling this. I wanted to encourage you to be patient about responses, there are just a few people in the universe who understand the details well enough to comment on the code (I'm sure not one of them!) , so it might feel like you're shouting down a well....
          Hide
          Sivan Yogev added a comment -

          Solved the problem with stored fields, I understood that in order to have the right number of documents for the stored fields reader the last document must have a stored field. Term Vectors still failing...

          Show
          Sivan Yogev added a comment - Solved the problem with stored fields, I understood that in order to have the right number of documents for the stored fields reader the last document must have a stored field. Term Vectors still failing...
          Hide
          Sivan Yogev added a comment -

          New patch. This patch contains some failing tests, some probably due to problems in my implementation of SegmentReader.getTermVectors(int), and others in handling of stored fields.

          Can anyone with knowledge of these two areas check what is it that I do wrong? Thanks.

          Show
          Sivan Yogev added a comment - New patch. This patch contains some failing tests, some probably due to problems in my implementation of SegmentReader.getTermVectors(int), and others in handling of stored fields. Can anyone with knowledge of these two areas check what is it that I do wrong? Thanks.
          Hide
          Sivan Yogev added a comment -

          New patch with additional testing and bug fixes.

          Currently the term statistics does not take into account field replacements, and therefore term counts are wrong and CheckIndex fails.

          I can think of two possible solutions for this. The first is for CheckIndex to identify updated segments and ignore term statistics - is there similar mechanism for deletions?

          The other solution is to pre-compute term statistics for updated segments. However, this will be costly - requires going through the entire posting list for every term, and count non-replaced occurrences.

          Any suggestions?

          Show
          Sivan Yogev added a comment - New patch with additional testing and bug fixes. Currently the term statistics does not take into account field replacements, and therefore term counts are wrong and CheckIndex fails. I can think of two possible solutions for this. The first is for CheckIndex to identify updated segments and ignore term statistics - is there similar mechanism for deletions? The other solution is to pre-compute term statistics for updated segments. However, this will be costly - requires going through the entire posting list for every term, and count non-replaced occurrences. Any suggestions?
          Hide
          Sivan Yogev added a comment -

          New patch implementing some previously missing parts, with preliminary code to enable field replacements.

          Show
          Sivan Yogev added a comment - New patch implementing some previously missing parts, with preliminary code to enable field replacements.
          Hide
          Sivan Yogev added a comment -

          New patch, naive test of adding updates to a single-document segment before or after update working. Working on more complex tests with multiple segments, documents and updates.

          Show
          Sivan Yogev added a comment - New patch, naive test of adding updates to a single-document segment before or after update working. Working on more complex tests with multiple segments, documents and updates.
          Hide
          Shai Erera added a comment -

          We will take care of DocValues too, eventually. I think this can be handled in a separate issue though.

          and would already solve most use-cases

          I have a problem with that statement. Robert thinks that the most common use cases are to replace a field's content entirely. In our world (Sivan and mine's), updating a field's terms (removing / adding single terms) is the most common use case. And perhaps in your world updating DocValues for scoring purposes is the most common use case.

          Therefore I don't think that there is one common use case, and therefore IMO we shouldn't aim at solving one first. Personally, DocValues are relatively new (compared to posting lists and payloads) and therefore I believe that being able to update them should come second (just because I estimate they are not as widely used as the others). But that's just my opinion – obviously one that relies solely on DocValues would state otherwise .

          The design currently doesn't cover DocValues at all. I think, in order to keep this issue focused, we should handle that in a separate issue after we land updateable fields.

          On slide 4 one of the enumerated operations is field deletion but I am not sure how to do it with the proposed API on slide 5?

          Good point. Well, you could say that replaceField("f", "value") called as replaceField("f", null) would mean "delete 'f'". That should work, but perhaps we can come up with something more explicit.

          Show
          Shai Erera added a comment - We will take care of DocValues too, eventually. I think this can be handled in a separate issue though. and would already solve most use-cases I have a problem with that statement. Robert thinks that the most common use cases are to replace a field's content entirely. In our world (Sivan and mine's), updating a field's terms (removing / adding single terms) is the most common use case. And perhaps in your world updating DocValues for scoring purposes is the most common use case. Therefore I don't think that there is one common use case, and therefore IMO we shouldn't aim at solving one first. Personally, DocValues are relatively new (compared to posting lists and payloads) and therefore I believe that being able to update them should come second (just because I estimate they are not as widely used as the others). But that's just my opinion – obviously one that relies solely on DocValues would state otherwise . The design currently doesn't cover DocValues at all. I think, in order to keep this issue focused, we should handle that in a separate issue after we land updateable fields. On slide 4 one of the enumerated operations is field deletion but I am not sure how to do it with the proposed API on slide 5? Good point. Well, you could say that replaceField("f", "value") called as replaceField("f", null) would mean "delete 'f'". That should work, but perhaps we can come up with something more explicit.
          Hide
          Adrien Grand added a comment -

          On slide 4 one of the enumerated operations is field deletion but I am not sure how to do it with the proposed API on slide 5?

          It is just a tought, but your work plan only mentions Lucene fields. Wouldn't it be easier to start working with DocValues? I guess it would help us get started with document updates and would already solve most use-cases (I'm especially thinking of scoring factors).

          Show
          Adrien Grand added a comment - On slide 4 one of the enumerated operations is field deletion but I am not sure how to do it with the proposed API on slide 5? It is just a tought, but your work plan only mentions Lucene fields. Wouldn't it be easier to start working with DocValues? I guess it would help us get started with document updates and would already solve most use-cases (I'm especially thinking of scoring factors).
          Hide
          Sivan Yogev added a comment -

          Forgot to mention that the implementation patch still missing many components...

          Show
          Sivan Yogev added a comment - Forgot to mention that the implementation patch still missing many components...
          Hide
          Sivan Yogev added a comment -

          Adding a design proposal presentation, and two patches following the proposal concepts. The first patch includes proposed API changes (does not compile) for, and the other one inner changes for those interested in the implementation details. The second patch contains a new test named TestFieldsUpdates which currently fails.

          Show
          Sivan Yogev added a comment - Adding a design proposal presentation, and two patches following the proposal concepts. The first patch includes proposed API changes (does not compile) for, and the other one inner changes for those interested in the implementation details. The second patch contains a new test named TestFieldsUpdates which currently fails.
          Hide
          Sivan Yogev added a comment -

          Working on the details, it seems that we need to add a new layer of information for stacked segments. For each field that was added with REPLACE_FIELDS, we need to hold the documents in which a replace took place, with the number of the latest generation that had the replacement. Name this list the "generation vector". That way, TermDocs provided by StackedSegmentReader for a certain term is a special merge of that term's TermDocs for all stacked segments. The "special" part about it is that we ignore occurrences from documents in which the term's field was replaced in a later generation.

          An example. Assume we have doc 1 with title "I love bananas" and doc 2 with title "I love oranges", and the segment is flushed. We will have the following base segment (ignoring positions):

          bananas: doc 1
          I: doc1, doc 2
          love: doc 1, doc 2
          oranges: doc2

          Now we add to doc 1 additional title field "I hate apples", and replace the title of doc 2 with "I love lemons", and flush. We will have the following segment for generation 1:

          apples: doc 1
          hate: doc 1
          I: doc 1, doc 2
          lemons: doc 2
          love: doc 2
          generation vector for field "title": (doc 2, generation 1)

          TermDocs for a few terms:

          • title:bananas : {1}

            , uses the TermDocs of the base segment and not affected by the field title generation vector.

          • title:oranges : {}, uses the TermDocs of the base segment, doc 2 title affected for generations < 1, and the generation is 0.
          • title:lemons : {2}

            , uses the TermDocs of generation 1. Doc 2 title affected for generations < 1, but the term appears in generation 1.

          • title:love : {1,2}

            , uses the TermDocs of both segments. Doc 2 title affected for generations < 1, but the term appears in generation 1.

          I propose to initially use PackedInts for the generation vector, since we know how many generations the curent segment has upon flushing. Later we might consider special treatment for sparse vectors.

          Show
          Sivan Yogev added a comment - Working on the details, it seems that we need to add a new layer of information for stacked segments. For each field that was added with REPLACE_FIELDS, we need to hold the documents in which a replace took place, with the number of the latest generation that had the replacement. Name this list the "generation vector". That way, TermDocs provided by StackedSegmentReader for a certain term is a special merge of that term's TermDocs for all stacked segments. The "special" part about it is that we ignore occurrences from documents in which the term's field was replaced in a later generation. An example. Assume we have doc 1 with title "I love bananas" and doc 2 with title "I love oranges", and the segment is flushed. We will have the following base segment (ignoring positions): bananas: doc 1 I: doc1, doc 2 love: doc 1, doc 2 oranges: doc2 Now we add to doc 1 additional title field "I hate apples", and replace the title of doc 2 with "I love lemons", and flush. We will have the following segment for generation 1: apples: doc 1 hate: doc 1 I: doc 1, doc 2 lemons: doc 2 love: doc 2 generation vector for field "title": (doc 2, generation 1) TermDocs for a few terms: title:bananas : {1} , uses the TermDocs of the base segment and not affected by the field title generation vector. title:oranges : {}, uses the TermDocs of the base segment, doc 2 title affected for generations < 1, and the generation is 0. title:lemons : {2} , uses the TermDocs of generation 1. Doc 2 title affected for generations < 1, but the term appears in generation 1. title:love : {1,2} , uses the TermDocs of both segments. Doc 2 title affected for generations < 1, but the term appears in generation 1. I propose to initially use PackedInts for the generation vector, since we know how many generations the curent segment has upon flushing. Later we might consider special treatment for sparse vectors.
          Hide
          Sivan Yogev added a comment -

          How does this relate (if at all, I confess I just looked at the title) to Andrzej's proposal here?

          The basic idea is the same. One major difference is that in Andrzej's proposal the stacked updates are added to a new index with different doc IDs, and then the SegmentReader needs to map to the original doc IDs. The plan in this proposal (Shai correct me if I'm wrong) is for the stacked updates not to be stand alone segments. Although they will have the structure of regular segments they will be tightly coupled with the original segment, with doc IDs matching those of the original segment.

          Show
          Sivan Yogev added a comment - How does this relate (if at all, I confess I just looked at the title) to Andrzej's proposal here? The basic idea is the same. One major difference is that in Andrzej's proposal the stacked updates are added to a new index with different doc IDs, and then the SegmentReader needs to map to the original doc IDs. The plan in this proposal (Shai correct me if I'm wrong) is for the stacked updates not to be stand alone segments. Although they will have the structure of regular segments they will be tightly coupled with the original segment, with doc IDs matching those of the original segment.
          Hide
          Shai Erera added a comment -

          BTW, since the new method is to handle multiple fields (as the name suggests), the operation descriptions should also be in plural: UPDATE_FIELDS and REPLACE_FIELDS.

          Ok. I think to not confuse though, we should call it UPDATE_TERMS (not FIELDS). Then someone can updateFields() twice, once for all the fields which he wants to REPLACE and second for the fields he just wants to update their terms.

          What about merging?

          I wrote about it above – MergePolicy will need to take care of these stacked segments, and we'll add something like ,merge/expungeFieldUpdates so the app can call it deliberately.

          seems like the app should be using doc values instead, and we should (eventually) make doc values updatable?

          I agree we should not UPDATE_TERMS fields that record norms. I'm not sure that every use case of storing info in the payload today can be translated to using DocValues, so I don't want to limit things. So, let's start with UPDATE_TERMS taking care of fields that omit norms. Then, if we handle payload or not for few use cases, can become as an optimization later on. In the meanwhile, apps will just need to replace the entire field.

          Progress, not perfection !

          Show
          Shai Erera added a comment - BTW, since the new method is to handle multiple fields (as the name suggests), the operation descriptions should also be in plural: UPDATE_FIELDS and REPLACE_FIELDS. Ok. I think to not confuse though, we should call it UPDATE_TERMS (not FIELDS). Then someone can updateFields() twice, once for all the fields which he wants to REPLACE and second for the fields he just wants to update their terms. What about merging? I wrote about it above – MergePolicy will need to take care of these stacked segments, and we'll add something like ,merge/expungeFieldUpdates so the app can call it deliberately. seems like the app should be using doc values instead, and we should (eventually) make doc values updatable? I agree we should not UPDATE_TERMS fields that record norms. I'm not sure that every use case of storing info in the payload today can be translated to using DocValues, so I don't want to limit things. So, let's start with UPDATE_TERMS taking care of fields that omit norms. Then, if we handle payload or not for few use cases, can become as an optimization later on. In the meanwhile, apps will just need to replace the entire field. Progress, not perfection !
          Hide
          Robert Muir added a comment -

          I don't see how UPDATE_FIELDS can do so unless we somehow save the raw
          stats (FieldInvertState) in the index. It seems like UPDATE_FIELDS
          should forever be limited to DOCS_ONLY, no norms updating?

          Actually its DOCS_ONLY plus OMIT_NORMS.

          Anyway why not start with updating the entire contents of a field as I suggested?
          It seems to be the most general solution, and there is some discussion about how scoring can
          work correctly on LUCENE-3837 (the stats, not just norms).

          I do think this is a common use case (ACLs, filters, social
          tags)... though I'm not sure how bad it'd really be in practice for
          the app to simply REPLACE_FIELDS with the full set of tags. I guess
          if we build REPLACE_FIELDS first we can test that.

          This is why we should do 'replace contents of a field' first. Its the most well-defined and general.

          Its also still controversial, myself I'm not convinced it will actually help most people that think
          they want it, I think it will just slow down searches.

          Show
          Robert Muir added a comment - I don't see how UPDATE_FIELDS can do so unless we somehow save the raw stats (FieldInvertState) in the index. It seems like UPDATE_FIELDS should forever be limited to DOCS_ONLY, no norms updating? Actually its DOCS_ONLY plus OMIT_NORMS. Anyway why not start with updating the entire contents of a field as I suggested? It seems to be the most general solution, and there is some discussion about how scoring can work correctly on LUCENE-3837 (the stats, not just norms). I do think this is a common use case (ACLs, filters, social tags)... though I'm not sure how bad it'd really be in practice for the app to simply REPLACE_FIELDS with the full set of tags. I guess if we build REPLACE_FIELDS first we can test that. This is why we should do 'replace contents of a field' first. Its the most well-defined and general. Its also still controversial, myself I'm not convinced it will actually help most people that think they want it, I think it will just slow down searches.
          Hide
          Erick Erickson added a comment -

          How does this relate (if at all, I confess I just looked at the title) to Andrzej's proposal here? https://issues.apache.org/jira/browse/LUCENE-3837

          Show
          Erick Erickson added a comment - How does this relate (if at all, I confess I just looked at the title) to Andrzej's proposal here? https://issues.apache.org/jira/browse/LUCENE-3837
          Hide
          Michael McCandless added a comment -

          BTW, since the new method is to handle multiple fields (as the name suggests), the operation descriptions should also be in plural: UPDATE_FIELDS and REPLACE_FIELDS.

          +1

          I think this design sounds good! REPLACE_FIELDS should easily be able
          to update norms correctly, right? Because the full per-field stats
          are recomputed from scratch. So then scores should be identical:
          should be a nice simple testcase to create

          I don't see how UPDATE_FIELDS can do so unless we somehow save the raw
          stats (FieldInvertState) in the index. It seems like UPDATE_FIELDS
          should forever be limited to DOCS_ONLY, no norms updating? Positions
          also seems hard to update, and if the only reason to do so is for
          payloads... seems like the app should be using doc values instead, and
          we should (eventually) make doc values updatable?.

          I do think this is a common use case (ACLs, filters, social
          tags)... though I'm not sure how bad it'd really be in practice for
          the app to simply REPLACE_FIELDS with the full set of tags. I guess
          if we build REPLACE_FIELDS first we can test that.

          The implementation should be able to piggy-back on all the
          buffering/tracking we currently do for buffered deletes.

          I think this change should live entirely above Codec? Ie Codec just
          thinks it's writing a segment, not knowing if that segment is the base
          segment, or one of the stacked ones. If the +postings and -postings
          are simply 2 terms then the Codec need not know...

          Seems like only SegmentInfos needs to track how segments stack up, and
          then I guess we'd need a new StackedSegmentReader that is atomic,
          holds N SegmentReaders, and presents the merged codec APIs by merging
          down the stack on the fly? I suspect this (having to use a PQ to
          merge the docIDs in the postings) will be a huge search performance
          hit....

          I think UnionDocs/AndPositionsEnum (in MultiPhraseQuery.java) is
          already doing what we want? (Except it doesn't handle negative
          postings).

          What about merging? Seems like the merge policy should know about
          stacking and should sometimes (aggressively?) merge a stack down?

          Show
          Michael McCandless added a comment - BTW, since the new method is to handle multiple fields (as the name suggests), the operation descriptions should also be in plural: UPDATE_FIELDS and REPLACE_FIELDS. +1 I think this design sounds good! REPLACE_FIELDS should easily be able to update norms correctly, right? Because the full per-field stats are recomputed from scratch. So then scores should be identical: should be a nice simple testcase to create I don't see how UPDATE_FIELDS can do so unless we somehow save the raw stats (FieldInvertState) in the index. It seems like UPDATE_FIELDS should forever be limited to DOCS_ONLY, no norms updating? Positions also seems hard to update, and if the only reason to do so is for payloads... seems like the app should be using doc values instead, and we should (eventually) make doc values updatable?. I do think this is a common use case (ACLs, filters, social tags)... though I'm not sure how bad it'd really be in practice for the app to simply REPLACE_FIELDS with the full set of tags. I guess if we build REPLACE_FIELDS first we can test that. The implementation should be able to piggy-back on all the buffering/tracking we currently do for buffered deletes. I think this change should live entirely above Codec? Ie Codec just thinks it's writing a segment, not knowing if that segment is the base segment, or one of the stacked ones. If the +postings and -postings are simply 2 terms then the Codec need not know... Seems like only SegmentInfos needs to track how segments stack up, and then I guess we'd need a new StackedSegmentReader that is atomic, holds N SegmentReaders, and presents the merged codec APIs by merging down the stack on the fly? I suspect this (having to use a PQ to merge the docIDs in the postings) will be a huge search performance hit.... I think UnionDocs/AndPositionsEnum (in MultiPhraseQuery.java) is already doing what we want? (Except it doesn't handle negative postings). What about merging? Seems like the merge policy should know about stacking and should sometimes (aggressively?) merge a stack down?
          Hide
          Sivan Yogev added a comment -

          BTW, since the new method is to handle multiple fields (as the name suggests), the operation descriptions should also be in plural: UPDATE_FIELDS and REPLACE_FIELDS.

          Show
          Sivan Yogev added a comment - BTW, since the new method is to handle multiple fields (as the name suggests), the operation descriptions should also be in plural: UPDATE_FIELDS and REPLACE_FIELDS.
          Hide
          Sivan Yogev added a comment -

          Seems like in any case we need to have a separation between fields given with UPDATE_FIELD and REPLACE_FIELD. There are two ways I could think of for implementing this separation.

          The first is at the segment level, where we can have separate "update" and "replace" segments, where the semantic is that a field in an "update" segment is merged with fields in previous segments, while a field in a "replace" segment ignores previous segments.

          The second option is to separate at the field level, choosing one type as the default behavior (maybe this can be configurable) and marking the fields of the non-default type by altering the field name or some other solution.

          I lean towards the segment level separation, since it requires less conventions and will probably require less work for Codec implementations to handle.

          Show
          Sivan Yogev added a comment - Seems like in any case we need to have a separation between fields given with UPDATE_FIELD and REPLACE_FIELD. There are two ways I could think of for implementing this separation. The first is at the segment level, where we can have separate "update" and "replace" segments, where the semantic is that a field in an "update" segment is merged with fields in previous segments, while a field in a "replace" segment ignores previous segments. The second option is to separate at the field level, choosing one type as the default behavior (maybe this can be configurable) and marking the fields of the non-default type by altering the field name or some other solution. I lean towards the segment level separation, since it requires less conventions and will probably require less work for Codec implementations to handle.
          Hide
          Shai Erera added a comment -

          I had a chat about this with Robert a couple of days ago, figured it'll be easier to discuss the differences in approaches/opinions, rather than back and forth JIRA comments. Our idea of incremental field updates is not much different. Robert stressed that in his opinion we should tackle first the REPLACE_FIELD operation, which replaces the content of a field entirely by a new content, because he believes that's the most common scenario (i.e., update the title field). I believe that term-based updates are very important too, at least in the scenarios that I face (i.e. adding/removing one ACL, one social tag, one category etc.).

          We concluded that the design should take REPLACE_FIELD into consideration from the get go. Whether we'll also implement UPDATE_FIELD (or UPDATE_TERMS as a better name?) depends on the complexity of it. Because initially UPDATE_TERMS can be implemented through REPLACE_FIELD, so we don't lose functionality. UPDATE_TERMS can come later as an optimization.

          Robert, if I misrepresented our conclusions, please correct me.

          Show
          Shai Erera added a comment - I had a chat about this with Robert a couple of days ago, figured it'll be easier to discuss the differences in approaches/opinions, rather than back and forth JIRA comments. Our idea of incremental field updates is not much different. Robert stressed that in his opinion we should tackle first the REPLACE_FIELD operation, which replaces the content of a field entirely by a new content, because he believes that's the most common scenario (i.e., update the title field). I believe that term-based updates are very important too, at least in the scenarios that I face (i.e. adding/removing one ACL, one social tag, one category etc.). We concluded that the design should take REPLACE_FIELD into consideration from the get go. Whether we'll also implement UPDATE_FIELD (or UPDATE_TERMS as a better name?) depends on the complexity of it. Because initially UPDATE_TERMS can be implemented through REPLACE_FIELD, so we don't lose functionality. UPDATE_TERMS can come later as an optimization. Robert, if I misrepresented our conclusions, please correct me.
          Hide
          Shai Erera added a comment -

          can only update individual terms, but not fields

          Who said that? One of the update operations I listed is REPLACE_FIELD, which means replace the field's content entirely with the new content.

          I don't think its progress if we add a design that can only work with omitTFAP and no norms

          I never said that will be the design. What I said is that in order to update a field at the term level, we'll start with such fields only. The rest of the fields (i.e. w/ norms, payloads and what not) will be updated through REPLACE_FIELD. The way I see it, we still address all the issues, only for some fields we require a whole field replace, and not an optimized term-based update. That can be improved further along, or not.

          In fact this issue shouldnt be called incremental field updates: its not. its "term updates" or something else entirely different.

          That is my idea of incremental field updates and I'm not sure that it's not your idea as well . You seem to only want to support REPLACE_FIELD, while I say that for some field types we can support UPDATE_FIELD (i.e. at the term level), that's it !

          Show
          Shai Erera added a comment - can only update individual terms, but not fields Who said that? One of the update operations I listed is REPLACE_FIELD, which means replace the field's content entirely with the new content. I don't think its progress if we add a design that can only work with omitTFAP and no norms I never said that will be the design. What I said is that in order to update a field at the term level, we'll start with such fields only. The rest of the fields (i.e. w/ norms, payloads and what not) will be updated through REPLACE_FIELD. The way I see it, we still address all the issues, only for some fields we require a whole field replace, and not an optimized term-based update. That can be improved further along, or not. In fact this issue shouldnt be called incremental field updates: its not. its "term updates" or something else entirely different. That is my idea of incremental field updates and I'm not sure that it's not your idea as well . You seem to only want to support REPLACE_FIELD, while I say that for some field types we can support UPDATE_FIELD (i.e. at the term level), that's it !
          Hide
          Robert Muir added a comment -

          I don't think its progress if we add a design that can only work with omitTFAP and no norms,
          and can only update individual terms, but not fields.

          it means to support these things we have to also totally clear out whats there, and then introduce a new design

          In fact this issue shouldnt be called incremental field updates: its not. its "term updates" or something else entirely different.

          Show
          Robert Muir added a comment - I don't think its progress if we add a design that can only work with omitTFAP and no norms , and can only update individual terms, but not fields. it means to support these things we have to also totally clear out whats there, and then introduce a new design In fact this issue shouldnt be called incremental field updates: its not. its "term updates" or something else entirely different.
          Hide
          Shai Erera added a comment -

          ...which is to update the contents of one field, without reindexing the entire document

          I agree, but I distinguish between two operations:

          1. replacing the content of a field entirely with a new content (or remove the field)
          2. update the field's content by adding/removing individual terms

          I think requiring no positions, no frequencies, and no norms makes it even more fringe. This means its not really useful for any search purposes. And we are a search engine library.

          I disagree. Where I come from, the most common use case where such operation will be useful is when a single change affects hundreds and sometimes thousands of documents. An example is a document library like application which manages folders with ACLs. You can add an ACL to a top-level folder and it affects the entire documents and folder beneath it. That results in reindexing, sometimes, a huge amount of documents.

          I don't diminish the use case of updating a field for scoring purposes, not at all. Just saying that starting by supporting one use case is more than supporting no use case.

          Now, and this probably stems from my lack of understanding of the Lucene internals – I see "supporting terms that omit TFAP" as a starting point because that is the easiest case, and even that requires a lot of understanding of the internals. After we do that, I'll feel more comfortable discussing other types of updates for other field types ... at least, I'll feel that I have more intelligent things to say .

          Regarding your other concerns, I share them with you, and we of course need to benchmark everything. I don't know how this affect search or not. But those updates will get merged away when segments are merged, so while I'm sure search will be affected, it's not for eternity - only until that segment is merged. And, I think we need to add capability to MergePolicy to findSegmentsForMergeUpdates, just like we expungeDeletes.

          If the first step means that in order to update a field used for scoring (i.e. w/ norms) means that you need to replace the content of the field entirely by a new content, I'm ok with it. As one esteem member of this community always says "progress, not perfection" - I'm totally soled for that !

          Show
          Shai Erera added a comment - ...which is to update the contents of one field, without reindexing the entire document I agree, but I distinguish between two operations: replacing the content of a field entirely with a new content (or remove the field) update the field's content by adding/removing individual terms I think requiring no positions, no frequencies, and no norms makes it even more fringe. This means its not really useful for any search purposes. And we are a search engine library. I disagree. Where I come from, the most common use case where such operation will be useful is when a single change affects hundreds and sometimes thousands of documents. An example is a document library like application which manages folders with ACLs. You can add an ACL to a top-level folder and it affects the entire documents and folder beneath it. That results in reindexing, sometimes, a huge amount of documents. I don't diminish the use case of updating a field for scoring purposes, not at all. Just saying that starting by supporting one use case is more than supporting no use case. Now, and this probably stems from my lack of understanding of the Lucene internals – I see "supporting terms that omit TFAP" as a starting point because that is the easiest case, and even that requires a lot of understanding of the internals. After we do that, I'll feel more comfortable discussing other types of updates for other field types ... at least, I'll feel that I have more intelligent things to say . Regarding your other concerns, I share them with you, and we of course need to benchmark everything. I don't know how this affect search or not. But those updates will get merged away when segments are merged, so while I'm sure search will be affected, it's not for eternity - only until that segment is merged. And, I think we need to add capability to MergePolicy to findSegmentsForMergeUpdates, just like we expungeDeletes. If the first step means that in order to update a field used for scoring (i.e. w/ norms) means that you need to replace the content of the field entirely by a new content, I'm ok with it. As one esteem member of this community always says "progress, not perfection" - I'm totally soled for that !
          Hide
          Robert Muir added a comment -

          Codecs didn't handle all segment files first... stored fields and such were added later. I do agree though that we should keep in mind the full range of scenarios.

          I don't think thats really comparable at all, for two reasons:
          1. Codecs can be considered a "rote" refactoring of the XXXWriter in 3.x. I'm not trying to diminish the value but its just an introduced abstraction layer. Something like this is different in that its algorithmic.
          2. The fact that Codecs only handled postings at first wasn't easy to fix after they were introduced as postings-only. Once they handled postings initially, this was a significant refactoring.

          I'm not trying to pick on your proposal, I'm just saying there are things I don't like about the design.

          • I think that updating individual terms is a fringe use-case, and not the major use case for incremental updates, which is to update the contents of one field, without reindexing the entire document. This was also noted by someone else on the discussion thread. This issue seems to be solely about supporting the 'tagging' use case, which is just one of many.
          • I think requiring no positions, no frequencies, and no norms makes it even more fringe. This means its not really useful for any search purposes. And we are a search engine library.
          • I think that negatives won't compress well, as in general compression algorithms for IR in the last years focus on positive integers.
          • I think merging the postings will be slow: I don't like the tradeoff of slowing down searching so much for what I'm not even sure will be a significant speedup to indexing.
          Show
          Robert Muir added a comment - Codecs didn't handle all segment files first... stored fields and such were added later. I do agree though that we should keep in mind the full range of scenarios. I don't think thats really comparable at all, for two reasons: 1. Codecs can be considered a "rote" refactoring of the XXXWriter in 3.x. I'm not trying to diminish the value but its just an introduced abstraction layer. Something like this is different in that its algorithmic. 2. The fact that Codecs only handled postings at first wasn't easy to fix after they were introduced as postings-only. Once they handled postings initially, this was a significant refactoring. I'm not trying to pick on your proposal, I'm just saying there are things I don't like about the design. I think that updating individual terms is a fringe use-case, and not the major use case for incremental updates, which is to update the contents of one field, without reindexing the entire document. This was also noted by someone else on the discussion thread. This issue seems to be solely about supporting the 'tagging' use case, which is just one of many. I think requiring no positions, no frequencies, and no norms makes it even more fringe. This means its not really useful for any search purposes. And we are a search engine library. I think that negatives won't compress well, as in general compression algorithms for IR in the last years focus on positive integers. I think merging the postings will be slow: I don't like the tradeoff of slowing down searching so much for what I'm not even sure will be a significant speedup to indexing.
          Hide
          Shai Erera added a comment -

          I think it's ok if we introduced IFU for DOCS_ONLY at first, throwing exceptions otherwise. E.g., UpdateField override setOmitNorms and such and throws UOE... at first.

          Everything else will still work as it is today...

          Codecs didn't handle all segment files first... stored fields and such were added later. I do agree though that we should keep in mind the full range of scenarios.

          Sorry for the short response, JIRA isn't great with smart phones: -).

          Show
          Shai Erera added a comment - I think it's ok if we introduced IFU for DOCS_ONLY at first, throwing exceptions otherwise. E.g., UpdateField override setOmitNorms and such and throws UOE... at first. Everything else will still work as it is today... Codecs didn't handle all segment files first... stored fields and such were added later. I do agree though that we should keep in mind the full range of scenarios. Sorry for the short response, JIRA isn't great with smart phones: -).
          Hide
          Robert Muir added a comment -

          I don't think I'm sold on introducing the feature in steps.

          I think its critical for something of this magnitude that we figure out the design totally, up-front,
          so it will work for the major use-cases. I think its fine to implement in steps if we need though.

          Honestly I think we should throw it all out on the table and get to the real problems I think
          that most people face today:

          1. For many document sizes, use-cases (especially rapidly changing stuff): The real problem is not the
            speed of lucene reindexing the document, its that the user must rebuild the entire document. Solr solved
            this by providing an option where you just say "update field X" and internally it reindexes the
            document from stored fields (for that feature to work, the whole thing must be stored). We shouldn't
            discard the possibility of implementing cleaner support for a solution like this, which wouldnt
            complicate indexwriter at all.
          2. A second problem (not solved by the above) is that many people are using scoring factors with a variety
            of signals and these are changing often. I think unfortunately, people are often putting these in
            a normal indexed field and uninverting these on the fieldcache, requiring the whole document to
            be reindexed just because of how they implemented the scoring factor. People could instead solve this
            by putting their apps primary key into a docvalues field, allowing them to keep these scoring factors
            completely external to lucene (e.g. their own array or whatever), indexed by their own primary key. But
            the problem is I think people want lucene to manage this, they don't want to implement themselves whats
            necessary to make it consistent with commits etc.

          So we can look at several approaches to solving this stuff. I feel like both these problems could be
          solved via a contrib module without modifying indexwriter at all for many use cases: maybe better if
          we go for more tight integration. And with those simple approaches I describe above, searching doesn't
          get any slower.

          But if we really feel like we need a "full incremental update API" (i know there are a few use cases
          where it can help, I'm not discarding that), then I feel like there are a few things I want:

          • I want scoring to be correct: this is a must. If we provide a incremental update API on IW and it doesnt
            achieve the same thing as updateDocument today, then its broken. But I think its ok for things to
            be temporarily off (as long as this is in a consistent way) until merging takes place, just like
            deletes today.
          • I want to know for any incremental update API, the cost to search performance.
            I want to know, at what document size is any incremental update API actually faster than us just
            reindexing the document internally, and how much faster is it? I also want us to consider that
            compared to the slowdown in search performance. We should know what the tradeoffs are before committing
            such APIs.

          I strongly feel like if we just add these incremental APIs to indexwriter without being careful about these
          things, the end result could be that people use them without thinking and end out with slower search
          and worse relevance, thats why I am asking so many questions.

          Show
          Robert Muir added a comment - I don't think I'm sold on introducing the feature in steps. I think its critical for something of this magnitude that we figure out the design totally, up-front, so it will work for the major use-cases. I think its fine to implement in steps if we need though. Honestly I think we should throw it all out on the table and get to the real problems I think that most people face today: For many document sizes, use-cases (especially rapidly changing stuff): The real problem is not the speed of lucene reindexing the document, its that the user must rebuild the entire document. Solr solved this by providing an option where you just say "update field X" and internally it reindexes the document from stored fields (for that feature to work, the whole thing must be stored). We shouldn't discard the possibility of implementing cleaner support for a solution like this, which wouldnt complicate indexwriter at all. A second problem (not solved by the above) is that many people are using scoring factors with a variety of signals and these are changing often. I think unfortunately, people are often putting these in a normal indexed field and uninverting these on the fieldcache, requiring the whole document to be reindexed just because of how they implemented the scoring factor. People could instead solve this by putting their apps primary key into a docvalues field, allowing them to keep these scoring factors completely external to lucene (e.g. their own array or whatever), indexed by their own primary key. But the problem is I think people want lucene to manage this, they don't want to implement themselves whats necessary to make it consistent with commits etc. So we can look at several approaches to solving this stuff. I feel like both these problems could be solved via a contrib module without modifying indexwriter at all for many use cases: maybe better if we go for more tight integration. And with those simple approaches I describe above, searching doesn't get any slower. But if we really feel like we need a "full incremental update API" (i know there are a few use cases where it can help, I'm not discarding that), then I feel like there are a few things I want: I want scoring to be correct: this is a must. If we provide a incremental update API on IW and it doesnt achieve the same thing as updateDocument today, then its broken. But I think its ok for things to be temporarily off (as long as this is in a consistent way) until merging takes place, just like deletes today. I want to know for any incremental update API, the cost to search performance. I want to know, at what document size is any incremental update API actually faster than us just reindexing the document internally, and how much faster is it? I also want us to consider that compared to the slowdown in search performance. We should know what the tradeoffs are before committing such APIs. I strongly feel like if we just add these incremental APIs to indexwriter without being careful about these things, the end result could be that people use them without thinking and end out with slower search and worse relevance, thats why I am asking so many questions.
          Hide
          Shai Erera added a comment -

          There is more to it than just the referenced email. I've had a couple of discussions in the past about this with various people (and it is my fault that I didn't wrote them down and shared them with the rest of you) – I'll try to summarize below a more detailed proposal:

          API
          Add an updateFields method which takes a Constraint and OP (eventually, it might replace today's updateDocument):

          • Constraint defines 'which documents' should be updated, and follows today's deleteDocument API (takes Term, Query and arrays of each)
          • OP defines the actual update to do on those documents:
            • It has a TYPE, with 3 values (At least for now):
              1. REPLACE_DOC – replaces an entire document (essentially what updateDocument is today)
              2. UPDATE_FIELD – incrementally update a field
              3. REPLACE_FIELD – replaces a field entirely
            • In addition, it takes a Field[] (or Iterable) to remove/add.
            • In light of the recent changes to IndexableField and friends, perhaps what it should take is a concrete UpdateField with a boolean specifying whether to add/remove its content. Suggestions are welcome !

          Implementation
          The idea is to create StackedSegments, which, well, stack on top of current segments. The inspiration came from deletes, which can be viewed as a segment stacked on an existing segment, that marks which documents are deleted.

          Following that semantics, a segment could be comprised of these files:

          • Layer 1: _0.prx, _0.fnm, _0.fdt ...
          • Layer 2: _0_1.prx, _0_1.fdt (no updates to .fnm) – override/merge info from layer 1
          • Layer 3: _0_2.prx – override/merge info from layer 2
          • Layer 4: _0_1.del – deletes are always the last layer, irregardless of their 'layer id' – _0_1.del overrides everything, even _0_100.prx.
            • And they can be stacked on themselves as today, e.g. _0_2.del etc.
              I believe that we'll need an UpdateCodec or something ... this is the part of the internal API that we still need to understand better. Help from folks like you Robert will be greatly appreciated !

          Two options to encode the posting lists:

          • field:value --> +1, -5, +8, +12, -17 ... (simple, but cannot be encoded efficiently
            1. +field:value --> 1, 8, 12
            2. -field:value --> 5, 17

          Ideally, the way incremental updates will be applied will follow how deletes are applied today:

          • An update always applies to all documents that are flushed
          • And to all documents currently in the RAM buffer
          • But never to documents that are indexed later

          Again, this is an internal detail that I'd appreciate if someone can give us a pointer to where that happens in the code today (now with concurrent flushing). I remember PackedDeletes existed at some point, has that changed?

          If it's a new Codec, then SegmentReader may not even need to change ...

          The REPLACE_FIELD OP is tricky ... perhaps it's like how deletes are materialized on disk – as a sparse bit vector that marks the documents that are no longer associated with it ...

          I also think that we should introduce this feature in steps:

          1. Support only fields that omit TFAP (i.e. DOCS_ONLY). This is very valuable for fields like ACL, TAGS, CATEGORIES etc.
            • Ideally, the app would just need to say "add/remove ACL:SHAI to/from document X", rather than passing the entire list of ACLs every on every update operation.
            • This I believe is also the most common use case for incremental field updates
          2. Support stored fields, whether as part of (1) or a follow-on, but adding TAG:LUCENE to the postings, but not the stored fields, is limiting ...
          3. Support terms with positions, but no norms. What I'm thinking about are terms that store stuff in the payload, but don't care about the positions themselves. An example are the category dimensions of the facet module, which stores category ordinals in the payload
            • Positions are tricky, and we'll need to do this carefully, I know. But I don't rule it out at this point.
          4. Then, support fields with norms. I get your concern Robert, and I agree it's a challenge, hence why I leave it to last. The scenario I have in mind is: a search engine that lets you comment on a result or tag it, and the comment/tag should be added to the document's 'catchall' field for later searches. I think it's a valuable scenario, and this is something I'd like to support. If we cannot find a way to deal with it and the norms, then I see two options:
            1. Document a limitation to updating a field with norms, at your own risk.
            2. Enforce REPLACE_FIELD OP on fields with norms.
          • Since norms are under DocValues now, maybe that's solvable, I don't know. At the moment I think that we have a lot to do before we worry about norms ...
          • I also think that we should start with the simpler ADD_FIELD operation, and not support REMOVE_FIELD ... really to keep things simple at start.

          I suggest we do this work in a dedicated branch of course. Ideally, we can port everything to 4.x at some point, as I think most of the changes are internal details ...

          Show
          Shai Erera added a comment - There is more to it than just the referenced email. I've had a couple of discussions in the past about this with various people (and it is my fault that I didn't wrote them down and shared them with the rest of you) – I'll try to summarize below a more detailed proposal: API Add an updateFields method which takes a Constraint and OP (eventually, it might replace today's updateDocument): Constraint defines 'which documents' should be updated, and follows today's deleteDocument API (takes Term, Query and arrays of each) OP defines the actual update to do on those documents: It has a TYPE, with 3 values (At least for now): REPLACE_DOC – replaces an entire document (essentially what updateDocument is today) UPDATE_FIELD – incrementally update a field REPLACE_FIELD – replaces a field entirely In addition, it takes a Field[] (or Iterable) to remove/add. In light of the recent changes to IndexableField and friends, perhaps what it should take is a concrete UpdateField with a boolean specifying whether to add/remove its content. Suggestions are welcome ! Implementation The idea is to create StackedSegments, which, well, stack on top of current segments. The inspiration came from deletes, which can be viewed as a segment stacked on an existing segment, that marks which documents are deleted. Following that semantics, a segment could be comprised of these files: Layer 1: _0.prx, _0.fnm, _0.fdt ... Layer 2: _0_1.prx, _0_1.fdt (no updates to .fnm) – override/merge info from layer 1 Layer 3: _0_2.prx – override/merge info from layer 2 Layer 4: _0_1.del – deletes are always the last layer, irregardless of their 'layer id' – _0_1.del overrides everything, even _0_100.prx. And they can be stacked on themselves as today, e.g. _0_2.del etc. I believe that we'll need an UpdateCodec or something ... this is the part of the internal API that we still need to understand better. Help from folks like you Robert will be greatly appreciated ! Two options to encode the posting lists: field:value --> +1, -5, +8, +12, -17 ... (simple, but cannot be encoded efficiently +field:value --> 1, 8, 12 -field:value --> 5, 17 Ideally, the way incremental updates will be applied will follow how deletes are applied today: An update always applies to all documents that are flushed And to all documents currently in the RAM buffer But never to documents that are indexed later Again, this is an internal detail that I'd appreciate if someone can give us a pointer to where that happens in the code today (now with concurrent flushing). I remember PackedDeletes existed at some point, has that changed? If it's a new Codec, then SegmentReader may not even need to change ... The REPLACE_FIELD OP is tricky ... perhaps it's like how deletes are materialized on disk – as a sparse bit vector that marks the documents that are no longer associated with it ... I also think that we should introduce this feature in steps: Support only fields that omit TFAP (i.e. DOCS_ONLY). This is very valuable for fields like ACL, TAGS, CATEGORIES etc. Ideally, the app would just need to say "add/remove ACL:SHAI to/from document X", rather than passing the entire list of ACLs every on every update operation. This I believe is also the most common use case for incremental field updates Support stored fields, whether as part of (1) or a follow-on, but adding TAG:LUCENE to the postings, but not the stored fields, is limiting ... Support terms with positions, but no norms. What I'm thinking about are terms that store stuff in the payload, but don't care about the positions themselves. An example are the category dimensions of the facet module, which stores category ordinals in the payload Positions are tricky, and we'll need to do this carefully, I know. But I don't rule it out at this point. Then, support fields with norms. I get your concern Robert, and I agree it's a challenge, hence why I leave it to last. The scenario I have in mind is: a search engine that lets you comment on a result or tag it, and the comment/tag should be added to the document's 'catchall' field for later searches. I think it's a valuable scenario, and this is something I'd like to support. If we cannot find a way to deal with it and the norms, then I see two options: Document a limitation to updating a field with norms, at your own risk. Enforce REPLACE_FIELD OP on fields with norms. Since norms are under DocValues now, maybe that's solvable, I don't know. At the moment I think that we have a lot to do before we worry about norms ... I also think that we should start with the simpler ADD_FIELD operation, and not support REMOVE_FIELD ... really to keep things simple at start. I suggest we do this work in a dedicated branch of course. Ideally, we can port everything to 4.x at some point, as I think most of the changes are internal details ...
          Hide
          Robert Muir added a comment -

          A few things I dont understand:

          • when updating a document, how do you know which terms to apply negative postings to?
          • how can the idea of updating "individual terms" work as far as length normalization information? How will that be reconciled?

          In truth I think term is too fine-grained of a level for updates in lucene because of norms,
          only updating whole fields of a document will really work (as then the norm can simply be recomputed and replaced).

          Show
          Robert Muir added a comment - A few things I dont understand: when updating a document, how do you know which terms to apply negative postings to? how can the idea of updating "individual terms" work as far as length normalization information? How will that be reconciled? In truth I think term is too fine-grained of a level for updates in lucene because of norms, only updating whole fields of a document will really work (as then the norm can simply be recomputed and replaced).

            People

            • Assignee:
              Unassigned
              Reporter:
              Sivan Yogev
            • Votes:
              13 Vote for this issue
              Watchers:
              34 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 2,520h
                2,520h
                Remaining:
                Remaining Estimate - 2,520h
                2,520h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development