Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I've been reviewing the ideas for updatable fields and have an alternative
      proposal that I think would address my biggest concern:

      • not slowing down searching

      When I look at what Solr and Elasticsearch do here, by basically reindexing from stored fields, I think they solve a lot of the problem: users don't have to "rebuild" their document from scratch just to update one tiny piece.

      But I think we can do this more efficiently: by avoiding reindexing of the unaffected fields.

      The basic idea is that we would require term vectors for this approach (as the already store a serialized indexed version of the doc), and so we could just take the other pieces from the existing vectors for the doc.

      I dont think we should discard the idea because vectors are slow/big today, this seems like something we could fix.

      Personally I like the idea of not slowing down search performance to solve the problem, I think we should really start from that angle and work towards making the indexing side more efficient, not vice-versa.

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          This is an interesting idea! And it makes sense to factor this down from ElasticSearch/Solr.

          So we have the codec approach (LUCENE-3837), the stacked-segments approach (LUCENE-4258), and this new approach (copy over already-inverted fields).

          We could quite efficiently add the already-inverted doc (term vectors) to the in-memory postings. And then there'd be zero impact to search performance, and no (well, small) index format changes.

          The only downside is the use case of replacing tiny fields on otherwise massive docs: in this case the other approaches would be faster at indexing (but still slower at searching). I agree not slowing down search is a big plus for this approach.

          We'd also need to open up the TV APIs so we can get TVs for a doc in the current segment, for the case where app adds a doc and later (before flush), replaces some fields. And we need to pool readers in IW so the updates can on-demand resolve the Term to docIDs. Hmm and we'd need to be able to do so for the in-memory segment (I think we should not support replaceFields by Query for starters).

          Show
          Michael McCandless added a comment - This is an interesting idea! And it makes sense to factor this down from ElasticSearch/Solr. So we have the codec approach ( LUCENE-3837 ), the stacked-segments approach ( LUCENE-4258 ), and this new approach (copy over already-inverted fields). We could quite efficiently add the already-inverted doc (term vectors) to the in-memory postings. And then there'd be zero impact to search performance, and no (well, small) index format changes. The only downside is the use case of replacing tiny fields on otherwise massive docs: in this case the other approaches would be faster at indexing (but still slower at searching). I agree not slowing down search is a big plus for this approach. We'd also need to open up the TV APIs so we can get TVs for a doc in the current segment, for the case where app adds a doc and later (before flush), replaces some fields. And we need to pool readers in IW so the updates can on-demand resolve the Term to docIDs. Hmm and we'd need to be able to do so for the in-memory segment (I think we should not support replaceFields by Query for starters).
          Hide
          Robert Muir added a comment -

          Well I think there are a few other advantages:

          complexity, e.g. not having to stack segments keeps the number of "dimensions" the same.
          The general structure of the index would be unchanged as well.

          to IndexSearcher/Similarity/etc everything would appear just as if someone had deleted and re-added
          completely like today: this means we dont have to change our search APIs to have maxDoc(field) or anything
          else: scoring works just fine.

          it seems possible we could support tryXXX incremental updates by docid via just like LUCENE-4203 too, though
          thats just an optimization.

          as far as tiny fields on otherwise massive docs, i think we can break this down into 3 layers:

          1. document 'build' <-- retrieving from your SQL database / sending over the wire / etc
          2. field 'analyze' <-- actually doing the text analysis etc on the doc
          3. field 'indexing' <-- consuming the already-analyzed pieces thru the indexer chain/codec flush/etc

          Today people 'pay' for 1, 2, and 3. If they use the solr/es approach they only pay 2 and 3 I think?
          With this approach its just 3. I think for the vast majority of apps it will be fast enough, as I
          am totally convinced 1 and 2 are the biggest burden on people. I think these are totally possible
          to fix without hurting search performance. I cant imagine many real world apps where its 3, not
          1 and 2, that are their bottleneck AND they are willing to trade off significant search performance for that.

          Show
          Robert Muir added a comment - Well I think there are a few other advantages: complexity, e.g. not having to stack segments keeps the number of "dimensions" the same. The general structure of the index would be unchanged as well. to IndexSearcher/Similarity/etc everything would appear just as if someone had deleted and re-added completely like today: this means we dont have to change our search APIs to have maxDoc(field) or anything else: scoring works just fine. it seems possible we could support tryXXX incremental updates by docid via just like LUCENE-4203 too, though thats just an optimization. as far as tiny fields on otherwise massive docs, i think we can break this down into 3 layers: document 'build' <-- retrieving from your SQL database / sending over the wire / etc field 'analyze' <-- actually doing the text analysis etc on the doc field 'indexing' <-- consuming the already-analyzed pieces thru the indexer chain/codec flush/etc Today people 'pay' for 1, 2, and 3. If they use the solr/es approach they only pay 2 and 3 I think? With this approach its just 3. I think for the vast majority of apps it will be fast enough, as I am totally convinced 1 and 2 are the biggest burden on people. I think these are totally possible to fix without hurting search performance. I cant imagine many real world apps where its 3, not 1 and 2, that are their bottleneck AND they are willing to trade off significant search performance for that.
          Hide
          Robert Muir added a comment -

          We'd also need to open up the TV APIs so we can get TVs for a doc in the current segment, for the case where app adds a doc and later (before flush), replaces some fields.

          Realistically I'd like to support that anyway for the norms case so that codecs can index term impacts (LUCENE-4198),
          as this is going to involve length normalization in addition to TF. But currently the postings writer has no way
          to "see" this.

          So it would be nice if we could do solve that too, then we wouldnt need norms/dvs in the vectors (they are already per-doc).
          This would make for a faster way of updating docvalues fields: for that specific case I think more can be done
          but it would be an improvement and fit well.

          Show
          Robert Muir added a comment - We'd also need to open up the TV APIs so we can get TVs for a doc in the current segment, for the case where app adds a doc and later (before flush), replaces some fields. Realistically I'd like to support that anyway for the norms case so that codecs can index term impacts ( LUCENE-4198 ), as this is going to involve length normalization in addition to TF. But currently the postings writer has no way to "see" this. So it would be nice if we could do solve that too, then we wouldnt need norms/dvs in the vectors (they are already per-doc). This would make for a faster way of updating docvalues fields: for that specific case I think more can be done but it would be an improvement and fit well.
          Hide
          Shai Erera added a comment -

          That's an interesting idea Robert. I agree that (1) is sometimes more expensive than re-indexing and I'll admit that in the cases I've seen, fetching docs from the DB was a huge bottleneck, because the DB was used for many other application transactions, while search was not the majority of transactions. Also, (2) is not so cheap either. So I agree your approach would keep the users with (3) only.

          There is a downside to this approach, that it requires the app to store everything in the index too (in addition to the DB). Even if it's just term vectors, that's still extra storage. I know that for large applications, the index stores the minimal set of fields that are required to build the search results. For really large apps, the content isn't even there, but rather the search snippets are computed on a different cluster.
          Just want to point that out. It may not be a big deal to small applications ... but then reindexing documents when you have a small application isn't a big deal either ...

          I also think that your approach may not work well for apps with relatively high frequency of tiny updates? I mean, today they need to re-index the entire document, doing steps 1-3 and with your approach they'll need to do just #3. But in the approach on LUCENE-4258, the cost of indexing an update is proportional to the size of the update? We still don't know the impact on the search side, but we know for sure that if updates are frequently merged down to the segment (a'la expunge deletes), there is no effect on search?

          Perhaps what we should do on LUCENE-4258 is run a benchmark on an index w/ low, mid and high number of updates and measure the impact on search.

          Show
          Shai Erera added a comment - That's an interesting idea Robert. I agree that (1) is sometimes more expensive than re-indexing and I'll admit that in the cases I've seen, fetching docs from the DB was a huge bottleneck, because the DB was used for many other application transactions, while search was not the majority of transactions. Also, (2) is not so cheap either. So I agree your approach would keep the users with (3) only. There is a downside to this approach, that it requires the app to store everything in the index too (in addition to the DB). Even if it's just term vectors, that's still extra storage. I know that for large applications, the index stores the minimal set of fields that are required to build the search results. For really large apps, the content isn't even there, but rather the search snippets are computed on a different cluster. Just want to point that out. It may not be a big deal to small applications ... but then reindexing documents when you have a small application isn't a big deal either ... I also think that your approach may not work well for apps with relatively high frequency of tiny updates? I mean, today they need to re-index the entire document, doing steps 1-3 and with your approach they'll need to do just #3. But in the approach on LUCENE-4258 , the cost of indexing an update is proportional to the size of the update? We still don't know the impact on the search side, but we know for sure that if updates are frequently merged down to the segment (a'la expunge deletes), there is no effect on search? Perhaps what we should do on LUCENE-4258 is run a benchmark on an index w/ low, mid and high number of updates and measure the impact on search.
          Hide
          Robert Muir added a comment -

          Perhaps what we should do on LUCENE-4258 is run a benchmark on an index w/ low, mid and high number of updates and measure the impact on search.

          Yes. Especially the impact on mean average precision.

          Show
          Robert Muir added a comment - Perhaps what we should do on LUCENE-4258 is run a benchmark on an index w/ low, mid and high number of updates and measure the impact on search. Yes. Especially the impact on mean average precision.
          Hide
          Shai Erera added a comment -

          Especially the impact on mean average precision.

          I'll focus on performance first because I think that we should give a good solution for DOCS_ONLY type of fields.

          Also, constructing a test which can reliably check the effect on MAP is not trivial. Maybe if e.g. I replace the entire content field, or some part of it.

          But, to measure MAP I'd need to use the TREC (GOV, GOV2) collection, for which I have judgements. But then I believe I'm the only one that can run the test? Unless anyone else has access to that collection? Do you know of any other open collection with judgements that I can use?

          Not saying that it's not important to measure, but to me that comes second in the list, at least for the first step of field updates.

          Show
          Shai Erera added a comment - Especially the impact on mean average precision. I'll focus on performance first because I think that we should give a good solution for DOCS_ONLY type of fields. Also, constructing a test which can reliably check the effect on MAP is not trivial. Maybe if e.g. I replace the entire content field, or some part of it. But, to measure MAP I'd need to use the TREC (GOV, GOV2) collection, for which I have judgements. But then I believe I'm the only one that can run the test? Unless anyone else has access to that collection? Do you know of any other open collection with judgements that I can use? Not saying that it's not important to measure, but to me that comes second in the list, at least for the first step of field updates.
          Hide
          Robert Muir added a comment -

          I'll focus on performance first because I think that we should give a good solution for DOCS_ONLY type of fields.

          I dont know about this.

          To me its not a case of "progress not perfection". I don't see the design for LUCENE-4258 scaling beyond DOCS_ONLY + OMIT_NORMS fields.

          Show
          Robert Muir added a comment - I'll focus on performance first because I think that we should give a good solution for DOCS_ONLY type of fields. I dont know about this. To me its not a case of "progress not perfection". I don't see the design for LUCENE-4258 scaling beyond DOCS_ONLY + OMIT_NORMS fields.
          Hide
          Shai Erera added a comment -

          That remains to be seen. Storing entire documents (term vectors or not) is not going to scale either I think. Merging will just merge this data over and over .. unless you put it in another index or something. Sivan and I tried that (before 4258) in a project, it didn't perform so well. For every tiny update fetch the content from a stored field (yes we did #2 and #3, not just #3) simply didn't perform.

          I think we're coming from different worlds. We may need to develop two different solutions for field updates, each is better for some scenarios.

          Or hopefully, the approach on 4258 would prove performing enough, so we stick w/ just one approach.

          Show
          Shai Erera added a comment - That remains to be seen. Storing entire documents (term vectors or not) is not going to scale either I think. Merging will just merge this data over and over .. unless you put it in another index or something. Sivan and I tried that (before 4258) in a project, it didn't perform so well. For every tiny update fetch the content from a stored field (yes we did #2 and #3, not just #3) simply didn't perform. I think we're coming from different worlds. We may need to develop two different solutions for field updates, each is better for some scenarios. Or hopefully, the approach on 4258 would prove performing enough, so we stick w/ just one approach.
          Hide
          Tim Smith added a comment -

          +1 on term vector approach

          I would like to see the following added to IndexableField:
          /** Expert. index inverted terms for field */
          public Terms invertedTerms();

          this would allow partial updates via term vectors without having to flatten back into TokenStream first

          This would also facilitate things like the following:

          • index document into memory index
          • run "alert" queries/per-doc analysis against memory index
          • get "terms" from memory index for all fields and index into on disk index using IndexableField.invertedTerms()
          • double tokenization/analysis/inversion is now avoided
          Show
          Tim Smith added a comment - +1 on term vector approach I would like to see the following added to IndexableField: /** Expert. index inverted terms for field */ public Terms invertedTerms(); this would allow partial updates via term vectors without having to flatten back into TokenStream first This would also facilitate things like the following: index document into memory index run "alert" queries/per-doc analysis against memory index get "terms" from memory index for all fields and index into on disk index using IndexableField.invertedTerms() double tokenization/analysis/inversion is now avoided
          Hide
          Robert Muir added a comment -

          edit: just to make it clear we dont need to change the index format if we wnt to implement this: its "just code".

          norms for unaffected fields can be reused as-is. for the affected fields when digesting the Terms, we could just process them as normal.

          Show
          Robert Muir added a comment - edit: just to make it clear we dont need to change the index format if we wnt to implement this: its "just code". norms for unaffected fields can be reused as-is. for the affected fields when digesting the Terms, we could just process them as normal.

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              5 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:

                Development