Lucene - Core
  1. Lucene - Core
  2. LUCENE-3112

Add IW.add/updateDocuments to support nested documents

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.2, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I think nested documents (LUCENE-2454) is a very compelling addition
      to Lucene. It's also a popular (many votes) issue.

      Beyond supporting nested document querying, which is already an
      incredible addition since it preserves the relational model on
      indexing normalized content (eg, DB tables, XML docs), LUCENE-2454
      should also enable speedups in grouping implementation when you group
      by a nested field.

      For the same reason, it can also enable very fast post-group facet
      counting impl (LUCENE-3097) when you what to
      count(distinct(nestedField)), instead of unique documents, as your
      "identifier". I expect many apps that use faceting need this ability
      (to count(distinct(nestedField)) not distinct(docID)).

      To support these use cases, I believe the only core change needed is
      the ability to atomically add or update multiple documents, which you
      cannot do today since in between add/updateDocument calls a flush (eg
      due to commit or getReader()) could occur.

      This new API (addDocuments(Iterable<Document>), updateDocuments(Term
      delTerm, Iterable<Document>) would also further guarantee that the
      documents are assigned sequential docIDs in the order the iterator
      provided them, and that the docIDs all reside in one segment.

      Segment merging never splits segments apart, so this invariant would
      hold even as merges/optimizes take place.

      1. LUCENE-3112.patch
        12 kB
        Michael McCandless
      2. LUCENE-3112.patch
        42 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          Bulk closing for 3.2

          Show
          Robert Muir added a comment - Bulk closing for 3.2
          Hide
          Steve Rowe added a comment -

          IndexSplitter.java has javadocs references

          Oops, I got the trunk javadocs problem mixed up with the branch_3x javadocs problem...

          The incorrect references are actually in IndexWriter.java itself, on updateDocuments() methods.

          Show
          Steve Rowe added a comment - IndexSplitter.java has javadocs references Oops, I got the trunk javadocs problem mixed up with the branch_3x javadocs problem... The incorrect references are actually in IndexWriter.java itself, on updateDocuments() methods.
          Hide
          Michael McCandless added a comment -

          Argh! Thanks Steven – I will fix those jdocs 2nd time around.

          Show
          Michael McCandless added a comment - Argh! Thanks Steven – I will fix those jdocs 2nd time around.
          Hide
          Steve Rowe added a comment -

          Mike, when you put the branch_3x port back, contrib/misc's IndexSplitter.java has javadocs references to IndexWriter#addDocuments(Iterable), instead of IW#aDs(Collection) - this triggers javadocs warnings which fail the build.

          Show
          Steve Rowe added a comment - Mike, when you put the branch_3x port back, contrib/misc's IndexSplitter.java has javadocs references to IndexWriter#addDocuments(Iterable), instead of IW#aDs(Collection) - this triggers javadocs warnings which fail the build.
          Hide
          Michael McCandless added a comment -

          Fixed.

          Only hitch was in 3.x the APIs take Collection<Document> (vs trunk's Iterable<Document>). If we backport DWPT we can put it back to Iterable<Document>...

          Show
          Michael McCandless added a comment - Fixed. Only hitch was in 3.x the APIs take Collection<Document> (vs trunk's Iterable<Document>). If we backport DWPT we can put it back to Iterable<Document>...
          Hide
          Michael McCandless added a comment -

          New patch, I think it's ready to commit but it could use some healthy reviewing...

          I fang'd up TestNRThreads to add/update doc blocks and verify the docs in each block remain adjacent, and also added a couple other test cases to make sure we test non-aborting exceptions when adding a doc block.

          And I put warning in the jdocs about possible future full re-indexing.

          Show
          Michael McCandless added a comment - New patch, I think it's ready to commit but it could use some healthy reviewing... I fang'd up TestNRThreads to add/update doc blocks and verify the docs in each block remain adjacent, and also added a couple other test cases to make sure we test non-aborting exceptions when adding a doc block. And I put warning in the jdocs about possible future full re-indexing.
          Hide
          Robert Muir added a comment -

          I suppose we could consider changing the index format today to record
          which docs are subs... but I think we don't need to. Maybe I should
          strengthen the @experimental to explain the risk that a future
          reindexing could be required?

          I think this would be perfect. I certainly don't want to hold up this
          improvement, yet, in the future I just didnt want us to be in a
          situation where we say 'well if only we had recorded this information,
          now its not possible to do XYZ because someone COULD have used
          add/updateDocuments() for some arbitrary reason and we will 'split'
          their grouped ids'.

          We could also include in the note that various existing
          IndexSorters/Splitters are unaware about this, so use with caution

          Show
          Robert Muir added a comment - I suppose we could consider changing the index format today to record which docs are subs... but I think we don't need to. Maybe I should strengthen the @experimental to explain the risk that a future reindexing could be required? I think this would be perfect. I certainly don't want to hold up this improvement, yet, in the future I just didnt want us to be in a situation where we say 'well if only we had recorded this information, now its not possible to do XYZ because someone COULD have used add/updateDocuments() for some arbitrary reason and we will 'split' their grouped ids'. We could also include in the note that various existing IndexSorters/Splitters are unaware about this, so use with caution
          Hide
          Michael McCandless added a comment -

          We should really think through the consequences of this though.

          If core features of lucene become implemented in a way that they rely upon these sequential docids, we then lock ourselves out of future optimizations such as reordering docids for optimal index compression.

          I agree it's somewhat dangerous we are making an (experimental)
          guarantee that these docIDs will remain adjacent "forever". We
          normally are very protective about letting apps rely on docID
          assignment/order.

          But, I think this will not be "core" functionality that relies on
          sub-docs (adjacent docs), but rather modules – grouping, faceting,
          nestedqueries/queries. And, even if you use these modules, it's
          optional whether the app did sub-docs. Ie we would still have the
          'generic" grouping collector, but then also an optimized one that
          takes advantage of sub-docs.

          Finally, I think doing this today would not preclude doing docID
          reording in the future because the sub docs would be recomputable
          based on the "identifier" field which grouped them in the first
          place.

          Ie the worst case future scenario (an app uses this new sub-docs
          feature, but then has a big index they don't want to reindex and wants
          to take advantage of a future docid reording compression we add) would
          still be solvable because we could use this identifier field to find
          blocks of sub-docs.

          I suppose we could consider changing the index format today to record
          which docs are subs... but I think we don't need to. Maybe I should
          strengthen the @experimental to explain the risk that a future
          reindexing could be required?

          Show
          Michael McCandless added a comment - We should really think through the consequences of this though. If core features of lucene become implemented in a way that they rely upon these sequential docids, we then lock ourselves out of future optimizations such as reordering docids for optimal index compression. I agree it's somewhat dangerous we are making an (experimental) guarantee that these docIDs will remain adjacent "forever". We normally are very protective about letting apps rely on docID assignment/order. But, I think this will not be "core" functionality that relies on sub-docs (adjacent docs), but rather modules – grouping, faceting, nestedqueries/queries. And, even if you use these modules, it's optional whether the app did sub-docs. Ie we would still have the 'generic" grouping collector, but then also an optimized one that takes advantage of sub-docs. Finally, I think doing this today would not preclude doing docID reording in the future because the sub docs would be recomputable based on the "identifier" field which grouped them in the first place. Ie the worst case future scenario (an app uses this new sub-docs feature, but then has a big index they don't want to reindex and wants to take advantage of a future docid reording compression we add) would still be solvable because we could use this identifier field to find blocks of sub-docs. I suppose we could consider changing the index format today to record which docs are subs... but I think we don't need to. Maybe I should strengthen the @experimental to explain the risk that a future reindexing could be required?
          Hide
          Michael McCandless added a comment -

          Yet, I think you should push the document iteration etc into DWPT to actually apply the delterm only once to make it really atomic.

          Ahh good point – it's wrong just passing that delTerm down N times, too. I'll fix.

          I also wonder if we should allow multiple delTerm e.g. Tuple<DelTerm, Document> otherwise you would be bound to one delterm pre "collection" but what if you want to remove only one of the "sub-documents"?

          So, this won't work today w/ nested querying, if I understand it right. Ie, if you only update one of the subs, now your subdocs are no longer sequential (nor in one segment). So I think "design for today" here...?

          Someday, when we implement incremental field updates correctly, so that updates are written as stacked segments against the original segment containing the document, at that point I think we can add an API that lets you update multiple docs atomically?

          Show
          Michael McCandless added a comment - Yet, I think you should push the document iteration etc into DWPT to actually apply the delterm only once to make it really atomic. Ahh good point – it's wrong just passing that delTerm down N times, too. I'll fix. I also wonder if we should allow multiple delTerm e.g. Tuple<DelTerm, Document> otherwise you would be bound to one delterm pre "collection" but what if you want to remove only one of the "sub-documents"? So, this won't work today w/ nested querying, if I understand it right. Ie, if you only update one of the subs, now your subdocs are no longer sequential (nor in one segment). So I think "design for today" here...? Someday, when we implement incremental field updates correctly, so that updates are written as stacked segments against the original segment containing the document, at that point I think we can add an API that lets you update multiple docs atomically?
          Hide
          Jason Rutherglen added a comment -

          I think perhaps like a Hadoop input format split, we can define meta-data at the segment level as to where the documents live so that if one is 'splitting' the index, as is being implemented with HBase, the 'splitter' can be 'smart'.

          Show
          Jason Rutherglen added a comment - I think perhaps like a Hadoop input format split, we can define meta-data at the segment level as to where the documents live so that if one is 'splitting' the index, as is being implemented with HBase, the 'splitter' can be 'smart'.
          Hide
          Robert Muir added a comment -

          We should really think through the consequences of this though.

          If core features of lucene become implemented in a way that they rely upon these sequential docids, we then lock ourselves out of future optimizations such as reordering docids for optimal index compression.

          Show
          Robert Muir added a comment - We should really think through the consequences of this though. If core features of lucene become implemented in a way that they rely upon these sequential docids, we then lock ourselves out of future optimizations such as reordering docids for optimal index compression.
          Hide
          Simon Willnauer added a comment -

          Initial patch.

          nice simple idea! I like the refactorings into pre/postUpdate - looks much cleaner. Yet, I think you should push the document iteration etc into DWPT to actually apply the delterm only once to make it really atomic. I also wonder if we should allow multiple delTerm e.g. Tuple<DelTerm, Document> otherwise you would be bound to one delterm pre "collection" but what if you want to remove only one of the "sub-documents"? So if we would have those tuples you really want to push the iteration into DWPT to make a final finishDocument(Term[] terms) call pushing the terms into a single DeleteItem.

          Show
          Simon Willnauer added a comment - Initial patch. nice simple idea! I like the refactorings into pre/postUpdate - looks much cleaner. Yet, I think you should push the document iteration etc into DWPT to actually apply the delterm only once to make it really atomic. I also wonder if we should allow multiple delTerm e.g. Tuple<DelTerm, Document> otherwise you would be bound to one delterm pre "collection" but what if you want to remove only one of the "sub-documents"? So if we would have those tuples you really want to push the iteration into DWPT to make a final finishDocument(Term[] terms) call pushing the terms into a single DeleteItem.
          Hide
          Michael McCandless added a comment -

          Initial patch.

          It's not done yet (needs tests, and the nocommit needs to be addressed).

          Show
          Michael McCandless added a comment - Initial patch. It's not done yet (needs tests, and the nocommit needs to be addressed).

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development