Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0, 6.5
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      While working on SOLR-5944, I needed a way to know whether applying an update to a DV is possible (i.e. the DV exists or not), while deciding upon whether or not to apply the update as an in-place update or a regular full document update. This information is present at the IndexWriter in a FieldInfos instance, and can be exposed.

      1. LUCENE-7659.patch
        2 kB
        Ishan Chattopadhyaya
      2. LUCENE-7659.patch
        2 kB
        Ishan Chattopadhyaya
      3. LUCENE-7659.patch
        2 kB
        Ishan Chattopadhyaya
      4. LUCENE-7659.patch
        3 kB
        Ishan Chattopadhyaya
      5. LUCENE-7659.patch
        2 kB
        Ishan Chattopadhyaya

        Issue Links

          Activity

          Hide
          ichattopadhyaya Ishan Chattopadhyaya added a comment -

          Here's the patch for the change. For context, please see the SOLR-5944 patch (which includes this change at the moment).

          Show
          ichattopadhyaya Ishan Chattopadhyaya added a comment - Here's the patch for the change. For context, please see the SOLR-5944 patch (which includes this change at the moment).
          Hide
          jpountz Adrien Grand added a comment -

          It should not be necessary to know the list of the current field names. If I understand the Solr issue correctly, your use-case is to check whether an update can be applied using dv-updates only, or whether it requires an regular update. Do I get it right? If I do then maybe a better way to address this use-case would be to either try the dv-only update and fallback to a regular update if it failed (which should be fine since dv updates are atomic), or change the semantics of dv updates to create fields if they did not exist already (then the only thing the Solr code should check is whether any of the fields is already used for index sorting)? Note that I am not very familiar with IndexWriter, hopefully others can comment about whether that makes any sense.

          Show
          jpountz Adrien Grand added a comment - It should not be necessary to know the list of the current field names. If I understand the Solr issue correctly, your use-case is to check whether an update can be applied using dv-updates only, or whether it requires an regular update. Do I get it right? If I do then maybe a better way to address this use-case would be to either try the dv-only update and fallback to a regular update if it failed (which should be fine since dv updates are atomic), or change the semantics of dv updates to create fields if they did not exist already (then the only thing the Solr code should check is whether any of the fields is already used for index sorting)? Note that I am not very familiar with IndexWriter , hopefully others can comment about whether that makes any sense.
          Hide
          ichattopadhyaya Ishan Chattopadhyaya added a comment - - edited

          Thanks Adrien Grand for looking into this.

          If I understand the Solr issue correctly, your use-case is to check whether an update can be applied using dv-updates only, or whether it requires an regular update. Do I get it right?

          Yes, exactly.

          maybe a better way to address this use-case would be to either try the dv-only update and fallback to a regular update if it failed

          There are few issues with that approach: 1. When a user's command comes in, it has operations like ("set": 3), or ("inc": 5). At the UpdateProcessor, we resolve it to a merged document (either partial document, or a regular full document) by pulling the last document from the index (or transaction log) to merge the command with that document. We then send the "resolved" document (partial or full) to the DirectUpdateHandler, which performs the IW update. However, by this time, if the IW were to throw an exception for a partial update from the IW.updateDocValues() method, we have already lost the information about the original operation ("set", "inc" etc.), but instead just have the merged values.
          2. The second problem is that if we wish to handle the exception for IW.updateDocValues() and decide to fallback on regular update, we could now potentially be merging against a different previous document than the one that was merged with in the failed attempt. 3. The performance cost of a regular update would increase due to merging twice against the previously indexed document.

          change the semantics of dv updates to create fields if they did not exist already

          I agree that this is the cleanest way forward. From the IndexWriter's API standpoint, I think it would certainly be cleanest if updateDocValues() method were to create non-existent DVs. Till the time we have such functionality in the updateDocValues() method, do you think we could expose the field names through a method marked as internal and/or experimental, with the intention of phasing it out after we have such functionality in IW's updateDocValues()? I think it would be suitable (interim) workaround for applications who find themselves in such a situation.

          Show
          ichattopadhyaya Ishan Chattopadhyaya added a comment - - edited Thanks Adrien Grand for looking into this. If I understand the Solr issue correctly, your use-case is to check whether an update can be applied using dv-updates only, or whether it requires an regular update. Do I get it right? Yes, exactly. maybe a better way to address this use-case would be to either try the dv-only update and fallback to a regular update if it failed There are few issues with that approach: 1. When a user's command comes in, it has operations like ("set": 3), or ("inc": 5). At the UpdateProcessor, we resolve it to a merged document (either partial document, or a regular full document) by pulling the last document from the index (or transaction log) to merge the command with that document. We then send the "resolved" document (partial or full) to the DirectUpdateHandler, which performs the IW update. However, by this time, if the IW were to throw an exception for a partial update from the IW.updateDocValues() method, we have already lost the information about the original operation ("set", "inc" etc.), but instead just have the merged values. 2. The second problem is that if we wish to handle the exception for IW.updateDocValues() and decide to fallback on regular update, we could now potentially be merging against a different previous document than the one that was merged with in the failed attempt. 3. The performance cost of a regular update would increase due to merging twice against the previously indexed document. change the semantics of dv updates to create fields if they did not exist already I agree that this is the cleanest way forward. From the IndexWriter's API standpoint, I think it would certainly be cleanest if updateDocValues() method were to create non-existent DVs. Till the time we have such functionality in the updateDocValues() method, do you think we could expose the field names through a method marked as internal and/or experimental, with the intention of phasing it out after we have such functionality in IW's updateDocValues()? I think it would be suitable (interim) workaround for applications who find themselves in such a situation.
          Hide
          ichattopadhyaya Ishan Chattopadhyaya added a comment -

          Adding @lucene.internal and @lucene.experimental annotations to the method.

          Show
          ichattopadhyaya Ishan Chattopadhyaya added a comment - Adding @lucene.internal and @lucene.experimental annotations to the method.
          Hide
          mikemccand Michael McCandless added a comment -

          I'm confused here: doesn't Solr know, from its schema, whether a field was indexed as doc values or not?

          Show
          mikemccand Michael McCandless added a comment - I'm confused here: doesn't Solr know, from its schema, whether a field was indexed as doc values or not?
          Hide
          ichattopadhyaya Ishan Chattopadhyaya added a comment -

          I'm confused here: doesn't Solr know, from its schema, whether a field was indexed as doc values or not?

          Fields that have DVs enabled and have not been indexed before cannot be used for DV updates. Dynamic fields are examples. We know that *_l_dvo are docValues fields. But if someone tries to update a field for that pattern, say price_l_dvo, it wouldn't exist as a DV field in the index.

          Show
          ichattopadhyaya Ishan Chattopadhyaya added a comment - I'm confused here: doesn't Solr know, from its schema, whether a field was indexed as doc values or not? Fields that have DVs enabled and have not been indexed before cannot be used for DV updates. Dynamic fields are examples. We know that *_l_dvo are docValues fields. But if someone tries to update a field for that pattern, say price_l_dvo, it wouldn't exist as a DV field in the index.
          Hide
          mikemccand Michael McCandless added a comment -

          OK I see, tricky. I think it's OK to add this (experimental) method to IW, and I agree it would be cleaner if IW could just bring a new DV field into existence on update.

          Such a thing used to be terrifying, because you were in fact bringing an entire column into existence, but in 7.0 we've fixed sparse doc values to be written sparsely.

          The patch wraps in Collections.unmodifiableSet twice now ... maybe remove the one in IW and add a comment saying FieldInfos already did so?

          Show
          mikemccand Michael McCandless added a comment - OK I see, tricky. I think it's OK to add this (experimental) method to IW, and I agree it would be cleaner if IW could just bring a new DV field into existence on update. Such a thing used to be terrifying, because you were in fact bringing an entire column into existence, but in 7.0 we've fixed sparse doc values to be written sparsely. The patch wraps in Collections.unmodifiableSet twice now ... maybe remove the one in IW and add a comment saying FieldInfos already did so?
          Hide
          jpountz Adrien Grand added a comment -

          I think this change is not thread-safe? It currently returns a view (Map.keySet()) of the field numbers map which may be written to at any time by IndexWriter, I think it should rather take a snapshot under the lock? Ie. something like this:

          +    synchronized Set<String> getFieldNames() {
          +      return Collections.unmodifiableSet(new HashSet<>(nameToNumber.keySet()));
          +    }
          
          Show
          jpountz Adrien Grand added a comment - I think this change is not thread-safe? It currently returns a view ( Map.keySet() ) of the field numbers map which may be written to at any time by IndexWriter , I think it should rather take a snapshot under the lock? Ie. something like this: + synchronized Set< String > getFieldNames() { + return Collections.unmodifiableSet( new HashSet<>(nameToNumber.keySet())); + }
          Hide
          ichattopadhyaya Ishan Chattopadhyaya added a comment -

          Thanks Michael McCandless and Adrien Grand for your reviews. I've updated the patch here based on your reviews.

          Show
          ichattopadhyaya Ishan Chattopadhyaya added a comment - Thanks Michael McCandless and Adrien Grand for your reviews. I've updated the patch here based on your reviews.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 733060121dc6f5cbc1b0e0e1412e396a3241240b in lucene-solr's branch refs/heads/master from Ishan Chattopadhyaya
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7330601 ]

          LUCENE-7659: Added IndexWriter#getFieldNames() to return all visible field names

          Show
          jira-bot ASF subversion and git services added a comment - Commit 733060121dc6f5cbc1b0e0e1412e396a3241240b in lucene-solr's branch refs/heads/master from Ishan Chattopadhyaya [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7330601 ] LUCENE-7659 : Added IndexWriter#getFieldNames() to return all visible field names
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit aa467e39f04a5592e97c11c15fc936be60ad2f10 in lucene-solr's branch refs/heads/branch_6x from Ishan Chattopadhyaya
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=aa467e3 ]

          LUCENE-7659: Added IndexWriter#getFieldNames() to return all visible field names

          Show
          jira-bot ASF subversion and git services added a comment - Commit aa467e39f04a5592e97c11c15fc936be60ad2f10 in lucene-solr's branch refs/heads/branch_6x from Ishan Chattopadhyaya [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=aa467e3 ] LUCENE-7659 : Added IndexWriter#getFieldNames() to return all visible field names
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 733060121dc6f5cbc1b0e0e1412e396a3241240b in lucene-solr's branch refs/heads/apiv2 from Ishan Chattopadhyaya
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7330601 ]

          LUCENE-7659: Added IndexWriter#getFieldNames() to return all visible field names

          Show
          jira-bot ASF subversion and git services added a comment - Commit 733060121dc6f5cbc1b0e0e1412e396a3241240b in lucene-solr's branch refs/heads/apiv2 from Ishan Chattopadhyaya [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7330601 ] LUCENE-7659 : Added IndexWriter#getFieldNames() to return all visible field names

            People

            • Assignee:
              ichattopadhyaya Ishan Chattopadhyaya
              Reporter:
              ichattopadhyaya Ishan Chattopadhyaya
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development