Lucene - Core
  1. Lucene - Core
  2. LUCENE-5833

Suggestor Version 2 doesn't support multiValued fields

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.8.1
    • Fix Version/s: 5.0, 6.0
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      So if you use a multiValued field in the new suggestor it will not pick up terms for any term after the first one. So it treats the first term as the only term it will make it's dictionary from.

      This is the suggestor I'm talking about:
      https://issues.apache.org/jira/browse/SOLR-5378

      1. LUCENE-5833_branch4_10.patch
        10 kB
        Jan Høydahl
      2. LUCENE-5833.patch
        8 kB
        Varun Thacker
      3. LUCENE-5833.patch
        6 kB
        Varun Thacker
      4. LUCENE-5833.patch
        9 kB
        Varun Thacker
      5. SOLR-6210.patch
        5 kB
        Varun Thacker

        Issue Links

          Activity

          Hide
          Varun Thacker added a comment -

          This should be a lucene issue.

          I have added support to DocumentDictionary to support multiValued fields. It's a hack I guess but I could not think of a better approach. Opinions?

          Also DocumentDictionaryTest passes and I added a very basic test for multi valued field docs. I will improve it in the next iteration just wanted to get a quick patch out.

          Show
          Varun Thacker added a comment - This should be a lucene issue. I have added support to DocumentDictionary to support multiValued fields. It's a hack I guess but I could not think of a better approach. Opinions? Also DocumentDictionaryTest passes and I added a very basic test for multi valued field docs. I will improve it in the next iteration just wanted to get a quick patch out.
          Hide
          Varun Thacker added a comment -
          • Updated it with trunk
          • Added a better test for multiValued fields.

          Does anyone have a better suggestion on how to add multi valued field than the logic the patch uses?

          Show
          Varun Thacker added a comment - Updated it with trunk Added a better test for multiValued fields. Does anyone have a better suggestion on how to add multi valued field than the logic the patch uses?
          Hide
          Varun Thacker added a comment -

          Hi Michael McCandless ,

          Can you also have a look at this patch please and let me know if this is the correct approach or do you have a better way to tackle the problem.

          Show
          Varun Thacker added a comment - Hi Michael McCandless , Can you also have a look at this patch please and let me know if this is the correct approach or do you have a better way to tackle the problem.
          Hide
          Michael McCandless added a comment -

          Net/net I agree it makes sense for DocumentDictionary to be able to handle multi-valued fields.

          Isn't that first if (term.size() == 0) always be true? If terms is non-empty we return the first one at the start of the loop?

          Also it seems like you can remove the if in the end? The loop will just cycle around and return the first term if there is one? Maybe we can somehow simplify this "loop within a loop"...

          I don't like that there's O(N^2) here (terms.remove(0) from an ArrayList) ... can we fix that? Yeah, N should be small in general, but you never know ... I don't like having secret adversaries to our code ... maybe you can just hang onto the StorableField[] from the current doc and step through it; then you don't need to alloc this separate ArrayList...

          Can you rename terms to something like currentDocTerms? (Separately it's strange we call these "terms"; they are really intended to be "suggestions" right?).

          Styling: there should be a space after "if" and before ( in "if(...)"

          Show
          Michael McCandless added a comment - Net/net I agree it makes sense for DocumentDictionary to be able to handle multi-valued fields. Isn't that first if (term.size() == 0) always be true? If terms is non-empty we return the first one at the start of the loop? Also it seems like you can remove the if in the end? The loop will just cycle around and return the first term if there is one? Maybe we can somehow simplify this "loop within a loop"... I don't like that there's O(N^2) here (terms.remove(0) from an ArrayList) ... can we fix that? Yeah, N should be small in general, but you never know ... I don't like having secret adversaries to our code ... maybe you can just hang onto the StorableField[] from the current doc and step through it; then you don't need to alloc this separate ArrayList... Can you rename terms to something like currentDocTerms? (Separately it's strange we call these "terms"; they are really intended to be "suggestions" right?). Styling: there should be a space after "if" and before ( in "if(...)"
          Hide
          Varun Thacker added a comment -

          HI Michael McCandless,

          Thanks for reviewing. Here is a new patch which hangs on to the StorableField . This patch also folds in your other suggestions.

          Show
          Varun Thacker added a comment - HI Michael McCandless , Thanks for reviewing. Here is a new patch which hangs on to the StorableField . This patch also folds in your other suggestions.
          Hide
          Varun Thacker added a comment -

          Hi Michael McCandless,

          Have you had a chance to review the latest patch? Let me know if there are any other changes that you recommend.

          Show
          Varun Thacker added a comment - Hi Michael McCandless , Have you had a chance to review the latest patch? Let me know if there are any other changes that you recommend.
          Hide
          Michael McCandless added a comment -

          Thanks Varun.

          Maybe instead of currentFieldsPosition name it nextFieldsPosition?
          I.e, it's the index for the next value the iterator should return?

          Maybe simplify this:

            if (fieldVal.binaryValue() == null && fieldVal.stringValue() == null) {
              continue;
            }
            return (fieldVal.stringValue() != null) ? new BytesRef(fieldVal.stringValue()) : fieldVal.binaryValue();
          

          to something like this:

            if (fieldVal.binaryValue() != null) {
              return fieldVal.binaryValue();
            } else if (fieldValue.stringValue() != null) {
              return new BytesRef(fieldValue.stringValue());
            } else {
              continue;
            }
          

          (And also where this pattern is used again towards the end of the
          while loop). And I think also fix the similarly confusing
          pre-existing payload null checking?

          Can we move the BytesRef tempTerm type declaration down to where
          we actually assign it?

          Can you pull out the atLeast(N) from the for loops,
          e.g. for(int j=0; j<atLeast(2); j++)? That is eval'd on each loop
          iteration, so it's ... strange to call it over and over.

          Show
          Michael McCandless added a comment - Thanks Varun. Maybe instead of currentFieldsPosition name it nextFieldsPosition? I.e, it's the index for the next value the iterator should return? Maybe simplify this: if (fieldVal.binaryValue() == null && fieldVal.stringValue() == null) { continue; } return (fieldVal.stringValue() != null) ? new BytesRef(fieldVal.stringValue()) : fieldVal.binaryValue(); to something like this: if (fieldVal.binaryValue() != null) { return fieldVal.binaryValue(); } else if (fieldValue.stringValue() != null) { return new BytesRef(fieldValue.stringValue()); } else { continue; } (And also where this pattern is used again towards the end of the while loop). And I think also fix the similarly confusing pre-existing payload null checking? Can we move the BytesRef tempTerm type declaration down to where we actually assign it? Can you pull out the atLeast(N) from the for loops, e.g. for(int j=0; j<atLeast(2); j++) ? That is eval'd on each loop iteration, so it's ... strange to call it over and over.
          Hide
          Varun Thacker added a comment -

          New patch with the changes that you recommended -

          1. Renamed currentFieldsPosition name it nextFieldsPosition
          2. Moved to the following syntax -

           if (fieldVal.binaryValue() != null) {
              return fieldVal.binaryValue();
            } else if (fieldValue.stringValue() != null) {
              return new BytesRef(fieldValue.stringValue());
            } else {
              continue;
            }

          Although I like the payload condition check the way it is currently because of the added null check.

          Can you pull out the atLeast(N) from the for loops,

          Absolutely. I don't know how I overlooked that

          Show
          Varun Thacker added a comment - New patch with the changes that you recommended - 1. Renamed currentFieldsPosition name it nextFieldsPosition 2. Moved to the following syntax - if (fieldVal.binaryValue() != null ) { return fieldVal.binaryValue(); } else if (fieldValue.stringValue() != null ) { return new BytesRef(fieldValue.stringValue()); } else { continue ; } Although I like the payload condition check the way it is currently because of the added null check. Can you pull out the atLeast(N) from the for loops, Absolutely. I don't know how I overlooked that
          Hide
          Varun Thacker added a comment -

          Hi Michael McCandless ,

          Can you please review the patch and let me know if anything else needs to be done

          Show
          Varun Thacker added a comment - Hi Michael McCandless , Can you please review the patch and let me know if anything else needs to be done
          Hide
          Michael McCandless added a comment -

          Oh thanks for the reminder Varun Thacker, I'll have a look.

          Show
          Michael McCandless added a comment - Oh thanks for the reminder Varun Thacker , I'll have a look.
          Hide
          ASF subversion and git services added a comment -

          Commit 1640886 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1640886 ]

          LUCENE-5833: allow suggesters to build off of each value from multi-valued fields

          Show
          ASF subversion and git services added a comment - Commit 1640886 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1640886 ] LUCENE-5833 : allow suggesters to build off of each value from multi-valued fields
          Hide
          ASF subversion and git services added a comment -

          Commit 1640887 from Michael McCandless in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1640887 ]

          LUCENE-5833: allow suggesters to build off of each value from multi-valued fields

          Show
          ASF subversion and git services added a comment - Commit 1640887 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1640887 ] LUCENE-5833 : allow suggesters to build off of each value from multi-valued fields
          Hide
          Michael McCandless added a comment -

          Thanks Varun!

          Show
          Michael McCandless added a comment - Thanks Varun!
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.
          Hide
          Jan Høydahl added a comment -

          Reopening to add patch for 4_10 branch.
          It was a simple merge from 5.x and adding VERSION param to one of the test methods. Tests and precommit succeeds, hope to include in 4.10.5 or 4.10.4 RCx.

          Show
          Jan Høydahl added a comment - Reopening to add patch for 4_10 branch. It was a simple merge from 5.x and adding VERSION param to one of the test methods. Tests and precommit succeeds, hope to include in 4.10.5 or 4.10.4 RCx.
          Hide
          Erick Erickson added a comment -

          Jan:

          As it happens, I'm doing some other suggester stuff for 4.10.5 (didn't want to try to shoe-horn it in to 4.10.4 at the last second).

          I'll make it happen.

          Show
          Erick Erickson added a comment - Jan: As it happens, I'm doing some other suggester stuff for 4.10.5 (didn't want to try to shoe-horn it in to 4.10.4 at the last second). I'll make it happen.
          Hide
          Michael McCandless added a comment -

          I'm not sure this should go back to 4.10.x? It's more like a feature (adding multi-valued support to DocumentDictionary) than a bug? And I think there is some risk here ... it's a non-trivial change.

          Show
          Michael McCandless added a comment - I'm not sure this should go back to 4.10.x? It's more like a feature (adding multi-valued support to DocumentDictionary) than a bug? And I think there is some risk here ... it's a non-trivial change.
          Hide
          Erick Erickson added a comment -

          Michael McCandless Truth be told I haven't looked at it yet. Kind of waiting for the 4.10.4 tag to be set before getting too far into this.

          Of the 4 suggester-related things I marked for putting in 4.10.5, this is the one I could be most easily convinced to leave out. I'll look a bit more deeply, but I'll almost certainly defer to your expertise here.

          The other three JIRAs are either very low-risk (the bit about requiring weightField in solrconfig.xml configuration) or make using the Suggester stuff from Solr unworkable for a large corpus, so I'm willing to argue about those. This one kind of snuck in there by chance though.

          At least with Jan's work, we have a viable alternative for people on 4.x that MUST have this functionality, "apply the patch and go". And if that isn't possible, construct the field up-front so it's not multiValued.

          Hmmm, I'm pretty much convincing myself to not put this in 4.10.5. Although I'll go for trunk and 5x.

          Thanks! It's always great to have an informed viewpoint!

          Show
          Erick Erickson added a comment - Michael McCandless Truth be told I haven't looked at it yet. Kind of waiting for the 4.10.4 tag to be set before getting too far into this. Of the 4 suggester-related things I marked for putting in 4.10.5, this is the one I could be most easily convinced to leave out. I'll look a bit more deeply, but I'll almost certainly defer to your expertise here. The other three JIRAs are either very low-risk (the bit about requiring weightField in solrconfig.xml configuration) or make using the Suggester stuff from Solr unworkable for a large corpus, so I'm willing to argue about those. This one kind of snuck in there by chance though. At least with Jan's work, we have a viable alternative for people on 4.x that MUST have this functionality, "apply the patch and go". And if that isn't possible, construct the field up-front so it's not multiValued. Hmmm, I'm pretty much convincing myself to not put this in 4.10.5. Although I'll go for trunk and 5x. Thanks! It's always great to have an informed viewpoint!
          Hide
          Jan Høydahl added a comment - - edited

          As a Lucene issue, this is probably not a bug. However it started as a SOLR issue, where it appears more like a bug:

          The Solr ref guide does not document the single-value only constraint; in fact it encourages using copyField to populate the "field" to use for suggestions, which in fact requires multiValued=true. So we can either treat it as a (Solr) bug and backport this issue, or add a (Solr) patch which adds a warning or exception when attempting to use multiVal; or treat it as a (Solr) documentation bug and add a note in refguide Errata page for 4.10.x. I'm happy with any of these, my customer who got bit by this can run a patched version if need be.

          Perhaps there should have been a separate SOLR issue for this?

          Show
          Jan Høydahl added a comment - - edited As a Lucene issue, this is probably not a bug. However it started as a SOLR issue, where it appears more like a bug: The Solr ref guide does not document the single-value only constraint; in fact it encourages using copyField to populate the "field" to use for suggestions, which in fact requires multiValued=true. So we can either treat it as a (Solr) bug and backport this issue, or add a (Solr) patch which adds a warning or exception when attempting to use multiVal; or treat it as a (Solr) documentation bug and add a note in refguide Errata page for 4.10.x. I'm happy with any of these, my customer who got bit by this can run a patched version if need be. Perhaps there should have been a separate SOLR issue for this?
          Hide
          Steve Rowe added a comment -

          I'm not sure this should go back to 4.10.x? It's more like a feature (adding multi-valued support to DocumentDictionary) than a bug? And I think there is some risk here ... it's a non-trivial change.

          Sounds to me like a 4.11 release is in our future.

          Show
          Steve Rowe added a comment - I'm not sure this should go back to 4.10.x? It's more like a feature (adding multi-valued support to DocumentDictionary) than a bug? And I think there is some risk here ... it's a non-trivial change. Sounds to me like a 4.11 release is in our future.
          Hide
          Jan Høydahl added a comment - - edited

          I did some testing with multi valued field input, and random weight input between 0-100 for the docs in the index. What seems to happen then is that the suggester outputs one suggestion per unique weight:

          {
            "suggest":{"languages":{
                "engl":{
                  "numFound":101,
                  "suggestions":[{
                      "term":"<b>Engl</b>ish",
                      "weight":100,
                      "payload":"0"},
                    {
                      "term":"<b>Engl</b>ish",
                      "weight":99,
                      "payload":"0"},
                    {
                      "term":"<b>Engl</b>ish",
                      "weight":98,
                      "payload":"0"},
          ---etc all the way down to 0---
          

          Using Solr with BlendedInfixLookupFactory. Can anyone explain this behavior? I tested on solr5.0.0

          Show
          Jan Høydahl added a comment - - edited I did some testing with multi valued field input, and random weight input between 0-100 for the docs in the index. What seems to happen then is that the suggester outputs one suggestion per unique weight: { "suggest" :{ "languages" :{ "engl" :{ "numFound" :101, "suggestions" :[{ "term" : "<b>Engl</b>ish" , "weight" :100, "payload" : "0" }, { "term" : "<b>Engl</b>ish" , "weight" :99, "payload" : "0" }, { "term" : "<b>Engl</b>ish" , "weight" :98, "payload" : "0" }, ---etc all the way down to 0--- Using Solr with BlendedInfixLookupFactory . Can anyone explain this behavior? I tested on solr5.0.0
          Hide
          Jan Høydahl added a comment -

          Closing this again. Let's open a new issue for the duplication issue, if I have not completely misunderstood how this is supposed to work..

          Show
          Jan Høydahl added a comment - Closing this again. Let's open a new issue for the duplication issue, if I have not completely misunderstood how this is supposed to work..

            People

            • Assignee:
              Erick Erickson
              Reporter:
              Greg Harris
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development