Lucene - Core
  1. Lucene - Core
  2. LUCENE-6486

DocumentDictionary entry iterator skips items with optional null payload field

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.10.3
    • Fix Version/s: 5.2, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available
    • Flags:
      Patch

      Description

      As denoted in the ticket SOLR-7086 the DocumentDictionary entry iterator shouldn't skip entries from the dictionary having null value for the payload field due to the fact that this field is optional.
      This behaviour causes inconsistencies in the Solr suggester which simply skips valid documents due to the fact that they don't have values for the payload field.
      As agreed with Michael McCandless I am attaching a patch to this Lucene issue.

      1. LUCENE-6486.patch
        23 kB
        Marius Grama
      2. LUCENE-6486.patch
        22 kB
        Marius Grama
      3. LUCENE-6486.patch
        20 kB
        Michael McCandless
      4. LUCENE-6486.patch
        21 kB
        Marius Grama

        Issue Links

          Activity

          Hide
          Marius Grama added a comment -

          Patch used to fix this inconsistency.

          Michael McCandless could you please have a look on this patch?
          NOTE : As in the class FileDictionary, when the dictionary is configured with hasPayloads field set to true, there will be an empty value instead of null for the field payload when the payload field in the document is null.

          Show
          Marius Grama added a comment - Patch used to fix this inconsistency. Michael McCandless could you please have a look on this patch? NOTE : As in the class FileDictionary, when the dictionary is configured with hasPayloads field set to true, there will be an empty value instead of null for the field payload when the payload field in the document is null.
          Hide
          Michael McCandless added a comment -

          Thanks Marius Grama!

          I started from your patched and tweaked javadocs / CHANGES wording, and made a small code style improvement (attached).

          Could you add a dedicated test case, e.g. something like "testPayloadsAreOptional", just to make it clear we intend this functionality?

          Also, I'm nervous about pretending we saw a 0-byte payload when the payload was in fact missing, and also the empty set when no contexts were specified: we lose information by doing this, e.g. we can no longer distinguish if the document did in fact have a 0-byte payload.

          Can we simply return null (in both cases) when there was no payload or no context for a given document?

          Show
          Michael McCandless added a comment - Thanks Marius Grama ! I started from your patched and tweaked javadocs / CHANGES wording, and made a small code style improvement (attached). Could you add a dedicated test case, e.g. something like "testPayloadsAreOptional", just to make it clear we intend this functionality? Also, I'm nervous about pretending we saw a 0-byte payload when the payload was in fact missing, and also the empty set when no contexts were specified: we lose information by doing this, e.g. we can no longer distinguish if the document did in fact have a 0-byte payload. Can we simply return null (in both cases) when there was no payload or no context for a given document?
          Hide
          Marius Grama added a comment -

          Attached patch with slightly modified version of DocumentDictionaryTest class to outline the handling of optional payload field.

          Show
          Marius Grama added a comment - Attached patch with slightly modified version of DocumentDictionaryTest class to outline the handling of optional payload field.
          Hide
          Marius Grama added a comment -

          Also, I'm nervous about pretending we saw a 0-byte payload when the payload was in fact missing, and also the empty set when no contexts were specified: we lose information by doing this, e.g. we can no longer distinguish if the document did in fact have a 0-byte payload.

          NOTE : As in the class FileDictionary, when the dictionary is configured with hasPayloads field set to true, there will be an empty value instead of null for the field payload when the payload field in the document is null.

          Michael McCandless I would gladly change the behaviour of returning an empty BytesRef instead of null, but I find this task rather difficult because there are several classes having their logic depending on InputIterator#hasPayloads() , InputIterator#hasContexts() methods (SortedInputIterator, BufferedInputIterator, AnalyzingSuggester to name a few).
          I guess that one solution would be to remove the previously mentioned InputIterator methods, but this would mean to rewrite all the classes where these methods are used and also to find other ways to encode null values for payloads.
          e.g. :
          SortedInputIterator(line 88-90)

                  if (hasPayloads) {
                    payload = decodePayload(bytes, input);
                  }
          

          SortedInputIterator (line 240-243)

              if (hasPayloads) {
                output.writeBytes(payload.bytes, payload.offset, payload.length);
                output.writeShort((short) payload.length);
              }
          

          Any ideas how to encode null values for the non-existing payload fields so that there can we differentiate when reading from persistence whether we're dealing with null and not with an empty BytesRef instance? I am thinking of writing -1 for the length of the payload, but I guess there are better solutions at hand than this one.

          Show
          Marius Grama added a comment - Also, I'm nervous about pretending we saw a 0-byte payload when the payload was in fact missing, and also the empty set when no contexts were specified: we lose information by doing this, e.g. we can no longer distinguish if the document did in fact have a 0-byte payload. NOTE : As in the class FileDictionary, when the dictionary is configured with hasPayloads field set to true, there will be an empty value instead of null for the field payload when the payload field in the document is null. Michael McCandless I would gladly change the behaviour of returning an empty BytesRef instead of null, but I find this task rather difficult because there are several classes having their logic depending on InputIterator#hasPayloads() , InputIterator#hasContexts() methods (SortedInputIterator, BufferedInputIterator, AnalyzingSuggester to name a few). I guess that one solution would be to remove the previously mentioned InputIterator methods, but this would mean to rewrite all the classes where these methods are used and also to find other ways to encode null values for payloads. e.g. : SortedInputIterator(line 88-90) if (hasPayloads) { payload = decodePayload(bytes, input); } SortedInputIterator (line 240-243) if (hasPayloads) { output.writeBytes(payload.bytes, payload.offset, payload.length); output.writeShort(( short ) payload.length); } Any ideas how to encode null values for the non-existing payload fields so that there can we differentiate when reading from persistence whether we're dealing with null and not with an empty BytesRef instance? I am thinking of writing -1 for the length of the payload, but I guess there are better solutions at hand than this one.
          Hide
          Michael McCandless added a comment -

          Marius Grama OK thank you for digging, it sounds risky to return "null" for payload and contexts since consumers don't expect that ... so let's just "cast" missing -> empty for both payloads and contexts, like in your last patch.

          On the new test case, it looks like testBasic just got renamed to testWithOptionalPayload, with a few added checks?? Can you instead put testBasic back and make a new very minimal testWithOptionalPayload? I was picturing a tiny test case that indexes 1 doc (and does not use generateIndexDocuments) that's missing its payload and then confirms it's an empty BytesRef when iterating ... I think separate single-purpose test cases are nicer than the all-encompassing testBasic seems to be...

          Show
          Michael McCandless added a comment - Marius Grama OK thank you for digging, it sounds risky to return "null" for payload and contexts since consumers don't expect that ... so let's just "cast" missing -> empty for both payloads and contexts, like in your last patch. On the new test case, it looks like testBasic just got renamed to testWithOptionalPayload, with a few added checks?? Can you instead put testBasic back and make a new very minimal testWithOptionalPayload? I was picturing a tiny test case that indexes 1 doc (and does not use generateIndexDocuments) that's missing its payload and then confirms it's an empty BytesRef when iterating ... I think separate single-purpose test cases are nicer than the all-encompassing testBasic seems to be...
          Hide
          Marius Grama added a comment -

          Michael McCandless I've attached a patch containing the unit test testWithOptionalPayload. Please let me know if there is something else to be changed.

          Show
          Marius Grama added a comment - Michael McCandless I've attached a patch containing the unit test testWithOptionalPayload. Please let me know if there is something else to be changed.
          Hide
          Michael McCandless added a comment -

          Thanks Marius Grama, new test looks great! I'll commit soon ... thank you.

          Show
          Michael McCandless added a comment - Thanks Marius Grama , new test looks great! I'll commit soon ... thank you.
          Hide
          ASF subversion and git services added a comment -

          Commit 1680641 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1680641 ]

          LUCENE-6486: make payloads optional in DocumentDictionary

          Show
          ASF subversion and git services added a comment - Commit 1680641 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1680641 ] LUCENE-6486 : make payloads optional in DocumentDictionary
          Hide
          ASF subversion and git services added a comment -

          Commit 1680645 from Michael McCandless in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1680645 ]

          LUCENE-6486: make payloads optional in DocumentDictionary

          Show
          ASF subversion and git services added a comment - Commit 1680645 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1680645 ] LUCENE-6486 : make payloads optional in DocumentDictionary
          Hide
          Michael McCandless added a comment -

          Thanks Marius Grama!

          Show
          Michael McCandless added a comment - Thanks Marius Grama !
          Hide
          Anshum Gupta added a comment -

          Bulk close for 5.2.0.

          Show
          Anshum Gupta added a comment - Bulk close for 5.2.0.

            People

            • Assignee:
              Unassigned
              Reporter:
              Marius Grama
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development