Lucene - Core
  1. Lucene - Core
  2. LUCENE-5329

Make DocumentDictionary and co more lenient to dirty documents

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7, 6.0
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Currently DocumentDictionary errors out whenever any document does not have value for any relevant stored fields. It would be nice to make it lenient and instead ignore the invalid documents.

      Another "issue" with the DocumentDictionary is that it only allows string fields as suggestions and binary fields as payloads. When exposing these dictionaries to solr (via https://issues.apache.org/jira/browse/SOLR-5378), it is inconvenient for the user to ensure that a suggestion field is a string field and a payload field is a binary field. It would be nice to have the dictionary "just work" whenever a string/binary field is passed to suggestion/payload field. The patch provides one solution to this problem (by accepting string or binary values), though it would be great if there are any other solution to this, without making the DocumentDictionary "too flexible"

      1. LUCENE-5329.patch
        31 kB
        Areek Zillur
      2. LUCENE-5329.patch
        32 kB
        Areek Zillur
      3. LUCENE-5329.patch
        21 kB
        Areek Zillur
      4. LUCENE-5329.patch
        21 kB
        Areek Zillur
      5. LUCENE-5329.patch
        14 kB
        Areek Zillur

        Activity

        Hide
        Areek Zillur added a comment -

        Initial patch:

        • skip documents that have any invalid fields
        • improved tests to test it
        • allow suggestion/payload field to accept string/binary values
        • improved documentation
        Show
        Areek Zillur added a comment - Initial patch: skip documents that have any invalid fields improved tests to test it allow suggestion/payload field to accept string/binary values improved documentation
        Hide
        Varun Thacker added a comment -

        Hi Areek,

        I was asking myself the same question the other day.

        I felt making payloads binary values was too restrictive. Also I thought if we don't use binary values you could have multiple payload fields.

        A use case where multiple payload fields would be useful:
        You have product autosuggest in an eCommerce store.
        You might want back in the suggested document - link, image url, price etc. as payloads making it easier for someone for integrate.

        Is there a negative side to making any such changes?

        Show
        Varun Thacker added a comment - Hi Areek, I was asking myself the same question the other day. I felt making payloads binary values was too restrictive. Also I thought if we don't use binary values you could have multiple payload fields. A use case where multiple payload fields would be useful: You have product autosuggest in an eCommerce store. You might want back in the suggested document - link, image url, price etc. as payloads making it easier for someone for integrate. Is there a negative side to making any such changes?
        Hide
        Areek Zillur added a comment -

        These are my thoughts on this:

        • From the Lucene Suggester perspective, it makes perfect sense to have the payload as a binary field. Because the payload is just 'stored' as is and then returned, no processing is done on it, hence it makes sense that the lucene suggester just treats it as binary data.
        • Having said that when exposing it from Solr, it would be nice to make it "just work", rather than the user having to make sure what field will spit what out, hence the proposed changes
        • Regarding the use-case with ecommerce store, the payload does not necessarily have to be a field, it can be the aggregate of other fields or some arbitrary associated data. (though there is no way to do so in Solr now, but I plan to make it possible with the new Solr Suggester (SOLR-5378) .
        • As far as I understand, payloads should remain binary in the Lucene Suggesters, but the dilemma is whether the input to the suggester be flexible
        Show
        Areek Zillur added a comment - These are my thoughts on this: From the Lucene Suggester perspective, it makes perfect sense to have the payload as a binary field. Because the payload is just 'stored' as is and then returned, no processing is done on it, hence it makes sense that the lucene suggester just treats it as binary data. Having said that when exposing it from Solr, it would be nice to make it "just work", rather than the user having to make sure what field will spit what out, hence the proposed changes Regarding the use-case with ecommerce store, the payload does not necessarily have to be a field, it can be the aggregate of other fields or some arbitrary associated data. (though there is no way to do so in Solr now, but I plan to make it possible with the new Solr Suggester ( SOLR-5378 ) . As far as I understand, payloads should remain binary in the Lucene Suggesters, but the dilemma is whether the input to the suggester be flexible
        Hide
        Areek Zillur added a comment -

        Patch Updated:

        • Added ctor for DocumentExpressionDictionary (can take in ValueSource) [wondering if the name should be more general, as it can now compute weights using ValueSource directly]
        • Allow DocumentDictionary to use NumericDocValuesField for suggestion weights
        • Updated tests to reflect new changes

        NOTE: using ant documenation-lint gives me this error (any advice on fixing this javadoc is greatly appreciated):
        [exec] file:///build/docs/suggest/org/apache/lucene/search/suggest/DocumentExpressionDictionary.html
        [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/queries.function.ValueSource.html
        [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/queries.function.ValueSource.html
        [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/queries.function.ValueSource.html
        [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/queries.function.ValueSource.html

        Show
        Areek Zillur added a comment - Patch Updated: Added ctor for DocumentExpressionDictionary (can take in ValueSource) [wondering if the name should be more general, as it can now compute weights using ValueSource directly] Allow DocumentDictionary to use NumericDocValuesField for suggestion weights Updated tests to reflect new changes NOTE: using ant documenation-lint gives me this error (any advice on fixing this javadoc is greatly appreciated): [exec] file:///build/docs/suggest/org/apache/lucene/search/suggest/DocumentExpressionDictionary.html [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/queries.function.ValueSource.html [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/queries.function.ValueSource.html [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/queries.function.ValueSource.html [exec] BROKEN LINK: file:///build/docs/core/org/apache/lucene/queries.function.ValueSource.html
        Hide
        Areek Zillur added a comment -

        Updated patch:

        • fixed documentation-lint issue (added queries javadoc in build)
        • changed # of docs generated in tests to make it a little faster
        Show
        Areek Zillur added a comment - Updated patch: fixed documentation-lint issue (added queries javadoc in build) changed # of docs generated in tests to make it a little faster
        Hide
        Areek Zillur added a comment -

        Updated Patch:

        • Properly handle reader with no documents [instead of erroring out, return no suggester inputs]
        • Minor refactoring of ctor logic
        • Added tests for empty readers [DocumentDictionary & DocumentExpressionDictionary]
        • Added ValueSource test for DocumentExpressionDictionary
        Show
        Areek Zillur added a comment - Updated Patch: Properly handle reader with no documents [instead of erroring out, return no suggester inputs] Minor refactoring of ctor logic Added tests for empty readers [DocumentDictionary & DocumentExpressionDictionary] Added ValueSource test for DocumentExpressionDictionary
        Hide
        Michael McCandless added a comment -

        Patch looks good!

        When you say:

         *      if any of the term or (optionally) payload fields supplied
         *      do not have a value for a document, then the document is 
         *      rejected by the dictionary
        

        Does that mean the document is silently skipped? (rejected sounds like it may throw an exception).

        Why did liveDocs need to become non-final?

        Instead of trying to pull the NumericDocValues for weight on every getWeight call, could we do it up front in the ctor? And, just use MultiDocValues (instead of holding leaves, starts and doing the bin search ourselves).

        Show
        Michael McCandless added a comment - Patch looks good! When you say: * if any of the term or (optionally) payload fields supplied * do not have a value for a document, then the document is * rejected by the dictionary Does that mean the document is silently skipped? (rejected sounds like it may throw an exception). Why did liveDocs need to become non-final? Instead of trying to pull the NumericDocValues for weight on every getWeight call, could we do it up front in the ctor? And, just use MultiDocValues (instead of holding leaves, starts and doing the bin search ourselves).
        Hide
        Areek Zillur added a comment -

        Updated Patch:

        • Updated Document [rejected -> skipped]
        • Use MultiDocValues for DocumentDictionary
        • Refactored ctor
        Show
        Areek Zillur added a comment - Updated Patch: Updated Document [rejected -> skipped] Use MultiDocValues for DocumentDictionary Refactored ctor
        Hide
        Areek Zillur added a comment -

        Thanks Michael for the review! I have addressed your comments. One question though, I was wondering if we have something similar to MultiDocValues for ValueSource (MultValueSource doesnt seem like what I wanted). I think it would be nice if we can get rid of the leaves and starts logic from the DocExprDict too.

        Show
        Areek Zillur added a comment - Thanks Michael for the review! I have addressed your comments. One question though, I was wondering if we have something similar to MultiDocValues for ValueSource (MultValueSource doesnt seem like what I wanted). I think it would be nice if we can get rid of the leaves and starts logic from the DocExprDict too.
        Hide
        Michael McCandless added a comment -

        Thanks Areek, patch looks great! I'll commit soon...

        That's a good question: I don't think we have a MultiValueSource.

        Show
        Michael McCandless added a comment - Thanks Areek, patch looks great! I'll commit soon... That's a good question: I don't think we have a MultiValueSource.
        Hide
        ASF subversion and git services added a comment -

        Commit 1544570 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1544570 ]

        LUCENE-5329: Document/ExpressionDictionary are now lenient if a doc is missing term/payload/weight

        Show
        ASF subversion and git services added a comment - Commit 1544570 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1544570 ] LUCENE-5329 : Document/ExpressionDictionary are now lenient if a doc is missing term/payload/weight
        Hide
        ASF subversion and git services added a comment -

        Commit 1544572 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1544572 ]

        LUCENE-5329: Document/ExpressionDictionary are now lenient if a doc is missing term/payload/weight

        Show
        ASF subversion and git services added a comment - Commit 1544572 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1544572 ] LUCENE-5329 : Document/ExpressionDictionary are now lenient if a doc is missing term/payload/weight
        Hide
        Michael McCandless added a comment -

        Thanks Areek!

        Show
        Michael McCandless added a comment - Thanks Areek!

          People

          • Assignee:
            Unassigned
            Reporter:
            Areek Zillur
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development