Lucene - Core
  1. Lucene - Core
  2. LUCENE-5251

New Dictionary Implementation for Suggester consumption

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.6, 6.0
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      With the vast array of new suggester, It would be nice to have a dictionary implementation that could feed the suggesters terms, weights and (optionally) payloads from the lucene index.

      The idea of this dictionary implementation is to grab stored documents from the index and use user-configured fields for terms, weights and payloads.

      use-case: If you have a document with three fields

      • product_id
      • product_name
      • product_popularity_score

      then using this implementation would enable you to have a suggester for product_name using the weight of product_popularity_score and return you the payload of product_id, with which you can do further processing on (example: construct a url etc).

      1. LUCENE-5251.patch
        14 kB
        Areek Zillur
      2. LUCENE-5251.patch
        14 kB
        Areek Zillur
      3. LUCENE-5251.patch
        13 kB
        Areek Zillur
      4. LUCENE-5251.patch
        18 kB
        Areek Zillur

        Activity

        Hide
        Areek Zillur added a comment -

        Uploaded initial patch with tests

        Show
        Areek Zillur added a comment - Uploaded initial patch with tests
        Hide
        Michael McCandless added a comment -

        This is a great idea!

        Hmm, the patch duplicates files, i.e. DocumentDictionary.java and DocumentDictionaryTest.java appear twice; looks like git included some extra local revisions or something?

        Unfortunately, the older suggesters do an instanceof check for TermFreqPayloadIterator and refuse to build() if so, since they don't support payloads, which means they can't be used w// DocumentDictionary? Maybe we should fix those to instead fail if payload() is ever non-null; and maybe we should nuke the separate iterator and just add payload() to TermFreqIterator? (We can do all this separately...).

        Maybe you should pull the payload from StorableField.binaryValue() instead of new BytesRef(StorableField.stringValue())? Maybe javadoc that all fields are required (cannot be sparse) for every document.

        You already check that weight cannot be null, so you don't need the extra null check after that?

        Show
        Michael McCandless added a comment - This is a great idea! Hmm, the patch duplicates files, i.e. DocumentDictionary.java and DocumentDictionaryTest.java appear twice; looks like git included some extra local revisions or something? Unfortunately, the older suggesters do an instanceof check for TermFreqPayloadIterator and refuse to build() if so, since they don't support payloads, which means they can't be used w// DocumentDictionary? Maybe we should fix those to instead fail if payload() is ever non-null; and maybe we should nuke the separate iterator and just add payload() to TermFreqIterator? (We can do all this separately...). Maybe you should pull the payload from StorableField.binaryValue() instead of new BytesRef(StorableField.stringValue())? Maybe javadoc that all fields are required (cannot be sparse) for every document. You already check that weight cannot be null, so you don't need the extra null check after that?
        Hide
        Areek Zillur added a comment -

        Hey Michael,

        Thanks for reviewing the patch!

        The patch does not duplicate the files but shows two git commits on the new files. The first one was the actual implementation + tests and the second one removes all the debug printlns that were put in for inspection. (someday I will try to use svn ). It does look confusing.

        I am aware of that problem and was thinking along the same lines, have ONLY TermFreqPayloadIterator (maybe should be named something like EntryIterator? as there will be no TermFreqIterator and co) and have it accepted for all the suggester and let the suggester throw exceptions if it does not support payload/weight. I would be happy to open up an issue and work on it!

        I will upload another patch incorporating all your other comments!

        Show
        Areek Zillur added a comment - Hey Michael, Thanks for reviewing the patch! The patch does not duplicate the files but shows two git commits on the new files. The first one was the actual implementation + tests and the second one removes all the debug printlns that were put in for inspection. (someday I will try to use svn ). It does look confusing. I am aware of that problem and was thinking along the same lines, have ONLY TermFreqPayloadIterator (maybe should be named something like EntryIterator? as there will be no TermFreqIterator and co) and have it accepted for all the suggester and let the suggester throw exceptions if it does not support payload/weight. I would be happy to open up an issue and work on it! I will upload another patch incorporating all your other comments!
        Hide
        Areek Zillur added a comment -

        Uploaded patch (merged commits).

        • updated javadocs to note that fields supplied are required for all documents
        • for payload and field try to get binaryValues if available; fallback to stringValue if not
        • gotten rid of redundant null checks
        Show
        Areek Zillur added a comment - Uploaded patch (merged commits). updated javadocs to note that fields supplied are required for all documents for payload and field try to get binaryValues if available; fallback to stringValue if not gotten rid of redundant null checks
        Hide
        Michael McCandless added a comment -

        The patch does not duplicate the files but shows two git commits on the new files.

        OK, got it. That's cool; it's nice to see the separate revisions
        spelled out. I just wish "patch" or "svn patch" did the right thing
        here...

        Thank you for merging commits on the new patch

        I am aware of that problem and was thinking along the same lines, have ONLY TermFreqPayloadIterator (maybe should be named something like EntryIterator? as there will be no TermFreqIterator and co) and have it accepted for all the suggester and let the suggester throw exceptions if it does not support payload/weight. I would be happy to open up an issue and work on it!

        That would be awesome! I agree we should do this separately... for
        this issue I think we just document that it works only with the
        "newer" suggesters?

        New patch looks great, thanks Areek! Some comments:

        • For the "field", I think we should use .stringValue() not
          .binaryValue()? Ie, an app would typically have stored a
          StringField? (But for payloadField, I think it should be
          .binaryValue()).
        • If the payload field exists, but its .binaryValue() is null, I
          think we should throw an IAE too?
        • Maybe use RandomIndexWriter in the tests? (It's more evil.) The
          only issue is, you should then also set a .newLogMergePolicy()
          onto those IndexWriterConfigs, else eventually you'll get the
          AlcoholicMergePolicy and the docIDs will be in a different order
          than how you indexed and the test will [falsely] fail.
        Show
        Michael McCandless added a comment - The patch does not duplicate the files but shows two git commits on the new files. OK, got it. That's cool; it's nice to see the separate revisions spelled out. I just wish "patch" or "svn patch" did the right thing here... Thank you for merging commits on the new patch I am aware of that problem and was thinking along the same lines, have ONLY TermFreqPayloadIterator (maybe should be named something like EntryIterator? as there will be no TermFreqIterator and co) and have it accepted for all the suggester and let the suggester throw exceptions if it does not support payload/weight. I would be happy to open up an issue and work on it! That would be awesome! I agree we should do this separately... for this issue I think we just document that it works only with the "newer" suggesters? New patch looks great, thanks Areek! Some comments: For the "field", I think we should use .stringValue() not .binaryValue()? Ie, an app would typically have stored a StringField? (But for payloadField, I think it should be .binaryValue()). If the payload field exists, but its .binaryValue() is null, I think we should throw an IAE too? Maybe use RandomIndexWriter in the tests? (It's more evil.) The only issue is, you should then also set a .newLogMergePolicy() onto those IndexWriterConfigs, else eventually you'll get the AlcoholicMergePolicy and the docIDs will be in a different order than how you indexed and the test will [falsely] fail.
        Hide
        Areek Zillur added a comment -

        Thanks for the quick review! I will upload a patch soon, incorporating the suggested changes.

        Show
        Areek Zillur added a comment - Thanks for the quick review! I will upload a patch soon, incorporating the suggested changes.
        Hide
        Areek Zillur added a comment -

        Uploaded new patch (diff rather than git patch):

        • field is expected to have stringvalue(); payload is expected to binaryValue()
        • throw IAE if any of the fields do not have the desired value (including weight)
        • Use RandomIndexWriter for tests!
        • Added documentation that DocumentDictionary will not work with older suggesters

        I also opened another issue (https://issues.apache.org/jira/browse/LUCENE-5260) to make the new Dictionary work for the older suggesters!

        Show
        Areek Zillur added a comment - Uploaded new patch (diff rather than git patch): field is expected to have stringvalue(); payload is expected to binaryValue() throw IAE if any of the fields do not have the desired value (including weight) Use RandomIndexWriter for tests! Added documentation that DocumentDictionary will not work with older suggesters I also opened another issue ( https://issues.apache.org/jira/browse/LUCENE-5260 ) to make the new Dictionary work for the older suggesters!
        Hide
        Michael McCandless added a comment -

        Patch looks great! I think it's basically done ... I found just minor things:

        Maybe change the javadoc to say "Dictionary with terms, weights and optionally payload information taken from stored fields in a Lucene index"?

        Also, that list under the NOTE: is not HTML, so it will render in the browser incorrectly. Maybe make it a <ul> ... <li> ... </ul>?

        It's also not compatible w/ AnalyzingInfixSuggester (which refuses to look at payloads), at least until we fix LUCENE-5260.

        Show
        Michael McCandless added a comment - Patch looks great! I think it's basically done ... I found just minor things: Maybe change the javadoc to say "Dictionary with terms, weights and optionally payload information taken from stored fields in a Lucene index"? Also, that list under the NOTE: is not HTML, so it will render in the browser incorrectly. Maybe make it a <ul> ... <li> ... </ul>? It's also not compatible w/ AnalyzingInfixSuggester (which refuses to look at payloads), at least until we fix LUCENE-5260 .
        Hide
        Areek Zillur added a comment -

        Patch Uploaded

        • Changed Javadoc as suggested and noted AnalyzingInfix suggester incompatibility
        • Used valid html for listing

        Michael Thanks for the quick reviews!

        Show
        Areek Zillur added a comment - Patch Uploaded Changed Javadoc as suggested and noted AnalyzingInfix suggester incompatibility Used valid html for listing Michael Thanks for the quick reviews!
        Hide
        ASF subversion and git services added a comment -

        Commit 1530229 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1530229 ]

        LUCENE-5251: add DocumentDictionary

        Show
        ASF subversion and git services added a comment - Commit 1530229 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1530229 ] LUCENE-5251 : add DocumentDictionary
        Hide
        ASF subversion and git services added a comment -

        Commit 1530231 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1530231 ]

        LUCENE-5251: add DocumentDictionary

        Show
        ASF subversion and git services added a comment - Commit 1530231 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1530231 ] LUCENE-5251 : add DocumentDictionary
        Hide
        Michael McCandless added a comment -

        Thanks Areek, I just committed this!

        I just had to fix one assert in testWithDeletions; it didn't hold up under beasting.

        And fixed up a few things on backporting to 4.x ...

        Show
        Michael McCandless added a comment - Thanks Areek, I just committed this! I just had to fix one assert in testWithDeletions; it didn't hold up under beasting. And fixed up a few things on backporting to 4.x ...
        Hide
        Areek Zillur added a comment -

        Thanks for committing the patch, Michael!

        Show
        Areek Zillur added a comment - Thanks for committing the patch, Michael!
        Hide
        ASF subversion and git services added a comment -

        Commit 1554207 from Michael McCandless in branch 'dev/branches/lucene5376'
        [ https://svn.apache.org/r1554207 ]

        LUCENE-5376, LUCENE-5251: expose DocumentDictionary (to build suggestor from stored documents) in demo server

        Show
        ASF subversion and git services added a comment - Commit 1554207 from Michael McCandless in branch 'dev/branches/lucene5376' [ https://svn.apache.org/r1554207 ] LUCENE-5376 , LUCENE-5251 : expose DocumentDictionary (to build suggestor from stored documents) in demo server

          People

          • Assignee:
            Unassigned
            Reporter:
            Areek Zillur
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development