Lucene - Core
  1. Lucene - Core
  2. LUCENE-5404

Add support to get number of entries a Suggester Lookup was built with and minor refactorings

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7, 6.0
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      It would be nice to be able to tell the number of entries a suggester lookup was built with. This would let components using lookups to keep some stats regarding how many entries were used to build a lookup.
      Additionally, Dictionary could use InputIterator rather than the BytesRefIteratator, as most of the implmentations now use it.

      1. LUCENE-5404.patch
        59 kB
        Areek Zillur
      2. LUCENE-5404.patch
        52 kB
        Areek Zillur
      3. LUCENE-5404.patch
        37 kB
        Areek Zillur
      4. LUCENE-5404.patch
        37 kB
        Areek Zillur

        Activity

        Hide
        Areek Zillur added a comment -

        Initial patch:

        • adds count() method to InputIterator to get the # of entries used by consuming Lookup implementations
        • Dictionary now uses InputIterator rather than BytesRefIterator
        • renamed Dictionary.getWordsIterator() to Dictionary.getEntryIterator()
        • Lookup.build returns number of entries used for building it
        • minor renaming
        Show
        Areek Zillur added a comment - Initial patch: adds count() method to InputIterator to get the # of entries used by consuming Lookup implementations Dictionary now uses InputIterator rather than BytesRefIterator renamed Dictionary.getWordsIterator() to Dictionary.getEntryIterator() Lookup.build returns number of entries used for building it minor renaming
        Hide
        Michael McCandless added a comment -

        Thanks Areek, this looks great.

        Can we name the method .getCount() instead, just to make it clear it's "just" getting a simple property? .count() makes it sound like it may go off and do a bunch of work to count something...

        Show
        Michael McCandless added a comment - Thanks Areek, this looks great. Can we name the method .getCount() instead, just to make it clear it's "just" getting a simple property? .count() makes it sound like it may go off and do a bunch of work to count something...
        Hide
        Robert Muir added a comment -

        I'm a little confused with the count() being on the iterator API. When is it valid to call count()?
        One is example is the HIghFrequencyDictionary case where it counts-as-it-goes.

        Show
        Robert Muir added a comment - I'm a little confused with the count() being on the iterator API. When is it valid to call count()? One is example is the HIghFrequencyDictionary case where it counts-as-it-goes.
        Hide
        Areek Zillur added a comment - - edited

        Updated Patch (Thanks for the review Michael and Robert):

        • count() -> getCount()
        • doc to make it clear that count represents only the entries the iterator has seen so far

        Regarding the getCount() in the InputIterator API, it is just the number of entries the iterator has seen so far. So it will be valid to call it only after a Lookup has consumed all the entries from the Iterator. I think it will be something useful to know the # of entries a lookup was built with, given some of the iterator skips 'invalid' entries as it feeds it to the lookups? I can also see this being useful when suggesters are run in distributed mode, maybe when shards dont respond, this can be used to report the % of entries that the lookup ran against?

        Show
        Areek Zillur added a comment - - edited Updated Patch (Thanks for the review Michael and Robert): count() -> getCount() doc to make it clear that count represents only the entries the iterator has seen so far Regarding the getCount() in the InputIterator API, it is just the number of entries the iterator has seen so far. So it will be valid to call it only after a Lookup has consumed all the entries from the Iterator. I think it will be something useful to know the # of entries a lookup was built with, given some of the iterator skips 'invalid' entries as it feeds it to the lookups? I can also see this being useful when suggesters are run in distributed mode, maybe when shards dont respond, this can be used to report the % of entries that the lookup ran against?
        Hide
        Areek Zillur added a comment -

        Robert Muir Any thoughts on the proposed API change?

        Show
        Areek Zillur added a comment - Robert Muir Any thoughts on the proposed API change?
        Hide
        Robert Muir added a comment -

        I guess mainly I wonder if we should just add it to the Lookup api directly (rather than a return value from build()). This would probably remove the necessity of having it in the iterator API (the lookup could just count, and then write the count into its data file if need be).

        Otherwise a lookup would only be able to provide this information when build() was called (rather than when it was loaded from e.g. a stored FST).

        Show
        Robert Muir added a comment - I guess mainly I wonder if we should just add it to the Lookup api directly (rather than a return value from build()). This would probably remove the necessity of having it in the iterator API (the lookup could just count, and then write the count into its data file if need be). Otherwise a lookup would only be able to provide this information when build() was called (rather than when it was loaded from e.g. a stored FST).
        Hide
        Areek Zillur added a comment -

        I like this idea, I can see how the count stuff is more appropriate in the Lookup API rather than the InputIterator. I am in the process of cooking up a patch that will do this.

        Show
        Areek Zillur added a comment - I like this idea, I can see how the count stuff is more appropriate in the Lookup API rather than the InputIterator. I am in the process of cooking up a patch that will do this.
        Hide
        Areek Zillur added a comment -

        Updated Patch:

        • added getCount in Lookup API
        • added store/load methods to persist # of entries in Lookup
        • minor refactoring of Lookup API

        Now we can get the # of entries even when the datastructure is loaded from disk. The new store/load methods wrap around the abstract ones and uniformly writes the # of entries.

        Show
        Areek Zillur added a comment - Updated Patch: added getCount in Lookup API added store/load methods to persist # of entries in Lookup minor refactoring of Lookup API Now we can get the # of entries even when the datastructure is loaded from disk. The new store/load methods wrap around the abstract ones and uniformly writes the # of entries.
        Hide
        Michael McCandless added a comment -

        I think it's a little strange to load the count in super's .load(InputStream) method, and then fwd to sub's .load(DataInput)?

        (Separately I think it's good to switch .store/.load from OutputStream/InputStream to DataOutput/DataInput).

        Maybe ... we can just do the change from "void build" to "int build", and not worry about adding a .getCount() from a load'd suggester. This way the app knows how many suggestions were built when creating the suggester and if really cares to persist this info it can do so itself? Or, we can keep the .getCount, but have each suggester handle writing/reading it itself? I just think it's odd to have super write/read this ...

        It's also not great how these suggesters don't use CodecUtil.write/readHeader to make sure the file they are loading is in fact something that this suggester had previously .stored ... but we can fix that separately.

        Show
        Michael McCandless added a comment - I think it's a little strange to load the count in super's .load(InputStream) method, and then fwd to sub's .load(DataInput)? (Separately I think it's good to switch .store/.load from OutputStream/InputStream to DataOutput/DataInput). Maybe ... we can just do the change from "void build" to "int build", and not worry about adding a .getCount() from a load'd suggester. This way the app knows how many suggestions were built when creating the suggester and if really cares to persist this info it can do so itself? Or, we can keep the .getCount, but have each suggester handle writing/reading it itself? I just think it's odd to have super write/read this ... It's also not great how these suggesters don't use CodecUtil.write/readHeader to make sure the file they are loading is in fact something that this suggester had previously .stored ... but we can fix that separately.
        Hide
        Robert Muir added a comment -

        Or, we can keep the .getCount, but have each suggester handle writing/reading it itself?

        Thats what we should do. Apps should not have to track this stuff, thats crazy.

        Show
        Robert Muir added a comment - Or, we can keep the .getCount, but have each suggester handle writing/reading it itself? Thats what we should do. Apps should not have to track this stuff, thats crazy.
        Hide
        Areek Zillur added a comment -

        Thanks for the feedback
        I was thinking the "count" stuff should be common to all suggesters, hence put it in super. I will change it so that the suggesters themselves take care of it.
        I will create an issue to let suggesters use CodecUtil.write/checkHeader. Thanks for pointing that out.
        Regarding .store/.load, I was thinking if we could use something more general so that we can read/write directories (for AnalyzingInfixSuggester and co), along with files (for other suggesters)? I think that will let all suggester impl to respect the Lookup API. Any thoughts on this?

        Show
        Areek Zillur added a comment - Thanks for the feedback I was thinking the "count" stuff should be common to all suggesters, hence put it in super. I will change it so that the suggesters themselves take care of it. I will create an issue to let suggesters use CodecUtil.write/checkHeader. Thanks for pointing that out. Regarding .store/.load, I was thinking if we could use something more general so that we can read/write directories (for AnalyzingInfixSuggester and co), along with files (for other suggesters)? I think that will let all suggester impl to respect the Lookup API. Any thoughts on this?
        Hide
        Michael McCandless added a comment -

        Thanks Areek.

        Regarding .store/.load, I was thinking if we could use something more general so that we can read/write directories (for AnalyzingInfixSuggester and co), along with files (for other suggesters)? I think that will let all suggester impl to respect the Lookup API. Any thoughts on this?

        I'm not really sure what to do w/ the store/load APIs. It may be too much to ask that all suggesters use a common API for it. E.g, it's sort of weird to 1) create a new suggester class, and 2) call its load API; it's more natural to create the suggester, passing in a Dir/File where it should load its state from. Ie, you are either loading a previously built suggester, or you creating a new one. Today the API allows you to load a previously built one and then also .build() a new one over it, which is strange. I think LUCENE-4492 is getting at this too ...

        E.g., AnalyzingInfixSuggester cannot do this: it loads itself based on what you passed to the ctor, so it's .load does nothing.

        Show
        Michael McCandless added a comment - Thanks Areek. Regarding .store/.load, I was thinking if we could use something more general so that we can read/write directories (for AnalyzingInfixSuggester and co), along with files (for other suggesters)? I think that will let all suggester impl to respect the Lookup API. Any thoughts on this? I'm not really sure what to do w/ the store/load APIs. It may be too much to ask that all suggesters use a common API for it. E.g, it's sort of weird to 1) create a new suggester class, and 2) call its load API; it's more natural to create the suggester, passing in a Dir/File where it should load its state from. Ie, you are either loading a previously built suggester, or you creating a new one. Today the API allows you to load a previously built one and then also .build() a new one over it, which is strange. I think LUCENE-4492 is getting at this too ... E.g., AnalyzingInfixSuggester cannot do this: it loads itself based on what you passed to the ctor, so it's .load does nothing.
        Hide
        Areek Zillur added a comment -

        Thanks Michael for the clarification on the API and the pointer to the jira. I understand the concern with the API better now. I will try to come up with something to improve the API in the future, hopefully.

        I have attached an updated patch, the changes are as follows:

        • let individual suggesters deal with their count in .store/.load
        • added abstract getCount to Lookup API
        • tests to make sure getCount works
        • left the load/store(Input/OutputStream) for sugar in Lookup API

        On a different note, I also see the different options that can be fed in the lookup() for different suggesters, I was thinking an object (LookupOptions?) can be passed instead (which will encapsulate all the params). I think this will make the API 'cleaner' and allow suggester specific options to be passed by just using the Lookup class, Thoughts? (I will probably just do this separately)

        Show
        Areek Zillur added a comment - Thanks Michael for the clarification on the API and the pointer to the jira. I understand the concern with the API better now. I will try to come up with something to improve the API in the future, hopefully. I have attached an updated patch, the changes are as follows: let individual suggesters deal with their count in .store/.load added abstract getCount to Lookup API tests to make sure getCount works left the load/store(Input/OutputStream) for sugar in Lookup API On a different note, I also see the different options that can be fed in the lookup() for different suggesters, I was thinking an object (LookupOptions?) can be passed instead (which will encapsulate all the params). I think this will make the API 'cleaner' and allow suggester specific options to be passed by just using the Lookup class, Thoughts? (I will probably just do this separately)
        Hide
        Michael McCandless added a comment -

        +1, patch looks great, thanks Areek.

        On a different note, I also see the different options that can be fed in the lookup() for different suggesters, I was thinking an object (LookupOptions?) can be passed instead (which will encapsulate all the params). I think this will make the API 'cleaner' and allow suggester specific options to be passed by just using the Lookup class, Thoughts? (I will probably just do this separately)

        I think maybe each suggester should just have its own lookup method, taking its additional params? Ie, I'm not sure how consistently each one will have options that the others would want to use. E.g. AnalyzingInfixSuggester accepts two additional booleans: allTermsRequired, doHighlight. But I don't think other suggesters can support these options...

        Show
        Michael McCandless added a comment - +1, patch looks great, thanks Areek. On a different note, I also see the different options that can be fed in the lookup() for different suggesters, I was thinking an object (LookupOptions?) can be passed instead (which will encapsulate all the params). I think this will make the API 'cleaner' and allow suggester specific options to be passed by just using the Lookup class, Thoughts? (I will probably just do this separately) I think maybe each suggester should just have its own lookup method, taking its additional params? Ie, I'm not sure how consistently each one will have options that the others would want to use. E.g. AnalyzingInfixSuggester accepts two additional booleans: allTermsRequired, doHighlight. But I don't think other suggesters can support these options...
        Hide
        Areek Zillur added a comment - - edited

        Thanks for the review Michael! I will commit this shortly if there are no further objections.

        I think maybe each suggester should just have its own lookup method, taking its additional params? Ie, I'm not sure how consistently each one will have options that the others would want to use. E.g. AnalyzingInfixSuggester accepts two additional booleans: allTermsRequired, doHighlight. But I don't think other suggesters can support these options...

        I am aware that all suggesters can not support all params but, it is the same idea behind InputIterator? (not all suggesters support payload or weight). So if this object has params set that is not supported by suggesters, the suggester can raise an exception? The way I see it:

          Lookup suggester = ..;
          LookupOptions lookupOptions = ...;
          suggester.lookup(lookupOptions); // if lookupOptons set doHighlight but the suggester does not support it throw exception
        

        is nicer than

          Lookup suggester = ..;
          if (suggester instanceof AnalyzingInfixSuggester)
            ((AnalyzingInfixSuggester) suggster).lookup("foo", 3, false, true);
         else
           suggester.lookup("foo", false, 3);
        

        But this does add another level of 'abstraction'? This will allow for adding new params by suggesters and be used by just using the Lookup class. Just a thought.

        Show
        Areek Zillur added a comment - - edited Thanks for the review Michael! I will commit this shortly if there are no further objections. I think maybe each suggester should just have its own lookup method, taking its additional params? Ie, I'm not sure how consistently each one will have options that the others would want to use. E.g. AnalyzingInfixSuggester accepts two additional booleans: allTermsRequired, doHighlight. But I don't think other suggesters can support these options... I am aware that all suggesters can not support all params but, it is the same idea behind InputIterator? (not all suggesters support payload or weight). So if this object has params set that is not supported by suggesters, the suggester can raise an exception? The way I see it: Lookup suggester = ..; LookupOptions lookupOptions = ...; suggester.lookup(lookupOptions); // if lookupOptons set doHighlight but the suggester does not support it throw exception is nicer than Lookup suggester = ..; if (suggester instanceof AnalyzingInfixSuggester) ((AnalyzingInfixSuggester) suggster).lookup( "foo" , 3, false , true ); else suggester.lookup( "foo" , false , 3); But this does add another level of 'abstraction'? This will allow for adding new params by suggesters and be used by just using the Lookup class. Just a thought.
        Hide
        ASF subversion and git services added a comment -

        Commit 1565810 from Areek Zillur in branch 'dev/trunk'
        [ https://svn.apache.org/r1565810 ]

        LUCENE-5404: Add .getCount method to all suggesters (Lookup), persist count metadata on .store(), Dictionary returns InputIterator, Dictionary.getWordIterator renamed to .getEntryIterator

        Show
        ASF subversion and git services added a comment - Commit 1565810 from Areek Zillur in branch 'dev/trunk' [ https://svn.apache.org/r1565810 ] LUCENE-5404 : Add .getCount method to all suggesters (Lookup), persist count metadata on .store(), Dictionary returns InputIterator, Dictionary.getWordIterator renamed to .getEntryIterator
        Hide
        ASF subversion and git services added a comment -

        Commit 1565811 from Areek Zillur in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1565811 ]

        LUCENE-5404: Add .getCount method to all suggesters (Lookup), persist count metadata on .store(), Dictionary returns InputIterator, Dictionary.getWordIterator renamed to .getEntryIterator

        Show
        ASF subversion and git services added a comment - Commit 1565811 from Areek Zillur in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1565811 ] LUCENE-5404 : Add .getCount method to all suggesters (Lookup), persist count metadata on .store(), Dictionary returns InputIterator, Dictionary.getWordIterator renamed to .getEntryIterator
        Hide
        Areek Zillur added a comment -

        Minor change made count is long (was int). Hopefully all hell did not break loose, this is my first commit!

        Show
        Areek Zillur added a comment - Minor change made count is long (was int). Hopefully all hell did not break loose, this is my first commit!
        Hide
        Areek Zillur added a comment -

        Thanks Michael and Robert for the review and feedback.

        Show
        Areek Zillur added a comment - Thanks Michael and Robert for the review and feedback.

          People

          • Assignee:
            Areek Zillur
            Reporter:
            Areek Zillur
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development