Solr
  1. Solr
  2. SOLR-877

Access to Lucene's TermEnum capabilities

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Component/s: None
    • Labels:
      None

      Description

      I wrote a simple SearchComponent on the plane the other day that gives access to Lucene's TermEnum capabilities. I think this will be useful for doing auto-suggest and other term based operations. My first draft is not distributed, but it probably should be made to do so eventually.

      1. SOLR-877.patch
        12 kB
        Grant Ingersoll
      2. SOLR-877.patch
        16 kB
        Grant Ingersoll
      3. SOLR-877.patch
        4 kB
        Khee Chin
      4. SOLR-877_2.patch
        13 kB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          Grant Ingersoll added a comment -

          First draft

          Show
          Grant Ingersoll added a comment - First draft
          Hide
          Grant Ingersoll added a comment -

          This is close to ready to commit.

          Show
          Grant Ingersoll added a comment - This is close to ready to commit.
          Hide
          Grant Ingersoll added a comment -

          Committed revision 721491.

          This commit is slightly different from the last patch. Fixed a couple of minor issues and added the ability to exclude the lower bound.

          Show
          Grant Ingersoll added a comment - Committed revision 721491. This commit is slightly different from the last patch. Fixed a couple of minor issues and added the ability to exclude the lower bound.
          Hide
          Yonik Seeley added a comment - - edited

          Looks useful, esp for distributed idf when we get around to it.
          A quick review:

          • "terms.fl", fl normally stands for field list. Would this make more sense as "terms.f"?
          • strings are returned instead of integers for the term count (it's even more obvious in JSON output)
          • how does one ask for "all terms after foo?" the docs suggest that upper or rows must be set... is the only way to set a really high rows value? If so, allowing terms.rows=-1 for "unlimited" might be nicer.

          Actually, this is very much like faceting.... perhaps we should use the same parameters:
          terms.field
          terms.offset
          terms.limit
          terms.mincount (future)
          terms.sort (future)

          If this is to be useful for distributed search, it needs to be able to handle the direct specifications of terms in multiple fields. We necessarily need to implement this now, but we should think about having an output format that doesn't need to be deprecated when it is added. At a minimum it seems like there should be an extra level.... something like

          "terms" = {
            "myfield" = { "foo"=10, "bar"=5, ...}
          }
          

          Or if we want to get even more like faceting with the output:

          "terms" = {
            "fields" = {
              "myfield" = {"foo"=10,...}
            }
          }
          
          • "terms.upr.incl" and "terms.lwr.incl" hurt the eyes a little since we already have "lower" and "upper" in the names of other terms - seems error prone (but this is a purely aesthetic thing).
          • This is supposedly useful for auto suggest - how do I go about asking for all terms starting with "abc"? Shouldn't there be a "terms.prefix" (just like faceting)?
          Show
          Yonik Seeley added a comment - - edited Looks useful, esp for distributed idf when we get around to it. A quick review: "terms.fl", fl normally stands for field list. Would this make more sense as "terms.f"? strings are returned instead of integers for the term count (it's even more obvious in JSON output) how does one ask for "all terms after foo?" the docs suggest that upper or rows must be set... is the only way to set a really high rows value? If so, allowing terms.rows=-1 for "unlimited" might be nicer. Actually, this is very much like faceting.... perhaps we should use the same parameters: terms.field terms.offset terms.limit terms.mincount (future) terms.sort (future) If this is to be useful for distributed search, it needs to be able to handle the direct specifications of terms in multiple fields. We necessarily need to implement this now, but we should think about having an output format that doesn't need to be deprecated when it is added. At a minimum it seems like there should be an extra level.... something like "terms" = { "myfield" = { "foo" =10, "bar" =5, ...} } Or if we want to get even more like faceting with the output: "terms" = { "fields" = { "myfield" = { "foo" =10,...} } } "terms.upr.incl" and "terms.lwr.incl" hurt the eyes a little since we already have "lower" and "upper" in the names of other terms - seems error prone (but this is a purely aesthetic thing). This is supposedly useful for auto suggest - how do I go about asking for all terms starting with "abc"? Shouldn't there be a "terms.prefix" (just like faceting)?
          Hide
          Grant Ingersoll added a comment -

          Yeah, it is sort of like faceting, but I didn't want to get too close to it, otherwise we're just duplicating effort. I wanted it to be really lightweight. We can add the pieces suggested here. I'll switch to int output right now.

          Show
          Grant Ingersoll added a comment - Yeah, it is sort of like faceting, but I didn't want to get too close to it, otherwise we're just duplicating effort. I wanted it to be really lightweight. We can add the pieces suggested here. I'll switch to int output right now.
          Hide
          Grant Ingersoll added a comment -

          Fixed the int issue.

          Committed revision 721534.

          Show
          Grant Ingersoll added a comment - Fixed the int issue. Committed revision 721534.
          Hide
          Yonik Seeley added a comment -

          It's not about duplicating effort (implementation), it's about reusing interface conventions, and what makes the most sense.

          Show
          Yonik Seeley added a comment - It's not about duplicating effort (implementation), it's about reusing interface conventions, and what makes the most sense.
          Hide
          Grant Ingersoll added a comment -

          The field thing makes sense, too. I thought about it and originally decided against it to avoid having to ever go to deep into nested NamedLists, but in hindsight, it does make sense to be able to send in multiple fields, which means we can keep the terms.fl param.

          Show
          Grant Ingersoll added a comment - The field thing makes sense, too. I thought about it and originally decided against it to avoid having to ever go to deep into nested NamedLists, but in hindsight, it does make sense to be able to send in multiple fields, which means we can keep the terms.fl param.
          Hide
          Erik Hatcher added a comment -

          Do note that suggestions using this component will be across the entire index, not confined with q/fq constraints. For that capability, look to the facet.prefix feature of the facet component.

          Show
          Erik Hatcher added a comment - Do note that suggestions using this component will be across the entire index, not confined with q/fq constraints. For that capability, look to the facet.prefix feature of the facet component.
          Hide
          Grant Ingersoll added a comment -

          It's not about duplicating effort (implementation), it's about reusing interface conventions, and what makes the most sense.

          True, I see what you mean. For some reason, though, I tend to think of TermEnum in the terms of lower bound and upper bound instead of offset, b/c offset implies you are a certain number of items into an array (i.e. foo[10]), whereas lower bound just feels looser to me. Semantics, I know. As for term.limit, isn't that faceting duplicating the "rows" parameter?


          I just committed several changes:

          • Added terms.prefix so now the auto-suggest should be possible. See the Wiki for an example
          • Changed upr.incl -> upper.incl and lwr.incl -> lower.incl
          • Fixed a but w/ lower.incl = false that skipped the next term even if the first term was not a match for the input lower bound term.

          Committed revision 721681.

          Show
          Grant Ingersoll added a comment - It's not about duplicating effort (implementation), it's about reusing interface conventions, and what makes the most sense. True, I see what you mean. For some reason, though, I tend to think of TermEnum in the terms of lower bound and upper bound instead of offset, b/c offset implies you are a certain number of items into an array (i.e. foo [10] ), whereas lower bound just feels looser to me. Semantics, I know. As for term.limit, isn't that faceting duplicating the "rows" parameter? I just committed several changes: Added terms.prefix so now the auto-suggest should be possible. See the Wiki for an example Changed upr.incl -> upper.incl and lwr.incl -> lower.incl Fixed a but w/ lower.incl = false that skipped the next term even if the first term was not a match for the input lower bound term. Committed revision 721681.
          Hide
          Yonik Seeley added a comment -

          For the purposes of this component, I think of TermEnum as an implementation, not the interface.... people will eventually want to do things like sort by high docfreq (just as in faceting), or only list terms above a certain count, or only list terms matching a certain pattern, etc. All of these can make sense since we can do it more efficiently closer to the data.

          I tend to think of TermEnum in the terms of lower bound and upper bound instead of offset

          Right, offset doesn't make as much sense with the current semantics (but it might later).

          As for term.limit, isn't that faceting duplicating the "rows" parameter?

          Yes, we unfortunately have two ways of specifying this ("rows" and "limit"). I think limit is the better name though (and this will be highly associated with faceting in people's mind I think).

          Show
          Yonik Seeley added a comment - For the purposes of this component, I think of TermEnum as an implementation, not the interface.... people will eventually want to do things like sort by high docfreq (just as in faceting), or only list terms above a certain count, or only list terms matching a certain pattern, etc. All of these can make sense since we can do it more efficiently closer to the data. I tend to think of TermEnum in the terms of lower bound and upper bound instead of offset Right, offset doesn't make as much sense with the current semantics (but it might later). As for term.limit, isn't that faceting duplicating the "rows" parameter? Yes, we unfortunately have two ways of specifying this ("rows" and "limit"). I think limit is the better name though (and this will be highly associated with faceting in people's mind I think).
          Hide
          Khee Chin added a comment - - edited

          As a solr-user who uses this function for auto-complete, I'd like to filter out terms with a low-frequency count. Thus, I've implemented a quick-hack, against a 28th Nov checkout.

          /src/java/org/apache/solr/common/params/TermsParams.java
          
            // Optional.  The minimum value of docFreq to be returned.  1 by default
            public static final String TERMS_FREQ_MIN = TERMS_PREFIX + "freqmin";
             // Optional.  The maximum value of docFreq to be returned.  -1 by default means no boundary
            public static final String TERMS_FREQ_MAX = TERMS_PREFIX + "freqmax";
          
          /src/java/org/apache/solr/handler/component/TermsComponent.java
          
              // At lines 55-56, after initializing boolean upperIncl and lowerIncl
              int freqmin = params.getInt(TermsParams.TERMS_FREQ_MIN,1); // initialize freqmin
              int freqmax = params.getInt(TermsParams.TERMS_FREQ_MAX,-1); // initialize freqmax
              
              // At line 69, replacing terms.add(theText, termEnum.docFreq());,    
              if (termEnum.docFreq() >= freqmin && (freqmax==-1 || (termEnum.docFreq() <= freqmax))) {
                  terms.add(theText, termEnum.docFreq());
              } else {
                  i--;
              } 
          

          The new parameters could be used by calling
          terms.freqmin=<value>
          terms.freqmax=<value>
          both of which, are optional.

          Show
          Khee Chin added a comment - - edited As a solr-user who uses this function for auto-complete, I'd like to filter out terms with a low-frequency count. Thus, I've implemented a quick-hack, against a 28th Nov checkout. /src/java/org/apache/solr/common/params/TermsParams.java // Optional. The minimum value of docFreq to be returned. 1 by default public static final String TERMS_FREQ_MIN = TERMS_PREFIX + "freqmin" ; // Optional. The maximum value of docFreq to be returned. -1 by default means no boundary public static final String TERMS_FREQ_MAX = TERMS_PREFIX + "freqmax" ; /src/java/org/apache/solr/handler/component/TermsComponent.java // At lines 55-56, after initializing boolean upperIncl and lowerIncl int freqmin = params.getInt(TermsParams.TERMS_FREQ_MIN,1); // initialize freqmin int freqmax = params.getInt(TermsParams.TERMS_FREQ_MAX,-1); // initialize freqmax // At line 69, replacing terms.add(theText, termEnum.docFreq());, if (termEnum.docFreq() >= freqmin && (freqmax==-1 || (termEnum.docFreq() <= freqmax))) { terms.add(theText, termEnum.docFreq()); } else { i--; } The new parameters could be used by calling terms.freqmin=<value> terms.freqmax=<value> both of which, are optional.
          Hide
          Grant Ingersoll added a comment -

          Can you supply as a patch with some simple unit tests?

          Show
          Grant Ingersoll added a comment - Can you supply as a patch with some simple unit tests?
          Hide
          Khee Chin added a comment -

          As req, however, I have only done a single test case, but it should be trivial to add on more test cases in future.

          Show
          Khee Chin added a comment - As req, however, I have only done a single test case, but it should be trivial to add on more test cases in future.
          Hide
          Noble Paul added a comment -

          We may need a faster solution for the autoSuggest feature.
          This can be quite slow because we are doing a string compare for each string .Considering the fact that autoSuggest gets TOO MANY hits in a typical website it should not be doing so much of processing

          We must use something like this http://en.wikipedia.org/wiki/Radix_tree to make it efficient and build the tree in the beginning (and after every commit).

          SOLR-706 can be closed if that is included

          Show
          Noble Paul added a comment - We may need a faster solution for the autoSuggest feature. This can be quite slow because we are doing a string compare for each string .Considering the fact that autoSuggest gets TOO MANY hits in a typical website it should not be doing so much of processing We must use something like this http://en.wikipedia.org/wiki/Radix_tree to make it efficient and build the tree in the beginning (and after every commit). SOLR-706 can be closed if that is included
          Hide
          Yonik Seeley added a comment -

          lol - didn't take too long for other faceting like features to pop up (i.e. facet.mincount).
          We really should reuse the facet interface terminology here: limit, mincount, etc.

          Show
          Yonik Seeley added a comment - lol - didn't take too long for other faceting like features to pop up (i.e. facet.mincount). We really should reuse the facet interface terminology here: limit, mincount, etc.
          Hide
          Yonik Seeley added a comment -

          Noble: would optionally using something like a radix tree change the external interface? That's what we should be focused most on now in order to enable seamlessly adding optimizations in the future.

          Show
          Yonik Seeley added a comment - Noble: would optionally using something like a radix tree change the external interface? That's what we should be focused most on now in order to enable seamlessly adding optimizations in the future.
          Hide
          Noble Paul added a comment -

          autosuggest is a very commonly used feature. And when they are used , the hits are just too many.

          we can add extra options to optimize or not (say memTree=true/false).

          If the option is set we can build a the data structure . Potentially this tree can consume a lot of memory if there are too many terms . Users must have an option to turn it off.

          The feature may be added with faceting component or the TermComponent. The problem here is that these components are already overloaded.with features. Adding this small option into these can cause more confusion.

          IMHO we should not pack too many features into one component unless we are sure that they are mostly used together. (For instance faceting and autosuggest are rarely done as a part of one request ).It would be better to write separate components for each functionality . Internally the components can share code and the users can mix and match if they need to

          Show
          Noble Paul added a comment - autosuggest is a very commonly used feature. And when they are used , the hits are just too many. we can add extra options to optimize or not (say memTree=true/false). If the option is set we can build a the data structure . Potentially this tree can consume a lot of memory if there are too many terms . Users must have an option to turn it off. The feature may be added with faceting component or the TermComponent. The problem here is that these components are already overloaded.with features. Adding this small option into these can cause more confusion. IMHO we should not pack too many features into one component unless we are sure that they are mostly used together. (For instance faceting and autosuggest are rarely done as a part of one request ).It would be better to write separate components for each functionality . Internally the components can share code and the users can mix and match if they need to
          Hide
          Yonik Seeley added a comment -

          Regardless of how optimizations are selected or turned on/off, do you see anything in the current API that we should change now to enable optimization later (or now for all I care). I'm only asking about the API.

          Show
          Yonik Seeley added a comment - Regardless of how optimizations are selected or turned on/off, do you see anything in the current API that we should change now to enable optimization later (or now for all I care). I'm only asking about the API.
          Hide
          Noble Paul added a comment -

          No changes in the TermComponent API (I mean the HTTP API).
          May be a config param (in solrconfig.xml)

          Show
          Noble Paul added a comment - No changes in the TermComponent API (I mean the HTTP API). May be a config param (in solrconfig.xml)
          Hide
          Grant Ingersoll added a comment -

          lol - didn't take too long for other faceting like features to pop up (i.e. facet.mincount).

          We really should reuse the facet interface terminology here: limit, mincount, etc.

          Yeah, Yonik, I'm starting to think of this as Term Faceting. I still like rows better than limit, but will change Khee's params to be mincount and maxcount

          Show
          Grant Ingersoll added a comment - lol - didn't take too long for other faceting like features to pop up (i.e. facet.mincount). We really should reuse the facet interface terminology here: limit, mincount, etc. Yeah, Yonik, I'm starting to think of this as Term Faceting. I still like rows better than limit, but will change Khee's params to be mincount and maxcount
          Hide
          Yonik Seeley added a comment -

          I still like rows better than limit,

          So are you advocating deprecating facet.limit and adding facet.rows,
          or that aligning the APIs doesn't matter in this case?

          Show
          Yonik Seeley added a comment - I still like rows better than limit, So are you advocating deprecating facet.limit and adding facet.rows, or that aligning the APIs doesn't matter in this case?
          Hide
          Grant Ingersoll added a comment -

          Noble, have you done any performance testing of this approach versus the radix tree (or other tree/trie approaches)?

          AIUI, if you do the tree approach, doesn't that mean you need to build the tree from all of the terms in a given field? And then what about if you want to go across multiple fields? Seems like that would be a pretty large footprint. In some sense, the Term dictionary in Lucene is already very similar to this structure, except it can't do the character matching like you are proposing (but it does very efficiently encodes the terms)

          Show
          Grant Ingersoll added a comment - Noble, have you done any performance testing of this approach versus the radix tree (or other tree/trie approaches)? AIUI, if you do the tree approach, doesn't that mean you need to build the tree from all of the terms in a given field? And then what about if you want to go across multiple fields? Seems like that would be a pretty large footprint. In some sense, the Term dictionary in Lucene is already very similar to this structure, except it can't do the character matching like you are proposing (but it does very efficiently encodes the terms)
          Hide
          Grant Ingersoll added a comment -

          Committed revision 723985. Thanks, Khee! I slightly changed the patch to use "mincount" and "maxcount" per Yonik's suggestion to overlap w/ faceting.

          Show
          Grant Ingersoll added a comment - Committed revision 723985. Thanks, Khee! I slightly changed the patch to use "mincount" and "maxcount" per Yonik's suggestion to overlap w/ faceting.
          Hide
          Grant Ingersoll added a comment -

          So are you advocating deprecating facet.limit and adding facet.rows,

          or that aligning the APIs doesn't matter in this case?

          I don't know. Does every param need to be consistent? If that is the case, then I guess we should either decide on rows or limit across all of them.

          Otherwise, I mean, all of these end up having a prefix attached to them (i.e. terms.rows), so it may not be a big deal, either. I'm fine either way, I guess. Your call.

          Show
          Grant Ingersoll added a comment - So are you advocating deprecating facet.limit and adding facet.rows, or that aligning the APIs doesn't matter in this case? I don't know. Does every param need to be consistent? If that is the case, then I guess we should either decide on rows or limit across all of them. Otherwise, I mean, all of these end up having a prefix attached to them (i.e. terms.rows), so it may not be a big deal, either. I'm fine either way, I guess. Your call.
          Hide
          Yonik Seeley added a comment -

          Does every param need to be consistent?

          No... there's certainly no mandate... "rows" and "facet.limit" are established enough now that neither should be changed IMO.
          It just seems like more consistency rather than less generally seems like a good thing, balanced with other factors of course. It also affects usability - if people think about this more like faceting, they are more likely to quickly type terms.limit by mistake.

          Of course these things are aesthetic and subjective - would be nice to hear from someone else on preferences before a release.

          Show
          Yonik Seeley added a comment - Does every param need to be consistent? No... there's certainly no mandate... "rows" and "facet.limit" are established enough now that neither should be changed IMO. It just seems like more consistency rather than less generally seems like a good thing, balanced with other factors of course. It also affects usability - if people think about this more like faceting, they are more likely to quickly type terms.limit by mistake. Of course these things are aesthetic and subjective - would be nice to hear from someone else on preferences before a release.
          Hide
          Yonik Seeley added a comment -

          Another thought I'll just throw out there w/o advocating: could this just be a faceting optimization?

          If the faceting base query is :, and one can ignore deleted docs, then the facet counts are equivalent to the term df here. So this could be implemented as an optimization to faceting w/ the addition of a facet.ignoreDeletes parameter. Then the distributed part would already be done (once facet.sort=false is implemented for distributed search).

          Show
          Yonik Seeley added a comment - Another thought I'll just throw out there w/o advocating: could this just be a faceting optimization? If the faceting base query is : , and one can ignore deleted docs, then the facet counts are equivalent to the term df here. So this could be implemented as an optimization to faceting w/ the addition of a facet.ignoreDeletes parameter. Then the distributed part would already be done (once facet.sort=false is implemented for distributed search).
          Hide
          Noble Paul added a comment -

          have you done any performance testing of this approach versus the radix tree (or other tree/trie approaches)?

          I haven't done a perf comparison. But , it occurred to me as I looked at the code .It goes through each Term one by one and does a startsWith()

          It can be quite expensive for large no:of Terms

          Memory consumption can be a problem.
          That is why I suggested a config param. User can make a call as to whether he wants to pay that price and he can afford that.

          Show
          Noble Paul added a comment - have you done any performance testing of this approach versus the radix tree (or other tree/trie approaches)? I haven't done a perf comparison. But , it occurred to me as I looked at the code .It goes through each Term one by one and does a startsWith() It can be quite expensive for large no:of Terms Memory consumption can be a problem. That is why I suggested a config param. User can make a call as to whether he wants to pay that price and he can afford that.
          Hide
          Grant Ingersoll added a comment -

          Another thought I'll just throw out there w/o advocating: could this just be a faceting optimization?

          If the faceting base query is :, and one can ignore deleted docs, then the facet counts are equivalent to the term df here. So this could be implemented as an optimization to faceting w/ the addition of a facet.ignoreDeletes parameter. Then the distributed part would already be done (once facet.sort=false is implemented for distributed search).

          Hmmm, probably true. I tend to think of this as pretty lightweight, but I see no reason not to make your suggested changes in the facet code either.

          Show
          Grant Ingersoll added a comment - Another thought I'll just throw out there w/o advocating: could this just be a faceting optimization? If the faceting base query is :, and one can ignore deleted docs, then the facet counts are equivalent to the term df here. So this could be implemented as an optimization to faceting w/ the addition of a facet.ignoreDeletes parameter. Then the distributed part would already be done (once facet.sort=false is implemented for distributed search). Hmmm, probably true. I tend to think of this as pretty lightweight, but I see no reason not to make your suggested changes in the facet code either.
          Hide
          Yonik Seeley added a comment -

          I've been reviewing some interfaces in prep for the 1.4 release.... Here's a patch to this request handler:

          • fixes the labels for non-text fields (makes them human readable)
          • adds a terms.raw parameter just in case you really do want the raw internal indexed form of the term
          • changes terms.rows to terms.limit to match faceting (as predicted, people want to sort by freq too, so this is much closer to faceting than anything else)
          • changes the name of the request handler in the example schema to the more generic /terms (from /autoSuggest). I could see this becoming a standard useful request handler, and limiting it to /autoSuggest doesn't sound right

          I'll commit shortly barring objections.

          Show
          Yonik Seeley added a comment - I've been reviewing some interfaces in prep for the 1.4 release.... Here's a patch to this request handler: fixes the labels for non-text fields (makes them human readable) adds a terms.raw parameter just in case you really do want the raw internal indexed form of the term changes terms.rows to terms.limit to match faceting (as predicted, people want to sort by freq too, so this is much closer to faceting than anything else) changes the name of the request handler in the example schema to the more generic /terms (from /autoSuggest). I could see this becoming a standard useful request handler, and limiting it to /autoSuggest doesn't sound right I'll commit shortly barring objections.
          Hide
          Matt Weber added a comment -

          I wrote a patch for freq. sorting thar is attached to SOLR-1156. I will update that patch once you commit your latest changes.

          Show
          Matt Weber added a comment - I wrote a patch for freq. sorting thar is attached to SOLR-1156 . I will update that patch once you commit your latest changes.
          Hide
          Yonik Seeley added a comment -

          Committed latest changes, will update wiki shortly.

          Show
          Yonik Seeley added a comment - Committed latest changes, will update wiki shortly.
          Hide
          Yonik Seeley added a comment -

          Hmmm, looking at the wiki examples, I think there are some more quick improvements we can do...

          From the wiki:

          To use in auto-suggest, add in a lower bound, an upper bound and make the lower bound exclusive of the input term, as in: http://localhost:8983/solr/terms?terms.fl=name&terms.lower=at&terms.prefix=at&terms.lower.incl=false&terms.upper=b&indent=true

          Unless I'm missing something, it doesn't make sense to exclude the lower bound... seems like it would often be useful to know if what the user typed in matched a term in the index.... excluding it would lead one to believe that it doesn't exist.

          But the improvement is that one should simply be able to specify a prefix:
          http://localhost:8983/solr/terms?terms.fl=name&terms.prefix=at

          The rest should be implementation details in Solr to make it efficient (i.e. Solr should know to start at "at" in the TermEnum if that's the prefix.)

          Show
          Yonik Seeley added a comment - Hmmm, looking at the wiki examples, I think there are some more quick improvements we can do... From the wiki: To use in auto-suggest, add in a lower bound, an upper bound and make the lower bound exclusive of the input term, as in: http://localhost:8983/solr/terms?terms.fl=name&terms.lower=at&terms.prefix=at&terms.lower.incl=false&terms.upper=b&indent=true Unless I'm missing something, it doesn't make sense to exclude the lower bound... seems like it would often be useful to know if what the user typed in matched a term in the index.... excluding it would lead one to believe that it doesn't exist. But the improvement is that one should simply be able to specify a prefix: http://localhost:8983/solr/terms?terms.fl=name&terms.prefix=at The rest should be implementation details in Solr to make it efficient (i.e. Solr should know to start at "at" in the TermEnum if that's the prefix.)
          Hide
          Yonik Seeley added a comment -

          Found another bug: null pointer exception if you try a term enum past the end of the index (where lucene will return null from the term enum):
          http://localhost:8983/solr/terms?terms.fl=zzz_s

          Show
          Yonik Seeley added a comment - Found another bug: null pointer exception if you try a term enum past the end of the index (where lucene will return null from the term enum): http://localhost:8983/solr/terms?terms.fl=zzz_s
          Hide
          Grant Ingersoll added a comment -

          Unless I'm missing something, it doesn't make sense to exclude the lower bound... seems like it would often be useful to know if what the user typed in matched a term in the index.... excluding it would lead one to believe that it doesn't exist.

          I think excluding the lower bound allows you to get the next item for suggestion, but I suppose it is up to the app to decide whether they want to confirm the existing word, or just suggest what could come next.

          Show
          Grant Ingersoll added a comment - Unless I'm missing something, it doesn't make sense to exclude the lower bound... seems like it would often be useful to know if what the user typed in matched a term in the index.... excluding it would lead one to believe that it doesn't exist. I think excluding the lower bound allows you to get the next item for suggestion, but I suppose it is up to the app to decide whether they want to confirm the existing word, or just suggest what could come next.
          Hide
          Yonik Seeley added a comment -

          OK, I just committed a fix for the null pointer exceptions, changed to use intern'd comparisons, and a little restructuring. Hopefully that's it.

          Show
          Yonik Seeley added a comment - OK, I just committed a fix for the null pointer exceptions, changed to use intern'd comparisons, and a little restructuring. Hopefully that's it.
          Hide
          Grant Ingersoll added a comment -

          Bulk close for Solr 1.4

          Show
          Grant Ingersoll added a comment - Bulk close for Solr 1.4

            People

            • Assignee:
              Yonik Seeley
              Reporter:
              Grant Ingersoll
            • Votes:
              1 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development