Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10
    • Component/s: query parsers
    • Labels:
      None

      Description

      Some applications require filtering documents by a large number of terms. It's often related to security filtering. Naively this is done this way:

          fq={!df=myfield q.op=OR}code1 code2 code3 code4 code5...
      

      And this ends up being a BooleanQuery. Users then wind up hitting BooleaQuery.maxClauseCount (sometimes in production, sadly) and they up it to a huge number to get the job done.

      Solr should offer a QParser based on TermsFilter. I propose it be named "terms" (plural of term), and have a "separator" option defaulting to a space. When it's a space, the values also get trimmed, which wouldn't otherwise happen. The analysis logic should be the same as that for "term" QParser which is to call FieldType.readableToIndexed.

        Activity

        Hide
        David Smiley added a comment -

        I've read somewhere (in the Lucene source, I forget) that BooleanQuery was shown to be faster than TermsFilter when the number of terms is less than some number, based on a bunch of assumptions of course. It would be nice to have a threshold option to switch between BooleanQuery & TermsFilter. I've also seen a suggestion that TermsFilter should use or be replaced by AutomatonQuery LUCENE-3893. It would be easy to use any of these options.

        Show
        David Smiley added a comment - I've read somewhere (in the Lucene source, I forget) that BooleanQuery was shown to be faster than TermsFilter when the number of terms is less than some number, based on a bunch of assumptions of course. It would be nice to have a threshold option to switch between BooleanQuery & TermsFilter. I've also seen a suggestion that TermsFilter should use or be replaced by AutomatonQuery LUCENE-3893 . It would be easy to use any of these options.
        Hide
        David Smiley added a comment -

        Here it is, with test.
        From the javadoc:

        Finds documents whose specified field has any of the specified values. It's like TermQParserPlugin but multi-valued, and supports a variety of internal algorithms. Parameters: f: The field name (mandatory) separator: the separator delimiting the values in the query string. By default it's a " " which is special in that it splits on any consecutive whitespace. method: Any of termsFilter (default), booleanQuery, automaton, docValuesTermsFilter. Note that if no values are specified then the query matches no documents.

        It would be cool if somebody did some benchmarking that would allow us to choose between some of the algorithms based on heuristics... but this is fine for now. For example use method=X when the number of values is > some value. And use docValuesTermsFilter if docValues is enabled. Note that DocValuesTermsFilter (trunk) is known as FieldCacheTermsFilter on 4x. On 4x this feature doesn't support DocValues (just FieldCache) whereas on trunk it supports both depending on wether you indexed DocValues or not (I think). That method is also limited to single valued fields, but there's no explicit check.

        I'll commit this in a couple days, pending input.

        Show
        David Smiley added a comment - Here it is, with test. From the javadoc: Finds documents whose specified field has any of the specified values. It's like TermQParserPlugin but multi-valued, and supports a variety of internal algorithms. Parameters: f: The field name (mandatory) separator: the separator delimiting the values in the query string. By default it's a " " which is special in that it splits on any consecutive whitespace. method: Any of termsFilter (default), booleanQuery, automaton, docValuesTermsFilter. Note that if no values are specified then the query matches no documents. It would be cool if somebody did some benchmarking that would allow us to choose between some of the algorithms based on heuristics... but this is fine for now. For example use method=X when the number of values is > some value. And use docValuesTermsFilter if docValues is enabled. Note that DocValuesTermsFilter (trunk) is known as FieldCacheTermsFilter on 4x. On 4x this feature doesn't support DocValues (just FieldCache) whereas on trunk it supports both depending on wether you indexed DocValues or not (I think). That method is also limited to single valued fields, but there's no explicit check. I'll commit this in a couple days, pending input.
        Hide
        ASF subversion and git services added a comment -

        Commit 1616558 from David Smiley in branch 'dev/trunk'
        [ https://svn.apache.org/r1616558 ]

        SOLR-6318: New terms QParser

        Show
        ASF subversion and git services added a comment - Commit 1616558 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1616558 ] SOLR-6318 : New terms QParser
        Hide
        ASF subversion and git services added a comment -

        Commit 1616559 from David Smiley in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1616559 ]

        SOLR-6318: New terms QParser

        Show
        ASF subversion and git services added a comment - Commit 1616559 from David Smiley in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1616559 ] SOLR-6318 : New terms QParser
        Hide
        Yonik Seeley added a comment -

        Heh - I recently committed a TermsQParserPlugin to Heliosearch... it's syntax is

        {!terms f=myfield}term1,term2,term3
        

        Before the paint dries on this too much, perhaps we should set the default separator to ","?

        • comma separated lists of things (ids, terms, etc) are much more frequently used in Solr in general, and I think we should try to standardize on this
        • comma results in nicer to read URLs since they don't get URL encoded
        • comma works better for embedded queries in lucene syntax:
          A OR {!terms f=myfield}term1,term2,term3 OR C
          

        The few things that are not comma separated now constantly trip me up... but there are only a few of them (like edismax qf). I'm forever writing qf=field1,field2 instead of qf=field1 field2

        Show
        Yonik Seeley added a comment - Heh - I recently committed a TermsQParserPlugin to Heliosearch... it's syntax is {!terms f=myfield}term1,term2,term3 Before the paint dries on this too much, perhaps we should set the default separator to ","? comma separated lists of things (ids, terms, etc) are much more frequently used in Solr in general, and I think we should try to standardize on this comma results in nicer to read URLs since they don't get URL encoded comma works better for embedded queries in lucene syntax: A OR {!terms f=myfield}term1,term2,term3 OR C The few things that are not comma separated now constantly trip me up... but there are only a few of them (like edismax qf). I'm forever writing qf=field1,field2 instead of qf=field1 field2
        Hide
        David Smiley added a comment -

        Works for me Yonik. I'll make the change now.

        Show
        David Smiley added a comment - Works for me Yonik. I'll make the change now.
        Hide
        ASF subversion and git services added a comment -

        Commit 1616609 from David Smiley in branch 'dev/trunk'
        [ https://svn.apache.org/r1616609 ]

        SOLR-6318: New terms QParser defaults to comma delimited now

        Show
        ASF subversion and git services added a comment - Commit 1616609 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1616609 ] SOLR-6318 : New terms QParser defaults to comma delimited now
        Hide
        ASF subversion and git services added a comment -

        Commit 1616610 from David Smiley in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1616610 ]

        SOLR-6318: New terms QParser defaults to comma delimited now

        Show
        ASF subversion and git services added a comment - Commit 1616610 from David Smiley in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1616610 ] SOLR-6318 : New terms QParser defaults to comma delimited now
        Hide
        Yonik Seeley added a comment -

        The performance I got for id filters (term queries on the id field) varied from being 4 times faster to almost 9 times faster.
        I was only able to test up to 100K ids though... when I tried 1M, something failed in Jetty I think (maybe just hit the POST limit...)
        http://heliosearch.org/solr-terms-query/

        Show
        Yonik Seeley added a comment - The performance I got for id filters (term queries on the id field) varied from being 4 times faster to almost 9 times faster. I was only able to test up to 100K ids though... when I tried 1M, something failed in Jetty I think (maybe just hit the POST limit...) http://heliosearch.org/solr-terms-query/
        Hide
        David Smiley added a comment -

        Cool Yonik Seeley; thanks for spending the time benchmarking it. Could you try some of the other methods supported besides termsFilter: method=automaton and method=docValuesTermsFilter

        Show
        David Smiley added a comment - Cool Yonik Seeley ; thanks for spending the time benchmarking it. Could you try some of the other methods supported besides termsFilter: method=automaton and method=docValuesTermsFilter
        Hide
        Yonik Seeley added a comment -

        Could you try some of the other methods supported besides termsFilter: method=automaton and method=docValuesTermsFilter

        Good idea - I forgot to look into that. I'll put it on the queue...

        Show
        Yonik Seeley added a comment - Could you try some of the other methods supported besides termsFilter: method=automaton and method=docValuesTermsFilter Good idea - I forgot to look into that. I'll put it on the queue...

          People

          • Assignee:
            David Smiley
            Reporter:
            David Smiley
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development