Lucene - Core
  1. Lucene - Core
  2. LUCENE-6539

Add DocValuesNumbersQuery, like DocValuesTermsQuery but works only with long values

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.3, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This query accepts any document where any of the provided set of longs
      was indexed into the specified field as a numeric DV field
      (NumericDocValuesField or SortedNumericDocValuesField). You can use
      it instead of DocValuesTermsQuery when you have field values that can
      be represented as longs.

      Like DocValuesTermsQuery, this is slowish in general, since it doesn't
      use an inverted data structure, but in certain cases (many
      terms/numbers and fewish matching hits) it should be faster than using
      TermsQuery because it's done as a "post filter" when other (faster)
      query clauses are MUST'd with it.

      In such cases it should also be faster than DocValuesTermsQuery since
      it skips having to resolve terms -> ords.

      1. LUCENE-6539.patch
        70 kB
        Michael McCandless
      2. LUCENE-6539.patch
        13 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Initial rough patch ... test is passing.

        Show
        Michael McCandless added a comment - Initial rough patch ... test is passing.
        Hide
        Adrien Grand added a comment -

        This new query looks good to me. However instead of keeping adding such queries to core, I think we should consider moving all our doc values queries to misc since they have complicated trade-offs and are only useful in expert use-cases?

        +  private static Set<Long> toSet(Long[] array) {
        +    Set<Long> numbers = new HashSet<>();
        +    for(Long number : array) {
        +      numbers.add(number);
        +    }
        +    return numbers;
        +  }
        

        FYI you don't need this helper and could do just: new HashSet<Long>(Arrays.asList(array)).

        in certain cases (many terms/numbers and fewish matching hits) it should be faster than using TermsQuery

        This comment got me confused: I think in general these queries are more efficient when they match many documents, ie. even when an equivalent TermsQuery would not be used as a lead iterator in a conjunction? I think the only case when such a query matching few documents would be useful would be in a prohibited clause since these prohibited clauses can never be used to lead iteration anyway and are only used in a random-access fashion?

        Show
        Adrien Grand added a comment - This new query looks good to me. However instead of keeping adding such queries to core, I think we should consider moving all our doc values queries to misc since they have complicated trade-offs and are only useful in expert use-cases? + private static Set< Long > toSet( Long [] array) { + Set< Long > numbers = new HashSet<>(); + for ( Long number : array) { + numbers.add(number); + } + return numbers; + } FYI you don't need this helper and could do just: new HashSet<Long>(Arrays.asList(array)) . in certain cases (many terms/numbers and fewish matching hits) it should be faster than using TermsQuery This comment got me confused: I think in general these queries are more efficient when they match many documents, ie. even when an equivalent TermsQuery would not be used as a lead iterator in a conjunction? I think the only case when such a query matching few documents would be useful would be in a prohibited clause since these prohibited clauses can never be used to lead iteration anyway and are only used in a random-access fashion?
        Hide
        Robert Muir added a comment -

        I don't think this query should be a standalone one. It forces users to decide which one to use, and they will fuck this up.

        every time.

        Its ok in current form to go to sandbox, but i think this needs to be integrated into the inverted approach so that based on index stats, lucene can just do the right thing.

        Show
        Robert Muir added a comment - I don't think this query should be a standalone one. It forces users to decide which one to use, and they will fuck this up. every time. Its ok in current form to go to sandbox, but i think this needs to be integrated into the inverted approach so that based on index stats, lucene can just do the right thing.
        Hide
        Michael McCandless added a comment -

        new HashSet<Long>(Arrays.asList(array)).

        Good, I'll fix.

        However instead of keeping adding such queries to core, I think we should consider moving all our doc values queries to misc since they have complicated trade-offs and are only useful in expert use-cases?

        +1, I can move them here.

        in certain cases (many terms/numbers and fewish matching hits) it should be faster than using TermsQuery

        This comment got me confused: I think in general these queries are more efficient when they match many documents, ie. even when an equivalent TermsQuery would not be used as a lead iterator in a conjunction? I think the only case when such a query matching few documents would be useful would be in a prohibited clause since these prohibited clauses can never be used to lead iteration anyway and are only used in a random-access fashion?

        Hmm this is hard to think about, but yes I was thinking about the "there is some other MUST'd clause as the primary" and then this query is a MUST_NOT of a big list of numeric IDs, use case.

        The per-hit cost is higher with these DocValuesXXX queries (the forward lookup + check) vs visiting postings and ORing bitsets that TermsQuery does (when there are enough terms), but the setup cost is higher with TermsQuery since it must lookup many terms across N segments, which is why I thought "not matching too many total hits" would favor DocValueXXXQuery with a large number of terms.

        E.g. in the extreme case where you pass a single term to your TemsQuery or DocValuesTermsQuery, matching many docs, and its the primary (only) clause in the query, TermsQuery should be much faster.

        Its ok in current form to go to sandbox, but i think this needs to be integrated into the inverted approach so that based on index stats, lucene can just do the right thing.

        OK, or I can just WONTFIX this ... I just thought there are use cases where this post-filter approach would be much faster then the choices we have today, e.g. when an app has numeric IDs and wants to make big "NOT in list" clauses.

        I agree it would be better if we had only TermsQuery, and then it would figure out which strategy is best (use doc values, use numeric doc values if ids are really numeric, use postings) to take depending on index stats, whether clause is primary or not, etc... but this seems very tricky: I can't even properly think about the cases, see Adrien's comment above

        Show
        Michael McCandless added a comment - new HashSet<Long>(Arrays.asList(array)). Good, I'll fix. However instead of keeping adding such queries to core, I think we should consider moving all our doc values queries to misc since they have complicated trade-offs and are only useful in expert use-cases? +1, I can move them here. in certain cases (many terms/numbers and fewish matching hits) it should be faster than using TermsQuery This comment got me confused: I think in general these queries are more efficient when they match many documents, ie. even when an equivalent TermsQuery would not be used as a lead iterator in a conjunction? I think the only case when such a query matching few documents would be useful would be in a prohibited clause since these prohibited clauses can never be used to lead iteration anyway and are only used in a random-access fashion? Hmm this is hard to think about, but yes I was thinking about the "there is some other MUST'd clause as the primary" and then this query is a MUST_NOT of a big list of numeric IDs, use case. The per-hit cost is higher with these DocValuesXXX queries (the forward lookup + check) vs visiting postings and ORing bitsets that TermsQuery does (when there are enough terms), but the setup cost is higher with TermsQuery since it must lookup many terms across N segments, which is why I thought "not matching too many total hits" would favor DocValueXXXQuery with a large number of terms. E.g. in the extreme case where you pass a single term to your TemsQuery or DocValuesTermsQuery, matching many docs, and its the primary (only) clause in the query, TermsQuery should be much faster. Its ok in current form to go to sandbox, but i think this needs to be integrated into the inverted approach so that based on index stats, lucene can just do the right thing. OK, or I can just WONTFIX this ... I just thought there are use cases where this post-filter approach would be much faster then the choices we have today, e.g. when an app has numeric IDs and wants to make big "NOT in list" clauses. I agree it would be better if we had only TermsQuery, and then it would figure out which strategy is best (use doc values, use numeric doc values if ids are really numeric, use postings) to take depending on index stats, whether clause is primary or not, etc... but this seems very tricky: I can't even properly think about the cases, see Adrien's comment above
        Hide
        Adrien Grand added a comment -

        OK, or I can just WONTFIX this

        I think you should commit it, it is a missing piece today since you can do this on SORTED or SORTED_SET but not NUMERIC or SORTED_NUMERIC while this new query is cheaper. Let's put it into sandbox if we want to be "safe"?

        Agreed that integration with TermsQuery would be wonderful, but I also see challenges on the way.

        Show
        Adrien Grand added a comment - OK, or I can just WONTFIX this I think you should commit it, it is a missing piece today since you can do this on SORTED or SORTED_SET but not NUMERIC or SORTED_NUMERIC while this new query is cheaper. Let's put it into sandbox if we want to be "safe"? Agreed that integration with TermsQuery would be wonderful, but I also see challenges on the way.
        Hide
        Michael McCandless added a comment -

        OK how about for this issue, I move DocValuesTermsQuery, DocValuesRangeQuery and this new one (DocValuesNumbersQuery) to sandbox, add warnings / experimental, and commit there?

        I think it would be wonderful if TermsQuery did all this magically, but I don't think it should hold up adding this query.

        Show
        Michael McCandless added a comment - OK how about for this issue, I move DocValuesTermsQuery, DocValuesRangeQuery and this new one (DocValuesNumbersQuery) to sandbox, add warnings / experimental, and commit there? I think it would be wonderful if TermsQuery did all this magically, but I don't think it should hold up adding this query.
        Hide
        Adrien Grand added a comment -

        +1

        Show
        Adrien Grand added a comment - +1
        Hide
        Michael McCandless added a comment -

        New patch, moving these 3 "typically slow" queries to sandbox. I think it's ready...

        Show
        Michael McCandless added a comment - New patch, moving these 3 "typically slow" queries to sandbox. I think it's ready...
        Hide
        Adrien Grand added a comment -

        +1

        Show
        Adrien Grand added a comment - +1
        Hide
        ASF subversion and git services added a comment -

        Commit 1685540 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1685540 ]

        LUCENE-6539: Add DocValuesNumbersQuery

        Show
        ASF subversion and git services added a comment - Commit 1685540 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1685540 ] LUCENE-6539 : Add DocValuesNumbersQuery
        Hide
        ASF subversion and git services added a comment -

        Commit 1685585 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1685585 ]

        LUCENE-6539: Add DocValuesNumbersQuery

        Show
        ASF subversion and git services added a comment - Commit 1685585 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1685585 ] LUCENE-6539 : Add DocValuesNumbersQuery
        Hide
        ASF subversion and git services added a comment -

        Commit 1685597 from Steve Rowe in branch 'dev/trunk'
        [ https://svn.apache.org/r1685597 ]

        LUCENE-6539: Intellij config: add sandbox module dependency to solr-core and solr-analysis-extras modules

        Show
        ASF subversion and git services added a comment - Commit 1685597 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1685597 ] LUCENE-6539 : Intellij config: add sandbox module dependency to solr-core and solr-analysis-extras modules
        Hide
        ASF subversion and git services added a comment -

        Commit 1685598 from Steve Rowe in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1685598 ]

        LUCENE-6539: Intellij config: add sandbox module dependency to solr-core and solr-analysis-extras modules (merged trunk r1685597)

        Show
        ASF subversion and git services added a comment - Commit 1685598 from Steve Rowe in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1685598 ] LUCENE-6539 : Intellij config: add sandbox module dependency to solr-core and solr-analysis-extras modules (merged trunk r1685597)
        Hide
        Shalin Shekhar Mangar added a comment -

        Bulk close for 5.3.0 release

        Show
        Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development