Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 3.1, 4.0-ALPHA
    • Fix Version/s: 4.9, 5.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently CommonGramsFilter expects users to remove the common words around which output token ngrams are formed, by appending a StopFilter to the analysis pipeline. This is inefficient in two ways: captureState() is called on (trailing) stopwords, and then the whole stream has to be re-examined by the following StopFilter.

      The current ctor should be deprecated, and another ctor added with a boolean option controlling whether the common words should be output as unigrams.

      If common words are configured to be output as unigrams, captureState() will still need to be called, as it is now.

      If the common words are not configured to be output as unigrams, rather than calling captureState() for the trailing token in each output token ngram, the term text, position and offset can be maintained in the same way as they are now for the leading token: using a System.arrayCopy()'d term buffer and a few ints for positionIncrement and offsetd. The user then no longer would need to append a StopFilter to the analysis chain.

      An example illustrating both possibilities should also be added.

      1. commit-6402a55.patch
        23 kB
        Itamar Syn-Hershko

        Activity

        Hide
        Robert Muir added a comment -

        +1, this would be a great improvement.

        there are two basic use cases (that I see):

        1. you still aren't ever removing any stopwords, but using this solely to speed up phrase queries by forming bigrams around the common terms.
        2. you are using commongrams+stopfilter as a "stopfilter replacement", which gives a more reasonable index size, the relevance benefits of stopwords, but a user can always refine the query with double quotes and the stopwords are taken into consideration, and fast.

        the latter case currently requires you to also use a stopfilter, but it means we are doing needless captureState very very often (by definition, on common terms!). It also means you are specifying your stopwords list twice, and hashing two chararraysets, etc. So it would be nice to add the boolean and accelerate case #2.

        Show
        Robert Muir added a comment - +1, this would be a great improvement. there are two basic use cases (that I see): you still aren't ever removing any stopwords, but using this solely to speed up phrase queries by forming bigrams around the common terms. you are using commongrams+stopfilter as a "stopfilter replacement", which gives a more reasonable index size, the relevance benefits of stopwords, but a user can always refine the query with double quotes and the stopwords are taken into consideration, and fast. the latter case currently requires you to also use a stopfilter, but it means we are doing needless captureState very very often (by definition, on common terms!). It also means you are specifying your stopwords list twice, and hashing two chararraysets, etc. So it would be nice to add the boolean and accelerate case #2.
        Hide
        Jason Rutherglen added a comment -

        you still aren't ever removing any stopwords, but using this solely to speed up phrase queries by forming bigrams around the common terms.

        Isn't ShingleFilter for that case?

        Show
        Jason Rutherglen added a comment - you still aren't ever removing any stopwords, but using this solely to speed up phrase queries by forming bigrams around the common terms. Isn't ShingleFilter for that case?
        Hide
        Robert Muir added a comment -

        No, ShingleFilter forms bigrams around all terms, not just common ones.

        Show
        Robert Muir added a comment - No, ShingleFilter forms bigrams around all terms, not just common ones.
        Hide
        Steve Rowe added a comment -

        you still aren't ever removing any stopwords, but using this solely to speed up phrase queries by forming bigrams around the common terms.

        Isn't ShingleFilter for that case?

        On the index side: ShingleFilter generates token ngrams for all input tokens, not just those around and including common words, so although it could be used to speed up phrase queries, it would be at the expense of a much larger term dicitionary.

        On the query side: ShingleFilter could be a useful replacement for CommonGramsQueryFilter if you don't have access to the list of words used by CommonGramsFilter on the index side.

        Show
        Steve Rowe added a comment - you still aren't ever removing any stopwords, but using this solely to speed up phrase queries by forming bigrams around the common terms. Isn't ShingleFilter for that case? On the index side: ShingleFilter generates token ngrams for all input tokens, not just those around and including common words, so although it could be used to speed up phrase queries, it would be at the expense of a much larger term dicitionary. On the query side: ShingleFilter could be a useful replacement for CommonGramsQueryFilter if you don't have access to the list of words used by CommonGramsFilter on the index side.
        Hide
        Itamar Syn-Hershko added a comment -

        Adding option to CommonGramsFilter to not unigram common words

        Show
        Itamar Syn-Hershko added a comment - Adding option to CommonGramsFilter to not unigram common words
        Hide
        Itamar Syn-Hershko added a comment -

        Attached is a patch to fix this, including tests. There is no regression, and the new behavior when keepOrig is set to true is as described in the comments here.

        The only thing I wasn't sure about was CommonGramsQueryFilter - should it be deprecated? or how should it be made to work with this change?

        Show
        Itamar Syn-Hershko added a comment - Attached is a patch to fix this, including tests. There is no regression, and the new behavior when keepOrig is set to true is as described in the comments here. The only thing I wasn't sure about was CommonGramsQueryFilter - should it be deprecated? or how should it be made to work with this change?
        Hide
        Steve Rowe added a comment -

        Bulk move 4.4 issues to 4.5 and 5.0

        Show
        Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
        Hide
        Uwe Schindler added a comment -

        Move issue to Lucene 4.9.

        Show
        Uwe Schindler added a comment - Move issue to Lucene 4.9.
        Hide
        Itamar Syn-Hershko added a comment -

        Can anyone review and comment?

        Show
        Itamar Syn-Hershko added a comment - Can anyone review and comment?

          People

          • Assignee:
            Unassigned
            Reporter:
            Steve Rowe
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development