Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 3.1, 4.0-ALPHA
    • Fix Version/s: 4.9, 5.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently CommonGramsFilter expects users to remove the common words around which output token ngrams are formed, by appending a StopFilter to the analysis pipeline. This is inefficient in two ways: captureState() is called on (trailing) stopwords, and then the whole stream has to be re-examined by the following StopFilter.

      The current ctor should be deprecated, and another ctor added with a boolean option controlling whether the common words should be output as unigrams.

      If common words are configured to be output as unigrams, captureState() will still need to be called, as it is now.

      If the common words are not configured to be output as unigrams, rather than calling captureState() for the trailing token in each output token ngram, the term text, position and offset can be maintained in the same way as they are now for the leading token: using a System.arrayCopy()'d term buffer and a few ints for positionIncrement and offsetd. The user then no longer would need to append a StopFilter to the analysis chain.

      An example illustrating both possibilities should also be added.

      1. commit-6402a55.patch
        23 kB
        Itamar Syn-Hershko

        Activity

        Uwe Schindler made changes -
        Fix Version/s 4.9 [ 12326730 ]
        Fix Version/s 5.0 [ 12321663 ]
        Fix Version/s 4.8 [ 12326269 ]
        David Smiley made changes -
        Fix Version/s 4.8 [ 12326269 ]
        Fix Version/s 4.7 [ 12325572 ]
        Simon Willnauer made changes -
        Fix Version/s 4.7 [ 12325572 ]
        Fix Version/s 4.6 [ 12324999 ]
        Adrien Grand made changes -
        Fix Version/s 4.6 [ 12324999 ]
        Fix Version/s 5.0 [ 12321663 ]
        Fix Version/s 4.5 [ 12324742 ]
        Steve Rowe made changes -
        Fix Version/s 5.0 [ 12321663 ]
        Fix Version/s 4.5 [ 12324742 ]
        Fix Version/s 4.4 [ 12324323 ]
        Uwe Schindler made changes -
        Fix Version/s 4.4 [ 12324323 ]
        Fix Version/s 4.3 [ 12324143 ]
        Robert Muir made changes -
        Fix Version/s 4.3 [ 12324143 ]
        Fix Version/s 5.0 [ 12321663 ]
        Fix Version/s 4.2 [ 12323899 ]
        Mark Miller made changes -
        Fix Version/s 4.2 [ 12323899 ]
        Fix Version/s 4.1 [ 12321140 ]
        Mark Miller made changes -
        Fix Version/s 5.0 [ 12321663 ]
        Itamar Syn-Hershko made changes -
        Attachment commit-6402a55.patch [ 12562327 ]
        Robert Muir made changes -
        Fix Version/s 4.1 [ 12321140 ]
        Fix Version/s 4.0 [ 12314025 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12563912 ] jira [ 12585411 ]
        Mark Thomas made changes -
        Workflow jira [ 12541293 ] Default workflow, editable Closed status [ 12563912 ]
        Robert Muir made changes -
        Field Original Value New Value
        Fix Version/s 3.1 [ 12314822 ]
        Steve Rowe created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Steve Rowe
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development