Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7619

Add WordDelimiterGraphFilter

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 6.5, 7.0
    • None
    • None
    • New

    Description

      Currently, WordDelimiterFilter doesn't try to set the posLen attribute and so it creates graphs like this:

      but with this patch (still a work in progress) it creates this graph instead:

      This means (today) positional queries when using WDF at search time are buggy, but since we fixed LUCENE-7603, with this change here you should be able to use positional queries with WDGF.

      I'm also trying to produce holes properly (removes logic from the current WDF that swallows a hole when whole token is just delimiters).

      Surprisingly, it's actually quite easy to tweak WDF to create a graph (unlike e.g. SynonymGraphFilter) because it's already creating the necessary new positions, and its output graph never has side paths, except for single tokens that skip nodes because they have posLen > 1. I.e. the only fix to make, I think, is to set posLen properly. And it really helps that it does its own "new token buffering + sorting" already.

      Attachments

        1. LUCENE-7619.patch
          165 kB
          Michael McCandless
        2. LUCENE-7619.patch
          162 kB
          Michael McCandless
        3. LUCENE-7619.patch
          124 kB
          Michael McCandless
        4. after.png
          41 kB
          Michael McCandless
        5. before.png
          37 kB
          Michael McCandless

        Activity

          People

            mikemccand Michael McCandless
            mikemccand Michael McCandless
            Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: