Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7619

Add WordDelimiterGraphFilter


    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 6.5, 7.0
    • None
    • None
    • New


      Currently, WordDelimiterFilter doesn't try to set the posLen attribute and so it creates graphs like this:

      but with this patch (still a work in progress) it creates this graph instead:

      This means (today) positional queries when using WDF at search time are buggy, but since we fixed LUCENE-7603, with this change here you should be able to use positional queries with WDGF.

      I'm also trying to produce holes properly (removes logic from the current WDF that swallows a hole when whole token is just delimiters).

      Surprisingly, it's actually quite easy to tweak WDF to create a graph (unlike e.g. SynonymGraphFilter) because it's already creating the necessary new positions, and its output graph never has side paths, except for single tokens that skip nodes because they have posLen > 1. I.e. the only fix to make, I think, is to set posLen properly. And it really helps that it does its own "new token buffering + sorting" already.


        1. LUCENE-7619.patch
          124 kB
          Michael McCandless
        2. LUCENE-7619.patch
          162 kB
          Michael McCandless
        3. LUCENE-7619.patch
          165 kB
          Michael McCandless
        4. before.png
          37 kB
          Michael McCandless
        5. after.png
          41 kB
          Michael McCandless



            mikemccand Michael McCandless
            mikemccand Michael McCandless
            1 Vote for this issue
            9 Start watching this issue