Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0, 6.5
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently, WordDelimiterFilter doesn't try to set the posLen attribute and so it creates graphs like this:

      but with this patch (still a work in progress) it creates this graph instead:

      This means (today) positional queries when using WDF at search time are buggy, but since we fixed LUCENE-7603, with this change here you should be able to use positional queries with WDGF.

      I'm also trying to produce holes properly (removes logic from the current WDF that swallows a hole when whole token is just delimiters).

      Surprisingly, it's actually quite easy to tweak WDF to create a graph (unlike e.g. SynonymGraphFilter) because it's already creating the necessary new positions, and its output graph never has side paths, except for single tokens that skip nodes because they have posLen > 1. I.e. the only fix to make, I think, is to set posLen properly. And it really helps that it does its own "new token buffering + sorting" already.

      1. after.png
        41 kB
        Michael McCandless
      2. before.png
        37 kB
        Michael McCandless
      3. LUCENE-7619.patch
        165 kB
        Michael McCandless
      4. LUCENE-7619.patch
        162 kB
        Michael McCandless
      5. LUCENE-7619.patch
        124 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Initial dirty, work-in-progress, overly verbose patch ... it's still buggy in some cases but the basic idea is working. I also moved FlattenGraphFilter under oal.analysis.core from .synonym.

        Show
        mikemccand Michael McCandless added a comment - Initial dirty, work-in-progress, overly verbose patch ... it's still buggy in some cases but the basic idea is working. I also moved FlattenGraphFilter under oal.analysis.core from .synonym .
        Hide
        jpountz Adrien Grand added a comment -

        Wow, I did not think WDF would ever be fixed to produce correct positions, this is very exciting!

        Show
        jpountz Adrien Grand added a comment - Wow, I did not think WDF would ever be fixed to produce correct positions, this is very exciting!
        Hide
        dsmiley David Smiley added a comment -

        Very cool!

        Show
        dsmiley David Smiley added a comment - Very cool!
        Hide
        mikemccand Michael McCandless added a comment -

        Another iteration ... I think this one is very close, except there are some failing test cases for AnalyzingInfixSuggester.

        I changed how WordDelimiterGraphFilter buffers its tokens, to store the absolute start/end position (instead of pos inc) and use that for sorting all buffered tokens before returning them. I think this simplified the code somewhat.

        I also added a fun random test with a slowWDF method that emulates WordDelimiterGraphFilter slowly but hopefuly bug-free and compares it on random strings vs. the real thing.

        Finally, I fixed TokenStreamToAutomaton to use the ending posInc, instead of "cheating" by looking at the ending offsets.

        Show
        mikemccand Michael McCandless added a comment - Another iteration ... I think this one is very close, except there are some failing test cases for AnalyzingInfixSuggester . I changed how WordDelimiterGraphFilter buffers its tokens, to store the absolute start/end position (instead of pos inc) and use that for sorting all buffered tokens before returning them. I think this simplified the code somewhat. I also added a fun random test with a slowWDF method that emulates WordDelimiterGraphFilter slowly but hopefuly bug-free and compares it on random strings vs. the real thing. Finally, I fixed TokenStreamToAutomaton to use the ending posInc, instead of "cheating" by looking at the ending offsets.
        Hide
        mikemccand Michael McCandless added a comment -

        Another iteration; I think it's ready. I added an option to TS2G to optionally (default off) interpret an ending offset gap as a hole ...

        Show
        mikemccand Michael McCandless added a comment - Another iteration; I think it's ready. I added an option to TS2G to optionally (default off) interpret an ending offset gap as a hole ...
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 637915b890d9f0e5cfaa6887609f221029327a25 in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=637915b ]

        LUCENE-7619: add WordDelimiterGraphFilter (replacing WordDelimiterFilter) to produce a correct token stream graph when splitting words

        Show
        jira-bot ASF subversion and git services added a comment - Commit 637915b890d9f0e5cfaa6887609f221029327a25 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=637915b ] LUCENE-7619 : add WordDelimiterGraphFilter (replacing WordDelimiterFilter) to produce a correct token stream graph when splitting words
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 4e4ec082caa7c1fc8fca24ddfb6a8633a4ae9506 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4e4ec08 ]

        LUCENE-7619: add WordDelimiterGraphFilter (replacing WordDelimiterFilter) to produce a correct token stream graph when splitting words

        Show
        jira-bot ASF subversion and git services added a comment - Commit 4e4ec082caa7c1fc8fca24ddfb6a8633a4ae9506 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4e4ec08 ] LUCENE-7619 : add WordDelimiterGraphFilter (replacing WordDelimiterFilter) to produce a correct token stream graph when splitting words
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 0bdcfc291fceab26e1c62a7e9791ce417671eacd in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0bdcfc2 ]

        LUCENE-7619: don't let offsets go backwards

        Show
        jira-bot ASF subversion and git services added a comment - Commit 0bdcfc291fceab26e1c62a7e9791ce417671eacd in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0bdcfc2 ] LUCENE-7619 : don't let offsets go backwards
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 03ffb1287d9908f8e1bb1417b7f18ca4645f209f in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=03ffb12 ]

        LUCENE-7619: don't let offsets go backwards

        Show
        jira-bot ASF subversion and git services added a comment - Commit 03ffb1287d9908f8e1bb1417b7f18ca4645f209f in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=03ffb12 ] LUCENE-7619 : don't let offsets go backwards
        Hide
        jigaronline Jigar Shah added a comment -

        Hello Michael McCandless

        +1

        Many thanks for fixing this!

        I am using WordDelemeterFilter (which often breaks phrase queries on words with puntuations). I am currently using Lucene 6.4.1 in production. Can you please suggest which classes I should patch on Lucene 6.4.1 to use this feature. Patching just WordDelimiterGraphFilter and using it in token stream instead of WordDelemeterFilter be fine? or there are any other dependent classes which I have to patch (please provide list if there are other classes too) ?

        Once Lucene 6.5 is released i will upgrade to Lucene 6.5 so i will get better tested fix, but for now i would like to patch Lucene 6.4.1 if patch is compitible and simple.

        Show
        jigaronline Jigar Shah added a comment - Hello Michael McCandless +1 Many thanks for fixing this! I am using WordDelemeterFilter (which often breaks phrase queries on words with puntuations). I am currently using Lucene 6.4.1 in production. Can you please suggest which classes I should patch on Lucene 6.4.1 to use this feature. Patching just WordDelimiterGraphFilter and using it in token stream instead of WordDelemeterFilter be fine? or there are any other dependent classes which I have to patch (please provide list if there are other classes too) ? Once Lucene 6.5 is released i will upgrade to Lucene 6.5 so i will get better tested fix, but for now i would like to patch Lucene 6.4.1 if patch is compitible and simple.
        Hide
        mikemccand Michael McCandless added a comment -

        Hi Jigar Shah, besides WDGF itself, there have also been a number of changes to the query parser, to properly consume a graph and build the correct queries, e.g. LUCENE-7702, LUCENE-7699, LUCENE-7698, LUCENE-7638, etc.

        It may be simpler for you to test a snapshot build of Lucene 6.5.0?

        Show
        mikemccand Michael McCandless added a comment - Hi Jigar Shah , besides WDGF itself, there have also been a number of changes to the query parser, to properly consume a graph and build the correct queries, e.g. LUCENE-7702 , LUCENE-7699 , LUCENE-7698 , LUCENE-7638 , etc. It may be simpler for you to test a snapshot build of Lucene 6.5.0?
        Hide
        jigaronline Jigar Shah added a comment -

        I will go for snapshot for now. Thanks for suggesting direction

        Many Thanks!

        Show
        jigaronline Jigar Shah added a comment - I will go for snapshot for now. Thanks for suggesting direction Many Thanks!

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development