Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.8, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      WordDelimiterFilter is documented as broken is TestRandomChains (LUCENE-4641). Given how used it is, we should try to fix it.

      1. LUCENE-5111.patch
        52 kB
        Robert Muir
      2. LUCENE-5111.patch
        11 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          here is a patch. Its not super-optimized, but the 3 common conditions (no delimiters, all delimiters, just one word surrounded by delimiters) are just as fast. for the concatenation+parts stuff I used captureState (we can avoid it, it was just about correctness for me).

          I think this is fairly important to fix so users can use e.g. postings highlighter and don't hit bugs like http://stackoverflow.com/questions/20324016/shingle-filter-factory-startoffset-must-be-non-negative-and-endoffset-must-be

          Show
          Robert Muir added a comment - here is a patch. Its not super-optimized, but the 3 common conditions (no delimiters, all delimiters, just one word surrounded by delimiters) are just as fast. for the concatenation+parts stuff I used captureState (we can avoid it, it was just about correctness for me). I think this is fairly important to fix so users can use e.g. postings highlighter and don't hit bugs like http://stackoverflow.com/questions/20324016/shingle-filter-factory-startoffset-must-be-non-negative-and-endoffset-must-be
          Hide
          Michael McCandless added a comment -

          +1

          I use WDF at http://jirasearch.mikemccandless.com (for CamelCaseTokenization) ... very happy to see this finally getting fixed!

          Show
          Michael McCandless added a comment - +1 I use WDF at http://jirasearch.mikemccandless.com (for CamelCaseTokenization) ... very happy to see this finally getting fixed!
          Hide
          Robert Muir added a comment -

          I cleaned it up, beefed up tests, and added backwards compatibility (in case for some reason someone depends on the old behavior for some reason).

          I think its ready, would like to bake in trunk in case TestRandomChains finds some surprises.

          Show
          Robert Muir added a comment - I cleaned it up, beefed up tests, and added backwards compatibility (in case for some reason someone depends on the old behavior for some reason). I think its ready, would like to bake in trunk in case TestRandomChains finds some surprises.
          Hide
          ASF subversion and git services added a comment -

          Commit 1578993 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1578993 ]

          LUCENE-5111: Fix WordDelimiterFilter offsets

          Show
          ASF subversion and git services added a comment - Commit 1578993 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1578993 ] LUCENE-5111 : Fix WordDelimiterFilter offsets
          Hide
          Robert Muir added a comment -

          I setup a jenkins job to beat on the analyzers in trunk: http://builds.flonkings.com/job/Lucene-trunk-Linux-java7-64-analyzers/

          Show
          Robert Muir added a comment - I setup a jenkins job to beat on the analyzers in trunk: http://builds.flonkings.com/job/Lucene-trunk-Linux-java7-64-analyzers/
          Hide
          ASF subversion and git services added a comment -

          Commit 1579089 from Robert Muir in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1579089 ]

          LUCENE-5111: Fix WordDelimiterFilter offsets

          Show
          ASF subversion and git services added a comment - Commit 1579089 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1579089 ] LUCENE-5111 : Fix WordDelimiterFilter offsets
          Hide
          Michael McCandless added a comment -

          Should we backport this to 4.7.2? Or is it too big a change...? (E.g. we'd need matchVersion to distinguish 4.7.0,1 vs 4.7.2).

          Show
          Michael McCandless added a comment - Should we backport this to 4.7.2? Or is it too big a change...? (E.g. we'd need matchVersion to distinguish 4.7.0,1 vs 4.7.2).
          Hide
          Uwe Schindler added a comment -

          Should we backport this to 4.7.2? Or is it too big a change...? (E.g. we'd need matchVersion to distinguish 4.7.0,1 vs 4.7.2).

          -1 We should really not change the behaviour of analysis components in minor releases. And we should not add new constants. So sorry, no chance to get this into 4.7!

          I think we should simply get 4.8 out soon! I would be the RM, so I will send a request to the ML.

          Show
          Uwe Schindler added a comment - Should we backport this to 4.7.2? Or is it too big a change...? (E.g. we'd need matchVersion to distinguish 4.7.0,1 vs 4.7.2). -1 We should really not change the behaviour of analysis components in minor releases. And we should not add new constants. So sorry, no chance to get this into 4.7! I think we should simply get 4.8 out soon! I would be the RM, so I will send a request to the ML.
          Hide
          Hoss Man added a comment -

          -1 We should really not change the behaviour of analysis components in minor releases.

          Agreed, -1

          Show
          Hoss Man added a comment - -1 We should really not change the behaviour of analysis components in minor releases. Agreed, -1
          Hide
          Uwe Schindler added a comment -

          Close issue after release of 4.8.0

          Show
          Uwe Schindler added a comment - Close issue after release of 4.8.0

            People

            • Assignee:
              Adrien Grand
              Reporter:
              Adrien Grand
            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development