Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6582

SynonymFilter should generate a correct (or, at least, better) graph

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Some time ago, I had a problem with synonyms and phrase type queries (actually, it was elasticsearch and I was using a match query with multiple terms and the "and" operator, as better explained here: https://github.com/elastic/elasticsearch/issues/10394).

      That issue led to some work on Lucene: LUCENE-6400 (where I helped a little with tests) and LUCENE-6401. This issue is also related to LUCENE-3843.

      Starting from the discussion on LUCENE-6400, I'm attempting to implement a solution. Here is a patch with a first step - the implementation to fix "SynFilter to be able to 'make positions'" (as was mentioned on the issue). In this way, the synonym filter generates a correct (or, at least, better) graph.

      As the synonym matching is greedy, I only had to worry about fixing the position length of the rules of the current match, no future or past synonyms would "span" over this match (please correct me if I'm wrong!). It did require more buffering, twice as much.

      The new behavior I added is not active by default, a new parameter has to be passed in a new constructor for SynonymFilter. The changes I made do change the token stream generated by the synonym filter, and I thought it would be better to let that be a voluntary decision for now.

      I did some refactoring on the code, but mostly on what I had to change for may implementation, so that the patch was not too hard to read. I created specific unit tests for the new implementation (TestMultiWordSynonymFilter) that should show how things will be with the new behavior.

        Attachments

        1. after.png
          23 kB
          Michael McCandless
        2. after2.png
          12 kB
          Ian Ribas
        3. after3.png
          21 kB
          Michael McCandless
        4. before.png
          19 kB
          Michael McCandless
        5. LUCENE-6582.patch
          94 kB
          Ian Ribas
        6. LUCENE-6582.patch
          92 kB
          Ian Ribas
        7. LUCENE-6582.patch
          92 kB
          Ian Ribas

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                ianribas Ian Ribas
              • Votes:
                1 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: