Lucene - Core
  1. Lucene - Core
  2. LUCENE-3662

extend LevenshteinAutomata to support transpositions as primitive edits

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This would be a nice improvement for spell correction: currently a transposition counts as 2 edits,
      which means users of DirectSpellChecker must use larger values of n (e.g. 2 instead of 1) and
      larger priority queue sizes, plus some sort of re-ranking with another distance measure for good results.

      Instead if we can integrate "chapter 7" of http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
      then you can just build an alternative DFA where a transposition is only a single edit
      (http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)

      According to the benchmarks in the original paper, the performance for LevT looks to be very similar to Lev.

      Support for this is now in moman (https://bitbucket.org/jpbarrette/moman/) thanks to Jean-Philippe
      Barrette-LaPierre.

      1. lev1.rev115.txt
        4 kB
        Robert Muir
      2. lev1.rev119.txt
        3 kB
        Robert Muir
      3. lev1t.txt
        4 kB
        Robert Muir
      4. LUCENE-3662_upgrade_moman.patch
        14 kB
        Robert Muir
      5. LUCENE-3662.patch
        53 kB
        Robert Muir
      6. LUCENE-3662.patch
        48 kB
        Robert Muir
      7. LUCENE-3662.patch
        48 kB
        Robert Muir
      8. LUCENE-3662.patch
        37 kB
        Robert Muir
      9. LUCENE-3662.patch
        34 kB
        Robert Muir
      10. update-moman.patch
        18 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        as a start, i thought we could try upgrading moman, passing 'False' for transpositions (as if you pass true, it uses the extended state notation including t-positions and this really makes our code-generator angry).

        But already there is some silly problem (likely in our code generator). I'll attach some files showing how the difference in the moman output changed

        Show
        Robert Muir added a comment - as a start, i thought we could try upgrading moman, passing 'False' for transpositions (as if you pass true, it uses the extended state notation including t-positions and this really makes our code-generator angry). But already there is some silly problem (likely in our code generator). I'll attach some files showing how the difference in the moman output changed
        Hide
        Robert Muir added a comment -

        here is the before and after... at a glance the changes only seem cosmetic, maybe mike has an idea

        I attached a lev1T just for fun to show the new notation as well.

        Show
        Robert Muir added a comment - here is the before and after... at a glance the changes only seem cosmetic, maybe mike has an idea I attached a lev1T just for fun to show the new notation as well.
        Hide
        Robert Muir added a comment -

        ok i found the bug, it was a (no longer valid) optimization in the parser for this null state case... no longer a problem now.

        Tests pass with this patch, which simply upgrades our moman to the latest revision.

        From here we have to modify the code generator to also generate "T" tables, and deal with the new t-position syntax.

        Show
        Robert Muir added a comment - ok i found the bug, it was a (no longer valid) optimization in the parser for this null state case... no longer a problem now. Tests pass with this patch, which simply upgrades our moman to the latest revision. From here we have to modify the code generator to also generate "T" tables, and deal with the new t-position syntax.
        Hide
        Robert Muir added a comment -

        updated patch with hacks to the parser and generated Lev1T and Lev2T parametric descriptions... i have no idea if any of this works yet.

        Show
        Robert Muir added a comment - updated patch with hacks to the parser and generated Lev1T and Lev2T parametric descriptions... i have no idea if any of this works yet.
        Hide
        Robert Muir added a comment -

        updated patch, hooking this into LevenshteinAutomata.

        I wrote a trivial test for "dog" and it passes... this is promising.

        I'll try to come up with a good testing strategy.

        Show
        Robert Muir added a comment - updated patch, hooking this into LevenshteinAutomata. I wrote a trivial test for "dog" and it passes... this is promising. I'll try to come up with a good testing strategy.
        Hide
        Dawid Weiss added a comment -

        Avanti Robert!

        Show
        Dawid Weiss added a comment - Avanti Robert!
        Hide
        Uwe Schindler added a comment -

        How many beers did you need for that?

        Show
        Uwe Schindler added a comment - How many beers did you need for that?
        Hide
        Robert Muir added a comment -

        updated patch with tests. this reveals a bug in moman for LevT(2)...

        Show
        Robert Muir added a comment - updated patch with tests. this reveals a bug in moman for LevT(2)...
        Hide
        Robert Muir added a comment -

        patch updated to revision 120 of moman, fixing the bug. tests pass now.

        Show
        Robert Muir added a comment - patch updated to revision 120 of moman, fixing the bug. tests pass now.
        Hide
        Mark Miller added a comment -

        Sweet. I'd estimate many beers.

        Show
        Mark Miller added a comment - Sweet. I'd estimate many beers.
        Hide
        Robert Muir added a comment -

        patch, hooking this into directspellchecker by default. I think its ready to commit.

        I did some rough perf tests, the transpositions costs us nothing. But the suggestions are much more relevant in some situations:

        for example, I typed "Wahsington" into the geonames database, asking for top-5 suggestions:

        maxEdits=1

        before:
        (no suggestions)

        after:

        Suggestion Score DocFreq
        Washington 0.9 61
        maxEdits=2

        before:

        Suggestion Score DocFreq
        Washington 0.8 61
        Warrington 0.8 13
        Waddington 0.8 10
        Wallington 0.8 7
        Watlington 0.8 5

        after:

        Suggestion Score DocFreq
        Washington 0.9 61
        Warrington 0.8 13
        Waddington 0.8 10
        Wallington 0.8 7
        Watlington 0.8 5

        About the beers: the story is that I made some small progress towards implementing this (https://bitbucket.org/rmuir/moman) with many many beers, but got stuck, Jean-Phillipe merged my commits, emailed me that he is confident he can implement it in 15 hours, and did just that. I found a bug in the test, he fixed it the next day as before... and go figure it looks like the bug might have been in the part I did.

        Show
        Robert Muir added a comment - patch, hooking this into directspellchecker by default. I think its ready to commit. I did some rough perf tests, the transpositions costs us nothing. But the suggestions are much more relevant in some situations: for example, I typed "Wahsington" into the geonames database, asking for top-5 suggestions: maxEdits=1 before: (no suggestions) after: Suggestion Score DocFreq Washington 0.9 61 maxEdits=2 before: Suggestion Score DocFreq Washington 0.8 61 Warrington 0.8 13 Waddington 0.8 10 Wallington 0.8 7 Watlington 0.8 5 after: Suggestion Score DocFreq Washington 0.9 61 Warrington 0.8 13 Waddington 0.8 10 Wallington 0.8 7 Watlington 0.8 5 About the beers: the story is that I made some small progress towards implementing this ( https://bitbucket.org/rmuir/moman ) with many many beers, but got stuck, Jean-Phillipe merged my commits, emailed me that he is confident he can implement it in 15 hours, and did just that. I found a bug in the test, he fixed it the next day as before... and go figure it looks like the bug might have been in the part I did.
        Hide
        Uwe Schindler added a comment -

        A nice Xmas present! +1

        I think we should send Jean-Philippe some beers, too.

        Show
        Uwe Schindler added a comment - A nice Xmas present! +1 I think we should send Jean-Philippe some beers, too.
        Hide
        Dawid Weiss added a comment -

        +1 for beers for Jean-Phillipe.

        Show
        Dawid Weiss added a comment - +1 for beers for Jean-Phillipe.

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development