Lucene - Core
  1. Lucene - Core
  2. LUCENE-3830

MappingCharFilter could be improved by switching to an FST.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Lucene Fields:
      New

      Description

      MappingCharFilter stores an overly complex tree-like structure for matching input patterns. The input is a union of fixed strings mapped to a set of fixed strings; an fst matcher would be ideal here and provide both memory and speed improvement I bet.

      1. LUCENE-3830.patch
        35 kB
        Michael McCandless
      2. PerfTestMappingCharFilter.java
        10 kB
        Michael McCandless
      3. LUCENE-3830.patch
        35 kB
        Michael McCandless
      4. LUCENE-3830.patch
        28 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Dawid Weiss added a comment -

          I think these are related – the logic in MappingCharFilter was vague to me when I looked at it last time, I'm sure it could be cleaned up if an fst was used for matching input.

          Show
          Dawid Weiss added a comment - I think these are related – the logic in MappingCharFilter was vague to me when I looked at it last time, I'm sure it could be cleaned up if an fst was used for matching input.
          Hide
          Michael McCandless added a comment -

          Patch; I think it's basically working.

          I disallow empty string match to be added (this was accepted, but I think ignored, before), so I had to fix a couple places that were doing that.

          I also added another random test that separately (slowly) computes the output & corrected offsets and then compares ... it's passing.

          I still need to do a perf test.. does anyone have a "typical" set of mappings + content I could use? Else I'll try to come up with something...

          Show
          Michael McCandless added a comment - Patch; I think it's basically working. I disallow empty string match to be added (this was accepted, but I think ignored, before), so I had to fix a couple places that were doing that. I also added another random test that separately (slowly) computes the output & corrected offsets and then compares ... it's passing. I still need to do a perf test.. does anyone have a "typical" set of mappings + content I could use? Else I'll try to come up with something...
          Hide
          Robert Muir added a comment -

          just laughing a little bit to stumble on this:

          http://www.cis.uni-muenchen.de/people/Schulz/Pub/dictle5.ps

          Show
          Robert Muir added a comment - just laughing a little bit to stumble on this: http://www.cis.uni-muenchen.de/people/Schulz/Pub/dictle5.ps
          Hide
          Dawid Weiss added a comment -

          Stoyan Mihov is everywhere, he's done everything

          Show
          Dawid Weiss added a comment - Stoyan Mihov is everywhere, he's done everything
          Hide
          Dawid Weiss added a comment -

          I briefly looked at the patch. Looks good (didn't check indexes and most of the logic, just went through the general idea)

          + * An FST {@link Outputs} implementation where each output
          + * is a sequence of bytes.
          + *
          + * @lucene.experimental
          + */
          +
          +public final class CharSequenceOutputs extends Outputs<CharsRef> { 
          

          Javadoc is probably wrong (sequence of bytes?).

          Other than that I think it's all right. I'd probably implement it based on the stack of currently "active" tokens moving through the fst – this way you know when you have a greedy first longest match simply by looking at the set of active tokens when a match fires instead of restarting from each position.

          But then – Mihov's algorithm claims to reduce this to O(1) basically so it'd be very nice to implement that instead

          Show
          Dawid Weiss added a comment - I briefly looked at the patch. Looks good (didn't check indexes and most of the logic, just went through the general idea) + * An FST {@link Outputs} implementation where each output + * is a sequence of bytes. + * + * @lucene.experimental + */ + + public final class CharSequenceOutputs extends Outputs<CharsRef> { Javadoc is probably wrong (sequence of bytes?). Other than that I think it's all right. I'd probably implement it based on the stack of currently "active" tokens moving through the fst – this way you know when you have a greedy first longest match simply by looking at the set of active tokens when a match fires instead of restarting from each position. But then – Mihov's algorithm claims to reduce this to O(1) basically so it'd be very nice to implement that instead
          Hide
          Robert Muir added a comment -

          I only glanced at the Schulz/Mihov paper, but it seemed like in order for it to work you have
          to add extra transitions (loops) which would make the FST infinite? (could be totally wrong here)

          Maybe there is some creative way we could do this, but thats just in addition to the fact the
          paper is ... like their other papers and would take some work to decode anyway

          Show
          Robert Muir added a comment - I only glanced at the Schulz/Mihov paper, but it seemed like in order for it to work you have to add extra transitions (loops) which would make the FST infinite? (could be totally wrong here) Maybe there is some creative way we could do this, but thats just in addition to the fact the paper is ... like their other papers and would take some work to decode anyway
          Hide
          Michael McCandless added a comment -

          Just laughing a little bit to stumble on this:

          Only 23 pages!

          Javadoc is probably wrong (sequence of bytes?).

          Thanks Dawid, I'll fix.

          I'd probably implement it based on the stack of currently "active" tokens moving through the fst – this way you know when you have a greedy first longest match simply by looking at the set of active tokens when a match fires instead of restarting from each position.

          Hmm I don't quite understand ... can you describe more?

          If there's something simple in between what MappingCharFilter (and my patch) does today ("restart on each position"), and the 23 page paper, that would be nice

          Show
          Michael McCandless added a comment - Just laughing a little bit to stumble on this: Only 23 pages! Javadoc is probably wrong (sequence of bytes?). Thanks Dawid, I'll fix. I'd probably implement it based on the stack of currently "active" tokens moving through the fst – this way you know when you have a greedy first longest match simply by looking at the set of active tokens when a match fires instead of restarting from each position. Hmm I don't quite understand ... can you describe more? If there's something simple in between what MappingCharFilter (and my patch) does today ("restart on each position"), and the 23 page paper, that would be nice
          Hide
          Dawid Weiss added a comment -

          Hmm I don't quite understand ... can you describe more?

          I'm on vacation this week, so short: from the brief code inspection I concluded you're using a search from each position to get the maximum match, is this right? If so, the pessimistic time is quite bad for patterns like "aaa*b" and input strings "aaaa*c" (replace * with a large number of repeated sequences). The way I would try to implement it is to do a state-tracking much like in non-deterministic automata, where you have a stack of "active" states which you follow in the automaton and you move them from state to state on each input symbol. If there a given state fires as a match then you have some conditions to check – are there any states that may be potentially longer but haven't fired yet (if so, you need to delay this firing state because it can be subsumed by others), otherwise when the no other active state blocks that one you can do the replacement.

          I haven't really thought about how twisted the logic above can become but I suspect it's not going to be really that bad. The gain is that each input symbol advances your state (at most linearly with the number of active states). It helps with degenerate cases like the one above. I suppose what Mihov et al. do is to statically (on the fst) determine which states can lead to this "deferred match queue" and simply eliminate them, but haven't really looked into it.

          It's an improvement, doesn't need to be implemented right away. Sorry for being brief – the network is flaky here, I'm in the middle of nowhere.

          Show
          Dawid Weiss added a comment - Hmm I don't quite understand ... can you describe more? I'm on vacation this week, so short: from the brief code inspection I concluded you're using a search from each position to get the maximum match, is this right? If so, the pessimistic time is quite bad for patterns like "aaa*b" and input strings "aaaa*c" (replace * with a large number of repeated sequences). The way I would try to implement it is to do a state-tracking much like in non-deterministic automata, where you have a stack of "active" states which you follow in the automaton and you move them from state to state on each input symbol. If there a given state fires as a match then you have some conditions to check – are there any states that may be potentially longer but haven't fired yet (if so, you need to delay this firing state because it can be subsumed by others), otherwise when the no other active state blocks that one you can do the replacement. I haven't really thought about how twisted the logic above can become but I suspect it's not going to be really that bad. The gain is that each input symbol advances your state (at most linearly with the number of active states). It helps with degenerate cases like the one above. I suppose what Mihov et al. do is to statically (on the fst) determine which states can lead to this "deferred match queue" and simply eliminate them, but haven't really looked into it. It's an improvement, doesn't need to be implemented right away. Sorry for being brief – the network is flaky here, I'm in the middle of nowhere.
          Hide
          Michael McCandless added a comment -

          from the brief code inspection I concluded you're using a search from each position to get the maximum match, is this right?

          Right. I mean, that's how the code currently works ... I'm just
          replacing the recursive Map w/ an FST.

          If so, the pessimistic time is quite bad for patterns like "aaa*b" and input strings "aaaa*c" (replace * with a large number of repeated sequences).

          Yes... though I suspect in practice the impact is minimal for Lucene's
          typical use cases. Still it would be nice to fix

          The way I would try to implement it is to do a state-tracking much like in non-deterministic automata, where you have a stack of "active" states which you follow in the automaton and you move them from state to state on each input symbol.

          OK! I think I understand... I think this amounts to putting a .*
          cycle on the FST's start node, and then doing subset construction to
          "determinize" as you traverse? Ie track all states that the input
          characters could be in... so that you only traverse the input once.

          I suppose what Mihov et al. do is to statically (on the fst) determine which states can lead to this "deferred match queue" and simply eliminate them, but haven't really looked into it.

          Yeah I think I think the paper (and Aho/Corasick) essentially do that
          determinization "up front".

          SynonymFilter also has the same limitation....

          Show
          Michael McCandless added a comment - from the brief code inspection I concluded you're using a search from each position to get the maximum match, is this right? Right. I mean, that's how the code currently works ... I'm just replacing the recursive Map w/ an FST. If so, the pessimistic time is quite bad for patterns like "aaa*b" and input strings "aaaa*c" (replace * with a large number of repeated sequences). Yes... though I suspect in practice the impact is minimal for Lucene's typical use cases. Still it would be nice to fix The way I would try to implement it is to do a state-tracking much like in non-deterministic automata, where you have a stack of "active" states which you follow in the automaton and you move them from state to state on each input symbol. OK! I think I understand... I think this amounts to putting a .* cycle on the FST's start node, and then doing subset construction to "determinize" as you traverse? Ie track all states that the input characters could be in... so that you only traverse the input once. I suppose what Mihov et al. do is to statically (on the fst) determine which states can lead to this "deferred match queue" and simply eliminate them, but haven't really looked into it. Yeah I think I think the paper (and Aho/Corasick) essentially do that determinization "up front". SynonymFilter also has the same limitation....
          Hide
          Michael McCandless added a comment -

          New patch: I made a simple perf test (attached:
          PerfTestMappingCharFilter.java), and found the FST was slower... I
          fixed RollingCharBuffer to bulk read, and also pre-cache the FST arcs
          in a HashMap (FST pre-caches only latin bytes), and now perf is a bit
          faster. I also switched to builder API (like SynFilter) so
          the NormalizeCharMap instance is immutable.

          I think it's ready!

          We can separately improve the matching algo based on the 23 page
          paper...

          Show
          Michael McCandless added a comment - New patch: I made a simple perf test (attached: PerfTestMappingCharFilter.java), and found the FST was slower... I fixed RollingCharBuffer to bulk read, and also pre-cache the FST arcs in a HashMap (FST pre-caches only latin bytes), and now perf is a bit faster. I also switched to builder API (like SynFilter) so the NormalizeCharMap instance is immutable. I think it's ready! We can separately improve the matching algo based on the 23 page paper...
          Hide
          Robert Muir added a comment -

          Why does mappingcharfilter itself cache FST arcs?

          This seems like a slow thing to do per-reader. Shouldnt this be done
          when you build the NormalizeCharMap instead?

          Show
          Robert Muir added a comment - Why does mappingcharfilter itself cache FST arcs? This seems like a slow thing to do per-reader. Shouldnt this be done when you build the NormalizeCharMap instead?
          Hide
          Michael McCandless added a comment -

          Shouldnt this be done when you build the NormalizeCharMap instead?

          Duh, yes! I'll fix.

          Show
          Michael McCandless added a comment - Shouldnt this be done when you build the NormalizeCharMap instead? Duh, yes! I'll fix.
          Hide
          Michael McCandless added a comment -

          Patch moving the root arc cache creation to NormalizeCharMap...

          Show
          Michael McCandless added a comment - Patch moving the root arc cache creation to NormalizeCharMap...
          Hide
          Robert Muir added a comment -

          patch looks good: I guess the bulk read in RollingCharBuffer should help
          other things like Kuromoji that use it too?!

          Show
          Robert Muir added a comment - patch looks good: I guess the bulk read in RollingCharBuffer should help other things like Kuromoji that use it too?!
          Hide
          Michael McCandless added a comment -

          I guess the bulk read in RollingCharBuffer should help other things like Kuromoji that use it too?!

          I haven't tested but I think it should help!

          Show
          Michael McCandless added a comment - I guess the bulk read in RollingCharBuffer should help other things like Kuromoji that use it too?! I haven't tested but I think it should help!

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Dawid Weiss
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development