Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.4, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The current synonymsfilter uses a lot of ram and cpu, especially at build time.

      I think yesterday I heard about "huge synonyms files" three times.

      So, I think we should use an FST-based structure, sharing the inputs and outputs.
      And we should be more efficient with the tokenStream api, e.g. using save/restoreState instead of cloneAttributes()

      1. synonyms.zip
        575 kB
        Robert Muir
      2. LUCENE-3233.patch
        16 kB
        Robert Muir
      3. LUCENE-3233.patch
        48 kB
        Michael McCandless
      4. LUCENE-3233.patch
        44 kB
        Robert Muir
      5. LUCENE-3233.patch
        52 kB
        Michael McCandless
      6. LUCENE-3233.patch
        54 kB
        Michael McCandless
      7. LUCENE-3233.patch
        73 kB
        Robert Muir
      8. LUCENE-3233.patch
        78 kB
        Robert Muir
      9. LUCENE-3233.patch
        80 kB
        Robert Muir
      10. LUCENE-3233.patch
        83 kB
        Robert Muir
      11. LUCENE-3233.patch
        89 kB
        Michael McCandless
      12. LUCENE-3233.patch
        95 kB
        Robert Muir
      13. LUCENE-3233.patch
        91 kB
        Michael McCandless
      14. LUCENE-3233.patch
        94 kB
        Michael McCandless
      15. LUCENE-3233.patch
        252 kB
        Robert Muir
      16. LUCENE-3233.patch
        252 kB
        Robert Muir
      17. LUCENE-3233.patch
        248 kB
        Michael McCandless
      18. LUCENE-3223.patch
        29 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          here's a rough start to building a datastructure that I think makes good tradeoffs between RAM and processing.

          No matter what, the processing on the filter-side will be hairy because of the 'interleaving' with the tokenstream.

          This one is just an FST<CharsRef,Int[]>(BYTE4) where Int is an ord to a BytesRefHash, containing the output Bytes for each term.

          This way, at input time we can walk the FST with codePointAt()

          On both sides, the Chars/Bytes are actually phrases, using \u0000 as a word separator.

          Show
          Robert Muir added a comment - here's a rough start to building a datastructure that I think makes good tradeoffs between RAM and processing. No matter what, the processing on the filter-side will be hairy because of the 'interleaving' with the tokenstream. This one is just an FST<CharsRef,Int[]>(BYTE4) where Int is an ord to a BytesRefHash, containing the output Bytes for each term. This way, at input time we can walk the FST with codePointAt() On both sides, the Chars/Bytes are actually phrases, using \u0000 as a word separator.
          Hide
          Michael McCandless added a comment -

          Dumping my current state on FSTSynonymFilter – it compiles but it's got tons of bugs I'm sure! I added a trivial initial test.

          Show
          Michael McCandless added a comment - Dumping my current state on FSTSynonymFilter – it compiles but it's got tons of bugs I'm sure! I added a trivial initial test.
          Hide
          Michael McCandless added a comment -

          New patch w/ current state.

          I think it's closer; the test has more cases now (but I'd still like to make a random test), fewer nocommits, etc.

          Show
          Michael McCandless added a comment - New patch w/ current state. I think it's closer; the test has more cases now (but I'd still like to make a random test), fewer nocommits, etc.
          Hide
          Robert Muir added a comment -

          patch with a first random test, this one currently does 10 iterations where it adds random shit to the synonym map, then it analyzes 10k random strings (each time capturing the output, and replaying it back to ensure the thing is deterministic and doesn't have reuse bugs).

          i also added the ignoreCase support.

          the filter might have a reuse bug, see ant test -Dtestcase=TestFSTSynonymMapFilter -Dtestmethod=testRandom -Dtests.seed=-4122723628721952592:244824441557739968

          Show
          Robert Muir added a comment - patch with a first random test, this one currently does 10 iterations where it adds random shit to the synonym map, then it analyzes 10k random strings (each time capturing the output, and replaying it back to ensure the thing is deterministic and doesn't have reuse bugs). i also added the ignoreCase support. the filter might have a reuse bug, see ant test -Dtestcase=TestFSTSynonymMapFilter -Dtestmethod=testRandom -Dtests.seed=-4122723628721952592:244824441557739968
          Hide
          Michael McCandless added a comment -

          New patch, folding in Robert's changes and the random stress test. All tests pass. I think it's now functionally correct, but I still need to compare perf vs existing syn filter, and there are still a few minor nocommits to work out.

          Ideally, if we get perf close enough, since RAM is much much less w/ this new syn filter, I think we should replace the old one with this new one.

          Show
          Michael McCandless added a comment - New patch, folding in Robert's changes and the random stress test. All tests pass. I think it's now functionally correct, but I still need to compare perf vs existing syn filter, and there are still a few minor nocommits to work out. Ideally, if we get perf close enough, since RAM is much much less w/ this new syn filter, I think we should replace the old one with this new one.
          Hide
          Michael McCandless added a comment -

          New patch, adding dedup option to the builder, removing a couple nocommits, cutting back on iters/counts in testRandom2.

          Show
          Michael McCandless added a comment - New patch, adding dedup option to the builder, removing a couple nocommits, cutting back on iters/counts in testRandom2.
          Hide
          Robert Muir added a comment -

          Updated patch:

          • added a SolrSynonymsParser and test to the analyzers module, that parses the existing solr synonyms format.
          • added a Solr factory for this thing (untested!) that uses this when "format=solr" (the default)

          This way, the idea is the factory would be more extensible, e.g. you could load syns from a database, or we could add parsers for wordnet and nuke contrib/wordnet, etc etc.

          Still need to do some basic benchmarking.

          Show
          Robert Muir added a comment - Updated patch: added a SolrSynonymsParser and test to the analyzers module, that parses the existing solr synonyms format. added a Solr factory for this thing (untested!) that uses this when "format=solr" (the default) This way, the idea is the factory would be more extensible, e.g. you could load syns from a database, or we could add parsers for wordnet and nuke contrib/wordnet, etc etc. Still need to do some basic benchmarking.
          Hide
          Robert Muir added a comment -

          fixed some bugs, added some tests, but there is a problem, I started to add a little benchmark and I hit this on my largish synonyms file:

          java.lang.IllegalStateException: max arc size is too large (445)
          

          Just run the TestFSTSynonymFilterFactory and you will see it, i enabled some prints and it doesn't appear like anything totally stupid is going on... giving up for the night

          Show
          Robert Muir added a comment - fixed some bugs, added some tests, but there is a problem, I started to add a little benchmark and I hit this on my largish synonyms file: java.lang.IllegalStateException: max arc size is too large (445) Just run the TestFSTSynonymFilterFactory and you will see it, i enabled some prints and it doesn't appear like anything totally stupid is going on... giving up for the night
          Hide
          Robert Muir added a comment -

          attaching my synonyms.txt test file that i was using: its derived from wordnet

          Show
          Robert Muir added a comment - attaching my synonyms.txt test file that i was using: its derived from wordnet
          Hide
          Michael McCandless added a comment -

          java.lang.IllegalStateException: max arc size is too large (445)

          Ahh – to fix this we have to call Builder.setAllowArrayArcs(false), ie, disable the array arcs in the FST (and this binary search lookup for finding arcs!). I had to do this also for MemoryCodec, since postings encoded as output per arc can be more than 256 bytes, in general.

          This will hurt perf, ie, the arc lookup cannot use a binary search; it's because of a silly limitation in the FST representation, that we use a single byte to hold the max size of all arcs, so that if any arc is > 256 bytes we are unable to encode it as an array. We could fix this (eg, use vInt), however, arcs with such widely varying sizes (due to widely varying outputs on each arc) will be very wasteful in space because all arcs will use up a fixed number of bytes when represented as an array.

          For now I think we should just call the above method, and then test the resulting perf.

          Show
          Michael McCandless added a comment - java.lang.IllegalStateException: max arc size is too large (445) Ahh – to fix this we have to call Builder.setAllowArrayArcs(false), ie, disable the array arcs in the FST (and this binary search lookup for finding arcs!). I had to do this also for MemoryCodec, since postings encoded as output per arc can be more than 256 bytes, in general. This will hurt perf, ie, the arc lookup cannot use a binary search; it's because of a silly limitation in the FST representation, that we use a single byte to hold the max size of all arcs, so that if any arc is > 256 bytes we are unable to encode it as an array. We could fix this (eg, use vInt), however, arcs with such widely varying sizes (due to widely varying outputs on each arc) will be very wasteful in space because all arcs will use up a fixed number of bytes when represented as an array. For now I think we should just call the above method, and then test the resulting perf.
          Hide
          Michael McCandless added a comment -

          Actually, maybe a better general fix for FST would be for it to dynamically decide whether to make an array based on how many bytes will be wasted (in addition to the number of arcs / depth of the node). This way we could turn on arcs always, and FST would pick the right times to use it. If we stick to only 1 byte for the number of bytes per arc, the FST could simply not use the array when an arc is > 256 bytes.

          Show
          Michael McCandless added a comment - Actually, maybe a better general fix for FST would be for it to dynamically decide whether to make an array based on how many bytes will be wasted (in addition to the number of arcs / depth of the node). This way we could turn on arcs always, and FST would pick the right times to use it. If we stick to only 1 byte for the number of bytes per arc, the FST could simply not use the array when an arc is > 256 bytes.
          Hide
          Robert Muir added a comment -

          Thanks Mike, I will set the option for now, we can address any potential perf hit in a number of different ways here (besides modifying FST itself).

          Show
          Robert Muir added a comment - Thanks Mike, I will set the option for now, we can address any potential perf hit in a number of different ways here (besides modifying FST itself).
          Hide
          Robert Muir added a comment -

          I ran some quick numbers, using the syn file example here, just best of 3 runs:

          Impl Build time RAM usage
          SynonymFilterFactory 6619ms 207.92 mb
          FSTSynonymFilterFactory 463 ms 3.51 mb

          I modified the builder slightly to build the FST more efficiently for this, will upload the updated patch.

          So i think the build-time and RAM consumption are really improved, the next thing is to benchmark the runtime perf.

          Show
          Robert Muir added a comment - I ran some quick numbers, using the syn file example here, just best of 3 runs: Impl Build time RAM usage SynonymFilterFactory 6619ms 207.92 mb FSTSynonymFilterFactory 463 ms 3.51 mb I modified the builder slightly to build the FST more efficiently for this, will upload the updated patch. So i think the build-time and RAM consumption are really improved, the next thing is to benchmark the runtime perf.
          Hide
          David Smiley added a comment -

          Wow that's striking; nice work guys. FSTs are definitely one of those killer pieces of technology in Lucene.

          The difference in build time is surprising to me. Any theory why SynonymFilterFactory takes so much more time to build?

          Show
          David Smiley added a comment - Wow that's striking; nice work guys. FSTs are definitely one of those killer pieces of technology in Lucene. The difference in build time is surprising to me. Any theory why SynonymFilterFactory takes so much more time to build?
          Hide
          Robert Muir added a comment -

          The difference in build time is surprising to me. Any theory why SynonymFilterFactory takes so much more time to build?

          Yes, its the n^2 portion where you have a synonym entry like this: a, b, c, d
          in reality this is creating entries like this:
          a -> a
          a -> b
          a -> c
          a -> d
          b -> a
          b -> b
          ...

          in the current impl, this is done using some inefficient datastructures (like nested chararraymaps with Token),
          as well as calling merge().

          In the FST impl, we don't use any nested structures (instead input and output entries are just phrases), and we explicitly
          deduplicate both inputs and outputs during construction, the FST output is just a
          List<Integer> basically pointing to ords in the deduplicated bytesrefhash.

          so during construction when you add() its just a hashmap lookup on the input phrase, a bytesrefhash get/put on the UTF16toUTF8WithHash
          to get the output ord, and an append to an arraylist.

          this code isn't really optimized right now and we can definitely speed it up even more in the future. but the main thing
          right now is to ensure the filter performance is good.

          Show
          Robert Muir added a comment - The difference in build time is surprising to me. Any theory why SynonymFilterFactory takes so much more time to build? Yes, its the n^2 portion where you have a synonym entry like this: a, b, c, d in reality this is creating entries like this: a -> a a -> b a -> c a -> d b -> a b -> b ... in the current impl, this is done using some inefficient datastructures (like nested chararraymaps with Token), as well as calling merge(). In the FST impl, we don't use any nested structures (instead input and output entries are just phrases), and we explicitly deduplicate both inputs and outputs during construction, the FST output is just a List<Integer> basically pointing to ords in the deduplicated bytesrefhash. so during construction when you add() its just a hashmap lookup on the input phrase, a bytesrefhash get/put on the UTF16toUTF8WithHash to get the output ord, and an append to an arraylist. this code isn't really optimized right now and we can definitely speed it up even more in the future. but the main thing right now is to ensure the filter performance is good.
          Hide
          Robert Muir added a comment -

          here is a patch with a little microbenchmark... so we have some tuning to do.

          the benchmark analyzes a short string a million times, that doesn't match any synonyms (actually hte solr default)

          impl ms
          SynonymsFilter 1692
          FST with array arcs 2794
          FST with no array arcs 8823

          so, disabling the array arcs is a pretty crucial hit here. but we could do other options to speed up this common case, e.g. with daciuk we could build a charrunautomaton of the K-prefixes of the synonyms, this would be really fast to reject these terms that don't match any syns.

          or we could explicitly put our bytesref output in a byte[], and use long pointers as outputs.

          or we could speed up FST! But i think its interesting to see how important this parameter is.

          Show
          Robert Muir added a comment - here is a patch with a little microbenchmark... so we have some tuning to do. the benchmark analyzes a short string a million times, that doesn't match any synonyms (actually hte solr default) impl ms SynonymsFilter 1692 FST with array arcs 2794 FST with no array arcs 8823 so, disabling the array arcs is a pretty crucial hit here. but we could do other options to speed up this common case, e.g. with daciuk we could build a charrunautomaton of the K-prefixes of the synonyms, this would be really fast to reject these terms that don't match any syns. or we could explicitly put our bytesref output in a byte[], and use long pointers as outputs. or we could speed up FST! But i think its interesting to see how important this parameter is.
          Hide
          Michael McCandless added a comment -

          Wow, it's very important to allow arcs to be encoded as arrays (for the binary search on lookup). I think we should just fix FST... I'll think about it. MemoryCodec would also get big gains here.

          Show
          Michael McCandless added a comment - Wow, it's very important to allow arcs to be encoded as arrays (for the binary search on lookup). I think we should just fix FST... I'll think about it. MemoryCodec would also get big gains here.
          Hide
          Robert Muir added a comment -

          I agree, this would be the best solution. Maybe we should just open a separate issue for that?

          we can let this one be for now until that is resolved, can even continue working on other parts of it.

          Show
          Robert Muir added a comment - I agree, this would be the best solution. Maybe we should just open a separate issue for that? we can let this one be for now until that is resolved, can even continue working on other parts of it.
          Hide
          Michael McCandless added a comment -

          New patch, including some optimizing to FST (which we can commit under a separate issue): array arcs can now be any size, and I re-use the BytesReader inner class that's created for parsing arcs.

          Show
          Michael McCandless added a comment - New patch, including some optimizing to FST (which we can commit under a separate issue): array arcs can now be any size, and I re-use the BytesReader inner class that's created for parsing arcs.
          Hide
          Robert Muir added a comment -

          New patch, including some optimizing to FST (which we can commit under a separate issue)

          works! I don't think we need to open a new issue, I didn't think you would come back with a patch in just two hours!

          I'll play with the patch some now and see what I can do with it.

          Show
          Robert Muir added a comment - New patch, including some optimizing to FST (which we can commit under a separate issue) works! I don't think we need to open a new issue, I didn't think you would come back with a patch in just two hours! I'll play with the patch some now and see what I can do with it.
          Hide
          Robert Muir added a comment -

          updated patch, this tableizes the first FST arcs for latin-1.

          precomputing this tiny table speeds up this filter a ton (~3000ms -> ~2000ms) and I think is a cheap easy win for the terms index too.

          Show
          Robert Muir added a comment - updated patch, this tableizes the first FST arcs for latin-1. precomputing this tiny table speeds up this filter a ton (~3000ms -> ~2000ms) and I think is a cheap easy win for the terms index too.
          Hide
          Robert Muir added a comment -

          Benchmark with the synonyms.zip attached to this issue (so that we are actually matching synonyms):
          in this case i only analyzed the text 100,000 times, as its a lot more output.
          I also checked they are emitting the same stuff:

          Impl ms
          SynonymsFilter 112527
          FST 22872

          So, thats 5x faster, I think probably due to avoiding the expensive cloning.

          I think we are fine on performance.

          Show
          Robert Muir added a comment - Benchmark with the synonyms.zip attached to this issue (so that we are actually matching synonyms): in this case i only analyzed the text 100,000 times, as its a lot more output. I also checked they are emitting the same stuff: Impl ms SynonymsFilter 112527 FST 22872 So, thats 5x faster, I think probably due to avoiding the expensive cloning. I think we are fine on performance.
          Hide
          Robert Muir added a comment -

          so i don't forget, lets not waste an arc bitflag marking an arc as 'first'...

          I hear the secret is instead arc.target == startNode

          Show
          Robert Muir added a comment - so i don't forget, lets not waste an arc bitflag marking an arc as 'first'... I hear the secret is instead arc.target == startNode
          Hide
          Michael McCandless added a comment -

          New patch, moving the root arcs cache into FST, not using up our last precious arc bit.

          Show
          Michael McCandless added a comment - New patch, moving the root arcs cache into FST, not using up our last precious arc bit.
          Hide
          Michael McCandless added a comment -

          Another rev of the patch: I did a hard bump the FST version (so
          existing trunk indices must be rebuilt), and added NOTE in suggest's
          FST impl that the file format is experimental; removed
          maxVerticalContext; fixed false test failure.

          Show
          Michael McCandless added a comment - Another rev of the patch: I did a hard bump the FST version (so existing trunk indices must be rebuilt), and added NOTE in suggest's FST impl that the file format is experimental; removed maxVerticalContext; fixed false test failure.
          Hide
          Michael McCandless added a comment -

          I think this is ready to commit, but I'd like to rename existing syn filter to SlowSynonymFilter and rename the new one to SynonymFilter.

          Because there are some minor diffs (deduping rules, lowercasing), for Solr to cutover I think we need some back compat logic; I'll open a separate issue for this.

          Show
          Michael McCandless added a comment - I think this is ready to commit, but I'd like to rename existing syn filter to SlowSynonymFilter and rename the new one to SynonymFilter. Because there are some minor diffs (deduping rules, lowercasing), for Solr to cutover I think we need some back compat logic; I'll open a separate issue for this.
          Hide
          Robert Muir added a comment -

          I'll try to add some not-so-sophisticated backwards here!

          Show
          Robert Muir added a comment - I'll try to add some not-so-sophisticated backwards here!
          Hide
          Yonik Seeley added a comment -

          but I'd like to rename existing syn filter to SlowSynonymFilter and rename the new one to SynonymFilter.

          But the lookup on the original is still faster, right? And if someone has small synonym dicts (actually pretty common in my experience since SynonymFilter isn't necessary used to inject synonyms in the traditional sense, but for any mapping task) then build time and mem use won't be much of an issue (esp if the input to match is mostly single words).

          This looks great for large synonym maps, but perhaps instead of Slow* or Fast* we could name them for the implementation and either name the new one FSTSynonymFilter or rename the current one to MapSynonymFilter? Or is the plan to actually deprecate the current SynonymFilter?

          Show
          Yonik Seeley added a comment - but I'd like to rename existing syn filter to SlowSynonymFilter and rename the new one to SynonymFilter. But the lookup on the original is still faster, right? And if someone has small synonym dicts (actually pretty common in my experience since SynonymFilter isn't necessary used to inject synonyms in the traditional sense, but for any mapping task) then build time and mem use won't be much of an issue (esp if the input to match is mostly single words). This looks great for large synonym maps, but perhaps instead of Slow* or Fast* we could name them for the implementation and either name the new one FSTSynonymFilter or rename the current one to MapSynonymFilter? Or is the plan to actually deprecate the current SynonymFilter?
          Hide
          Michael McCandless added a comment -

          But the lookup on the original is still faster, right?

          That was before we optimized FST for this usage case.

          Now, from the testing above, it looks like we are faster when syns actually match; if no syns match the two are around the same speed.

          Separately: shouldn't we not have any syns in the default text_en field type? Like we can have a synonyms.txt but comment out all the rules in there?

          I don't think we should keep the old one around, ie, we should [eventually] replace it with the new one.

          Show
          Michael McCandless added a comment - But the lookup on the original is still faster, right? That was before we optimized FST for this usage case. Now, from the testing above, it looks like we are faster when syns actually match; if no syns match the two are around the same speed. Separately: shouldn't we not have any syns in the default text_en field type? Like we can have a synonyms.txt but comment out all the rules in there? I don't think we should keep the old one around, ie, we should [eventually] replace it with the new one.
          Hide
          Robert Muir added a comment -

          This one is faster for the simple reason that the TokenFilter uses captureState() and restoreState() [and has logic to minimize cloning in more cases] instead of AttributeSource.cloneAttributes()/copyTo()

          Show
          Robert Muir added a comment - This one is faster for the simple reason that the TokenFilter uses captureState() and restoreState() [and has logic to minimize cloning in more cases] instead of AttributeSource.cloneAttributes()/copyTo()
          Hide
          Yonik Seeley added a comment -

          Now, from the testing above, it looks like we are faster when syns actually match; if no syns match the two are around the same speed.

          Oh cool! I was looking at "1692" for the SynonymsFilter and a drop from "~3000ms -> ~2000ms" for the FST version. I assumed Robert's last benchmark was building and not lookup (the 112527/22872).

          Separately: shouldn't we not have any syns in the default text_en field type?

          I dunno... it's nice for both demonstration and testing (and it's in the current tutorial).

          Show
          Yonik Seeley added a comment - Now, from the testing above, it looks like we are faster when syns actually match; if no syns match the two are around the same speed. Oh cool! I was looking at "1692" for the SynonymsFilter and a drop from "~3000ms -> ~2000ms" for the FST version. I assumed Robert's last benchmark was building and not lookup (the 112527/22872). Separately: shouldn't we not have any syns in the default text_en field type? I dunno... it's nice for both demonstration and testing (and it's in the current tutorial).
          Hide
          Robert Muir added a comment -

          Updated patch:

          • renamed to SynonymFilter
          • added not-so-sophisticated backwards layer
          • added tests
          • added parser for format="wordnet"
          • removed contrib/wordnet

          but i found some bugs (well one, surely is) in the new tests, so i added nocommits here.

          Show
          Robert Muir added a comment - Updated patch: renamed to SynonymFilter added not-so-sophisticated backwards layer added tests added parser for format="wordnet" removed contrib/wordnet but i found some bugs (well one, surely is) in the new tests, so i added nocommits here.
          Hide
          Robert Muir added a comment -

          just updating patch to trunk, the nocommits remain...

          Show
          Robert Muir added a comment - just updating patch to trunk, the nocommits remain...
          Hide
          Michael McCandless added a comment -

          New patch, also setting offsets in the produced tokens (the wordnet test passes), and adding adding a NOTE about the dup output words issue.

          I think it's finally ready to commit!

          Show
          Michael McCandless added a comment - New patch, also setting offsets in the produced tokens (the wordnet test passes), and adding adding a NOTE about the dup output words issue. I think it's finally ready to commit!
          Hide
          Glen Stampoultzis added a comment -

          Does this support multi word synonyms? I tried one of the 3.4 builds and I was getting odd results on multi word synonyms.

          Show
          Glen Stampoultzis added a comment - Does this support multi word synonyms? I tried one of the 3.4 builds and I was getting odd results on multi word synonyms.
          Hide
          Michael McCandless added a comment -

          It does support multi-word synonyms – can you give some details on the odd behavior? Maybe email dev@lucene.apache.org, or open a new Jira issue?

          Show
          Michael McCandless added a comment - It does support multi-word synonyms – can you give some details on the odd behavior? Maybe email dev@lucene.apache.org, or open a new Jira issue?
          Hide
          Glen Stampoultzis added a comment -

          I'll try and put together a simple test case just to make sure I've got something repeatable and post to the list. I think it might have been related to synonym fields tokenized using KeywordTokenizerFactory.

          Show
          Glen Stampoultzis added a comment - I'll try and put together a simple test case just to make sure I've got something repeatable and post to the list. I think it might have been related to synonym fields tokenized using KeywordTokenizerFactory.
          Hide
          Yonik Seeley added a comment -

          Looks like that when Solr's synonym parsing was moved to the analysis module, it was also rewritten, introducing escaping bugs.

          Examples:
          a\,a is no longer treated as a single token
          a\=>a is no longer treated as a single token
          a\ta is treated as "ata" instead of containing a tab character

          I didn't do a full review, so I'm not sure if there are other differences in behavior.

          Show
          Yonik Seeley added a comment - Looks like that when Solr's synonym parsing was moved to the analysis module, it was also rewritten, introducing escaping bugs. Examples: a\,a is no longer treated as a single token a\=>a is no longer treated as a single token a\ta is treated as "ata" instead of containing a tab character I didn't do a full review, so I'm not sure if there are other differences in behavior.
          Hide
          Robert Muir added a comment -

          Do you have a test?

          Because we have tests for this:

              String testFile = 
                "a\\=>a => b\\=>b\n" +
                "a\\,a => b\\,b";
          ...
              assertAnalyzesTo(analyzer, "a=>a",
                  new String[] { "b=>b" },
                  new int[] { 1 });
              
              assertAnalyzesTo(analyzer, "a,a",
                  new String[] { "b,b" },
                  new int[] { 1 });
          
          Show
          Robert Muir added a comment - Do you have a test? Because we have tests for this: String testFile = "a\\=>a => b\\=>b\n" + "a\\,a => b\\,b"; ... assertAnalyzesTo(analyzer, "a=>a", new String[] { "b=>b" }, new int[] { 1 }); assertAnalyzesTo(analyzer, "a,a", new String[] { "b,b" }, new int[] { 1 });
          Hide
          Yonik Seeley added a comment -

          I just tested by hand:
          I added a line to synonyms.txt "a\,a => b\,b", fired up the example server and then executed the following query:
          http://localhost:8983/solr/select?q=a,a&debugQuery=true

          I then verified that the synonyms were in effect in general, via:
          http://localhost:8983/solr/select?q=fooaaa&debugQuery=true

          Show
          Yonik Seeley added a comment - I just tested by hand: I added a line to synonyms.txt "a\,a => b\,b", fired up the example server and then executed the following query: http://localhost:8983/solr/select?q=a,a&debugQuery=true I then verified that the synonyms were in effect in general, via: http://localhost:8983/solr/select?q=fooaaa&debugQuery=true
          Hide
          Yonik Seeley added a comment -

          User error - the field type changes in the example schema tripped up the user (and me too). The standard tokenizer does not keep "a,a" as a single token.

          Show
          Yonik Seeley added a comment - User error - the field type changes in the example schema tripped up the user (and me too). The standard tokenizer does not keep "a,a" as a single token.

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development