Solr
  1. Solr
  2. SOLR-1674

improve analysis tests, cut over to new API

    Details

    • Type: Test Test
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, 5.0
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      This patch

      • converts all analysis tests to use the new tokenstream api
      • converts most tests to use the more stringent assertion mechanisms from lucene
      • adds new tests to improve coverage

      Most bugs found by more stringent testing have been fixed, with the exception of SynonymFilter.
      The problems with this filter are more serious, the previous tests were essentially a no-op.
      The new tests for SynonymFilter test the current behavior, but have FIXMEs with what I think the old test wanted to expect in the comments.

      1. SOLR-1674.patch
        180 kB
        Robert Muir
      2. SOLR-1674.patch
        182 kB
        Robert Muir
      3. SOLR-1674_speedup.patch
        19 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          here is an updated patch.
          I think I managed to resolve some problems with synonymfilter, especially the recursion tests (I believe they were simply typos and there isnt a bug), and the position increments (this was a problem in the tests)

          so in my opinion, the only problem left is SOLR-1670, the repeat problem.

          Show
          Robert Muir added a comment - here is an updated patch. I think I managed to resolve some problems with synonymfilter, especially the recursion tests (I believe they were simply typos and there isnt a bug), and the position increments (this was a problem in the tests) so in my opinion, the only problem left is SOLR-1670 , the repeat problem.
          Hide
          Robert Muir added a comment - - edited

          Hello, I see Uwe has commented on SOLR-1657 that he would help convert tokenstreams to the new api, but he needs this patch (the tests) applied first.

          Is it possible for someone to take a look at this patch to get things moving along? its only tests, no source code changes.

          Show
          Robert Muir added a comment - - edited Hello, I see Uwe has commented on SOLR-1657 that he would help convert tokenstreams to the new api, but he needs this patch (the tests) applied first. Is it possible for someone to take a look at this patch to get things moving along? its only tests, no source code changes.
          Hide
          Mark Miller added a comment -

          I think TestCapitalizationFilter and TestMultiWordSynonyms need to be brought back up to trunk? A quick patch attempt is giving me problems.

          Show
          Mark Miller added a comment - I think TestCapitalizationFilter and TestMultiWordSynonyms need to be brought back up to trunk? A quick patch attempt is giving me problems.
          Hide
          Robert Muir added a comment -

          this is because you use $Id$. its not my fault...

          Show
          Robert Muir added a comment - this is because you use $Id$. its not my fault...
          Hide
          Mark Miller added a comment - - edited

          Thats BS - they are fixable in the patch

          I hate those damn $id tags - every time ... I'll fix them.

          Show
          Mark Miller added a comment - - edited Thats BS - they are fixable in the patch I hate those damn $id tags - every time ... I'll fix them.
          Hide
          Robert Muir added a comment -

          I sent an email... if no one cares about these $id$ tags then when i get back from vacation i will gladly volunteer to submit a patch to remove them

          A quick workaround is to change them back to $Id$ in your local, then apply the patch...

          Show
          Robert Muir added a comment - I sent an email... if no one cares about these $id$ tags then when i get back from vacation i will gladly volunteer to submit a patch to remove them A quick workaround is to change them back to $Id$ in your local, then apply the patch...
          Hide
          Mark Miller added a comment -

          Fixed a small issue with protWords.txt not matching prowrods.txt on unix systems.

          If there are no objections I will commit this beautiful addition to our analysis tests soon.

          Show
          Mark Miller added a comment - Fixed a small issue with protWords.txt not matching prowrods.txt on unix systems. If there are no objections I will commit this beautiful addition to our analysis tests soon.
          Hide
          Shalin Shekhar Mangar added a comment -

          All tests pass after renaming protWords.txt to protwords.txt. Unfortunately, this is too big to review in detail right now but I trust Robert to do the right thing

          If there are no objections I will commit this beautiful addition to our analysis tests soon.

          +1

          Show
          Shalin Shekhar Mangar added a comment - All tests pass after renaming protWords.txt to protwords.txt. Unfortunately, this is too big to review in detail right now but I trust Robert to do the right thing If there are no objections I will commit this beautiful addition to our analysis tests soon. +1
          Hide
          Mark Miller added a comment -

          Thanks a lot Robert! Test contributions are rare and exciting!

          Show
          Mark Miller added a comment - Thanks a lot Robert! Test contributions are rare and exciting!
          Hide
          Robert Muir added a comment -

          hmm it appears i may have slowed down the junit tests with the previous patch, unfortunately.

          attached is a patch to speed them up... (maybe not necessary, but they were very very slow on my laptop)

          Show
          Robert Muir added a comment - hmm it appears i may have slowed down the junit tests with the previous patch, unfortunately. attached is a patch to speed them up... (maybe not necessary, but they were very very slow on my laptop)
          Hide
          Yonik Seeley added a comment -

          Hmmm, isn't it a bug that this passes?
          assertTokenizesTo(map, "a b", new String[]

          { "ab", "ab", "ab" }

          );

          w/o the 1670 fix, we get "ab/ab/ab" (repeated tokens all at the same position). That's not the same as three "ab" tokens in a row.

          Also, we seem to have lost matching flexibility with overlapping tokens. "a/aa" should be the same as "aa/a", but if you change the order of overlapping tokens now, the tests fail.

          Didn't ya guys like my a/aa syntax to indicate overlapping tokens? It certainly made it faster for me to write the original testcases

          Show
          Yonik Seeley added a comment - Hmmm, isn't it a bug that this passes? assertTokenizesTo(map, "a b", new String[] { "ab", "ab", "ab" } ); w/o the 1670 fix, we get "ab/ab/ab" (repeated tokens all at the same position). That's not the same as three "ab" tokens in a row. Also, we seem to have lost matching flexibility with overlapping tokens. "a/aa" should be the same as "aa/a", but if you change the order of overlapping tokens now, the tests fail. Didn't ya guys like my a/aa syntax to indicate overlapping tokens? It certainly made it faster for me to write the original testcases
          Hide
          Mark Miller added a comment -

          Robert is on holiday I think - will reopen for now.

          Show
          Mark Miller added a comment - Robert is on holiday I think - will reopen for now.
          Hide
          Robert Muir added a comment -

          Hmmm, isn't it a bug that this passes?

          w/o the 1670 fix, we get "ab/ab/ab" (repeated tokens all at the same position). That's not the same as three "ab" tokens in a row.

          The tests pass because it "ab", "ab", "ab". If we want to validate pos incs, we should change the test to:

          assertTokenizesTo(map, "a b", 
            new String[] { "ab", "ab", "ab" },
            new int [] { 1, 0, 0 });
          

          this way the posIncs are tested too.

          Also, we seem to have lost matching flexibility with overlapping tokens. "a/aa" should be the same as "aa/a", but if you change the order of overlapping tokens now, the tests fail.

          This "flexibility" caused things such as SOLR-1670, SOLR-1667, SOLR-1662, and SOLR-1660. When I switched to less "flexible" tests, these bugs were found. So sorry to see it go.

          Show
          Robert Muir added a comment - Hmmm, isn't it a bug that this passes? w/o the 1670 fix, we get "ab/ab/ab" (repeated tokens all at the same position). That's not the same as three "ab" tokens in a row. The tests pass because it "ab", "ab", "ab". If we want to validate pos incs, we should change the test to: assertTokenizesTo(map, "a b" , new String [] { "ab" , "ab" , "ab" }, new int [] { 1, 0, 0 }); this way the posIncs are tested too. Also, we seem to have lost matching flexibility with overlapping tokens. "a/aa" should be the same as "aa/a", but if you change the order of overlapping tokens now, the tests fail. This "flexibility" caused things such as SOLR-1670 , SOLR-1667 , SOLR-1662 , and SOLR-1660 . When I switched to less "flexible" tests, these bugs were found. So sorry to see it go.
          Hide
          Mark Miller added a comment -

          I've committed the speed up patch, thanks Robert!

          Leaving open for posInc tests

          Show
          Mark Miller added a comment - I've committed the speed up patch, thanks Robert! Leaving open for posInc tests
          Hide
          Hoss Man added a comment -

          Correcting Fix Version based on CHANGES.txt, see this thread for more details...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3

          Show
          Hoss Man added a comment - Correcting Fix Version based on CHANGES.txt, see this thread for more details... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3
          Hide
          Mark Miller added a comment -

          Going to close this if no one objects...

          Show
          Mark Miller added a comment - Going to close this if no one objects...
          Hide
          Robert Muir added a comment -

          i'd still like to add posinc tests for some of these tokenstreams,
          but also other ones in the analyzers module too (e.g. ones from lucene contrib).

          i'll set 3.2 for now.

          Show
          Robert Muir added a comment - i'd still like to add posinc tests for some of these tokenstreams, but also other ones in the analyzers module too (e.g. ones from lucene contrib). i'll set 3.2 for now.
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Uwe Schindler added a comment -

          Move issue to Solr 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Solr 4.9.

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:

                Development