Lucene - Core
  1. Lucene - Core
  2. LUCENE-3919

more thorough testing of analysis chains

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA, 3.6.1
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In lucene we essentially test each analysis component separately. we also give some
      good testing to the example Analyzers we provide that combine them.

      But we don't test various combinations that are possible: which is bad because
      it doesnt test possibilities for custom analyzers (especially since lots of solr users
      etc define their own).

      1. LUCENE-3919-generics-fixes.patch
        9 kB
        Uwe Schindler
      2. LUCENE-3919.patch
        9 kB
        Robert Muir
      3. LUCENE-3919.patch
        9 kB
        Robert Muir
      4. LUCENE-3919.patch
        10 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          really rough initial stab.

          first time i ran this it seems like it found a bug:

              [junit] Exception from random analyzer: tokenizer=class org.apache.lucene.analysis.core.KeywordTokenizer
              [junit] filters=class org.apache.lucene.analysis.cz.CzechStemFilter,class org.apache.lucene.analysis.cjk.CJKWidthFilter
              [junit] java.lang.ArrayIndexOutOfBoundsException: -1
              [junit] 	at org.apache.lucene.analysis.cz.CzechStemmer.normalize(CzechStemmer.java:148)
              [junit] 	at org.apache.lucene.analysis.cz.CzechStemmer.stem(CzechStemmer.java:47)
              [junit] 	at org.apache.lucene.analysis.cz.CzechStemFilter.incrementToken(CzechStemFilter.java:52)
              [junit] 	at org.apache.lucene.analysis.cjk.CJKWidthFilter.incrementToken(CJKWidthFilter.java:62)
          
          Show
          Robert Muir added a comment - really rough initial stab. first time i ran this it seems like it found a bug: [junit] Exception from random analyzer: tokenizer=class org.apache.lucene.analysis.core.KeywordTokenizer [junit] filters=class org.apache.lucene.analysis.cz.CzechStemFilter,class org.apache.lucene.analysis.cjk.CJKWidthFilter [junit] java.lang.ArrayIndexOutOfBoundsException: -1 [junit] at org.apache.lucene.analysis.cz.CzechStemmer.normalize(CzechStemmer.java:148) [junit] at org.apache.lucene.analysis.cz.CzechStemmer.stem(CzechStemmer.java:47) [junit] at org.apache.lucene.analysis.cz.CzechStemFilter.incrementToken(CzechStemFilter.java:52) [junit] at org.apache.lucene.analysis.cjk.CJKWidthFilter.incrementToken(CJKWidthFilter.java:62)
          Hide
          Robert Muir added a comment -

          That one is ant test -Dtestcase=TestRandomChains -Dtestmethod=testRandomChains -Dtests.seed=104b56460756fb6:33a429fcfb5503db:-1d952b2910440c7d -Dargs="-Dfile.encoding=UTF-8"

          I'll see if i can figure out whats going on.

          Show
          Robert Muir added a comment - That one is ant test -Dtestcase=TestRandomChains -Dtestmethod=testRandomChains -Dtests.seed=104b56460756fb6:33a429fcfb5503db:-1d952b2910440c7d -Dargs="-Dfile.encoding=UTF-8" I'll see if i can figure out whats going on.
          Hide
          Michael McCandless added a comment -

          Awesome!

          Show
          Michael McCandless added a comment - Awesome!
          Hide
          Robert Muir added a comment -

          By the way: generics are totally broken with the test!

          Show
          Robert Muir added a comment - By the way: generics are totally broken with the test!
          Hide
          Uwe Schindler added a comment -

          Please don't commit... I will take care - tomorrow!

          There should be also improvements in ctor detectors: all Tokenizers/Tokenfilters with matchVersion will not work, I will think about some more intelligent ctor parsing: Class.getConstructors() -> choose one which has at least a Reader/TokenStream param, if a version is also there fill in matchVersion and all other parameters maybe random (int, bool,...)? Random params should always produce something correct, or they should throw IllegalArgumentException/... on the ctor.

          Show
          Uwe Schindler added a comment - Please don't commit... I will take care - tomorrow! There should be also improvements in ctor detectors: all Tokenizers/Tokenfilters with matchVersion will not work, I will think about some more intelligent ctor parsing: Class.getConstructors() -> choose one which has at least a Reader/TokenStream param, if a version is also there fill in matchVersion and all other parameters maybe random (int, bool,...)? Random params should always produce something correct, or they should throw IllegalArgumentException/... on the ctor.
          Hide
          Robert Muir added a comment -

          The CzechStemmer bug is easy, its because of a zero-length term from KeywordTokenizer
          I'll commit a trivial fix and test for that.

          The next time I ran the test, i got a new fail:

              [junit] TEST FAIL: useCharFilter=true text=⩀⪴⫈⪆⩞ ye ه
              [junit] Exception from random analyzer: tokenizer=class org.apache.lucene.analysis.ngram.NGramTokenizer
              [junit] filters=class org.apache.lucene.analysis.shingle.ShingleFilter
              [junit] NOTE: reproduce with: ant test -Dtestcase=TestRandomChains -Dtestmethod=testRandomChains -Dtests.seed=104b56460756fb6:33a429fcfb5503db:-1d952b2910440c7d -Dargs="-Dfile.encoding=UTF-8"
             [junit] java.lang.AssertionError: endOffset must be >= startOffset
              [junit] java.lang.RuntimeException: java.lang.AssertionError: endOffset must be >= startOffset
              [junit] 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:397)
          

          This is gonna be fun...

          Show
          Robert Muir added a comment - The CzechStemmer bug is easy, its because of a zero-length term from KeywordTokenizer I'll commit a trivial fix and test for that. The next time I ran the test, i got a new fail: [junit] TEST FAIL: useCharFilter=true text=⩀⪴⫈⪆⩞ ye ه [junit] Exception from random analyzer: tokenizer=class org.apache.lucene.analysis.ngram.NGramTokenizer [junit] filters=class org.apache.lucene.analysis.shingle.ShingleFilter [junit] NOTE: reproduce with: ant test -Dtestcase=TestRandomChains -Dtestmethod=testRandomChains -Dtests.seed=104b56460756fb6:33a429fcfb5503db:-1d952b2910440c7d -Dargs="-Dfile.encoding=UTF-8" [junit] java.lang.AssertionError: endOffset must be >= startOffset [junit] java.lang.RuntimeException: java.lang.AssertionError: endOffset must be >= startOffset [junit] at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:397) This is gonna be fun...
          Hide
          Robert Muir added a comment -

          Please don't commit... I will take care - tomorrow!

          Thank you Uwe! We can just leave this issue open.

          In the meantime I will run the test and try to fix the bugs it finds!

          Show
          Robert Muir added a comment - Please don't commit... I will take care - tomorrow! Thank you Uwe! We can just leave this issue open. In the meantime I will run the test and try to fix the bugs it finds!
          Hide
          Robert Muir added a comment -

          I committed fix and tests for empty term (only Czech stemmer seemed to have one).

          I opened LUCENE-3920 for the strange NGram+Shingle offsets bug.

          Show
          Robert Muir added a comment - I committed fix and tests for empty term (only Czech stemmer seemed to have one). I opened LUCENE-3920 for the strange NGram+Shingle offsets bug.
          Hide
          Robert Muir added a comment -

          updated patch: looking for Version+Reader ctors and avoiding CachingTokenFilter

          Show
          Robert Muir added a comment - updated patch: looking for Version+Reader ctors and avoiding CachingTokenFilter
          Hide
          Robert Muir added a comment -

          updated patch: disabling the n-gram filters (see LUCENE-3920), and looking for Version+TokenStream to get a few more filters. also sped up the test a bit...

          now it passes so Uwe can do his work

          Show
          Robert Muir added a comment - updated patch: disabling the n-gram filters (see LUCENE-3920 ), and looking for Version+TokenStream to get a few more filters. also sped up the test a bit... now it passes so Uwe can do his work
          Hide
          Robert Muir added a comment -

          I'm going to commit this. its a test: we can improve it later.

          Show
          Robert Muir added a comment - I'm going to commit this. its a test: we can improve it later.
          Hide
          Robert Muir added a comment -

          committed first iteration... lets improve the test later

          Show
          Robert Muir added a comment - committed first iteration... lets improve the test later
          Hide
          Uwe Schindler added a comment -

          Here the generics fixes and some additional checks to exclude all shit of non public anonymous or member classes.

          Show
          Uwe Schindler added a comment - Here the generics fixes and some additional checks to exclude all shit of non public anonymous or member classes.
          Hide
          Uwe Schindler added a comment -

          Bulk close for 3.6.1

          Show
          Uwe Schindler added a comment - Bulk close for 3.6.1

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development