Lucene - Core
  1. Lucene - Core
  2. LUCENE-3720

OOM in TestBeiderMorseFilter.testRandom

    Details

    • Type: Test Test
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-ALPHA
    • Fix Version/s: 4.0, 5.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This has been OOM'ing a lot... we should see why, its likely a real bug.

      ant test -Dtestcase=TestBeiderMorseFilter -Dtestmethod=testRandom -Dtests.seed=2e18f456e714be89:310bba5e8404100d:-3bd11277c22f4591 -Dtests.multiplier=3 -Dargs="-Dfile.encoding=ISO8859-1"

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Robert Muir
          http://svn.apache.org/viewvc?view=revision&revision=1386670

          LUCENE-3720: fix BeiderMorseFilter OOM issues

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1386670 LUCENE-3720 : fix BeiderMorseFilter OOM issues
          Hide
          Robert Muir added a comment -

          Thanks for fixing this Thomas!

          I opened LUCENE-4400 to add support for the new 1.7 encoder.

          Show
          Robert Muir added a comment - Thanks for fixing this Thomas! I opened LUCENE-4400 to add support for the new 1.7 encoder.
          Hide
          Robert Muir added a comment -
          Show
          Robert Muir added a comment - Release is officially out: http://mail-archives.apache.org/mod_mbox/www-announce/201209.mbox/%3CCACZkXPz4wOAzRVf41h6PSVDp9PV-3uj%3D8e2ZViByeX5EpmYWzw%40mail.gmail.com%3E I'll run tests/checks with the patch and commit shortly.
          Hide
          Robert Muir added a comment -

          I looked, it seems they are still in the release process (its just that 1.7 is starting to appear on mirrors).

          we should wait until the official release announcement for the upgrade.

          Show
          Robert Muir added a comment - I looked, it seems they are still in the release process (its just that 1.7 is starting to appear on mirrors). we should wait until the official release announcement for the upgrade.
          Hide
          Robert Muir added a comment -

          patch upgrading the jar, re-enabling the test, and removing the warnings.

          I ran the following 100 times from a shell script with no fails:

          ant test -Dtestcase=TestBeiderMorseFilter -Dtests.multiplier=3 -Dtests.nightly=true -Dtestmethod=testRandom

          Show
          Robert Muir added a comment - patch upgrading the jar, re-enabling the test, and removing the warnings. I ran the following 100 times from a shell script with no fails: ant test -Dtestcase=TestBeiderMorseFilter -Dtests.multiplier=3 -Dtests.nightly=true -Dtestmethod=testRandom
          Hide
          Robert Muir added a comment -

          Commons-codec 1.7 has been released with fixes. I'm testing them out now.

          Show
          Robert Muir added a comment - Commons-codec 1.7 has been released with fixes. I'm testing them out now.
          Hide
          Robert Muir added a comment -

          Thanks so much for digging into this Thomas!

          When the release comes out we can add support and tests for the new encoder too.

          Show
          Robert Muir added a comment - Thanks so much for digging into this Thomas! When the release comes out we can add support and tests for the new encoder too.
          Hide
          Thomas Neidhart added a comment - - edited

          Ok, it took me some time to understand the test and filter code. In fact my original statement was wrong, the output order of phoneme tokens from the codec was not deterministic. This has been fixed in commons-codec, and the test runs through now.

          btw. the next release will happen quite soon afaik and will also include a new phonetic encoder called Nysiis, which should perform slightly better than Soundex (see https://issues.apache.org/jira/browse/CODEC-63). Any feedback is very welcome!

          Show
          Thomas Neidhart added a comment - - edited Ok, it took me some time to understand the test and filter code. In fact my original statement was wrong, the output order of phoneme tokens from the codec was not deterministic. This has been fixed in commons-codec, and the test runs through now. btw. the next release will happen quite soon afaik and will also include a new phonetic encoder called Nysiis, which should perform slightly better than Soundex (see https://issues.apache.org/jira/browse/CODEC-63 ). Any feedback is very welcome!
          Hide
          Robert Muir added a comment -

          Thomas: the filter is ok, it does not need to be thread safe.
          Analyzer has as threadlocal for the tokenizer chain, each
          thread gets its own filter.

          Show
          Robert Muir added a comment - Thomas: the filter is ok, it does not need to be thread safe. Analyzer has as threadlocal for the tokenizer chain, each thread gets its own filter.
          Hide
          Thomas Neidhart added a comment -

          The BeiderMorseFilter uses internally a java.util.regex.Matcher which is not supposed to be thread-safe. When doing the random test, the same analyzer is used by different threads, so my first guess would be that the problems are related to this somehow.

          Show
          Thomas Neidhart added a comment - The BeiderMorseFilter uses internally a java.util.regex.Matcher which is not supposed to be thread-safe. When doing the random test, the same analyzer is used by different threads, so my first guess would be that the problems are related to this somehow.
          Hide
          Thomas Neidhart added a comment -

          The test starts multiple threads for the analysis, so maybe there is a concurrency problem in the BeiderMorseCodec.
          When doing a similar test sequentially, the result is always the same.

          I will look further into this.

          Show
          Thomas Neidhart added a comment - The test starts multiple threads for the analysis, so maybe there is a concurrency problem in the BeiderMorseCodec. When doing a similar test sequentially, the result is always the same. I will look further into this.
          Hide
          Robert Muir added a comment -

          Hmm i havent tested the patched version yet myself, but thats probably where we check that
          the output is deterministic (we replay the same random strings again and check the output is the same as before).

          Show
          Robert Muir added a comment - Hmm i havent tested the patched version yet myself, but thats probably where we check that the output is deterministic (we replay the same random strings again and check the output is the same as before).
          Hide
          Thomas Neidhart added a comment -

          Hi,

          I looked at the testcase and it runs through with the patched commons-codec, but I received an assertion:

              [junit] Testcase: testRandom(org.apache.lucene.analysis.phonetic.TestBeiderMorseFilter):	FAILED
              [junit] term 8 expected:<skript[Skriptsk]ript> but was:<skript[dZriptdZ]ript>
              [junit] junit.framework.AssertionFailedError: term 8 expected:<skript[Skriptsk]ript> but was:<skript[dZriptdZ]ript>
          
          Show
          Thomas Neidhart added a comment - Hi, I looked at the testcase and it runs through with the patched commons-codec, but I received an assertion: [junit] Testcase: testRandom(org.apache.lucene.analysis.phonetic.TestBeiderMorseFilter): FAILED [junit] term 8 expected:<skript[Skriptsk]ript> but was:<skript[dZriptdZ]ript> [junit] junit.framework.AssertionFailedError: term 8 expected:<skript[Skriptsk]ript> but was:<skript[dZriptdZ]ript>
          Hide
          Robert Muir added a comment -

          There is now a patch on CODEC-132 , I haven't had the time to try to apply-rebuild-test with it, but it seems like the right fix.

          Show
          Robert Muir added a comment - There is now a patch on CODEC-132 , I haven't had the time to try to apply-rebuild-test with it, but it seems like the right fix.
          Hide
          Robert Muir added a comment -

          I suspect (as noted on CODEC-132) that this is caused by certain punctuation. I can dig a little, could be we apply a temporary
          workaround sanitising the offending character (and long term/better, create a fix for CODEC-132)

          Show
          Robert Muir added a comment - I suspect (as noted on CODEC-132 ) that this is caused by certain punctuation. I can dig a little, could be we apply a temporary workaround sanitising the offending character (and long term/better, create a fix for CODEC-132 )
          Hide
          Michael McCandless added a comment -

          I don't think we need to remove the Tokenfilter just because it has a known bug. It could be this is rarely hit in practice / on more normal input. All software has bugs...

          Show
          Michael McCandless added a comment - I don't think we need to remove the Tokenfilter just because it has a known bug. It could be this is rarely hit in practice / on more normal input. All software has bugs...
          Hide
          Uwe Schindler added a comment -

          I think we should disable the whole TokenFilter for now until this is fixed. It was not yet released as far as I know. So I would suggest to temporary "svn rm" it and revert this once this is fixed. This makes it disappear from 3.6 Release, but it's not lost. We should keep this issue open, until the root cause is fixed.

          Show
          Uwe Schindler added a comment - I think we should disable the whole TokenFilter for now until this is fixed. It was not yet released as far as I know. So I would suggest to temporary "svn rm" it and revert this once this is fixed. This makes it disappear from 3.6 Release, but it's not lost. We should keep this issue open, until the root cause is fixed.
          Hide
          Robert Muir added a comment -

          For now I added a big red bold warning and disabled the test.

          Show
          Robert Muir added a comment - For now I added a big red bold warning and disabled the test.
          Hide
          Robert Muir added a comment -

          Linking this issue to CODEC-132, I attached a test case to that issue.

          Show
          Robert Muir added a comment - Linking this issue to CODEC-132 , I attached a test case to that issue.
          Hide
          Robert Muir added a comment -

          Here's an initial test:

            public void testOOM2() throws Exception {
              String test = "200697900'-->&#1913348150;</  bceaeef >aadaabcf\"aedfbff<!--\'-->?>cae" +
                  "cfaaa><?&#<!--</script>&lang&fc;aadeaf?>>&bdquo<    cc =\"abff\"    /></   afe  >" +
                  "<script><!-- f(';<    cf aefbeef = \"bfabadcf\" ebbfeedd = fccabeb >";
              TokenStream ts = analyzer.tokenStream("bogus", new StringReader(test));
              ts.reset();
              while (ts.incrementToken()) {
                ;
              }
              ts.end();
              ts.close();
            }
          

          Ill see if i can make it blow up with a smaller string, and then port the test to just use commons-codec apis (not lucene ones).

          Show
          Robert Muir added a comment - Here's an initial test: public void testOOM2() throws Exception { String test = "200697900'-->&#1913348150;</ bceaeef >aadaabcf\"aedfbff<!--\'-->?>cae" + "cfaaa><?&#<!--</script>&lang&fc;aadeaf?>>&bdquo< cc =\"abff\" /></ afe >" + "<script><!-- f(';< cf aefbeef = \"bfabadcf\" ebbfeedd = fccabeb >"; TokenStream ts = analyzer.tokenStream("bogus", new StringReader(test)); ts.reset(); while (ts.incrementToken()) { ; } ts.end(); ts.close(); } Ill see if i can make it blow up with a smaller string, and then port the test to just use commons-codec apis (not lucene ones).
          Hide
          Robert Muir added a comment -

          I'm still working on a testcase for this.

          I think the underlying commons-codec algorithm "blows up" on some inputs, especially those from randomHtmlIshString.
          Then given multiple threads its easy for it to OOM.

          So far the best i have is: (ant test -Dtestcase=TestBeiderMorseFilter -Dtestmethod=testOOM -Dtests.seed=320e23c1f5dbf9c5:-766b86e7dc5e81df:148a8a4955f89b5e -Dargs="-Dfile.encoding=UTF-8" -Dtests.heapsize=64M

          This blows up, so I just have to get it to log the string that blows up I think, and we have a start to a testcase.

            public void testOOM() throws Exception {
              int numIter = 100000;
              int numTokens = 0;
              for (int i = 0; i < numIter; i++) {
                String s = _TestUtil.randomHtmlishString(random, 30);
                TokenStream ts = analyzer.tokenStream("bogus", new StringReader(s));
                ts.reset();
                while (ts.incrementToken()) {
                  numTokens++;
                }
                ts.end();
                ts.close();
              }
              System.out.println(numTokens);
            }
          
          Show
          Robert Muir added a comment - I'm still working on a testcase for this. I think the underlying commons-codec algorithm "blows up" on some inputs, especially those from randomHtmlIshString. Then given multiple threads its easy for it to OOM. So far the best i have is: (ant test -Dtestcase=TestBeiderMorseFilter -Dtestmethod=testOOM -Dtests.seed=320e23c1f5dbf9c5:-766b86e7dc5e81df:148a8a4955f89b5e -Dargs="-Dfile.encoding=UTF-8" -Dtests.heapsize=64M This blows up, so I just have to get it to log the string that blows up I think, and we have a start to a testcase. public void testOOM() throws Exception { int numIter = 100000; int numTokens = 0; for ( int i = 0; i < numIter; i++) { String s = _TestUtil.randomHtmlishString(random, 30); TokenStream ts = analyzer.tokenStream( "bogus" , new StringReader(s)); ts.reset(); while (ts.incrementToken()) { numTokens++; } ts.end(); ts.close(); } System .out.println(numTokens); }

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development