Solr
  1. Solr
  2. SOLR-1706

wrong tokens output from WordDelimiterFilter depending upon options

    Details

      Description

      below you can see that when I have requested to only output numeric concatenations (not words), some words are still sometimes output, ignoring the options i have provided, and even then, in a very inconsistent way.

        assertWdf("Super-Duper-XL500-42-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
          new String[] { "42", "AutoCoder" },
          new int[] { 18, 21 },
          new int[] { 20, 30 },
          new int[] { 1, 1 });
      
        assertWdf("Super-Duper-XL500-42-AutoCoder's-56", 0,0,0,1,0,0,0,0,1, null,
          new String[] { "42", "AutoCoder", "56" },
          new int[] { 18, 21, 33 },
          new int[] { 20, 30, 35 },
          new int[] { 1, 1, 1 });
      
        assertWdf("Super-Duper-XL500-AB-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
          new String[] {  },
          new int[] {  },
          new int[] {  },
          new int[] {  });
      
        assertWdf("Super-Duper-XL500-42-AutoCoder's-BC", 0,0,0,1,0,0,0,0,1, null,
          new String[] { "42" },
          new int[] { 18 },
          new int[] { 20 },
          new int[] { 1 });
      

      where assertWdf is

        void assertWdf(String text, int generateWordParts, int generateNumberParts,
            int catenateWords, int catenateNumbers, int catenateAll,
            int splitOnCaseChange, int preserveOriginal, int splitOnNumerics,
            int stemEnglishPossessive, CharArraySet protWords, String expected[],
            int startOffsets[], int endOffsets[], String types[], int posIncs[])
            throws IOException {
          TokenStream ts = new WhitespaceTokenizer(new StringReader(text));
          WordDelimiterFilter wdf = new WordDelimiterFilter(ts, generateWordParts,
              generateNumberParts, catenateWords, catenateNumbers, catenateAll,
              splitOnCaseChange, preserveOriginal, splitOnNumerics,
              stemEnglishPossessive, protWords);
          assertTokenStreamContents(wdf, expected, startOffsets, endOffsets, types,
              posIncs);
        }
      

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          by the way, i do not have a patch here. i am putting the finishing touches on converting this tokenstream to the new tokenstream API so one alternative is to fix it under SOLR-1657.

          the problem is i am autogenerating many test cases for all 512 combos of the 9 boolean options across various strings and seeing things like this.

          so, at the least i would like agreement that its buggy behavior.. if someone knows how to fix the existing code that would be even better, it would make testing easier on me.

          Show
          Robert Muir added a comment - by the way, i do not have a patch here. i am putting the finishing touches on converting this tokenstream to the new tokenstream API so one alternative is to fix it under SOLR-1657 . the problem is i am autogenerating many test cases for all 512 combos of the 9 boolean options across various strings and seeing things like this. so, at the least i would like agreement that its buggy behavior.. if someone knows how to fix the existing code that would be even better, it would make testing easier on me.
          Hide
          Robert Muir added a comment -

          ok i narrowed this one down some, appears to be unrelated completely to possessives, but some other off-by-one bug:

          public void test0() throws Exception {
            assertWdf("1-a-2 3-b-c-4 5-d-e 6-f", 0,0,0,0,0,0,0,0,0, null,
              new String[] {  },
              new int[] {  },
              new int[] {  },
              new int[] {  });
          }
          
          public void test32() throws Exception {
            assertWdf("1-a-2 3-b-c-4 5-d-e 6-f", 0,0,0,1,0,0,0,0,0, null,
              new String[] { "1", "a", "2", "3", "4", "5", "6", "f" },
              new int[] { 0, 2, 4, 6, 12, 14, 20, 22 },
              new int[] { 1, 3, 5, 7, 13, 15, 21, 23 },
              new int[] { 1, 1, 1, 1, 1, 1, 1, 1 });
          }
          
          Show
          Robert Muir added a comment - ok i narrowed this one down some, appears to be unrelated completely to possessives, but some other off-by-one bug: public void test0() throws Exception { assertWdf( "1-a-2 3-b-c-4 5-d-e 6-f" , 0,0,0,0,0,0,0,0,0, null , new String [] { }, new int [] { }, new int [] { }, new int [] { }); } public void test32() throws Exception { assertWdf( "1-a-2 3-b-c-4 5-d-e 6-f" , 0,0,0,1,0,0,0,0,0, null , new String [] { "1" , "a" , "2" , "3" , "4" , "5" , "6" , "f" }, new int [] { 0, 2, 4, 6, 12, 14, 20, 22 }, new int [] { 1, 3, 5, 7, 13, 15, 21, 23 }, new int [] { 1, 1, 1, 1, 1, 1, 1, 1 }); }
          Hide
          Robert Muir added a comment -

          its not just the concatenation, but also the subword generation.

          In the case below, Autocoder should not be emitted, as only numeric subword generation is turned on.

            public void test128() throws Exception {
              assertWdf("word 1234 Super-Duper-XL500-42-Autocoder x'sbd123 a4b3c-", 0,1,0,0,0,0,0,0,0, null,
                new String[] { "word", "1234", "42", "Autocoder", "a4b3c" },
                new int[] { 0, 5, 28, 31, 50 },
                new int[] { 4, 9, 30, 40, 55 },
                new int[] { 1, 1, 1, 1, 2 });
            }
          
          Show
          Robert Muir added a comment - its not just the concatenation, but also the subword generation. In the case below, Autocoder should not be emitted, as only numeric subword generation is turned on. public void test128() throws Exception { assertWdf( "word 1234 Super-Duper-XL500-42-Autocoder x'sbd123 a4b3c-" , 0,1,0,0,0,0,0,0,0, null , new String [] { "word" , "1234" , "42" , "Autocoder" , "a4b3c" }, new int [] { 0, 5, 28, 31, 50 }, new int [] { 4, 9, 30, 40, 55 }, new int [] { 1, 1, 1, 1, 2 }); }
          Hide
          Yonik Seeley added a comment -

          Yep, certainly bugs. IMO, no need to worry about trying to match (even for compat) - these look like real configuration edge cases to me.

          Show
          Yonik Seeley added a comment - Yep, certainly bugs. IMO, no need to worry about trying to match (even for compat) - these look like real configuration edge cases to me.
          Hide
          Robert Muir added a comment -

          This was resolved in revision 922957.

          Show
          Robert Muir added a comment - This was resolved in revision 922957.
          Hide
          Hoss Man added a comment -

          Correcting Fix Version based on CHANGES.txt, see this thread for more details...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Show
          Hoss Man added a comment - Correcting Fix Version based on CHANGES.txt, see this thread for more details... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

            People

            • Assignee:
              Mark Miller
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development