Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-1706

wrong tokens output from WordDelimiterFilter depending upon options

    Details

      Description

      below you can see that when I have requested to only output numeric concatenations (not words), some words are still sometimes output, ignoring the options i have provided, and even then, in a very inconsistent way.

        assertWdf("Super-Duper-XL500-42-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
          new String[] { "42", "AutoCoder" },
          new int[] { 18, 21 },
          new int[] { 20, 30 },
          new int[] { 1, 1 });
      
        assertWdf("Super-Duper-XL500-42-AutoCoder's-56", 0,0,0,1,0,0,0,0,1, null,
          new String[] { "42", "AutoCoder", "56" },
          new int[] { 18, 21, 33 },
          new int[] { 20, 30, 35 },
          new int[] { 1, 1, 1 });
      
        assertWdf("Super-Duper-XL500-AB-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
          new String[] {  },
          new int[] {  },
          new int[] {  },
          new int[] {  });
      
        assertWdf("Super-Duper-XL500-42-AutoCoder's-BC", 0,0,0,1,0,0,0,0,1, null,
          new String[] { "42" },
          new int[] { 18 },
          new int[] { 20 },
          new int[] { 1 });
      

      where assertWdf is

        void assertWdf(String text, int generateWordParts, int generateNumberParts,
            int catenateWords, int catenateNumbers, int catenateAll,
            int splitOnCaseChange, int preserveOriginal, int splitOnNumerics,
            int stemEnglishPossessive, CharArraySet protWords, String expected[],
            int startOffsets[], int endOffsets[], String types[], int posIncs[])
            throws IOException {
          TokenStream ts = new WhitespaceTokenizer(new StringReader(text));
          WordDelimiterFilter wdf = new WordDelimiterFilter(ts, generateWordParts,
              generateNumberParts, catenateWords, catenateNumbers, catenateAll,
              splitOnCaseChange, preserveOriginal, splitOnNumerics,
              stemEnglishPossessive, protWords);
          assertTokenStreamContents(wdf, expected, startOffsets, endOffsets, types,
              posIncs);
        }
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                markrmiller@gmail.com Mark Miller
                Reporter:
                rcmuir Robert Muir
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: