Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-1706

wrong tokens output from WordDelimiterFilter depending upon options

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      below you can see that when I have requested to only output numeric concatenations (not words), some words are still sometimes output, ignoring the options i have provided, and even then, in a very inconsistent way.

        assertWdf("Super-Duper-XL500-42-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
          new String[] { "42", "AutoCoder" },
          new int[] { 18, 21 },
          new int[] { 20, 30 },
          new int[] { 1, 1 });
      
        assertWdf("Super-Duper-XL500-42-AutoCoder's-56", 0,0,0,1,0,0,0,0,1, null,
          new String[] { "42", "AutoCoder", "56" },
          new int[] { 18, 21, 33 },
          new int[] { 20, 30, 35 },
          new int[] { 1, 1, 1 });
      
        assertWdf("Super-Duper-XL500-AB-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
          new String[] {  },
          new int[] {  },
          new int[] {  },
          new int[] {  });
      
        assertWdf("Super-Duper-XL500-42-AutoCoder's-BC", 0,0,0,1,0,0,0,0,1, null,
          new String[] { "42" },
          new int[] { 18 },
          new int[] { 20 },
          new int[] { 1 });
      

      where assertWdf is

        void assertWdf(String text, int generateWordParts, int generateNumberParts,
            int catenateWords, int catenateNumbers, int catenateAll,
            int splitOnCaseChange, int preserveOriginal, int splitOnNumerics,
            int stemEnglishPossessive, CharArraySet protWords, String expected[],
            int startOffsets[], int endOffsets[], String types[], int posIncs[])
            throws IOException {
          TokenStream ts = new WhitespaceTokenizer(new StringReader(text));
          WordDelimiterFilter wdf = new WordDelimiterFilter(ts, generateWordParts,
              generateNumberParts, catenateWords, catenateNumbers, catenateAll,
              splitOnCaseChange, preserveOriginal, splitOnNumerics,
              stemEnglishPossessive, protWords);
          assertTokenStreamContents(wdf, expected, startOffsets, endOffsets, types,
              posIncs);
        }
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            markrmiller@gmail.com Mark Miller
            rcmuir Robert Muir
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment