[SOLR-1706] wrong tokens output from WordDelimiterFilter depending upon options - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.4
Fix Version/s: 1.4.1, 3.1, 4.0-ALPHA
Component/s: Schema and Analysis
Labels:
None

Description

below you can see that when I have requested to only output numeric concatenations (not words), some words are still sometimes output, ignoring the options i have provided, and even then, in a very inconsistent way.

  assertWdf("Super-Duper-XL500-42-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
    new String[] { "42", "AutoCoder" },
    new int[] { 18, 21 },
    new int[] { 20, 30 },
    new int[] { 1, 1 });

  assertWdf("Super-Duper-XL500-42-AutoCoder's-56", 0,0,0,1,0,0,0,0,1, null,
    new String[] { "42", "AutoCoder", "56" },
    new int[] { 18, 21, 33 },
    new int[] { 20, 30, 35 },
    new int[] { 1, 1, 1 });

  assertWdf("Super-Duper-XL500-AB-AutoCoder's", 0,0,0,1,0,0,0,0,1, null,
    new String[] {  },
    new int[] {  },
    new int[] {  },
    new int[] {  });

  assertWdf("Super-Duper-XL500-42-AutoCoder's-BC", 0,0,0,1,0,0,0,0,1, null,
    new String[] { "42" },
    new int[] { 18 },
    new int[] { 20 },
    new int[] { 1 });

where assertWdf is

  void assertWdf(String text, int generateWordParts, int generateNumberParts,
      int catenateWords, int catenateNumbers, int catenateAll,
      int splitOnCaseChange, int preserveOriginal, int splitOnNumerics,
      int stemEnglishPossessive, CharArraySet protWords, String expected[],
      int startOffsets[], int endOffsets[], String types[], int posIncs[])
      throws IOException {
    TokenStream ts = new WhitespaceTokenizer(new StringReader(text));
    WordDelimiterFilter wdf = new WordDelimiterFilter(ts, generateWordParts,
        generateNumberParts, catenateWords, catenateNumbers, catenateAll,
        splitOnCaseChange, preserveOriginal, splitOnNumerics,
        stemEnglishPossessive, protWords);
    assertTokenStreamContents(wdf, expected, startOffsets, endOffsets, types,
        posIncs);
  }