[LUCENE-4641] Fix analyzer bugs documented in TestRandomChains - ASF JIRA

Details

Type: Bug
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New

Description

TestRandomChains.java found a lot of bugs, some of which are hard to fix. So we blacklisted certain analysis components from the test.

But we really need to fix these, some of these bugs are bad, and they impact users with e.g. highlighting (~~SOLR-4137~~ and so on):

  // TODO: fix those and remove
  private static final Set<Class<?>> brokenComponents = Collections.newSetFromMap(new IdentityHashMap<Class<?>,Boolean>());
  static {
    // TODO: can we promote some of these to be only
    // offsets offenders?
    Collections.<Class<?>>addAll(brokenComponents,
      // TODO: fix basetokenstreamtestcase not to trip because this one has no CharTermAtt
      EmptyTokenizer.class,
      // doesn't actual reset itself!
      CachingTokenFilter.class,
      // doesn't consume whole stream!
      LimitTokenCountFilter.class,
      // Not broken: we forcefully add this, so we shouldn't
      // also randomly pick it:
      ValidatingTokenFilter.class,
      // NOTE: these by themselves won't cause any 'basic assertions' to fail.
      // but see https://issues.apache.org/jira/browse/LUCENE-3920, if any 
      // tokenfilter that combines words (e.g. shingles) comes after them,
      // this will create bogus offsets because their 'offsets go backwards',
      // causing shingle or whatever to make a single token with a 
      // startOffset thats > its endOffset
      // (see LUCENE-3738 for a list of other offenders here)
      // broken!
      NGramTokenizer.class,
      // broken!
      NGramTokenFilter.class,
      // broken!
      EdgeNGramTokenizer.class,
      // broken!
      EdgeNGramTokenFilter.class,
      // broken!
      WordDelimiterFilter.class,
      // broken!
      TrimFilter.class
    );
  }

  // TODO: also fix these and remove (maybe):
  // Classes that don't produce consistent graph offsets:
  private static final Set<Class<?>> brokenOffsetsComponents = Collections.newSetFromMap(new IdentityHashMap<Class<?>,Boolean>());
  static {
    Collections.<Class<?>>addAll(brokenOffsetsComponents,
      ReversePathHierarchyTokenizer.class,
      PathHierarchyTokenizer.class,
      HyphenationCompoundWordTokenFilter.class,
      DictionaryCompoundWordTokenFilter.class,
      // TODO: corrumpts graphs (offset consistency check):
      PositionFilter.class,
      // TODO: it seems to mess up offsets!?
      WikipediaTokenizer.class,
      // TODO: doesn't handle graph inputs
      ThaiWordFilter.class,
      // TODO: doesn't handle graph inputs
      CJKBigramFilter.class,
      // TODO: doesn't handle graph inputs (or even look at positionIncrement)
      HyphenatedWordsFilter.class,
      // LUCENE-4065: only if you pass 'false' to enablePositionIncrements!
      TypeTokenFilter.class,
      // TODO: doesn't handle graph inputs
      CommonGramsQueryFilter.class
    );
  }

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-4641_tests.patch
19/Mar/14 05:40
35 kB
Robert Muir

Issue Links

incorporates

LUCENE-4983 CommonGramsFilter assumes all input tokens have a length of 1

Open

LUCENE-4984 Fix ThaiWordFilter

Closed

LUCENE-5018 Never update offsets in CompoundWordTokenFilterBase

Closed

LUCENE-5111 Fix WordDelimiterFilter

Closed

LUCENE-4656 Fix IndexWriter working together with EmptyTokenizer and EmptyTokenStream (without CharTermAttribute), fix BaseTokenStreamTestCase

Closed

LUCENE-4963 Deprecate broken TokenFilter constructors

Closed

relates to

LUCENE-4065 FilteringTokenFilter should never corrupt the tokenstream graph

Open

LUCENE-3920 ngram tokenizer/filters create nonsense offsets if followed by a word combiner

Resolved

LUCENE-3738 Be consistent about negative vInt/vLong

Closed

LUCENE-4221 CheckIndex is overeager for term vector offsets bounds checks

Closed

SOLR-4137 FastVectorHighlighter: StringIndexOutOfBoundsException in BaseFragmentsBuilder

Closed

LUCENE-3907 Improve the Edge/NGramTokenizer/Filters

Closed

LUCENE-4489 improve LimitTokenCountFilter and/or it's tests

Closed

LUCENE-4667 Change TestRandomChains to replace the list of broken classes by a list of broken constructors

Resolved

(1 incorporates, 8 relates to)

Fix analyzer bugs documented in TestRandomChains

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates