[OAK-9145] OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in wrong order - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: indexing, jcr, lucene
Labels:
- easyfix
- pull-request-available
Environment:

Discovered while performing DAM searches in Adobe Experience Manager.

Flags:

Patch
External issue URL:
https://github.com/apache/jackrabbit-oak/pull/242

Description

I believe OakAnalyzer applies LowerCaseFilter and WordDelimiterFilter in the wrong order. WordDelimiterFilter is invoked with the GENERATE_WORD_PARTS flag, which splits camelCase/PascalCase into multiple terms, but since the LowerCaseFilter is applied first, the mixed-case is lost and the terms can't be split.

Searching for savings, the damAssetLucene index (which uses the default OakAnalyzer) does not find an asset named savingsAccount.svg.

Upon configuring the index's analyzers (/oak:index/damAssetLucene/analyzers) to apply WordDelimiterFilter before LowerCaseFilter, the correct behaviour was seen.

{
  "jcr:primaryType": "nt:unstructured",
  "default": {
    "jcr:primaryType": "nt:unstructured",
    "tokenizer": {
      "jcr:primaryType": "nt:unstructured",
      "name": "Standard"
    },
    "filters": {
      "jcr:primaryType": "nt:unstructured",
      "WordDelimiter": {"jcr:primaryType": "nt:unstructured"},
      "LowerCase": {"jcr:primaryType": "nt:unstructured"}
    }
  }
}