[JENA-1488] SelectiveFoldingFilter for jena-text - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: Jena 3.6.0
Fix Version/s: Jena 3.8.0
Component/s: Text
Labels:
None

Description

Currently there's some support for accent folding in jena-text, because Lucene provides an ASCIIFoldingFilter. When this filter is enabled, a search for "deja vu" will match the literal "déjà vu" in the data.

But we can't use it here at the National Library of Finland (for Finto.fi / Skosmos), because it folds too much! In the Finnish alphabet, in addition to the Latin a-z (which are in ASCII) we use the letters åäö and these should not be folded to ASCII. So we need a Lucene analyzer that can be configured with an exclude list, something like

new SelectiveFoldingFilter(String excludeChars)

and that can be also be configured via the Jena assembler just like other analyzers supported by jena-text.

This was also briefly discussed on the skosmos-users mailing list:
https://groups.google.com/d/msg/skosmos-users/x3zR_uRBQT0/Q90-O_iDAQAJ
Apparently Norwegians have the same problem...

I've discussed this with kinow and he has some initial code to implement this feature, so I think we can turn this into a PR fairly soon.

Attachments

Issue Links

Blocked

JENA-1506 Add configurable filters and tokenizers

Closed

depends upon

JENA-1506 Add configurable filters and tokenizers

Closed

links to

GitHub Pull Request #385

GitHub Pull Request #395

Activity

People

Assignee:: Bruno P. Kinoshita

Reporter:: Osma Suominen

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 13/Feb/18 13:38

Updated:: 29/Jun/18 10:13

Resolved:: 25/Apr/18 09:00