[NUTCH-2278] Handle alpha-2 language codes consistently - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 1.12
Fix Version/s: 1.21
Component/s: plugin
Labels:
None

Description

The language-identifier plugin provides two extraction policies: detect and identify.

However the two policies handle alpha-2 codes differently:

'identify' strips out the alpha-2 code e.g. if the identified language is 'en-US' then it will inject 'en' in the meta tags
'detect' does not strip out the alpha-2 code e.g. if the detected language is 'en-US' then it will inject 'en-US' in the meta tags

Any chance we can make this consistent and always strip out the alpha-2 code ?

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-2278.patch
11/Jun/16 02:56
0.7 kB
Fengtan
NUTCH-2278.patch
04/Aug/16 03:09
4 kB
Fengtan

Issue Links

relates to

NUTCH-1397 language-identifier incorrectly handles double-barreled language properties

Open

NUTCH-2449 Usage of Tika LanguageIdentifier in language-identifier plugin

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Fengtan

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Jun/16 02:56

Updated:: 30/Mar/24 17:19