Description
The language-identifier plugin provides two extraction policies: detect and identify.
However the two policies handle alpha-2 codes differently:
- 'identify' strips out the alpha-2 code e.g. if the identified language is 'en-US' then it will inject 'en' in the meta tags
- 'detect' does not strip out the alpha-2 code e.g. if the detected language is 'en-US' then it will inject 'en-US' in the meta tags
Any chance we can make this consistent and always strip out the alpha-2 code ?
Attachments
Attachments
Issue Links
- relates to
-
NUTCH-1397 language-identifier incorrectly handles double-barreled language properties
- Open
-
NUTCH-2449 Usage of Tika LanguageIdentifier in language-identifier plugin
- Closed