[ANY23-335] Improve document default language detection - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.2
Fix Version/s: 2.8
Component/s: core
Labels:
None

Description

Currently, to get a document's default language to pass to an ExtractionContext, only the "xml:lang" attribute in the HTML node is checked.

However, after reading this w3 article on document language declaration, and this w3 article on meta declarations, it appears that we should also be checking the "lang" attribute, and, as a fallback, the META http-equiv="Content-Language" elements.

Also: there seems to be some overlap here with (at least) the HTMLMetaExtractor, which, conversely, appears to check the "lang" attribute, and not the "xml:lang" attribute. Could the HTMLMetaExtractor just retrieve the default document language from the ExtractionContext rather than looking it up in the document all over again?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Hans Brende

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 27/Mar/18 06:48

Updated:: 21/Feb/22 18:24