Uploaded image for project: 'Apache Any23 (Retired)'
  1. Apache Any23 (Retired)
  2. ANY23-335

Improve document default language detection

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.2
    • 2.8
    • core
    • None

    Description

      Currently, to get a document's default language to pass to an ExtractionContext, only the "xml:lang" attribute in the HTML node is checked.

      However, after reading this w3 article on document language declaration, and this w3 article on meta declarations, it appears that we should also be checking the "lang" attribute, and, as a fallback, the META http-equiv="Content-Language" elements.

      Also: there seems to be some overlap here with (at least) the HTMLMetaExtractor, which, conversely, appears to check the "lang" attribute, and not the "xml:lang" attribute. Could the HTMLMetaExtractor just retrieve the default document language from the ExtractionContext rather than looking it up in the document all over again?

      Attachments

        Activity

          People

            Unassigned Unassigned
            hansbrende Hans Brende
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: