Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2421

parse-html to prioritize HTML5 charset definitions

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.15
    • 1.21
    • parser
    • None

    Description

      jira NUTCH-1733 add support to HTML5 charset definitions.
      In some case web site declare multiple meta element with different charset :
      <meta charset="utf-8">
      <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
      (ex : http://www.edga.fr/)
      In this case the second charset is detected (iso-8859-1).
      What about prioritize HTML5 charset definitions first ?

      Attachments

        Activity

          People

            Unassigned Unassigned
            lhervaud Laurent Hervaud
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: