Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2042

parse-html increase chunk size used to detect charset

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3, 1.10
    • Fix Version/s: 2.3.1, 1.12
    • Component/s: parser
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The chunk used to detect the encoding of a document is set to 2000 bytes. Although it is definitely best practice to "define" the character set on top, 2000 bytes are sometimes not enough: 20 longer <link> elements pointing to javascript and css libs may "hide" the <meta> element containing content type and encoding. Same problem has been observed in TIKA-357 and solved by increasing the buffer size to 8 kB.

        Attachments

        1. NUTCH-2042-trunk-v1.patch
          2 kB
          Sebastian Nagel
        2. NUTCH-2042-2x-v1.patch
          2 kB
          Sebastian Nagel
        3. NUTCH-2042-trunk-v2.patch
          2 kB
          Sebastian Nagel

          Activity

            People

            • Assignee:
              snagel Sebastian Nagel
              Reporter:
              snagel Sebastian Nagel
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: