Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2042

parse-html increase chunk size used to detect charset

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 2.3, 1.10
    • 2.3.1, 1.12
    • parser
    • None
    • Patch Available

    Description

      The chunk used to detect the encoding of a document is set to 2000 bytes. Although it is definitely best practice to "define" the character set on top, 2000 bytes are sometimes not enough: 20 longer <link> elements pointing to javascript and css libs may "hide" the <meta> element containing content type and encoding. Same problem has been observed in TIKA-357 and solved by increasing the buffer size to 8 kB.

      Attachments

        1. NUTCH-2042-trunk-v2.patch
          2 kB
          Sebastian Nagel
        2. NUTCH-2042-trunk-v1.patch
          2 kB
          Sebastian Nagel
        3. NUTCH-2042-2x-v1.patch
          2 kB
          Sebastian Nagel

        Activity

          People

            snagel Sebastian Nagel
            snagel Sebastian Nagel
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: