Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-357

Increase buffer size for meta tag sniffing

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 0.5
    • 0.6
    • None
    • None

    Description

      Some web pages (such as makler.su, see attached) have lots of script data before the body of the HTML.

      When this happens, the sniffing code fails to find the charset info in the meta tag, because it currently only sniffs the first 4K.

      Bumping it to 8K would cover all of the cases that I (Ken) have seen during a test crawl.

      Attachments

        1. TIKA-357-2.patch
          2 kB
          Kenneth William Krugler
        2. TIKA-357.patch
          0.9 kB
          Kenneth William Krugler
        3. makler.html
          47 kB
          Kenneth William Krugler
        4. big-preamble.html
          47 kB
          Kenneth William Krugler

        Issue Links

          Activity

            People

              chrismattmann Chris A. Mattmann
              kkrugler Kenneth William Krugler
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: