Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1154

Tika hangs on format detection of malformed HTML file.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.4
    • None
    • mime
    • None

    Description

      We are using Tika on large web archives, which also happen to contain some malformed files. In particular, we found a HTML file with binary characters in the DOCTYPE declaration. This hangs Tika, either embedded or from the command line, during format detection.

      An example file is attached.

      Attachments

        1. tika-breaker.html
          0.3 kB
          Andrew Jackson

        Issue Links

          Activity

            People

              Unassigned Unassigned
              anjackson Andrew Jackson
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: