Tika
  1. Tika
  2. TIKA-1001

tika no longer seems to honor HTTP meta tag for arabic text in ISO-8859-6 charset

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.2
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      attached document extracts correctly in Tika 1.1
      attached document extracts incorrectly in tika 1.2.

      The difference appears to be that tika 1.1 honors the http meta content-type tag which specifies the charset as iso-8859-6, and correctly converts the output to UTF-8.
      tika 1.2 appears to ignore the charset specified in the meta tag.

      Some noodling seems to indicate that the problem is the charset.

      it doesn't matter what mode tika is used in (server, app mode, etc. even if content-type is specified with a charset, the output is still garbage).

      1. TIKA-1001v1.tar.gz
        4 kB
        Tim Allison
      2. badarabic.html
        2 kB
        david lemon

        Issue Links

          Activity

          Hide
          david lemon added a comment -

          file which no longer extracts correctly

          Show
          david lemon added a comment - file which no longer extracts correctly
          Hide
          Tim Allison added a comment -

          This is a draft that simplifies the extraction of the charset attribute within a <meta> tag (old html and new HTML5) and should make the charset extraction more robust to noisy metaheaders.

          The strategy is:
          1) find the <meta> tags
          2) find charset=x within the meta tag
          3) return the first valid charset

          Is the proposed strategy too broad? Will there be false positives?

          Will commit in a few days if there is no feedback. Thank you!

          P.S. Ignore the patch.xml file, of course.

          Show
          Tim Allison added a comment - This is a draft that simplifies the extraction of the charset attribute within a <meta> tag (old html and new HTML5) and should make the charset extraction more robust to noisy metaheaders. The strategy is: 1) find the <meta> tags 2) find charset=x within the meta tag 3) return the first valid charset Is the proposed strategy too broad? Will there be false positives? Will commit in a few days if there is no feedback. Thank you! P.S. Ignore the patch.xml file, of course.
          Hide
          Tim Allison added a comment -

          Fixed as of r1514126. Thank you for submitting this issue with test file!

          Show
          Tim Allison added a comment - Fixed as of r1514126. Thank you for submitting this issue with test file!
          Hide
          david lemon added a comment -

          thanks for fixing it!

          Show
          david lemon added a comment - thanks for fixing it!
          Hide
          Tim Allison added a comment -

          David,

          Thank you for submitting this. I fixed the issue triggered by your file and a few other variants that occurred to me. I wouldn't be surprised if we'll need to make more modifications. Please submit any other issues you find. Thank you, again.

          Show
          Tim Allison added a comment - David, Thank you for submitting this. I fixed the issue triggered by your file and a few other variants that occurred to me. I wouldn't be surprised if we'll need to make more modifications. Please submit any other issues you find. Thank you, again.

            People

            • Assignee:
              Unassigned
              Reporter:
              david lemon
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development