Tika
  1. Tika
  2. TIKA-431

Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2
    • Component/s: general
    • Labels:
      None

      Description

      Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.

      Content-Encoding is not for the charset. It is for values like gzip, deflate, compress, or identity.

      Charset is passed in with the Content-Type. For instance: text/html; charset=iso-8859-1

      Tika should, in my opinion, do the following:

      1. Stop using Content-Encoding, unless it wants me to be able to pass in gzipped content in an input stream.

      2. Parse and understand charset=... declarations if passed in the Metadata object

      3. Return charset=... declarations in the Metadata object if a charset is detected.

      1. TIKA-431.patch
        29 kB
        Ken Krugler

        Issue Links

          Activity

          Hide
          Erik Hetzner added a comment -

          See TIKA-341, apparently my suggestion (2) above is implemented already.

          Thank you for anticipating this issue in advance!

          Show
          Erik Hetzner added a comment - See TIKA-341 , apparently my suggestion (2) above is implemented already. Thank you for anticipating this issue in advance!
          Hide
          Jukka Zitting added a comment -

          Agreed, we should be using the charset parameter of the media type instead of the Content-Encoding header.

          AFAICT we need to adjust the HtmlParser, MboxParser and TXTParser classes to do this. Any volunteers?

          Show
          Jukka Zitting added a comment - Agreed, we should be using the charset parameter of the media type instead of the Content-Encoding header. AFAICT we need to adjust the HtmlParser, MboxParser and TXTParser classes to do this. Any volunteers?
          Hide
          Ken Krugler added a comment -

          I should have some time soon to do a once-over on a bunch of encoding-related issues.

          Show
          Ken Krugler added a comment - I should have some time soon to do a once-over on a bunch of encoding-related issues.
          Hide
          Jan Høydahl added a comment -

          ping()
          We've just got bitten by this, any chance for a fix for v1.0?

          Show
          Jan Høydahl added a comment - ping() We've just got bitten by this, any chance for a fix for v1.0?
          Hide
          Ken Krugler added a comment -

          Hi Jan - sorry for the delay. Would end of week be soon enough?

          – Ken

          Show
          Ken Krugler added a comment - Hi Jan - sorry for the delay. Would end of week be soon enough? – Ken
          Hide
          Nick Burch added a comment -

          Any chance someone could work up a failing unit test for this, so when Ken's fix is done we'll be able to verify it works (and ensure it doesn't get broken in future!)

          Show
          Nick Burch added a comment - Any chance someone could work up a failing unit test for this, so when Ken's fix is done we'll be able to verify it works (and ensure it doesn't get broken in future!)
          Hide
          Jan Høydahl added a comment -

          End of week is good. As soon as 1.0 gets released, I'll try to get it into next Solr release..

          Show
          Jan Høydahl added a comment - End of week is good. As soon as 1.0 gets released, I'll try to get it into next Solr release..
          Hide
          Ken Krugler added a comment -

          This is a pretty big change, so I'm going to let the patch sit for a bit.

          This should also address TIKA-539.

          Show
          Ken Krugler added a comment - This is a pretty big change, so I'm going to let the patch sit for a bit. This should also address TIKA-539 .
          Hide
          Ken Krugler added a comment -

          Some other things I should have mentioned with regards to this patch:

          • Anybody who was returning the charset via CONTENT_ENCODING now returns it via CONTENT_TYPE, in the ...; charset=xxx parameter.
          • If the charset is specified via Metadata.CONTENT_TYPE, and it's valid (supported, etc) then TXTParser will no longer use ICU4J's charset detector code. This is to match the behavior of the HtmlParser code, and the essence of TIKA-539. We could revisit this, if people feel that depending on things like server response headers is too risky.
          • The charset detection algorithm in HtmlParser matches what was proposed/discussed in TIKA-539. Namely the <meta> tag or incoming metadata content-type charset is used if it's valid, and only exists in one or the other, or is the same in both. Otherwise ICU4J is used.
          Show
          Ken Krugler added a comment - Some other things I should have mentioned with regards to this patch: Anybody who was returning the charset via CONTENT_ENCODING now returns it via CONTENT_TYPE, in the ...; charset=xxx parameter. If the charset is specified via Metadata.CONTENT_TYPE, and it's valid (supported, etc) then TXTParser will no longer use ICU4J's charset detector code. This is to match the behavior of the HtmlParser code, and the essence of TIKA-539 . We could revisit this, if people feel that depending on things like server response headers is too risky. The charset detection algorithm in HtmlParser matches what was proposed/discussed in TIKA-539 . Namely the <meta> tag or incoming metadata content-type charset is used if it's valid, and only exists in one or the other, or is the same in both. Otherwise ICU4J is used.
          Hide
          Robert Muir added a comment -

          Shouldnt the charset from the http response header instead be supplied to CharsetDetector.setDeclaredEncoding

          http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html#setDeclaredEncoding(java.lang.String)

          Seems bad to trust this completely.

          Show
          Robert Muir added a comment - Shouldnt the charset from the http response header instead be supplied to CharsetDetector.setDeclaredEncoding http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html#setDeclaredEncoding(java.lang.String ) Seems bad to trust this completely.
          Hide
          Ken Krugler added a comment -

          Hi Robert,

          I'm assuming you're talking about the case where all we have is the server response header (versus the case where it's in the HTML meta tag), right?

          If so, then I agree with you - I think it would be better to not trust that. Given what I've seen coming back from web servers, they lie too often Though the iCU detection code isn't very good either, as I found out after doing an analysis.

          Anyway, if I made that change, then the current code would go ahead and pass it as the hint to ICU.

          Show
          Ken Krugler added a comment - Hi Robert, I'm assuming you're talking about the case where all we have is the server response header (versus the case where it's in the HTML meta tag), right? If so, then I agree with you - I think it would be better to not trust that. Given what I've seen coming back from web servers, they lie too often Though the iCU detection code isn't very good either, as I found out after doing an analysis. Anyway, if I made that change, then the current code would go ahead and pass it as the hint to ICU.
          Hide
          Robert Muir added a comment -

          I'm not sure even if its in both that it should be trusted, e.g. some logic like:

          • if there is a charset in either or header or meta tag, or both and its the same, use it as setDeclaredEncoding (suggestion)
          • if there is any ambiguity, then its clearly wrong already, and dont setDeclaredEncoding to anything.
          Show
          Robert Muir added a comment - I'm not sure even if its in both that it should be trusted, e.g. some logic like: if there is a charset in either or header or meta tag, or both and its the same, use it as setDeclaredEncoding (suggestion) if there is any ambiguity, then its clearly wrong already, and dont setDeclaredEncoding to anything.
          Hide
          Robert Muir added a comment -

          Though the iCU detection code isn't very good either, as I found out after doing an analysis.

          Really? what kind of analysis? Other tests seem to show these statistical techniques work pretty well (98% or 99%), though I havent looked too hard into their methodology.

          http://philip.html5.org/data/charsets.html#sniffing-bytes
          http://cs229.stanford.edu/proj2007/KimPark-AutomaticDetectionOfCharacterEncodingAndLanguages.pdf

          I think thats a lot more accurate than web server configurations or meta tags.

          Show
          Robert Muir added a comment - Though the iCU detection code isn't very good either, as I found out after doing an analysis. Really? what kind of analysis? Other tests seem to show these statistical techniques work pretty well (98% or 99%), though I havent looked too hard into their methodology. http://philip.html5.org/data/charsets.html#sniffing-bytes http://cs229.stanford.edu/proj2007/KimPark-AutomaticDetectionOfCharacterEncodingAndLanguages.pdf I think thats a lot more accurate than web server configurations or meta tags.
          Hide
          Ken Krugler added a comment -

          For analysis, I used Tika charset detection and compared to <meta> http-equiv charset. See my email to the Tika list with subject "HUG talk on Public Terabyte Dataset project".

          As I mentioned in that post, it's possible my analysis had errors, but the results weren't great for % of time that ICU4J matched the meta tag charset. When I looked at miss-matches manually, they mostly seemed to be issues with ICU, versus a bad meta tag charset.

          From the page at http://philip.html5.org/data/charsets.html#sniffing-bytes, I don't see stats on comparing various declared encodings (e.g. what percentage of the time did response header == meta == detected), which would be useful.

          I've got some crawl data which, if I had time, I could run through a similar analysis but this time dump out all of the cases where ICU (with and without hints) differs from both.

          Show
          Ken Krugler added a comment - For analysis, I used Tika charset detection and compared to <meta> http-equiv charset. See my email to the Tika list with subject "HUG talk on Public Terabyte Dataset project". As I mentioned in that post, it's possible my analysis had errors, but the results weren't great for % of time that ICU4J matched the meta tag charset. When I looked at miss-matches manually, they mostly seemed to be issues with ICU, versus a bad meta tag charset. From the page at http://philip.html5.org/data/charsets.html#sniffing-bytes , I don't see stats on comparing various declared encodings (e.g. what percentage of the time did response header == meta == detected), which would be useful. I've got some crawl data which, if I had time, I could run through a similar analysis but this time dump out all of the cases where ICU (with and without hints) differs from both.
          Hide
          Ken Krugler added a comment -

          Re "if there is any ambiguity, then its clearly wrong already". If there's ambiguity between the response header and the meta tag, then it's clear that one is wrong, but in my experience meta tags are a lot more accurate than the server response headers.

          Show
          Ken Krugler added a comment - Re "if there is any ambiguity, then its clearly wrong already". If there's ambiguity between the response header and the meta tag, then it's clear that one is wrong, but in my experience meta tags are a lot more accurate than the server response headers.
          Hide
          Jukka Zitting added a comment -

          In revision 1358858 I made the text and html parsers return character encoding information in the charset parameter of the returned content type. The content encoding field is still present for backwards compatibility, but I added a note to the CHANGES.txt mentioning that it should be considered deprecated.

          Show
          Jukka Zitting added a comment - In revision 1358858 I made the text and html parsers return character encoding information in the charset parameter of the returned content type. The content encoding field is still present for backwards compatibility, but I added a note to the CHANGES.txt mentioning that it should be considered deprecated.
          Hide
          Tomas Safarik added a comment -

          Hello,

          it seems that I created duplicate issue TIKA-952 (which I closed).

          But I am not sure why Content-Encoding and charset parameter values differ. Shoudn't they be the same?

          Show
          Tomas Safarik added a comment - Hello, it seems that I created duplicate issue TIKA-952 (which I closed). But I am not sure why Content-Encoding and charset parameter values differ. Shoudn't they be the same?

            People

            • Assignee:
              Jukka Zitting
              Reporter:
              Erik Hetzner
            • Votes:
              2 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development