Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2047

TXTParser overwrites mime type/masks types that are subtype of text

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 2.0, 1.14
    • Component/s: parser
    • Labels:
      None

      Description

      For vcal and other mime types that are subclasses of text/plain, the TXTParser overwrites their mime type as "text/plain". We should check to see what mime has been sent in via the Metadata and add the charset to that, e.g. "text/calendar; charset=ISO-8859-1"...right?

                  Charset charset = reader.getCharset();
                  MediaType type = new MediaType(MediaType.TEXT_PLAIN, charset);
                  metadata.set(Metadata.CONTENT_TYPE, type.toString());
      

        Issue Links

          Activity

          Hide
          tallison@mitre.org Tim Allison added a comment -

          This fix breaks unit tests testUsingCharsetInContentTypeHeader() and {{testCharsetDetectionWithShortSnipet()}}for TIKA-341, TIKA-771, TIKA-868. The issue is that the unit tests send in a mime type that is not "text/plain", and they expect it to be overwritten. Given the issues that those tests are linked to, I don't think that was the original intent. I think the original intent was only to carry the encoding information through.

          Ken Krugler and all, do you have any problems if I modify the unit tests, like so:

              public void testUsingCharsetInContentTypeHeader() throws Exception {
          ...
          -        assertEquals("text/plain; charset=ISO-8859-15", metadata.get(Metadata.CONTENT_TYPE));
          +        assertEquals("text/html; charset=ISO-8859-15", metadata.get(Metadata.CONTENT_TYPE));
          ...
          
              @Test
              public void testCharsetDetectionWithShortSnipet() throws Exception {
          ...
          -         assertEquals("text/plain; charset=UTF-8", metadata.get(Metadata.CONTENT_TYPE));
          +        assertEquals("application/binary; charset=UTF-8", metadata.get(Metadata.CONTENT_TYPE));
          ...
          
          Show
          tallison@mitre.org Tim Allison added a comment - This fix breaks unit tests testUsingCharsetInContentTypeHeader() and {{testCharsetDetectionWithShortSnipet()}}for TIKA-341 , TIKA-771 , TIKA-868 . The issue is that the unit tests send in a mime type that is not "text/plain", and they expect it to be overwritten. Given the issues that those tests are linked to, I don't think that was the original intent. I think the original intent was only to carry the encoding information through. Ken Krugler and all, do you have any problems if I modify the unit tests, like so: public void testUsingCharsetInContentTypeHeader() throws Exception { ... - assertEquals("text/plain; charset=ISO-8859-15", metadata.get(Metadata.CONTENT_TYPE)); + assertEquals("text/html; charset=ISO-8859-15", metadata.get(Metadata.CONTENT_TYPE)); ... @Test public void testCharsetDetectionWithShortSnipet() throws Exception { ... - assertEquals("text/plain; charset=UTF-8", metadata.get(Metadata.CONTENT_TYPE)); + assertEquals("application/binary; charset=UTF-8", metadata.get(Metadata.CONTENT_TYPE)); ...
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build tika-2.x-windows #49 (See https://builds.apache.org/job/tika-2.x-windows/49/)

          • Maintain passed-in mime in TXTParser (TIKA-2047). (tallison: rev 32d9ece8d84986de240087a580e094de3f879f3c)
          • (edit) CHANGES.txt
          • (edit) tika-parser-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/txt/TXTParserTest.java
          • (edit) tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/TXTParser.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x-windows #49 (See https://builds.apache.org/job/tika-2.x-windows/49/ ) Maintain passed-in mime in TXTParser ( TIKA-2047 ). (tallison: rev 32d9ece8d84986de240087a580e094de3f879f3c) (edit) CHANGES.txt (edit) tika-parser-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/txt/TXTParserTest.java (edit) tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/TXTParser.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1103 (See https://builds.apache.org/job/Tika-trunk/1103/)
          TIKA-2047 – maintain mime info for mimes that are subtype of text/plain (tallison: rev 415381212291e843e9091f43f6db8c432eb02aa9)

          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/txt/TXTParser.java
          • (edit) CHANGES.txt
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/txt/TXTParserTest.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1103 (See https://builds.apache.org/job/Tika-trunk/1103/ ) TIKA-2047 – maintain mime info for mimes that are subtype of text/plain (tallison: rev 415381212291e843e9091f43f6db8c432eb02aa9) (edit) tika-parsers/src/main/java/org/apache/tika/parser/txt/TXTParser.java (edit) CHANGES.txt (edit) tika-parsers/src/test/java/org/apache/tika/parser/txt/TXTParserTest.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build tika-2.x #145 (See https://builds.apache.org/job/tika-2.x/145/)

          • Maintain passed-in mime in TXTParser (TIKA-2047). (tallison: rev 32d9ece8d84986de240087a580e094de3f879f3c)
          • (edit) CHANGES.txt
          • (edit) tika-parser-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/txt/TXTParserTest.java
          • (edit) tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/TXTParser.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #145 (See https://builds.apache.org/job/tika-2.x/145/ ) Maintain passed-in mime in TXTParser ( TIKA-2047 ). (tallison: rev 32d9ece8d84986de240087a580e094de3f879f3c) (edit) CHANGES.txt (edit) tika-parser-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/txt/TXTParserTest.java (edit) tika-parser-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/txt/TXTParser.java

            People

            • Assignee:
              tallison@mitre.org Tim Allison
              Reporter:
              tallison@mitre.org Tim Allison
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development