Tika
  1. Tika
  2. TIKA-309

Mime type application/rdf+xml not correctly detected

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.5
    • Component/s: mime
    • Labels:
      None

      Description

      Mime type detector using AutoDetectParser and Metadata returns "application/xml" for the URL http://www.w3.org/2002/07/owl#, where it should be "application/rdf+xml". The correct mime type is also suggested here: http://www.w3.org/TR/owl-ref/#MIMEType.

      P.S., Tika was downloaded from svn and built with Maven last week.

        Issue Links

          Activity

          Hide
          Jukka Zitting added a comment -

          Added improved RDF/XML type metadata and test cases to verify that the type is correctly detected.

          Show
          Jukka Zitting added a comment - Added improved RDF/XML type metadata and test cases to verify that the type is correctly detected.
          Hide
          Yuan-Fang Li added a comment -

          This fix had worked for me till yesterday. When I updated to the latest version (829668) from svn, my test cases on application/rdf+xml mimetype failed again, for URLs "http://www.ai.sri.com/daml/services/owl-s/1.2/Process.owl" and "http://www.w3.org/2002/07/owl#". The mimetype returned is "application/xml" for the first one and "text/html" for the second one. Hence I'm reopening this issue.

          Show
          Yuan-Fang Li added a comment - This fix had worked for me till yesterday. When I updated to the latest version (829668) from svn, my test cases on application/rdf+xml mimetype failed again, for URLs "http://www.ai.sri.com/daml/services/owl-s/1.2/Process.owl" and "http://www.w3.org/2002/07/owl#". The mimetype returned is "application/xml" for the first one and "text/html" for the second one. Hence I'm reopening this issue.
          Hide
          Chris A. Mattmann added a comment -

          Hey Guys, I think we just need another line in the tika-mimetypes.xml file for this. I'll take a crack at it, if there are no objections. Thanks!

          Show
          Chris A. Mattmann added a comment - Hey Guys, I think we just need another line in the tika-mimetypes.xml file for this. I'll take a crack at it, if there are no objections. Thanks!
          Hide
          Chris A. Mattmann added a comment - - edited

          This ended up turning out to be a tricky nightmare. Yuan-Fang,

          1. the first file, http://www.ai.sri.com/daml/services/owl-s/1.2/Process.owl
          a. uses XML entities to obfuscate the rdf, rdfs, owl, etc. URI and localname, making rootXML detection nullified.
          b. includes XML byte chars that match application/xml magic detection, nullifying the bytes that I put in to detect RDF and OWL
          2. the second file, http://www.w3.org/2002/07/owl#
          a. is easier to detect, since the URIs and localname are not obfuscated, but magic detection still doesn't work out of the box

          I'm going to add in some better magic detection for RDF/OWL files that start with <rdf:RDF or <owl:Ontology, as well as better detection based on glob patterns and so forth, but in the end, my suggestion for this particular problem is:

          1. use the o.a.tika.detect.NameDetector and set the Metadata.RESOURCE_NAME_KEY value before calling (pseudo-code):

          AutoDetectParser parser = new AutoDetectParser();
          parser.setDetector(new NameDetector());
          Metadata met = new Metadata();
          met.set(Metadata.RESOURCE_NAME_KEY, "name or url of your file");
          parser.parse(InputStream stream, some ContentHandler, met);

          Commit that includes my updated tika-mimetypes.xml and unit tests to show that when the RESOURCE_NAME_KEY is set, your examples are correctly detected is forthcoming.

          Show
          Chris A. Mattmann added a comment - - edited This ended up turning out to be a tricky nightmare. Yuan-Fang, 1. the first file, http://www.ai.sri.com/daml/services/owl-s/1.2/Process.owl a. uses XML entities to obfuscate the rdf, rdfs, owl, etc. URI and localname, making rootXML detection nullified. b. includes XML byte chars that match application/xml magic detection, nullifying the bytes that I put in to detect RDF and OWL 2. the second file, http://www.w3.org/2002/07/owl# a. is easier to detect, since the URIs and localname are not obfuscated, but magic detection still doesn't work out of the box I'm going to add in some better magic detection for RDF/OWL files that start with <rdf:RDF or <owl:Ontology, as well as better detection based on glob patterns and so forth, but in the end, my suggestion for this particular problem is: 1. use the o.a.tika.detect.NameDetector and set the Metadata.RESOURCE_NAME_KEY value before calling (pseudo-code): AutoDetectParser parser = new AutoDetectParser(); parser.setDetector(new NameDetector()); Metadata met = new Metadata(); met.set(Metadata.RESOURCE_NAME_KEY, "name or url of your file"); parser.parse(InputStream stream, some ContentHandler, met); Commit that includes my updated tika-mimetypes.xml and unit tests to show that when the RESOURCE_NAME_KEY is set, your examples are correctly detected is forthcoming.
          Hide
          Chris A. Mattmann added a comment -
          • fixed in r836035:
          • was able to correctly identify RDF/OWL mime types using magic by changing regex pattern for localName in MimeTypes.java (in the case where only the <ns:localName..... is read, but there is no ">" at the end since we only read N first bytes of the magic header)
          • added unit tests and URLs from this issue for regression
          • refactored o.a.tika.mime.MimeDetectionTest to support URLs as InputStreams (as well as Files)
          • took out <match value="<!--" type="string" offset="0"/> for HTML detection since comments can appear in HTML, XML, etc., and aren't specific to HTML
          Show
          Chris A. Mattmann added a comment - fixed in r836035: was able to correctly identify RDF/OWL mime types using magic by changing regex pattern for localName in MimeTypes.java (in the case where only the <ns:localName..... is read, but there is no ">" at the end since we only read N first bytes of the magic header) added unit tests and URLs from this issue for regression refactored o.a.tika.mime.MimeDetectionTest to support URLs as InputStreams (as well as Files) took out <match value="<!--" type="string" offset="0"/> for HTML detection since comments can appear in HTML, XML, etc., and aren't specific to HTML
          Hide
          Yuan-Fang Li added a comment -

          Hi Chris,

          Thanks a lot for the fix. However, I have to reopen the ticket due to some problems with InputStream, and some other issues.

          1. In your comment you suggested that I do the following (pseudo code):

          AutoDetectParser parser = new AutoDetectParser();
          parser.setDetector(new NameDetector());
          Metadata met = new Metadata();
          met.set(Metadata.RESOURCE_NAME_KEY, "name or url of your file");
          parser.parse(InputStream stream, some ContentHandler, met);

          Since NameDetector takes a map as the parameter for the constructor, I have to do the following:

          parser.setDetector(new NameDetector(new HashMap<Pattern, MediaType>()));

          Doing so invalidates my tests because the map in NameDetector is empty, the mime type returned will always be "application/octet-stream". Is there another way to initialize the NameDetector?

          2. The detection for the 2 URLs works perfectly now based on your suggestion (not adding NameDetector to the parser but adding met.set(Metadata.RESOURCE_NAME_KEY, "name or url of your file"); ). However, if my input is an input stream, the test still fails since the parser doesn't have the hint from file/URL names.

          Show
          Yuan-Fang Li added a comment - Hi Chris, Thanks a lot for the fix. However, I have to reopen the ticket due to some problems with InputStream, and some other issues. 1. In your comment you suggested that I do the following (pseudo code): AutoDetectParser parser = new AutoDetectParser(); parser.setDetector(new NameDetector()); Metadata met = new Metadata(); met.set(Metadata.RESOURCE_NAME_KEY, "name or url of your file"); parser.parse(InputStream stream, some ContentHandler, met); Since NameDetector takes a map as the parameter for the constructor, I have to do the following: parser.setDetector(new NameDetector(new HashMap<Pattern, MediaType>())); Doing so invalidates my tests because the map in NameDetector is empty, the mime type returned will always be "application/octet-stream". Is there another way to initialize the NameDetector? 2. The detection for the 2 URLs works perfectly now based on your suggestion (not adding NameDetector to the parser but adding met.set(Metadata.RESOURCE_NAME_KEY, "name or url of your file"); ). However, if my input is an input stream, the test still fails since the parser doesn't have the hint from file/URL names.
          Hide
          Chris A. Mattmann added a comment -

          Yuan-Fang,

          There is a unit test that should correctly determine if this is working on your system or not. Does:
          /lucene/tika/trunk/tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java

          Pass on your system?

          Regarding #1 and #2 above, I was assuming you could pass in a regex pattern->MediaType map to NameDetector. If you didn't want to pass that in, you may want to take a look at the other Detectors in the o.a.t.detect package. For #2, if the test above passes, it should prove that InputStream detection properly works?

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Yuan-Fang, There is a unit test that should correctly determine if this is working on your system or not. Does: /lucene/tika/trunk/tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java Pass on your system? Regarding #1 and #2 above, I was assuming you could pass in a regex pattern->MediaType map to NameDetector. If you didn't want to pass that in, you may want to take a look at the other Detectors in the o.a.t.detect package. For #2, if the test above passes, it should prove that InputStream detection properly works? Cheers, Chris
          Hide
          Jukka Zitting added a comment -

          Re-resolving this as Fixed, as the test case we have works. Please file a new issue with a clear test case in case the current behaviour does not work for you.

          Note that with the fix Chris made, you should be able to auto-detect the mentioned RDF files with the normal AutoDetectParser even without any setDetector() customizations.

          Show
          Jukka Zitting added a comment - Re-resolving this as Fixed, as the test case we have works. Please file a new issue with a clear test case in case the current behaviour does not work for you. Note that with the fix Chris made, you should be able to auto-detect the mentioned RDF files with the normal AutoDetectParser even without any setDetector() customizations.
          Hide
          Yuan-Fang Li added a comment -

          Hi Chris, Jukka,

          Yes, the Tika tests are passing for me. However, my test for one of the ontologies ("http://www.w3.org/2002/07/owl#") is still failing, and here is why.

          In test tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java, the method testUrl(String expected, String url, String file) is actually testing the content in the file named "file" with the url being a clue for the detection. My test, however, opens an input stream on the actual url and use that to detect the mime type. For the above URL, tika is testing against the file named "test-difficult-rdf2.xml". The only difference I can see between this file and the actual content of the URl is the one line at the top: "<?xml version='1.0' encoding='ISO-8859-1'?>". This line is present in the tika test file but not in the URL.

          So. if you remove/comment out that line from "test-difficult-rdf2.xml" and run the following maven command to run the test: mvn -Dtest=MimeDetectionTest test, it will fail. Or, you could use the following test case to test against the real URL.

          @Test
          public void testRDFStreamMimeType() throws IOException {
          URL url = new URL("http://www.w3.org/2002/07/owl#");
          final InputStream stream = new BufferedInputStream(url.openStream());
          try

          { MimeTypes mimeTypes = TikaConfig.getDefaultConfig().getMimeRepository(); Metadata metadata = new Metadata(); String mime = mimeTypes.detect(stream, metadata).toString(); assertEquals("application/rdf+xml", mime); }

          finally

          { stream.close(); }

          }

          Cheers
          Yuan-Fang

          Show
          Yuan-Fang Li added a comment - Hi Chris, Jukka, Yes, the Tika tests are passing for me. However, my test for one of the ontologies ("http://www.w3.org/2002/07/owl#") is still failing, and here is why. In test tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java, the method testUrl(String expected, String url, String file) is actually testing the content in the file named "file" with the url being a clue for the detection. My test, however, opens an input stream on the actual url and use that to detect the mime type. For the above URL, tika is testing against the file named "test-difficult-rdf2.xml". The only difference I can see between this file and the actual content of the URl is the one line at the top: "<?xml version='1.0' encoding='ISO-8859-1'?>". This line is present in the tika test file but not in the URL. So. if you remove/comment out that line from "test-difficult-rdf2.xml" and run the following maven command to run the test: mvn -Dtest=MimeDetectionTest test, it will fail. Or, you could use the following test case to test against the real URL. @Test public void testRDFStreamMimeType() throws IOException { URL url = new URL("http://www.w3.org/2002/07/owl#"); final InputStream stream = new BufferedInputStream(url.openStream()); try { MimeTypes mimeTypes = TikaConfig.getDefaultConfig().getMimeRepository(); Metadata metadata = new Metadata(); String mime = mimeTypes.detect(stream, metadata).toString(); assertEquals("application/rdf+xml", mime); } finally { stream.close(); } } Cheers Yuan-Fang
          Hide
          Chris A. Mattmann added a comment -

          Yuang-Fang:

          I've confirmed what you mentioned. When the XML header first-line is taken out of the test-difficult-rdf2.xml (as the remote URL exists), I get this:

          [chipotle:~/src/tika/trunk] mattmann% mvn -Dtest=MimeDetectionTest clean test
          [INFO] Scanning for projects...
          [INFO] Reactor build order:
          [INFO] Apache Tika parent
          [INFO] Apache Tika core
          [INFO] Apache Tika parsers
          [INFO] Apache Tika application
          [INFO] Apache Tika
          [INFO] ------------------------------------------------------------------------
          [INFO] Building Apache Tika parent
          [INFO] task-segment: [clean, test]
          [INFO] ------------------------------------------------------------------------
          [INFO] [clean:clean]
          [INFO] Setting property: classpath.resource.loader.class => 'org.codehaus.plexus.velocity.ContextClassLoaderResourceLoader'.
          [INFO] Setting property: velocimacro.messages.on => 'false'.
          [INFO] Setting property: resource.loader => 'classpath'.
          [INFO] Setting property: resource.manager.logwhenfound => 'false'.
          [INFO] [remote-resources:process

          {execution: default}

          ]
          [INFO] ------------------------------------------------------------------------
          [INFO] Building Apache Tika core
          [INFO] task-segment: [clean, test]
          [INFO] ------------------------------------------------------------------------
          [INFO] [clean:clean]
          [INFO] [remote-resources:process

          {execution: default}

          ]
          [INFO] [resources:resources]
          [INFO] Using 'UTF-8' encoding to copy filtered resources.
          [INFO] Copying 20 resources
          [INFO] Copying 3 resources
          [INFO] [compiler:compile]
          [INFO] Compiling 86 source files to /Users/mattmann/src/tika/trunk/tika-core/target/classes
          [INFO] [resources:testResources]
          [INFO] Using 'UTF-8' encoding to copy filtered resources.
          [INFO] Copying 24 resources
          [INFO] Copying 3 resources
          [INFO] [compiler:testCompile]
          [INFO] Compiling 19 source files to /Users/mattmann/src/tika/trunk/tika-core/target/test-classes
          [INFO] [surefire:test]
          [INFO] Surefire report directory: /Users/mattmann/src/tika/trunk/tika-core/target/surefire-reports

          -------------------------------------------------------
          T E S T S
          -------------------------------------------------------
          Running org.apache.tika.mime.MimeDetectionTest
          Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.568 sec <<< FAILURE!

          Results :

          Failed tests:
          testDetection(org.apache.tika.mime.MimeDetectionTest)

          Tests run: 2, Failures: 1, Errors: 0, Skipped: 0

          [INFO] ------------------------------------------------------------------------
          [ERROR] BUILD FAILURE
          [INFO] ------------------------------------------------------------------------
          [INFO] There are test failures.

          Please refer to /Users/mattmann/src/tika/trunk/tika-core/target/surefire-reports for the individual test results.
          [INFO] ------------------------------------------------------------------------
          [INFO] For more information, run Maven with the -e switch
          [INFO] ------------------------------------------------------------------------
          [INFO] Total time: 8 seconds
          [INFO] Finished at: Wed Nov 25 14:45:52 PST 2009
          [INFO] Final Memory: 15M/31M
          [INFO] ------------------------------------------------------------------------
          [chipotle:~/src/tika/trunk] mattmann%

          [chipotle:~/src/tika/trunk] mattmann% more tika-core/target/surefire-reports/org.apache.tika.mime.MimeDetectionTest.txt
          -------------------------------------------------------------------------------
          Test set: org.apache.tika.mime.MimeDetectionTest
          -------------------------------------------------------------------------------
          Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.573 sec <<< FAILURE!
          testDetection(org.apache.tika.mime.MimeDetectionTest) Time elapsed: 0.44 sec <<< FAILURE!
          junit.framework.ComparisonFailure: http://www.w3.org/2002/07/owl# is not properly detected. expected:<application/rdf+xml> but w
          as:<text/plain>
          at junit.framework.Assert.assertEquals(Assert.java:81)
          at org.apache.tika.mime.MimeDetectionTest.testStream(MimeDetectionTest.java:87)
          at org.apache.tika.mime.MimeDetectionTest.testUrl(MimeDetectionTest.java:71)
          at org.apache.tika.mime.MimeDetectionTest.testDetection(MimeDetectionTest.java:54)

          I'm looking into this right now...I'll file another issue for this..
          I'm looking into this now:

          Show
          Chris A. Mattmann added a comment - Yuang-Fang: I've confirmed what you mentioned. When the XML header first-line is taken out of the test-difficult-rdf2.xml (as the remote URL exists), I get this: [chipotle:~/src/tika/trunk] mattmann% mvn -Dtest=MimeDetectionTest clean test [INFO] Scanning for projects... [INFO] Reactor build order: [INFO] Apache Tika parent [INFO] Apache Tika core [INFO] Apache Tika parsers [INFO] Apache Tika application [INFO] Apache Tika [INFO] ------------------------------------------------------------------------ [INFO] Building Apache Tika parent [INFO] task-segment: [clean, test] [INFO] ------------------------------------------------------------------------ [INFO] [clean:clean] [INFO] Setting property: classpath.resource.loader.class => 'org.codehaus.plexus.velocity.ContextClassLoaderResourceLoader'. [INFO] Setting property: velocimacro.messages.on => 'false'. [INFO] Setting property: resource.loader => 'classpath'. [INFO] Setting property: resource.manager.logwhenfound => 'false'. [INFO] [remote-resources:process {execution: default} ] [INFO] ------------------------------------------------------------------------ [INFO] Building Apache Tika core [INFO] task-segment: [clean, test] [INFO] ------------------------------------------------------------------------ [INFO] [clean:clean] [INFO] [remote-resources:process {execution: default} ] [INFO] [resources:resources] [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] Copying 20 resources [INFO] Copying 3 resources [INFO] [compiler:compile] [INFO] Compiling 86 source files to /Users/mattmann/src/tika/trunk/tika-core/target/classes [INFO] [resources:testResources] [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] Copying 24 resources [INFO] Copying 3 resources [INFO] [compiler:testCompile] [INFO] Compiling 19 source files to /Users/mattmann/src/tika/trunk/tika-core/target/test-classes [INFO] [surefire:test] [INFO] Surefire report directory: /Users/mattmann/src/tika/trunk/tika-core/target/surefire-reports ------------------------------------------------------- T E S T S ------------------------------------------------------- Running org.apache.tika.mime.MimeDetectionTest Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.568 sec <<< FAILURE! Results : Failed tests: testDetection(org.apache.tika.mime.MimeDetectionTest) Tests run: 2, Failures: 1, Errors: 0, Skipped: 0 [INFO] ------------------------------------------------------------------------ [ERROR] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] There are test failures. Please refer to /Users/mattmann/src/tika/trunk/tika-core/target/surefire-reports for the individual test results. [INFO] ------------------------------------------------------------------------ [INFO] For more information, run Maven with the -e switch [INFO] ------------------------------------------------------------------------ [INFO] Total time: 8 seconds [INFO] Finished at: Wed Nov 25 14:45:52 PST 2009 [INFO] Final Memory: 15M/31M [INFO] ------------------------------------------------------------------------ [chipotle:~/src/tika/trunk] mattmann% [chipotle:~/src/tika/trunk] mattmann% more tika-core/target/surefire-reports/org.apache.tika.mime.MimeDetectionTest.txt ------------------------------------------------------------------------------- Test set: org.apache.tika.mime.MimeDetectionTest ------------------------------------------------------------------------------- Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.573 sec <<< FAILURE! testDetection(org.apache.tika.mime.MimeDetectionTest) Time elapsed: 0.44 sec <<< FAILURE! junit.framework.ComparisonFailure: http://www.w3.org/2002/07/owl# is not properly detected. expected:<application/rdf+xml> but w as:<text/plain> at junit.framework.Assert.assertEquals(Assert.java:81) at org.apache.tika.mime.MimeDetectionTest.testStream(MimeDetectionTest.java:87) at org.apache.tika.mime.MimeDetectionTest.testUrl(MimeDetectionTest.java:71) at org.apache.tika.mime.MimeDetectionTest.testDetection(MimeDetectionTest.java:54) I'm looking into this right now...I'll file another issue for this.. I'm looking into this now:

            People

            • Assignee:
              Chris A. Mattmann
              Reporter:
              Yuan-Fang Li
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development