Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-772

media type detection fails for html documents, results in text/plain instead of text/html

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 0.10
    • None
    • mime

    Description

      Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with :
      <?xml version="1.0" encoding="UTF-8"?>

      composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ...

      Bar.java
      @Test
      public void testMediaType() throws Exception {
              List<Document> allDocs = DocumentProvider.docsAsList();
      	Map<Document, String> failed = new HashMap<Document, String>();
      	for (Document doc : allDocs) {
      		Tika tika = new Tika();
      		String type = tika.detect(TikaInputStream.get(doc.getFile()));
      
      		if(!doc.getMediaType().toString().equals(type))
      				failed.put(doc, type);	
      	}
      	
      	for (Document doc : failed.keySet()) {
      		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
      	}
      	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;");
      }
      

      Am I doing anything wrong ?

      Attachments

        1. tika.png
          99 kB
          Joseph Vychtrle
        2. it.html
          35 kB
          Joseph Vychtrle
        3. html.zip
          263 kB
          Joseph Vychtrle

        Activity

          People

            jukkaz Jukka Zitting
            vychtrle Joseph Vychtrle
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: