[TIKA-772] media type detection fails for html documents, results in text/plain instead of text/html - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 0.10
Fix Version/s: None
Component/s: mime
Labels:
- detection
- media-type

Description

Hey, I was testing media type detection on most of the major document types, but when testing html documents of cca 5000 words that starts with :
<?xml version="1.0" encoding="UTF-8"?>

composed of root "html" element and "p" elements only, it always results in text/plain instead of text/html ...

Bar.java

@Test
public void testMediaType() throws Exception {
        List<Document> allDocs = DocumentProvider.docsAsList();
	Map<Document, String> failed = new HashMap<Document, String>();
	for (Document doc : allDocs) {
		Tika tika = new Tika();
		String type = tika.detect(TikaInputStream.get(doc.getFile()));

		if(!doc.getMediaType().toString().equals(type))
				failed.put(doc, type);	
	}
	
	for (Document doc : failed.keySet()) {
		log.error("expected: " + doc.getMediaTypeString() + "; actual: " + failed.get(doc) + ";  path to file: " + doc.getFile().getAbsolutePath());
	}
	assertTrue(failed.isEmpty(), "mime type was incorrectly detected for : " + failed.size() + " documents;");
}

Am I doing anything wrong ?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

tika.png
05/Nov/11 20:43
99 kB
Joseph Vychtrle
it.html
05/Nov/11 22:34
35 kB
Joseph Vychtrle
html.zip
05/Nov/11 18:12
263 kB
Joseph Vychtrle

Activity

People

Assignee:: Jukka Zitting

Reporter:: Joseph Vychtrle

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 03/Nov/11 21:13

Updated:: 05/Nov/11 23:05

Resolved:: 05/Nov/11 19:31