It should be possible to disable language detection in the AutoDetectParser.
Between 0.4 and the current trunk, the time Tika spent parsing my test data (100MB of compressed web crawl data, mixed HTML, images, etc.) increased considerably. After profiling, I determined that most of the time was spent in language detection.
time results of indexing my test data with Lucene using AutoDetectParser:
time results on the same test data using the same code as AutoDetectParser, but with language detection disabled:
Obviously these numbers are worthless in their particulars but I think they demonstrate that one ought to be able to turn off language detection, as it can massively slow down parsing.
|Transition||Time In Source Status||Execution Times||Last Executer||Last Execution Date|
|8h 14m||1||Jukka Zitting||13/Nov/09 03:20|
|31d 18h 56m||1||Jukka Zitting||14/Dec/09 22:16|
|Status||Resolved [ 5 ]||Closed [ 6 ]|
|Field||Original Value||New Value|
|Status||Open [ 1 ]||Resolved [ 5 ]|
|Assignee||Jukka Zitting [ jukkaz ]|
|Fix Version/s||0.5 [ 12314095 ]|
|Resolution||Fixed [ 1 ]|