Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.2.1
-
None
-
None
Description
jmssiera wrote up the enhancement request TIKA-3325 where he originally requested that the number of bytes be passed as the write limit. I see that issue was marked as Resolved, but writeLimit is number of chars instead of number of bytes.
I have a use-case where the consumer side (an indexer) has a control for the maximum number of bytes to index. When I'm using the writeLimit header with Tika and I'm extracting text from a document with mixed ASCII and multi-byte characters I can't get back exactly, for example, 6MB worth of text because I don't know a-priori what chars will be in the file.
My ask here is for a new control, maybe "writeLimitBytes" where the number of characters returned breaks on the last coherent character. Therefore the returned text would be <= writeLimitBytes but would more or less be close to that value.