Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3643

writeLimit for bytes in addition to characters

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.2.1
    • None
    • core
    • None

    Description

      jmssiera wrote up the enhancement request TIKA-3325 where he originally requested that the number of bytes be passed as the write limit.  I see that issue was marked as Resolved, but writeLimit is number of chars instead of number of bytes.

      I have a use-case where the consumer side (an indexer) has a control for the maximum number of bytes to index.  When I'm using the writeLimit header with Tika and I'm extracting text from a document with mixed ASCII and multi-byte characters I can't get back exactly, for example, 6MB worth of text because I don't know a-priori what chars will be in the file.   

      My ask here is for a new control, maybe "writeLimitBytes" where the number of characters returned breaks on the last coherent character.  Therefore the returned text would be <= writeLimitBytes but would more or less be close to that value.

      Attachments

        Activity

          People

            Unassigned Unassigned
            jmbox80 Josh Burchard
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: