Tika
  1. Tika
  2. TIKA-388

Don't trust streams that claim mark support

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.7
    • Component/s: parser
    • Labels:
      None

      Description

      As seen on tika-dev@ and in JCR-2576, there are some InputStream implementations that claim to support the mark feature, but lose the mark as soon as the end of stream has been reached. There's no way for a client to detect such behaviour, so it's probably best for Tika to always use BufferedInputStream to wrap incoming streams when mark support is needed. This may cause one layer of extra buffering, but avoids problems with such broken streams.

        Activity

        Jukka Zitting created issue -
        Hide
        Chris A. Mattmann added a comment -

        +1! I've ran into this issue myself, and the overhead IMHO is worth is for the ease of use...

        Show
        Chris A. Mattmann added a comment - +1! I've ran into this issue myself, and the overhead IMHO is worth is for the ease of use...
        Hide
        Jukka Zitting added a comment -

        As of revision 925217 the AutoDetectParser wraps all incoming streams to BufferedInputStream regardless of whether they claim mark support or not. Resolving as fixed.

        Show
        Jukka Zitting added a comment - As of revision 925217 the AutoDetectParser wraps all incoming streams to BufferedInputStream regardless of whether they claim mark support or not. Resolving as fixed.
        Jukka Zitting made changes -
        Field Original Value New Value
        Status Open [ 1 ] Resolved [ 5 ]
        Assignee Jukka Zitting [ jukkaz ]
        Fix Version/s 0.7 [ 12314528 ]
        Resolution Fixed [ 1 ]
        Hide
        Daan de Wit added a comment -

        I did not test it, and it might be a premature optimization, but wouldn't it be better to check if the stream is already a BufferedInputStream?

        Show
        Daan de Wit added a comment - I did not test it, and it might be a premature optimization, but wouldn't it be better to check if the stream is already a BufferedInputStream?
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        1d 4h 21m 1 Jukka Zitting 19/Mar/10 13:52

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Jukka Zitting
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development