Tika
  1. Tika
  2. TIKA-153

Allow passing of files or memory buffers to parsers

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10
    • Component/s: parser
    • Labels:
      None

      Description

      Some of our parsers need to be able to go back and forth within a source document, so need either a file or (for smaller documents) an in-memory buffer that contains the full document. Currently we use temporary files for such cases, which in some cases means doing an extra copy of a file before it gets parsed. We should come up with some way for clients to pass in a file or a memory buffer if one is available.

        Activity

        Hide
        Babak Farhang added a comment -

        I suggest java.nio.FileChannel be used as the random access abstraction. This would allow implementations such as Skwish [ http://skwish.sourceforge.net/ ] be used as the source of a document.

        Ignoring certain of its niche capabilities (such as its map method), FileChannel, it turns out, allows one to slice and dice, construct filters (facades) in the same way java uses FilterInputStream and FilterOutputStream. As this idea is fleshed out a bit in skwish [see http://skwish.sourceforge.net/doc/com/faunos/util/io/package-summary.html ], thought I'd share..

        -Babak

        Show
        Babak Farhang added a comment - I suggest java.nio.FileChannel be used as the random access abstraction. This would allow implementations such as Skwish [ http://skwish.sourceforge.net/ ] be used as the source of a document. Ignoring certain of its niche capabilities (such as its map method), FileChannel, it turns out, allows one to slice and dice, construct filters (facades) in the same way java uses FilterInputStream and FilterOutputStream. As this idea is fleshed out a bit in skwish [see http://skwish.sourceforge.net/doc/com/faunos/util/io/package-summary.html ] , thought I'd share.. -Babak
        Hide
        Jukka Zitting added a comment -

        I have an idea on how to implement this...

        The current Tika APIs are already pretty good, and I'd hate to complicate the clean Parser interface with extra methods for different kinds of inputs. Instead I'm thinking of adding a TikaInputStream utility class that extends InputStream with methods that allow accessing the input document as a File.

        The TikaInputStream class would have at least the following construtors:

        public TikaInputStream(InputStream stream)

        { ... }
        public TikaInputStream(File file) { ... }

        And would in addition to the standard InputStream methods provide at least the following:

        public File getFile

        { ... }

        If the TikaInputStream instance was created from a normal InputStream, then the getFile() method would automatically copy the stream into a temporary file that'll get removed when the stream is closed.

        The Tika facade would always pass TikaInputStreams to the underlying parsers and we'd recommend downstream projects to use this class also when directly accessing the Parser API, but doing so would not be necessary. Instead the TikaInputStream class would have a static method like the following that our parsers could access the extra functionality:

        public static TikaInputStream getTikaInputStream(InputStream stream) {
        if (stream instanceof TikaInputStream)

        { return (TikaInputStream) stream; }

        else

        { return new TikaInputStream(stream); }

        }

        Show
        Jukka Zitting added a comment - I have an idea on how to implement this... The current Tika APIs are already pretty good, and I'd hate to complicate the clean Parser interface with extra methods for different kinds of inputs. Instead I'm thinking of adding a TikaInputStream utility class that extends InputStream with methods that allow accessing the input document as a File. The TikaInputStream class would have at least the following construtors: public TikaInputStream(InputStream stream) { ... } public TikaInputStream(File file) { ... } And would in addition to the standard InputStream methods provide at least the following: public File getFile { ... } If the TikaInputStream instance was created from a normal InputStream, then the getFile() method would automatically copy the stream into a temporary file that'll get removed when the stream is closed. The Tika facade would always pass TikaInputStreams to the underlying parsers and we'd recommend downstream projects to use this class also when directly accessing the Parser API, but doing so would not be necessary. Instead the TikaInputStream class would have a static method like the following that our parsers could access the extra functionality: public static TikaInputStream getTikaInputStream(InputStream stream) { if (stream instanceof TikaInputStream) { return (TikaInputStream) stream; } else { return new TikaInputStream(stream); } }
        Hide
        Chris A. Mattmann added a comment -

        The current Tika APIs are already pretty good, and I'd hate to complicate the clean Parser interface with extra methods for different kinds of inputs. Instead I'm thinking of adding a TikaInputStream utility class that extends InputStream with methods that allow accessing the input document as a File.

        The TikaInputStream class would have at least the following construtors:

        public TikaInputStream(InputStream stream)

        Unknown macro: { ... }


        public TikaInputStream(File file)

        +100!! I could have used this for TIKA-400 since NetCDF expects (and only provides means) to deal with input as a File. This happens a lot where streaming doesn't make a lot of sense in data-intensive files with huge memory footprint...

        Cheers,
        Chris

        Show
        Chris A. Mattmann added a comment - The current Tika APIs are already pretty good, and I'd hate to complicate the clean Parser interface with extra methods for different kinds of inputs. Instead I'm thinking of adding a TikaInputStream utility class that extends InputStream with methods that allow accessing the input document as a File. The TikaInputStream class would have at least the following construtors: public TikaInputStream(InputStream stream) Unknown macro: { ... } public TikaInputStream(File file) +100!! I could have used this for TIKA-400 since NetCDF expects (and only provides means) to deal with input as a File. This happens a lot where streaming doesn't make a lot of sense in data-intensive files with huge memory footprint... Cheers, Chris
        Hide
        Jukka Zitting added a comment -

        The TikaInputStream class is now in place and being used by many parsers. Resolving as fixed.

        Show
        Jukka Zitting added a comment - The TikaInputStream class is now in place and being used by many parsers. Resolving as fixed.

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Jukka Zitting
          • Votes:
            2 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development