Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.7
    • Fix Version/s: 1.0
    • Component/s: parser
    • Labels:
      None

      Description

      As discussed on the mailing list:
      http://mail-archives.apache.org/mod_mbox/tika-dev/201009.mbox/%3Calpine.DEB.1.10.1009010000250.5637@urchin.earth.li%3E

      This service will operate in a push mode, using streaming where possible (not all container formats will support that). Users can control recursion, and will be given the chance to process each embeded file in turn. It's up to them if they process a file or skip it.

      It will work similar to the current Parser code, with each container having its own extractor in the parsers package, and the interface defined in the core package. There will be an Auto extractor in the core package, configured with a list of parser extractors just like AutoDetectParser does.

        Activity

        Hide
        Nick Burch added a comment -

        Initial work committed in r995157.

        This commit shows a rough cut of the interfaces and key classes, so everyone can review them now there's some concrete code to look at. One change required though is to not pass an open inputstream in for every firing of the handler. Instead, the handler should be given a chance to decline the embeded resource first, before the extractor potentially does lots of work to generate/unpack the resource

        Also in the commit is a basic POIFS extractor. It only does nested embeded office files, no images, and only for word+excel. However, it should give an idea of how things may work

        The config for the auto-detector isn't wired in yet, that'll be done once we have further extractors.

        Show
        Nick Burch added a comment - Initial work committed in r995157. This commit shows a rough cut of the interfaces and key classes, so everyone can review them now there's some concrete code to look at. One change required though is to not pass an open inputstream in for every firing of the handler. Instead, the handler should be given a chance to decline the embeded resource first, before the extractor potentially does lots of work to generate/unpack the resource Also in the commit is a basic POIFS extractor. It only does nested embeded office files, no images, and only for word+excel. However, it should give an idea of how things may work The config for the auto-detector isn't wired in yet, that'll be done once we have further extractors.
        Hide
        Nick Burch added a comment -

        Also on the config side of things, currently there's just a straight list of ContainerExtractors, and each one does its own mini-detection to decide if it's suitable. We may wish to change this later to work with MediaTypes as the Parsers do, however there are issues around how much extra work might be involved in the detection step - we don't want to duplicate the container parsing by forcing people to run it through a ContainerAwareDetector and then process the container again in a ContainerExtractor. Probably one to hold off deciding on until we have several extractors in place, when we'll have a better idea of how they might work and fit together?

        Show
        Nick Burch added a comment - Also on the config side of things, currently there's just a straight list of ContainerExtractors, and each one does its own mini-detection to decide if it's suitable. We may wish to change this later to work with MediaTypes as the Parsers do, however there are issues around how much extra work might be involved in the detection step - we don't want to duplicate the container parsing by forcing people to run it through a ContainerAwareDetector and then process the container again in a ContainerExtractor. Probably one to hold off deciding on until we have several extractors in place, when we'll have a better idea of how they might work and fit together?
        Hide
        Jukka Zitting added a comment -

        I'm not too excited about the idea of introducing a completely new mechanism in parallel with the Parser API we already have. AFAIUI the Parser API already supports all the functionality you're looking for.

        See the attached patch that copies the embedded document handling code from the POIFSContainerExtractor class to our existing OfficeParser implementation, and adds a generic ParserContainerExtractor class that implements the ContainerExtractor interface based on our existing Parser and Detector APIs.

        This solution passes all the current test cases (see the modifications I made to POIFSContainerExtractorTest), implements the embedded document support asked for in TIKA-489, and as a bonus gives you ContainerExtractor support for all the package formats (zip, tar, cpio, etc.) that we already have parsers for.

        Show
        Jukka Zitting added a comment - I'm not too excited about the idea of introducing a completely new mechanism in parallel with the Parser API we already have. AFAIUI the Parser API already supports all the functionality you're looking for. See the attached patch that copies the embedded document handling code from the POIFSContainerExtractor class to our existing OfficeParser implementation, and adds a generic ParserContainerExtractor class that implements the ContainerExtractor interface based on our existing Parser and Detector APIs. This solution passes all the current test cases (see the modifications I made to POIFSContainerExtractorTest), implements the embedded document support asked for in TIKA-489 , and as a bonus gives you ContainerExtractor support for all the package formats (zip, tar, cpio, etc.) that we already have parsers for.
        Hide
        Chris A. Mattmann added a comment -

        Wow, Jukka! +1!

        Show
        Chris A. Mattmann added a comment - Wow, Jukka! +1!
        Hide
        Nick Burch added a comment -

        Jukka - your patch looks good, just thought I'd check a few things

        Are you thinking that the ContainerExtractor interface and ContainerEmbededResourceHandler will remain? And that this would be the way for people who's main interest is getting at the embeded documents to work?

        In terms of the parser / ParserContainerExtractor, are you thinking that we should try to make the container related Parsers call the nested parser from the ParseContext? The only issue with that I can see is that it isn't then possible for for users to say "I don't want that file, don't bother doing lots of work to extract it". Admittedly the ContainerExtractor doesn't support that either, but I had an idea for how to do that, and I can't easily see how that'd fit in with parsers

        I'll apply your patch shortly, then carry on with my work on making the office format embeded resources available, but using your new pattern

        Show
        Nick Burch added a comment - Jukka - your patch looks good, just thought I'd check a few things Are you thinking that the ContainerExtractor interface and ContainerEmbededResourceHandler will remain? And that this would be the way for people who's main interest is getting at the embeded documents to work? In terms of the parser / ParserContainerExtractor, are you thinking that we should try to make the container related Parsers call the nested parser from the ParseContext? The only issue with that I can see is that it isn't then possible for for users to say "I don't want that file, don't bother doing lots of work to extract it". Admittedly the ContainerExtractor doesn't support that either, but I had an idea for how to do that, and I can't easily see how that'd fit in with parsers I'll apply your patch shortly, then carry on with my work on making the office format embeded resources available, but using your new pattern
        Hide
        Jukka Zitting added a comment -

        Yes, I think the ContainerExtractor and ContainerEmbeddedResourceHandler (rename to EmbeddedResourceHandler?) interfaces should remain as they provide a much more convenient way to achieve this use case.

        > [...] we should try to make the container related Parsers call the nested parser from the ParseContext?

        Yes, that's the way the PackageParser was designed and how I'd like to see also other container formats handled.

        > "I don't want that file, don't bother doing lots of work to extract it"

        We could implement that with an optional strategy object that gets passed through the parse context along with the component parser.

        Show
        Jukka Zitting added a comment - Yes, I think the ContainerExtractor and ContainerEmbeddedResourceHandler (rename to EmbeddedResourceHandler?) interfaces should remain as they provide a much more convenient way to achieve this use case. > [...] we should try to make the container related Parsers call the nested parser from the ParseContext? Yes, that's the way the PackageParser was designed and how I'd like to see also other container formats handled. > "I don't want that file, don't bother doing lots of work to extract it" We could implement that with an optional strategy object that gets passed through the parse context along with the component parser.
        Hide
        Nick Burch added a comment -

        Interface rename makes sense to me, I've done that

        Not sure about the best way to do the strategy object, but hopefully someone else will do!

        Recursion was broken in ParserContainerExtractor - I've re-enabled this, but I'm not sure if it's the best way to handle it or not? See r995438 for how I've made it work for now.

        Next up, images in word!

        Show
        Nick Burch added a comment - Interface rename makes sense to me, I've done that Not sure about the best way to do the strategy object, but hopefully someone else will do! Recursion was broken in ParserContainerExtractor - I've re-enabled this, but I'm not sure if it's the best way to handle it or not? See r995438 for how I've made it work for now. Next up, images in word!
        Hide
        Nick Burch added a comment -

        Support is now in place for .doc, .docx, .xls and .xlsx

        There's a couple more office formats to support, and unit tests are needed for the general container formats which are supported via the PackageExtractor

        Show
        Nick Burch added a comment - Support is now in place for .doc, .docx, .xls and .xlsx There's a couple more office formats to support, and unit tests are needed for the general container formats which are supported via the PackageExtractor
        Hide
        Jukka Zitting added a comment -

        In revision 995946 I started drafting a possible solution for the selective extraction feature.

        Show
        Jukka Zitting added a comment - In revision 995946 I started drafting a possible solution for the selective extraction feature.
        Hide
        Nick Burch added a comment -

        I think we're almost there. All office formats except .ppt are now supported, and we now have unit tests which show that the PackageParser based extraction was already worked as required (yey!)

        Show
        Nick Burch added a comment - I think we're almost there. All office formats except .ppt are now supported, and we now have unit tests which show that the PackageParser based extraction was already worked as required (yey!)
        Hide
        Jukka Zitting added a comment -

        Good stuff! I guess we can resolve this as fixed.

        Show
        Jukka Zitting added a comment - Good stuff! I guess we can resolve this as fixed.
        Hide
        Jukka Zitting added a comment -

        Resolving as fixed as discussed above.

        Show
        Jukka Zitting added a comment - Resolving as fixed as discussed above.

          People

          • Assignee:
            Nick Burch
            Reporter:
            Nick Burch
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development