Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.6
    • Component/s: packaging
    • Labels:
      None

      Description

      To easily deploy Tika and especially the Tika parsers, it would be convenient to have an almost complete bundle consisting of Tika Core, Tika Parsers as well as the most important parser dependencies. Any remaining dependencies not included with the bundle should be declared as optional import to not fail bundle resolution if one or the other (or all) import(s) is missing.

      1. osgi-logging.patch
        14 kB
        Jukka Zitting
      2. TIKA-340-2.patch
        1 kB
        Felix Meschberger
      3. TIKA-340.patch
        17 kB
        Felix Meschberger

        Activity

        Hide
        Felix Meschberger added a comment -

        Patch providing a tika-full project which creates an almost complete bundle of the dependencies.

        The dependencies included are derived from the included dependencies of the Tika App project but omit some embeddings, which should be shared in the framework (most importantly the XML oriented APIs like W3C DOM and SAX).

        Show
        Felix Meschberger added a comment - Patch providing a tika-full project which creates an almost complete bundle of the dependencies. The dependencies included are derived from the included dependencies of the Tika App project but omit some embeddings, which should be shared in the framework (most importantly the XML oriented APIs like W3C DOM and SAX).
        Hide
        Felix Meschberger added a comment -

        Note: On my machine, I had to increase the Java heap memory size to prevent the build from aborting with an OutOfMemory Exception. I set the maximum heap size to 512MB and the build ultimately used 400MB.

        Show
        Felix Meschberger added a comment - Note: On my machine, I had to increase the Java heap memory size to prevent the build from aborting with an OutOfMemory Exception. I set the maximum heap size to 512MB and the build ultimately used 400MB.
        Hide
        Jukka Zitting added a comment -

        Excellent, thanks! I committed the patch in revision 885807.

        Would you mind if we rather called the package tika-bundle or tika-osgi instead of tika-full? I think that would make it easier to distinguish between this package and the tika-app jar that's also a "full" package.

        Some further improvements would be to automatically wire all logging to the OSGi log service and to make it possible for Tika to automatically leverage Parser implementations from other bundles.

        Show
        Jukka Zitting added a comment - Excellent, thanks! I committed the patch in revision 885807. Would you mind if we rather called the package tika-bundle or tika-osgi instead of tika-full? I think that would make it easier to distinguish between this package and the tika-app jar that's also a "full" package. Some further improvements would be to automatically wire all logging to the OSGi log service and to make it possible for Tika to automatically leverage Parser implementations from other bundles.
        Hide
        Jukka Zitting added a comment -

        Is there a particular reason why you configured some of the dependencies to be inlined and some to be just included as embedded jar files? Unless there's a good reason not to do so, I'd go for a fully inlined package to keep the jar structure simpler.

        Show
        Jukka Zitting added a comment - Is there a particular reason why you configured some of the dependencies to be inlined and some to be just included as embedded jar files? Unless there's a good reason not to do so, I'd go for a fully inlined package to keep the jar structure simpler.
        Hide
        Felix Meschberger added a comment -

        Thanks for committing.

        > Would you mind if we rather called the package tika-bundle or tika-osgi instead of tika-full?

        Not at all ...

        > Some further improvements would be to automatically wire all logging to the OSGi log service

        Well the bundle as it stands currently has imports for Log4J and Commons Logging. Both APIs are generally available from some logging support bundle, for example the Sling Log Service implementation or PAX logging. I am not sure, whether it is worth it to try to converge the logging approaches into OSGi LogService in the Tika Bundle itself.

        > some of the dependencies to be inlined

        Generally I came to like to embed JAR files. This makes it a lot easier to inspect the JAR files and AFAICT has no drawbacks on usability in an OSGi environment. I have inline one JAR file because I had to exclude an incomplete org.w3c.dom package, which would have caused resolution issues.

        OTOH if you would deem the jarfile useful in general, that is non-OSGi, environments, it would probably make perfect sense to inline the embedded libraries. In this case, though, the name of the library should probably not contain the words "osgi" or "bundle". WDYT ?

        Show
        Felix Meschberger added a comment - Thanks for committing. > Would you mind if we rather called the package tika-bundle or tika-osgi instead of tika-full? Not at all ... > Some further improvements would be to automatically wire all logging to the OSGi log service Well the bundle as it stands currently has imports for Log4J and Commons Logging. Both APIs are generally available from some logging support bundle, for example the Sling Log Service implementation or PAX logging. I am not sure, whether it is worth it to try to converge the logging approaches into OSGi LogService in the Tika Bundle itself. > some of the dependencies to be inlined Generally I came to like to embed JAR files. This makes it a lot easier to inspect the JAR files and AFAICT has no drawbacks on usability in an OSGi environment. I have inline one JAR file because I had to exclude an incomplete org.w3c.dom package, which would have caused resolution issues. OTOH if you would deem the jarfile useful in general, that is non-OSGi, environments, it would probably make perfect sense to inline the embedded libraries. In this case, though, the name of the library should probably not contain the words "osgi" or "bundle". WDYT ?
        Hide
        Felix Meschberger added a comment -

        Here is a patch against trunk inlining all previously embedded libraries.

        Interestingly now the jar file grows from 20MB to 25MB ... (well, out of my belly both sizes are horrendous given the task at hand; but that is probably another story )

        Show
        Felix Meschberger added a comment - Here is a patch against trunk inlining all previously embedded libraries. Interestingly now the jar file grows from 20MB to 25MB ... (well, out of my belly both sizes are horrendous given the task at hand; but that is probably another story )
        Hide
        Jukka Zitting added a comment -

        Re: logging; AFAIUI using the OSGi log service directly makes it possible for the log backend to sort out log messages based on the bundle from which they originated. That doesn't seem possible if we just depend on a support bundle that exposes the commons-logging API.

        Re: size; Yep, that's another story. See http://jukkaz.wordpress.com/2009/10/16/putting-poi-on-a-diet/ for the gory details.

        Re: inlining; The double compression of embedded jars explains the size difference you're seeing. That double compression seems a bit troublesome to me given the large number of non-class resources (PDF font mapping data, OOXML schemas, etc.) we have there. Ideally the classloader should be able to load such resources on demand without having to uncompress the entire archive. But I guess OSGi runtimes may already avoid that problem in similar ways as servlet containers do with embedded jars in WEB-INF/lib.

        Show
        Jukka Zitting added a comment - Re: logging; AFAIUI using the OSGi log service directly makes it possible for the log backend to sort out log messages based on the bundle from which they originated. That doesn't seem possible if we just depend on a support bundle that exposes the commons-logging API. Re: size; Yep, that's another story. See http://jukkaz.wordpress.com/2009/10/16/putting-poi-on-a-diet/ for the gory details. Re: inlining; The double compression of embedded jars explains the size difference you're seeing. That double compression seems a bit troublesome to me given the large number of non-class resources (PDF font mapping data, OOXML schemas, etc.) we have there. Ideally the classloader should be able to load such resources on demand without having to uncompress the entire archive. But I guess OSGi runtimes may already avoid that problem in similar ways as servlet containers do with embedded jars in WEB-INF/lib.
        Hide
        Felix Meschberger added a comment -

        Re: logging:

        Yes, that might be true – still there is some API to implement and if you mplement logging based on LogService you loose all the log categories previously used because the OSGi LogService does not have such a concept.

        Also, if you want to reuse the library in non-OSGi environments using the LogService will not work and create an OSGi dependency.

        Re: size:

        I knew there is some activity in this area. Thanks for the pointer.

        Re: inlining

        Why is the double compression troublesome ?

        What the OSGI framework actually does, is unpacking the bundle jar. Thus embedded libaries are unpacked into jar files, that is unpacking is not recursive. Then the regaluar classes are loaded regularly while the embedded JAR files are loaded as JAR URLs.

        Thus in the end, it might even be better to embed the libraries than to inline them.

        But this decision depends on whether you want to use the result of the build in a non-OSGi environment or not. If you only target OSGi frameworks, then I would go for embedded libraries. Otherwise I would go for inlined libraries at the expense of 20% of the size of the resulting JAR file.

        Show
        Felix Meschberger added a comment - Re: logging: Yes, that might be true – still there is some API to implement and if you mplement logging based on LogService you loose all the log categories previously used because the OSGi LogService does not have such a concept. Also, if you want to reuse the library in non-OSGi environments using the LogService will not work and create an OSGi dependency. Re: size: I knew there is some activity in this area. Thanks for the pointer. Re: inlining Why is the double compression troublesome ? What the OSGI framework actually does, is unpacking the bundle jar. Thus embedded libaries are unpacked into jar files, that is unpacking is not recursive. Then the regaluar classes are loaded regularly while the embedded JAR files are loaded as JAR URLs. Thus in the end, it might even be better to embed the libraries than to inline them. But this decision depends on whether you want to use the result of the build in a non-OSGi environment or not. If you only target OSGi frameworks, then I would go for embedded libraries. Otherwise I would go for inlined libraries at the expense of 20% of the size of the resulting JAR file.
        Hide
        Ken Krugler added a comment -

        Funny, I was just looking at the size of the Hadoop job jar I generate for Bixo. It was suddenly 26MB, and pushing it up to EC2 was taking a long time.

        As Jukka's blog post says, it's all about the ooxml-schemas-1.0.jar file - almost 14MB. And the 2.5MB xmlbeans-2.3.0.jar that this schema jar depends on. Excluding POI would cut about 18MB from my 26MB, which I might need to do (as an option for a smaller build).

        Show
        Ken Krugler added a comment - Funny, I was just looking at the size of the Hadoop job jar I generate for Bixo. It was suddenly 26MB, and pushing it up to EC2 was taking a long time. As Jukka's blog post says, it's all about the ooxml-schemas-1.0.jar file - almost 14MB. And the 2.5MB xmlbeans-2.3.0.jar that this schema jar depends on. Excluding POI would cut about 18MB from my 26MB, which I might need to do (as an option for a smaller build).
        Hide
        Andrzej Bialecki added a comment -

        Vast majority of classes in these JARs are never used. Perhaps one of the steps in preparing this bundle could be to pass it through a code shrinker, such as Proguard (http://proguard.sourceforge.net) - not to obfuscate it, but simply to remove unused cruft.

        Show
        Andrzej Bialecki added a comment - Vast majority of classes in these JARs are never used. Perhaps one of the steps in preparing this bundle could be to pass it through a code shrinker, such as Proguard ( http://proguard.sourceforge.net ) - not to obfuscate it, but simply to remove unused cruft.
        Hide
        Felix Meschberger added a comment -

        Interesting idea, there is even a maven plugin at http://pyx4me.com/pyx4me-maven-plugins/proguard-maven-plugin/

        Show
        Felix Meschberger added a comment - Interesting idea, there is even a maven plugin at http://pyx4me.com/pyx4me-maven-plugins/proguard-maven-plugin/
        Hide
        Jukka Zitting added a comment -

        OK, you have a better view of the best practices for logging with OSGi. See the attached osgi-logging.patch for a quick and dirty experiment of what we could do if we did want to directly use the OSGi log service.

        If the OSGi runtimes already unpack the bundle jar, then I have no problem with the embedded jars. Could we even avoid inlining the tika-core and tika-parsers jars, or is that something that's needed for the Export-Package rules to work? If the latter, can we exclude org.apache.tika.parser subpackages from being exported so that only tika-core gets inlined?

        Let's take the size issue to tika-dev@.

        Show
        Jukka Zitting added a comment - OK, you have a better view of the best practices for logging with OSGi. See the attached osgi-logging.patch for a quick and dirty experiment of what we could do if we did want to directly use the OSGi log service. If the OSGi runtimes already unpack the bundle jar, then I have no problem with the embedded jars. Could we even avoid inlining the tika-core and tika-parsers jars, or is that something that's needed for the Export-Package rules to work? If the latter, can we exclude org.apache.tika.parser subpackages from being exported so that only tika-core gets inlined? Let's take the size issue to tika-dev@.
        Hide
        Jukka Zitting added a comment -

        I've renamed the bundle to tika-bundle. With that I think we have all the basics in place so I'm resolving this issue as Fixed. Thanks for the contribution!

        Let's use followup issues to track the further improvements/features that have already been mentioned.

        Show
        Jukka Zitting added a comment - I've renamed the bundle to tika-bundle. With that I think we have all the basics in place so I'm resolving this issue as Fixed. Thanks for the contribution! Let's use followup issues to track the further improvements/features that have already been mentioned.
        Hide
        Felix Meschberger added a comment - - edited

        > Could we even avoid inlining the tika-core and tika-parsers jars

        Yes, I will create a new issue with patch.

        Show
        Felix Meschberger added a comment - - edited > Could we even avoid inlining the tika-core and tika-parsers jars Yes, I will create a new issue with patch.

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Felix Meschberger
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development