Jackrabbit Content Repository
  1. Jackrabbit Content Repository
  2. JCR-2885

Move tika-parsers dependency to deployment packages

    Details

      Description

      As discussed on the mailing list, it would be better if the tika-parsers dependency (and all the parser libraries it pulls in transitively) was included in our deployment packages but not directly in jackrabbit-core. This would make it easier for people to set up custom lightweight deployments with no or only partial full text extraction functionality.

      To do this we'll first need to wait for Tika 0.9, as we currently have a custom PDFParser class in jackrabbit-core as a workaround to a problem in Tika 0.8.

      At the same time we should do a more thorough review of the transitive parser dependencies we include. At least the rome and bouncycastle libraries were flagged as potentially unnecessary.

        Activity

        Hide
        Jukka Zitting added a comment -

        I moved the dependency from jackrabbit-core in revision 1076635.

        At the same time I went through the list of dependencies, and made the following exclusions:

        <exclusions>
        <!-- Exclude the NetCDF and the related commons-httpclient -->
        <!-- libraries since the related NetCDF and HDF file -->
        <!-- formats are not widely used beyond scientific data. -->
        <exclusion>
        <groupId>edu.ucar</groupId>
        <artifactId>netcdf</artifactId>
        </exclusion>
        <exclusion>
        <groupId>commons-httpclient</groupId>
        <artifactId>commons-httpclient</artifactId>
        </exclusion>
        <!-- Exclude the Apache MIME4J library as it's used for -->
        <!-- parsing raw email messages and mbox files, which are -->
        <!-- typically only needed by a file-based email system. -->
        <exclusion>
        <groupId>org.apache.james</groupId>
        <artifactId>apache-mime4j</artifactId>
        </exclusion>
        <!-- Exclude the Commons Compress library as we don't want -->
        <!-- to parse compressed archives like zips by default. -->
        <exclusion>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-compress</artifactId>
        </exclusion>
        <!-- Exclude the ASM library as it's only used for parsing -->
        <!-- Java class files, for which there's typically no need -->
        <!-- in a content repository. -->
        <exclusion>
        <groupId>asm</groupId>
        <artifactId>asm</artifactId>
        </exclusion>
        <!-- Exclude the extractor library for EXIF and other -->
        <!-- image metadata as we normally don't want to parse -->
        <!-- images for full text indexing. -->
        <exclusion>
        <groupId>com.drewnoakes</groupId>
        <artifactId>metadata-extractor</artifactId>
        </exclusion>
        <!-- Exclude the Rome library as we normally don't want to -->
        <!-- parse RSS and Atom feeds for full text indexing. -->
        <exclusion>
        <groupId>rome</groupId>
        <artifactId>rome</artifactId>
        </exclusion>
        <!-- Exclude the Boilerpipe library as we don't use the -->
        <!-- BoilerpipeContentHandler functionality from Tika. -->
        <exclusion>
        <groupId>de.l3s.boilerpipe</groupId>
        <artifactId>boilerpipe</artifactId>
        </exclusion>
        </exclusions>

        After these exclusions we'd still keep the following dependencies:

        PDF: pdfbox, fontbox, jempbox, bcmail, bcprov
        MS Office: poi, poi-ooxml, poi-ooxml-schemas, poi-scratchpad, xmlbeans
        HTML: tagsoup

        Basic formats like plain text and XML (plus rudimentary support for OpenOffice) are handled with the standard Java class library.

        Show
        Jukka Zitting added a comment - I moved the dependency from jackrabbit-core in revision 1076635. At the same time I went through the list of dependencies, and made the following exclusions: <exclusions> <!-- Exclude the NetCDF and the related commons-httpclient --> <!-- libraries since the related NetCDF and HDF file --> <!-- formats are not widely used beyond scientific data. --> <exclusion> <groupId>edu.ucar</groupId> <artifactId>netcdf</artifactId> </exclusion> <exclusion> <groupId>commons-httpclient</groupId> <artifactId>commons-httpclient</artifactId> </exclusion> <!-- Exclude the Apache MIME4J library as it's used for --> <!-- parsing raw email messages and mbox files, which are --> <!-- typically only needed by a file-based email system. --> <exclusion> <groupId>org.apache.james</groupId> <artifactId>apache-mime4j</artifactId> </exclusion> <!-- Exclude the Commons Compress library as we don't want --> <!-- to parse compressed archives like zips by default. --> <exclusion> <groupId>org.apache.commons</groupId> <artifactId>commons-compress</artifactId> </exclusion> <!-- Exclude the ASM library as it's only used for parsing --> <!-- Java class files, for which there's typically no need --> <!-- in a content repository. --> <exclusion> <groupId>asm</groupId> <artifactId>asm</artifactId> </exclusion> <!-- Exclude the extractor library for EXIF and other --> <!-- image metadata as we normally don't want to parse --> <!-- images for full text indexing. --> <exclusion> <groupId>com.drewnoakes</groupId> <artifactId>metadata-extractor</artifactId> </exclusion> <!-- Exclude the Rome library as we normally don't want to --> <!-- parse RSS and Atom feeds for full text indexing. --> <exclusion> <groupId>rome</groupId> <artifactId>rome</artifactId> </exclusion> <!-- Exclude the Boilerpipe library as we don't use the --> <!-- BoilerpipeContentHandler functionality from Tika. --> <exclusion> <groupId>de.l3s.boilerpipe</groupId> <artifactId>boilerpipe</artifactId> </exclusion> </exclusions> After these exclusions we'd still keep the following dependencies: PDF: pdfbox, fontbox, jempbox, bcmail, bcprov MS Office: poi, poi-ooxml, poi-ooxml-schemas, poi-scratchpad, xmlbeans HTML: tagsoup Basic formats like plain text and XML (plus rudimentary support for OpenOffice) are handled with the standard Java class library.
        Hide
        Jukka Zitting added a comment -

        Resolving as fixed based on the above changes.

        Show
        Jukka Zitting added a comment - Resolving as fixed based on the above changes.

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Jukka Zitting
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development