Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.3
    • Component/s: parser
    • Labels:
      None

      Description

      Apache POI has recently released the first betas of their support for Office XML file formats. We should use that in Tika.

      1. TIKA-152.patch
        42 kB
        Guillermo Arribas
      2. testWORD.docx
        10 kB
        Guillermo Arribas
      3. testEXCEL-formats.xlsx
        8 kB
        Guillermo Arribas
      4. testEXCEL.xlsx
        9 kB
        Guillermo Arribas
      5. testPPT.pptx
        49 kB
        Guillermo Arribas

        Activity

        Hide
        Jukka Zitting added a comment -

        I upgraded the POI dependency to 3.5-beta4.

        Note that if we want to use the new Office XML support in POI 3.5 we probably also need to add some of the extra XML dependencies. Any NOTICE and LICENSE changes related to POI 3.5 and potential other dependencies should be reviewed before our next release.

        There's a problem with a GPLv3 file being included in the HDGF part of POI that we use for text extraction from Visio diagrams. I filed a bug for that (see https://issues.apache.org/bugzilla/show_bug.cgi?id=46361) and I think we need to find some resolution to the issue before our next release.

        Show
        Jukka Zitting added a comment - I upgraded the POI dependency to 3.5-beta4. Note that if we want to use the new Office XML support in POI 3.5 we probably also need to add some of the extra XML dependencies. Any NOTICE and LICENSE changes related to POI 3.5 and potential other dependencies should be reviewed before our next release. There's a problem with a GPLv3 file being included in the HDGF part of POI that we use for text extraction from Visio diagrams. I filed a bug for that (see https://issues.apache.org/bugzilla/show_bug.cgi?id=46361 ) and I think we need to find some resolution to the issue before our next release.
        Hide
        Guillermo Arribas added a comment -

        Parser with support for structured text extraction for OOXML formats.
        New dependency on artifactId "poi-ooxml" 3.5-beta4 required.

        Show
        Guillermo Arribas added a comment - Parser with support for structured text extraction for OOXML formats. New dependency on artifactId "poi-ooxml" 3.5-beta4 required.
        Hide
        Kumar Raja added a comment -

        This parser seems to work fine but the config files are a bit outdated. What is the procedure to get this patch integrated with the main code? Is there any timeline defined for patch integration?

        Show
        Kumar Raja added a comment - This parser seems to work fine but the config files are a bit outdated. What is the procedure to get this patch integrated with the main code? Is there any timeline defined for patch integration?
        Hide
        Jukka Zitting added a comment -

        Patch applied in revision 744290. Great work, thanks!

        I'm leaving this issue open until we've updated the LICENSE and NOTICE files to match the new dependencies.

        Show
        Jukka Zitting added a comment - Patch applied in revision 744290. Great work, thanks! I'm leaving this issue open until we've updated the LICENSE and NOTICE files to match the new dependencies.
        Hide
        Jukka Zitting added a comment -

        I've now updated the required legal bits. I also upgraded the POI dependency to the latest 3.5-beta5 release.

        Resolving as Fixed.

        Show
        Jukka Zitting added a comment - I've now updated the required legal bits. I also upgraded the POI dependency to the latest 3.5-beta5 release. Resolving as Fixed.

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Jukka Zitting
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development