Details

      Description

      Adds an external parser that calls ExifTool to extract extended metadata fields from images and other content types.

      In the core project:
      An ExifTool interface is added which contains Property objects that define the metadata fields available.
      An additional Property constructor for internalTextBag type.

      In the parsers project:
      An ExiftoolMetadataExtractor is added which does the work of calling ExifTool on the command line and mapping the response to tika metadata fields. This extractor could be called instead of or in addition to the existing ImageMetadataExtractor and JempboxExtractor under TiffParser and/or JpegParser but those have not been changed at this time.
      An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor.
      An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool metadata fields to existing tika and Drew Noakes metadata fields if enabled.
      An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag implementations in XML files.
      An ExifToolParserTest is added which tests several expected XMP and IPTC metadata values in testJPEG_IPTC_EXT.jpg.

      1. testJPEG_IPTC_EXT.jpg
        27 kB
        Ray Gauss II
      2. tika-parsers-exiftool-parser-patch.txt
        34 kB
        Ray Gauss II
      3. tika-core-exiftool-parser-patch.txt
        20 kB
        Ray Gauss II

        Issue Links

          Activity

          Hide
          Jukka Zitting added a comment -

          Some notes:

          • We already have existing places for metadata schemas like Dublin Core and XMPDM. It would be better if the new metadata properties you're adding were located next to the already existing similar properties instead of in the separate new ExifTool interface.
          • We already have parsers for JPEG, PNG and TIFF. Instead of adding a conflicting new parser for the same formats, it would be better if the existing parsers could be extended with this new functionality.
          Show
          Jukka Zitting added a comment - Some notes: We already have existing places for metadata schemas like Dublin Core and XMPDM. It would be better if the new metadata properties you're adding were located next to the already existing similar properties instead of in the separate new ExifTool interface. We already have parsers for JPEG, PNG and TIFF. Instead of adding a conflicting new parser for the same formats, it would be better if the existing parsers could be extended with this new functionality.
          Hide
          Ray Gauss II added a comment -

          Indeed a tighter integration of the Property definitions would be ideal, but there could be the potential for namespace collisions for tika metadata constants that aren't prefixed so having the prefixed Properties in ExifTool allows users to explicitly fetch the field they need and optionally map those to existing fields like DublinCore via the ExiftoolTikaMapper. As a noob to the project and with so many new properties I thought a separate interface might be the best approach but I'm happy to move them elsewhere.

          I also agree that the ExiftoolMetadataExtractor could be called instead of or in addition to the existing ImageMetadataExtractor and JempboxExtractor under TiffParser and/or JpegParser since the absence of ExifTool in the environment should just fail silently, but again, as a noob, I didn't want to bust through the door with sweeping changes right off the bat.

          Show
          Ray Gauss II added a comment - Indeed a tighter integration of the Property definitions would be ideal, but there could be the potential for namespace collisions for tika metadata constants that aren't prefixed so having the prefixed Properties in ExifTool allows users to explicitly fetch the field they need and optionally map those to existing fields like DublinCore via the ExiftoolTikaMapper. As a noob to the project and with so many new properties I thought a separate interface might be the best approach but I'm happy to move them elsewhere. I also agree that the ExiftoolMetadataExtractor could be called instead of or in addition to the existing ImageMetadataExtractor and JempboxExtractor under TiffParser and/or JpegParser since the absence of ExifTool in the environment should just fail silently, but again, as a noob, I didn't want to bust through the door with sweeping changes right off the bat.
          Hide
          Ray Gauss II added a comment -

          I've refactored much of this and will be splitting some of it out to submit as separate, smaller issues.

          Show
          Ray Gauss II added a comment - I've refactored much of this and will be splitting some of it out to submit as separate, smaller issues.
          Hide
          Chris A. Mattmann added a comment -
          • push out to 1.2
          Show
          Chris A. Mattmann added a comment - push out to 1.2
          Hide
          Ray Gauss II added a comment -

          The code for the ExifTool parser has been moved to https://github.com/Alfresco/tika-exiftool which contains versions of tika-core and tika-parsers patched with TIKA-842 and TIKA-859.

          Show
          Ray Gauss II added a comment - The code for the ExifTool parser has been moved to https://github.com/Alfresco/tika-exiftool which contains versions of tika-core and tika-parsers patched with TIKA-842 and TIKA-859 .
          Hide
          Chris A. Mattmann added a comment -
          • push to 1.3
          Show
          Chris A. Mattmann added a comment - push to 1.3
          Hide
          Chris A. Mattmann added a comment -
          • push to 1.3
          Show
          Chris A. Mattmann added a comment - push to 1.3
          Hide
          Chris A. Mattmann added a comment -
          • push out to 1.4
          Show
          Chris A. Mattmann added a comment - push out to 1.4
          Hide
          Chris A. Mattmann added a comment -
          • push out to 1.4
          Show
          Chris A. Mattmann added a comment - push out to 1.4
          Hide
          Chris A. Mattmann added a comment -
          • push to 1.5, get ready for 1.4 RC #1.
          Show
          Chris A. Mattmann added a comment - push to 1.5, get ready for 1.4 RC #1.
          Hide
          Dave Meikle added a comment -

          Pushed out to 1.6, preparing for 1.5 RC

          Show
          Dave Meikle added a comment - Pushed out to 1.6, preparing for 1.5 RC
          Hide
          Chris A. Mattmann added a comment -
          • push to 1.8
          Show
          Chris A. Mattmann added a comment - push to 1.8
          Hide
          Tyler Palsulich added a comment - - edited

          Do we still want to integrate this? Is this a semi duplicate of TIKA-762? I agree that we should create another conflicting Parser for image types.

          Our decision on this is related to TIKA-776.

          Show
          Tyler Palsulich added a comment - - edited Do we still want to integrate this? Is this a semi duplicate of TIKA-762 ? I agree that we should create another conflicting Parser for image types. Our decision on this is related to TIKA-776 .
          Hide
          Dave Meikle added a comment -
          • Pushed to 1.11 following 1.10 release
          Show
          Dave Meikle added a comment - Pushed to 1.11 following 1.10 release
          Hide
          ASF GitHub Bot added a comment -

          GitHub user rgauss opened a pull request:

          https://github.com/apache/tika/pull/92

          TIKA-774: ExifTool Parser

          Contribution of tika-exiftool for review

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/Alfresco/tika tika-exiftool

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/92.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #92


          commit 8eb474b06e1463ca172128b59b713782eb4bece8
          Author: rgauss <rgauss@rgauss.com>
          Date: 2016-03-19T20:37:37Z

          Initial commit of tika-exiftool as is

          commit 5ff139d68bebd39382d5ed9626bff42797ece01d
          Author: rgauss <rgauss@rgauss.com>
          Date: 2016-03-19T22:44:00Z

          Added git ignore of properties override

          commit c8f4fb062ce809661527c91df89b230da95f592c
          Author: rgauss <rgauss@rgauss.com>
          Date: 2016-03-21T18:49:38Z

          Merge branch 'master' into tika-exiftool

          commit e8a2fa30b16f8b947d118b61ca12476420e9bee0
          Author: rgauss <rgauss@rgauss.com>
          Date: 2016-03-21T21:24:29Z

          TIKA-774: ExifTool Parser

          • Moved tika-exiftool from separate project to parsers
          • Updated license headers
          • Removed author Javadoc
          • Fixed a few forbiddenapi violations

          commit 37aae337c5ca3b5a45c2e45804e3768e08a8bbb6
          Author: rgauss <rgauss@rgauss.com>
          Date: 2016-03-21T21:31:31Z

          TIKA-774: ExifTool Parser

          • Removed more author Javadocs

          commit 90f8550c03aa873a81975dfa10cfd77aa557fc6f
          Author: rgauss <rgauss@rgauss.com>
          Date: 2016-03-21T22:00:00Z

          TIKA-774: ExifTool Parser

          • Renamed ExecutableUtils to ExiftoolExecutableUtils
          • Changed ExifToolImageParserTest to skip when exiftool is not
            available

          Show
          ASF GitHub Bot added a comment - GitHub user rgauss opened a pull request: https://github.com/apache/tika/pull/92 TIKA-774 : ExifTool Parser Contribution of tika-exiftool for review You can merge this pull request into a Git repository by running: $ git pull https://github.com/Alfresco/tika tika-exiftool Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/92.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #92 commit 8eb474b06e1463ca172128b59b713782eb4bece8 Author: rgauss <rgauss@rgauss.com> Date: 2016-03-19T20:37:37Z Initial commit of tika-exiftool as is commit 5ff139d68bebd39382d5ed9626bff42797ece01d Author: rgauss <rgauss@rgauss.com> Date: 2016-03-19T22:44:00Z Added git ignore of properties override commit c8f4fb062ce809661527c91df89b230da95f592c Author: rgauss <rgauss@rgauss.com> Date: 2016-03-21T18:49:38Z Merge branch 'master' into tika-exiftool commit e8a2fa30b16f8b947d118b61ca12476420e9bee0 Author: rgauss <rgauss@rgauss.com> Date: 2016-03-21T21:24:29Z TIKA-774 : ExifTool Parser Moved tika-exiftool from separate project to parsers Updated license headers Removed author Javadoc Fixed a few forbiddenapi violations commit 37aae337c5ca3b5a45c2e45804e3768e08a8bbb6 Author: rgauss <rgauss@rgauss.com> Date: 2016-03-21T21:31:31Z TIKA-774 : ExifTool Parser Removed more author Javadocs commit 90f8550c03aa873a81975dfa10cfd77aa557fc6f Author: rgauss <rgauss@rgauss.com> Date: 2016-03-21T22:00:00Z TIKA-774 : ExifTool Parser Renamed ExecutableUtils to ExiftoolExecutableUtils Changed ExifToolImageParserTest to skip when exiftool is not available
          Hide
          Tim Allison added a comment -

          Ray, this looks like an absolutely fantastic contribution. I've only had a chance to look at it quickly. My one recommendation is alluded to in your comments: we should add a static check for whether exiftool is available and adjust "handled" mimes at that point.

          I should have a chance to look more closely early next week, but I doubt there's reason to wait for my feedback.

          Show
          Tim Allison added a comment - Ray, this looks like an absolutely fantastic contribution. I've only had a chance to look at it quickly. My one recommendation is alluded to in your comments: we should add a static check for whether exiftool is available and adjust "handled" mimes at that point. I should have a chance to look more closely early next week, but I doubt there's reason to wait for my feedback.
          Hide
          Mattmann, Chris A (388J) added a comment -

          Is this a replacement for the one I hacked together? If so we should update docs and I'd like to review

          Sent from my iPhone

          Show
          Mattmann, Chris A (388J) added a comment - Is this a replacement for the one I hacked together? If so we should update docs and I'd like to review Sent from my iPhone
          Hide
          Ray Gauss II added a comment -

          we should add a static check for whether exiftool is available and adjust "handled" mimes at that point.

          I think we'll find other areas to improve on as well, I just wanted to get the ball rolling again on the contribution and review as we had to close the source on the stand-alone project mentioned above.

          I should have a chance to look more closely early next week, but I doubt there's reason to wait for my feedback.

          We'd value your feed back, and it's been over 4 years, we can wait a few more weeks.

          Is this a replacement for the one I hacked together?

          There's the possibility for the two to coexist, perhaps requiring this parser to be explicitly called programmatically.

          At a high level the biggest differences are:

          1. As mentioned in TIKA-1639, there's an extensive mapping from ExifTool's namespace to proper Tika properties (currently done programmatically)
          2. It includes the ability embed, i.e. writing metadata back into binary files. (TIKA-776)
          Show
          Ray Gauss II added a comment - we should add a static check for whether exiftool is available and adjust "handled" mimes at that point. I think we'll find other areas to improve on as well, I just wanted to get the ball rolling again on the contribution and review as we had to close the source on the stand-alone project mentioned above. I should have a chance to look more closely early next week, but I doubt there's reason to wait for my feedback. We'd value your feed back, and it's been over 4 years, we can wait a few more weeks. Is this a replacement for the one I hacked together? There's the possibility for the two to coexist, perhaps requiring this parser to be explicitly called programmatically. At a high level the biggest differences are: As mentioned in TIKA-1639 , there's an extensive mapping from ExifTool's namespace to proper Tika properties (currently done programmatically) It includes the ability embed, i.e. writing metadata back into binary files. ( TIKA-776 )
          Hide
          Chris A. Mattmann added a comment -

          Hey Ray Gauss II great work. Let's keep them co-existing for now, until you/me/we/others have time to update this page: http://wiki.apache.org/tika/EXIFToolParser.

          You did great work. I'll try and crowd source one of my students to do the work to combine the parsers down the road if no one beats me to it, or I'll do it myself.

          Show
          Chris A. Mattmann added a comment - Hey Ray Gauss II great work. Let's keep them co-existing for now, until you/me/we/others have time to update this page: http://wiki.apache.org/tika/EXIFToolParser . You did great work. I'll try and crowd source one of my students to do the work to combine the parsers down the road if no one beats me to it, or I'll do it myself.

            People

            • Assignee:
              Unassigned
              Reporter:
              Ray Gauss II
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:

                Development