Details

      Description

      Adds an external parser that calls ExifTool to extract extended metadata fields from images and other content types.

      In the core project:
      An ExifTool interface is added which contains Property objects that define the metadata fields available.
      An additional Property constructor for internalTextBag type.

      In the parsers project:
      An ExiftoolMetadataExtractor is added which does the work of calling ExifTool on the command line and mapping the response to tika metadata fields. This extractor could be called instead of or in addition to the existing ImageMetadataExtractor and JempboxExtractor under TiffParser and/or JpegParser but those have not been changed at this time.
      An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor.
      An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool metadata fields to existing tika and Drew Noakes metadata fields if enabled.
      An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag implementations in XML files.
      An ExifToolParserTest is added which tests several expected XMP and IPTC metadata values in testJPEG_IPTC_EXT.jpg.

      1. testJPEG_IPTC_EXT.jpg
        27 kB
        Ray Gauss II
      2. tika-parsers-exiftool-parser-patch.txt
        34 kB
        Ray Gauss II
      3. tika-core-exiftool-parser-patch.txt
        20 kB
        Ray Gauss II

        Issue Links

          Activity

          Hide
          Jukka Zitting added a comment -

          Some notes:

          • We already have existing places for metadata schemas like Dublin Core and XMPDM. It would be better if the new metadata properties you're adding were located next to the already existing similar properties instead of in the separate new ExifTool interface.
          • We already have parsers for JPEG, PNG and TIFF. Instead of adding a conflicting new parser for the same formats, it would be better if the existing parsers could be extended with this new functionality.
          Show
          Jukka Zitting added a comment - Some notes: We already have existing places for metadata schemas like Dublin Core and XMPDM. It would be better if the new metadata properties you're adding were located next to the already existing similar properties instead of in the separate new ExifTool interface. We already have parsers for JPEG, PNG and TIFF. Instead of adding a conflicting new parser for the same formats, it would be better if the existing parsers could be extended with this new functionality.
          Hide
          Ray Gauss II added a comment -

          Indeed a tighter integration of the Property definitions would be ideal, but there could be the potential for namespace collisions for tika metadata constants that aren't prefixed so having the prefixed Properties in ExifTool allows users to explicitly fetch the field they need and optionally map those to existing fields like DublinCore via the ExiftoolTikaMapper. As a noob to the project and with so many new properties I thought a separate interface might be the best approach but I'm happy to move them elsewhere.

          I also agree that the ExiftoolMetadataExtractor could be called instead of or in addition to the existing ImageMetadataExtractor and JempboxExtractor under TiffParser and/or JpegParser since the absence of ExifTool in the environment should just fail silently, but again, as a noob, I didn't want to bust through the door with sweeping changes right off the bat.

          Show
          Ray Gauss II added a comment - Indeed a tighter integration of the Property definitions would be ideal, but there could be the potential for namespace collisions for tika metadata constants that aren't prefixed so having the prefixed Properties in ExifTool allows users to explicitly fetch the field they need and optionally map those to existing fields like DublinCore via the ExiftoolTikaMapper. As a noob to the project and with so many new properties I thought a separate interface might be the best approach but I'm happy to move them elsewhere. I also agree that the ExiftoolMetadataExtractor could be called instead of or in addition to the existing ImageMetadataExtractor and JempboxExtractor under TiffParser and/or JpegParser since the absence of ExifTool in the environment should just fail silently, but again, as a noob, I didn't want to bust through the door with sweeping changes right off the bat.
          Hide
          Ray Gauss II added a comment -

          I've refactored much of this and will be splitting some of it out to submit as separate, smaller issues.

          Show
          Ray Gauss II added a comment - I've refactored much of this and will be splitting some of it out to submit as separate, smaller issues.
          Hide
          Chris A. Mattmann added a comment -
          • push out to 1.2
          Show
          Chris A. Mattmann added a comment - push out to 1.2
          Hide
          Ray Gauss II added a comment -

          The code for the ExifTool parser has been moved to https://github.com/Alfresco/tika-exiftool which contains versions of tika-core and tika-parsers patched with TIKA-842 and TIKA-859.

          Show
          Ray Gauss II added a comment - The code for the ExifTool parser has been moved to https://github.com/Alfresco/tika-exiftool which contains versions of tika-core and tika-parsers patched with TIKA-842 and TIKA-859 .
          Hide
          Chris A. Mattmann added a comment -
          • push to 1.3
          Show
          Chris A. Mattmann added a comment - push to 1.3
          Hide
          Chris A. Mattmann added a comment -
          • push to 1.3
          Show
          Chris A. Mattmann added a comment - push to 1.3
          Hide
          Chris A. Mattmann added a comment -
          • push out to 1.4
          Show
          Chris A. Mattmann added a comment - push out to 1.4
          Hide
          Chris A. Mattmann added a comment -
          • push out to 1.4
          Show
          Chris A. Mattmann added a comment - push out to 1.4
          Hide
          Chris A. Mattmann added a comment -
          • push to 1.5, get ready for 1.4 RC #1.
          Show
          Chris A. Mattmann added a comment - push to 1.5, get ready for 1.4 RC #1.
          Hide
          Dave Meikle added a comment -

          Pushed out to 1.6, preparing for 1.5 RC

          Show
          Dave Meikle added a comment - Pushed out to 1.6, preparing for 1.5 RC

            People

            • Assignee:
              Unassigned
              Reporter:
              Ray Gauss II
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development