Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1894

Add XMPMM metadata extraction to JempboxExtractor

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.13
    • Component/s: None
    • Labels:
      None

      Description

      The XMP Media Management (XMPMM) section of xmp carries some useful information. We currently have keys for many of the important attributes in tika-core's o.a.t.metadata.XMPMM, and JempBox extracts the XMPMM schema, but the wiring between the two has not yet been installed.

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        Update made to trunk with commit c5d4ec6c50824a9a40fdd2b492bf7557d8d693f3.

        In 2.0, I'm not sure how to share JempboxExtractor with the multi-media-module and the pdf-module. As expected, we get a cyclic dependency error if I add the multi-media-module as a dependency to the pdf-module, and, even if it did work, that wasn't a good option.

        Some options:

        1. Create a tika-parser-xmp-module that would include helper functionality for extracting xmp packets & metadata. Is this enough to warrant a separate module?
        2. Duplicate code (no!!!).
        3. Other options?

        Show
        tallison@mitre.org Tim Allison added a comment - - edited Update made to trunk with commit c5d4ec6c50824a9a40fdd2b492bf7557d8d693f3. In 2.0, I'm not sure how to share JempboxExtractor with the multi-media-module and the pdf-module. As expected, we get a cyclic dependency error if I add the multi-media-module as a dependency to the pdf-module, and, even if it did work, that wasn't a good option. Some options: 1. Create a tika-parser-xmp-module that would include helper functionality for extracting xmp packets & metadata. Is this enough to warrant a separate module? 2. Duplicate code (no!!!). 3. Other options?
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.7 #924 (See https://builds.apache.org/job/tika-trunk-jdk1.7/924/)
        TIKA-1894: Add XMPMM support to PDFParser and JpegParser via Jempbox (tallison: rev c5d4ec6c50824a9a40fdd2b492bf7557d8d693f3)

        • tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
        • tika-parsers/src/test/java/org/apache/tika/parser/jpeg/JpegParserTest.java
        • CHANGES.txt
        • tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java
        • tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
        • tika-core/src/main/java/org/apache/tika/metadata/XMPMM.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #924 (See https://builds.apache.org/job/tika-trunk-jdk1.7/924/ ) TIKA-1894 : Add XMPMM support to PDFParser and JpegParser via Jempbox (tallison: rev c5d4ec6c50824a9a40fdd2b492bf7557d8d693f3) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java tika-parsers/src/test/java/org/apache/tika/parser/jpeg/JpegParserTest.java CHANGES.txt tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java tika-core/src/main/java/org/apache/tika/metadata/XMPMM.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-2.x #47 (See https://builds.apache.org/job/tika-2.x/47/)
        TIKA-1894 - Add XMPMM support to PDFParser and JpegParser via Jempbox (tallison: rev dc4ca999c2855814158868af97e877cbcc74079a)

        • CHANGES.txt
        • tika-core/src/main/java/org/apache/tika/metadata/XMPMM.java
        • tika-parser-modules/tika-parser-multimedia-module/pom.xml
        • tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/image/xmp/XMPPacketScanner.java
        • tika-parser-modules/pom.xml
        • tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/jpeg/JpegParser.java
        • tika-parser-modules/tika-parser-xmp-module/src/main/java/org/apache/tika/module/xmp/internal/Activator.java
        • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
        • tika-parser-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
        • tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/image/xmp/JempboxExtractorTest.java
        • tika-parser-modules/tika-parser-xmp-module/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java
        • tika-parser-modules/tika-parser-xmp-module/src/main/java/org/apache/tika/parser/xmp/XMPPacketScanner.java
        • tika-parser-modules/tika-parser-pdf-module/pom.xml
        • tika-parser-modules/tika-parser-xmp-module/pom.xml
        • tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/image/TiffParser.java
        • tika-parser-modules/tika-parser-xmp-module/src/test/java/org/apache/tika/parser/xmp/JempboxExtractorTest.java
        • tika-parser-bundles/tika-parser-multimedia-bundle/pom.xml
        • tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java
        • tika-parser-bundles/tika-parser-pdf-bundle/pom.xml
        • tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/jpeg/JpegParserTest.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-2.x #47 (See https://builds.apache.org/job/tika-2.x/47/ ) TIKA-1894 - Add XMPMM support to PDFParser and JpegParser via Jempbox (tallison: rev dc4ca999c2855814158868af97e877cbcc74079a) CHANGES.txt tika-core/src/main/java/org/apache/tika/metadata/XMPMM.java tika-parser-modules/tika-parser-multimedia-module/pom.xml tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/image/xmp/XMPPacketScanner.java tika-parser-modules/pom.xml tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/jpeg/JpegParser.java tika-parser-modules/tika-parser-xmp-module/src/main/java/org/apache/tika/module/xmp/internal/Activator.java tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java tika-parser-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/image/xmp/JempboxExtractorTest.java tika-parser-modules/tika-parser-xmp-module/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java tika-parser-modules/tika-parser-xmp-module/src/main/java/org/apache/tika/parser/xmp/XMPPacketScanner.java tika-parser-modules/tika-parser-pdf-module/pom.xml tika-parser-modules/tika-parser-xmp-module/pom.xml tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/image/TiffParser.java tika-parser-modules/tika-parser-xmp-module/src/test/java/org/apache/tika/parser/xmp/JempboxExtractorTest.java tika-parser-bundles/tika-parser-multimedia-bundle/pom.xml tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java tika-parser-bundles/tika-parser-pdf-bundle/pom.xml tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/jpeg/JpegParserTest.java
        Hide
        bobpaulin Bob Paulin added a comment -

        Tim Allison So after looking at this I'm thinking a new module might be overkill here. There's no parsers in it so there's no need for there to be an Activator class also I see a number of the image classes instantiating objects that do not need to be instantiated.

        new JempboxExtractor(metadata).parse(tis);
        

        could be

        JempboxExtractor.parse(metadata, tis);
        

        I feel the pain that there is shared code between pdf and multimedia now. Maybe just a simple shared util jar?

        Show
        bobpaulin Bob Paulin added a comment - Tim Allison So after looking at this I'm thinking a new module might be overkill here. There's no parsers in it so there's no need for there to be an Activator class also I see a number of the image classes instantiating objects that do not need to be instantiated. new JempboxExtractor(metadata).parse(tis); could be JempboxExtractor.parse(metadata, tis); I feel the pain that there is shared code between pdf and multimedia now. Maybe just a simple shared util jar?
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Makes sense. There isn't a parser in there now, but at some point, I think I'd like to add a parser that combines the PacketScanner and the XMP extractor...won't have time for a while...though.

        By "shared jar", would that be a tika-utils package at the main level?

        Does this belong in the tika-xmp module...or would we run into circular references eventually? Ray Gauss II, any recommendations?

        Show
        tallison@mitre.org Tim Allison added a comment - Makes sense. There isn't a parser in there now, but at some point, I think I'd like to add a parser that combines the PacketScanner and the XMP extractor...won't have time for a while...though. By "shared jar", would that be a tika-utils package at the main level? Does this belong in the tika-xmp module...or would we run into circular references eventually? Ray Gauss II , any recommendations?
        Hide
        rgauss Ray Gauss II added a comment -

        The tika-xmp project deals with converting a populated Tika Metadata object into XMP.

        Perhaps that project should be renamed to something more specific at some point, but regardless, I don't think it's the right spot for this sort of shared parser code.

        I'd vote for the simpler shared util jar, but I think it can still live next to the modules, something like /tika-parsers-modules/tika-parser-xmp-commons?

        Show
        rgauss Ray Gauss II added a comment - The tika-xmp project deals with converting a populated Tika Metadata object into XMP. Perhaps that project should be renamed to something more specific at some point, but regardless, I don't think it's the right spot for this sort of shared parser code. I'd vote for the simpler shared util jar, but I think it can still live next to the modules, something like /tika-parsers-modules/tika-parser-xmp-commons ?
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Thank you, Ray Gauss II. Bob Paulin, if you're ok with this, I'll rename the module today.

        Show
        tallison@mitre.org Tim Allison added a comment - Thank you, Ray Gauss II . Bob Paulin , if you're ok with this, I'll rename the module today.
        Hide
        bobpaulin Bob Paulin added a comment -

        I think that sounds like a good idea.

        Show
        bobpaulin Bob Paulin added a comment - I think that sounds like a good idea.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Let's see if Hudson likes it... I just pushed this clean-up in 2.x. Thank you!

        Show
        tallison@mitre.org Tim Allison added a comment - Let's see if Hudson likes it... I just pushed this clean-up in 2.x. Thank you!
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-2.x #52 (See https://builds.apache.org/job/tika-2.x/52/)
        TIKA-1894 – clean up following recommendations from Ray Gauss and Bob (tallison: rev c58af959b6cc3f3a3d8f555d53b147388e36b01d)

        • tika-parser-modules/tika-parser-xmp-module/src/main/java/org/apache/tika/module/xmp/internal/Activator.java
        • tika-parser-modules/tika-parser-xmp-module/pom.xml
        • tika-parser-modules/tika-parser-multimedia-module/pom.xml
        • tika-parser-bundles/tika-parser-multimedia-bundle/pom.xml
        • tika-parser-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/XMPPacketScanner.java
        • tika-parser-modules/tika-parser-xmp-module/src/main/java/org/apache/tika/parser/xmp/XMPPacketScanner.java
        • tika-parser-modules/tika-parser-pdf-module/pom.xml
        • tika-parser-modules/pom.xml
        • tika-parser-modules/tika-parser-xmp-module/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java
        • tika-parser-modules/tika-parser-xmp-module/src/test/java/org/apache/tika/parser/xmp/JempboxExtractorTest.java
        • tika-parser-modules/tika-parser-xmp-commons/src/test/java/org/apache/tika/parser/xmp/JempboxExtractorTest.java
        • tika-parser-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java
        • tika-parser-bundles/tika-parser-pdf-bundle/pom.xml
        • tika-parser-modules/tika-parser-xmp-commons/pom.xml
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-2.x #52 (See https://builds.apache.org/job/tika-2.x/52/ ) TIKA-1894 – clean up following recommendations from Ray Gauss and Bob (tallison: rev c58af959b6cc3f3a3d8f555d53b147388e36b01d) tika-parser-modules/tika-parser-xmp-module/src/main/java/org/apache/tika/module/xmp/internal/Activator.java tika-parser-modules/tika-parser-xmp-module/pom.xml tika-parser-modules/tika-parser-multimedia-module/pom.xml tika-parser-bundles/tika-parser-multimedia-bundle/pom.xml tika-parser-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/XMPPacketScanner.java tika-parser-modules/tika-parser-xmp-module/src/main/java/org/apache/tika/parser/xmp/XMPPacketScanner.java tika-parser-modules/tika-parser-pdf-module/pom.xml tika-parser-modules/pom.xml tika-parser-modules/tika-parser-xmp-module/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java tika-parser-modules/tika-parser-xmp-module/src/test/java/org/apache/tika/parser/xmp/JempboxExtractorTest.java tika-parser-modules/tika-parser-xmp-commons/src/test/java/org/apache/tika/parser/xmp/JempboxExtractorTest.java tika-parser-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java tika-parser-bundles/tika-parser-pdf-bundle/pom.xml tika-parser-modules/tika-parser-xmp-commons/pom.xml
        Hide
        tallison@mitre.org Tim Allison added a comment -

        NPE discovered during TIKA-1302 regression tests in prep for 1.13 release.

        Show
        tallison@mitre.org Tim Allison added a comment - NPE discovered during TIKA-1302 regression tests in prep for 1.13 release.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.7 #971 (See https://builds.apache.org/job/tika-trunk-jdk1.7/971/)
        TIKA-1894 – fix potential NPE in XMPMM extraction (tallison: rev 92a4835d02d94fddbc7d70c0507b8a32345662d9)

        • tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java
          TIKA-1894 – fix potential NPE in XMPMM extraction (tallison: rev ee60bc6e1b10e7abdb1d36464fb564b195f37dcc)
        • tika-parsers/pom.xml
        • CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #971 (See https://builds.apache.org/job/tika-trunk-jdk1.7/971/ ) TIKA-1894 – fix potential NPE in XMPMM extraction (tallison: rev 92a4835d02d94fddbc7d70c0507b8a32345662d9) tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java TIKA-1894 – fix potential NPE in XMPMM extraction (tallison: rev ee60bc6e1b10e7abdb1d36464fb564b195f37dcc) tika-parsers/pom.xml CHANGES.txt

          People

          • Assignee:
            Unassigned
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development