Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1844

PooledTimeSeriesParser takes precedence over MP4Parser

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.13
    • Component/s: None
    • Labels:
      None

      Description

      The PooledTimeSeriesParser currently takes precedence over the MP4Parser even if the pooled-time-series application is not installed. This means that clients will lose metadata formerly extracted by the MP4Parser unless they remove the PooledTimeSeriesParser.

      This is similar to what happened with the integration of the Tesseract Parser (TIKA-1445). We should probably follow a similar pattern to that...run both parsers and combine metadata.

        Issue Links

          Activity

          Hide
          tallison@mitre.org Tim Allison added a comment -

          Aditya Dhulipala and Chris A. Mattmann, any chance we could get this in for 1.13?

          Show
          tallison@mitre.org Tim Allison added a comment - Aditya Dhulipala and Chris A. Mattmann , any chance we could get this in for 1.13?
          Hide
          1ceb00da Aditya Dhulipala added a comment -

          Hi Tim.

          I was able to reproduce the error.

          I have a patch in mind:
          ---> do not advertise PooledTimeSeries parser if it's not installed

          I will work on a patch today and tomorrow. Will submit it by end of tomorrow.

          Is that ok?

          Do you have a specific deadline in mind?

          Show
          1ceb00da Aditya Dhulipala added a comment - Hi Tim. I was able to reproduce the error. I have a patch in mind: ---> do not advertise PooledTimeSeries parser if it's not installed I will work on a patch today and tomorrow. Will submit it by end of tomorrow. Is that ok? Do you have a specific deadline in mind?
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          +1 works great Aditya Dhulipala

          Show
          chrismattmann Chris A. Mattmann added a comment - +1 works great Aditya Dhulipala
          Hide
          tallison@mitre.org Tim Allison added a comment -

          +1 ditto.

          OCRParser might be a good model. Thank you!

          Show
          tallison@mitre.org Tim Allison added a comment - +1 ditto. OCRParser might be a good model. Thank you!
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user cafed00d4j opened a pull request:

          https://github.com/apache/tika/pull/107

          Tika 1844

          Fix for Tika-1844
          https://issues.apache.org/jira/browse/TIKA-1844

          Fixes an issue where PooledTimeSeries parser takes precedence over other
          video parsers even if pooled-time-series is not installed

          Added the functionality to combine metadata extracted from other video parsers in addition to
          that extracted from PooledTimeSeriesParser

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/cafed00d4j/tika TIKA-1844

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/107.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #107


          commit 2e0c9bdc83af8659c01848886bdfc80e055a5f2a
          Author: Aditya Dhulipala <adhulipa@usc.edu>
          Date: 2016-04-19T02:34:40Z

          Skip PooledTimeSeriesParser if it's not available

          commit 1e2bd89e73888ed293d820d2bf33fb56a5402ce7
          Author: Aditya Dhulipala <adhulipa@usc.edu>
          Date: 2016-04-19T02:46:28Z

          Added CompositeParser workaround

          Added a workaround to include collate metadata
          from multiple parsers (such as MP4Parser)

          commit 9f3a32a171d2f45b057bb1f7bf8be5c28daa45d9
          Author: Aditya Dhulipala <adhulipa@usc.edu>
          Date: 2016-04-20T23:54:56Z

          Added imports


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user cafed00d4j opened a pull request: https://github.com/apache/tika/pull/107 Tika 1844 Fix for Tika-1844 https://issues.apache.org/jira/browse/TIKA-1844 Fixes an issue where PooledTimeSeries parser takes precedence over other video parsers even if pooled-time-series is not installed Added the functionality to combine metadata extracted from other video parsers in addition to that extracted from PooledTimeSeriesParser You can merge this pull request into a Git repository by running: $ git pull https://github.com/cafed00d4j/tika TIKA-1844 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/107.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #107 commit 2e0c9bdc83af8659c01848886bdfc80e055a5f2a Author: Aditya Dhulipala <adhulipa@usc.edu> Date: 2016-04-19T02:34:40Z Skip PooledTimeSeriesParser if it's not available commit 1e2bd89e73888ed293d820d2bf33fb56a5402ce7 Author: Aditya Dhulipala <adhulipa@usc.edu> Date: 2016-04-19T02:46:28Z Added CompositeParser workaround Added a workaround to include collate metadata from multiple parsers (such as MP4Parser) commit 9f3a32a171d2f45b057bb1f7bf8be5c28daa45d9 Author: Aditya Dhulipala <adhulipa@usc.edu> Date: 2016-04-20T23:54:56Z Added imports
          Hide
          1ceb00da Aditya Dhulipala added a comment -

          Thanks Tim, Prof. Chris

          Used OCRParser as a reference and opened a pull request

          Show
          1ceb00da Aditya Dhulipala added a comment - Thanks Tim, Prof. Chris Used OCRParser as a reference and opened a pull request
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tika/pull/107

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/107
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thank you Aditya Dhulipala! I applied your patch and then did some clean up...indentation and making sure the streams are closed in case there is an exception.

          Please confirm that I didn't break anything. Thank you, again!

          Show
          tallison@mitre.org Tim Allison added a comment - Thank you Aditya Dhulipala ! I applied your patch and then did some clean up...indentation and making sure the streams are closed in case there is an exception. Please confirm that I didn't break anything. Thank you, again!
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in tika-trunk-jdk1.7 #961 (See https://builds.apache.org/job/tika-trunk-jdk1.7/961/)
          TIKA-1844 pass through POT if it isn't available – via Aditya (tallison: rev 72d76f884e3997ecee322e8327f51053d0a34bad)

          • tika-parsers/src/main/java/org/apache/tika/parser/pot/PooledTimeSeriesParser.java
            TIKA-1844 clean up indentation, clean up streams in case of exceptions, (tallison: rev 2ec36ff4b817d1378b75f0654648a260c3d5928f)
          • tika-parsers/src/main/java/org/apache/tika/parser/pot/PooledTimeSeriesParser.java
            TIKA-1844 clean up indentation, clean up streams in case of exceptions, (tallison: rev 8487fa73d0110fdead3475b2980425b1ef6662fe)
          • CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in tika-trunk-jdk1.7 #961 (See https://builds.apache.org/job/tika-trunk-jdk1.7/961/ ) TIKA-1844 pass through POT if it isn't available – via Aditya (tallison: rev 72d76f884e3997ecee322e8327f51053d0a34bad) tika-parsers/src/main/java/org/apache/tika/parser/pot/PooledTimeSeriesParser.java TIKA-1844 clean up indentation, clean up streams in case of exceptions, (tallison: rev 2ec36ff4b817d1378b75f0654648a260c3d5928f) tika-parsers/src/main/java/org/apache/tika/parser/pot/PooledTimeSeriesParser.java TIKA-1844 clean up indentation, clean up streams in case of exceptions, (tallison: rev 8487fa73d0110fdead3475b2980425b1ef6662fe) CHANGES.txt
          Hide
          hudson Hudson added a comment -

          UNSTABLE: Integrated in tika-2.x #87 (See https://builds.apache.org/job/tika-2.x/87/)
          TIKA-1844 Extract metadata from MP4 videos whether or not the (tallison: rev a8225a0cfddefd7c7968981a68e191e1c7119a57)

          • tika-parser-modules/tika-parser-scientific-module/pom.xml
          • CHANGES.txt
          • tika-parser-modules/tika-parser-scientific-module/src/main/java/org/apache/tika/parser/pot/PooledTimeSeriesParser.java
          Show
          hudson Hudson added a comment - UNSTABLE: Integrated in tika-2.x #87 (See https://builds.apache.org/job/tika-2.x/87/ ) TIKA-1844 Extract metadata from MP4 videos whether or not the (tallison: rev a8225a0cfddefd7c7968981a68e191e1c7119a57) tika-parser-modules/tika-parser-scientific-module/pom.xml CHANGES.txt tika-parser-modules/tika-parser-scientific-module/src/main/java/org/apache/tika/parser/pot/PooledTimeSeriesParser.java
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Bob Paulin, the only way I could get this to work in 2.x was to add the multimedia-module as a dependency to the scientific-module...undoing some of the great work you've done to separate.

          If there's a better way, please let me know.

          We might consider reverting the integration of MP4 and POT in 2.x on the theory that "we'll figure it out by then." I'll open an issue to track a checklist of things in that category.

          Show
          tallison@mitre.org Tim Allison added a comment - Bob Paulin , the only way I could get this to work in 2.x was to add the multimedia-module as a dependency to the scientific-module...undoing some of the great work you've done to separate. If there's a better way, please let me know. We might consider reverting the integration of MP4 and POT in 2.x on the theory that "we'll figure it out by then." I'll open an issue to track a checklist of things in that category.
          Hide
          bobpaulin Bob Paulin added a comment -

          I should be able to take a look tomorrow. After removing POI from all but Office I think anything is possible .

          Show
          bobpaulin Bob Paulin added a comment - I should be able to take a look tomorrow. After removing POI from all but Office I think anything is possible .
          Hide
          bobpaulin Bob Paulin added a comment -

          Tim Allison So the most simple solution would be to just move the PooledTimeSeriesParser to the multimedia bundle. It's supported mediatypes are all video so it could fit. Otherwise I could set up the MP4Parser using a ParserProxy so multimedia is just optional rather than required.

          Show
          bobpaulin Bob Paulin added a comment - Tim Allison So the most simple solution would be to just move the PooledTimeSeriesParser to the multimedia bundle. It's supported mediatypes are all video so it could fit. Otherwise I could set up the MP4Parser using a ParserProxy so multimedia is just optional rather than required.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Moved...duh. Thank you.

          Show
          tallison@mitre.org Tim Allison added a comment - Moved...duh. Thank you.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-2.x #90 (See https://builds.apache.org/job/tika-2.x/90/)
          TIKA-1844 - move PooledTimeSeries to multimedia (tallison: rev b876fa5cb2dc61b2e0ae0c8e5727aa5e86de4b01)

          • tika-parser-modules/tika-parser-scientific-module/src/main/java/org/apache/tika/parser/pot/PooledTimeSeriesParser.java
          • tika-parser-modules/tika-parser-multimedia-module/pom.xml
          • tika-parser-modules/pom.xml
          • tika-parser-bundles/tika-parser-scientific-bundle/pom.xml
          • tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/pot/PooledTimeSeriesParser.java
          • tika-parser-bundles/tika-parser-scientific-bundle/src/test/java/org/apache/tika/module/scientific/BundleIT.java
          • tika-parser-modules/tika-parser-multimedia-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
          • tika-parser-bundles/tika-parser-multimedia-bundle/src/test/java/org/apache/tika/module/multimedia/BundleIT.java
          • tika-parser-bundles/tika-parser-multimedia-bundle/pom.xml
          • tika-parser-modules/tika-parser-scientific-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
          • tika-parser-modules/tika-parser-scientific-module/pom.xml
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-2.x #90 (See https://builds.apache.org/job/tika-2.x/90/ ) TIKA-1844 - move PooledTimeSeries to multimedia (tallison: rev b876fa5cb2dc61b2e0ae0c8e5727aa5e86de4b01) tika-parser-modules/tika-parser-scientific-module/src/main/java/org/apache/tika/parser/pot/PooledTimeSeriesParser.java tika-parser-modules/tika-parser-multimedia-module/pom.xml tika-parser-modules/pom.xml tika-parser-bundles/tika-parser-scientific-bundle/pom.xml tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/pot/PooledTimeSeriesParser.java tika-parser-bundles/tika-parser-scientific-bundle/src/test/java/org/apache/tika/module/scientific/BundleIT.java tika-parser-modules/tika-parser-multimedia-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser tika-parser-bundles/tika-parser-multimedia-bundle/src/test/java/org/apache/tika/module/multimedia/BundleIT.java tika-parser-bundles/tika-parser-multimedia-bundle/pom.xml tika-parser-modules/tika-parser-scientific-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser tika-parser-modules/tika-parser-scientific-module/pom.xml
          Hide
          bobpaulin Bob Paulin added a comment -

          +1. Good time to be running into this stuff since we're still pre-release for 2.x

          Show
          bobpaulin Bob Paulin added a comment - +1. Good time to be running into this stuff since we're still pre-release for 2.x

            People

            • Assignee:
              Unassigned
              Reporter:
              tallison@mitre.org Tim Allison
            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development