Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2374

Tika App -z should extract PDF inline images by default

    Details

    • Type: Improvement
    • Status: Reopened
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.14
    • Fix Version/s: 1.16
    • Component/s: cli
    • Labels:
      None

      Description

      As discussed on dev@ - If you use the Tika App with the default config and the -z extract option, it will extract embedded resources, except PDF inline images. This is unexpected for new users, who won't know that they'd need to pass in a custom config with the extractInlineImages PDF parser option set

      If the user passes in an explicit config to the app, we should respect that. However, if they don't pass one in and take the default, the -z option should (but only that one) enable whatever options are needed to make extraction work properly + fully (currently just extractInlineImages)

      If possible/easy, the -z option should print out some info to let affected users know that the default config was tweaked to give extra embedded resources

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        If a user does not supply a TikaConfig on the commandline, then extractInlineImages is set to true for all functionality of tika-app's cli. A warning is written to stderr.

        Nick Burch, if you feel strongly that we should limit this to -z, let me know. Also, please recommend improvements to the warning message as you see fit.

        Show
        tallison@mitre.org Tim Allison added a comment - If a user does not supply a TikaConfig on the commandline, then extractInlineImages is set to true for all functionality of tika-app's cli. A warning is written to stderr. Nick Burch , if you feel strongly that we should limit this to -z, let me know. Also, please recommend improvements to the warning message as you see fit.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Jenkins build Tika-trunk #1300 (See https://builds.apache.org/job/Tika-trunk/1300/)
        TIKA-2374 – tika-app cli should extract inline images by default (tallison: https://github.com/apache/tika/commit/2deadf4c4d3d396d4d9f3cc5cee6ed3cb0bce868)

        • (edit) CHANGES.txt
        • (edit) tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
        • (add) tika-app/src/test/resources/test-data/testPDF_childAttachments.pdf
        • (edit) tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Tika-trunk #1300 (See https://builds.apache.org/job/Tika-trunk/1300/ ) TIKA-2374 – tika-app cli should extract inline images by default (tallison: https://github.com/apache/tika/commit/2deadf4c4d3d396d4d9f3cc5cee6ed3cb0bce868 ) (edit) CHANGES.txt (edit) tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java (add) tika-app/src/test/resources/test-data/testPDF_childAttachments.pdf (edit) tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Reopening...we should actually limit this to the -z option. See TIKA-2434

        Show
        tallison@mitre.org Tim Allison added a comment - Reopening...we should actually limit this to the -z option. See TIKA-2434

          People

          • Assignee:
            Unassigned
            Reporter:
            gagravarr Nick Burch
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development