Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2389

Warn log level is pretty strong for missing JBIG2ImageReader

    Details

    • Type: Wish
    • Status: Reopened
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.15
    • Fix Version/s: 1.17
    • Component/s: parser
    • Labels:
      None

      Description

      Given the license of jbig2-imageio many projects (Apache or LGPL projects for example) won't include it and will always end up with a warning because of it while they probably don't really care that much about this image format.

      Ideally ImageParser should probably be made more extensible and jbig2 part moved in an optional module but in the meantime is this warning that necessary ?

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        For those who aren't using OCR, y, I agree that this is overkill.

        However, if you are trying to run OCR on inline images in PDF and if you don't include jbig2, you'll silently get no text for some pdfs, and you might not win the Pulitzer . I'd be ok moving this to "info", maybe, but I defer to the community. Matthew Caruana Galizia, any recommendations?

        Show
        tallison@mitre.org Tim Allison added a comment - For those who aren't using OCR, y, I agree that this is overkill. However, if you are trying to run OCR on inline images in PDF and if you don't include jbig2, you'll silently get no text for some pdfs, and you might not win the Pulitzer . I'd be ok moving this to "info", maybe, but I defer to the community. Matthew Caruana Galizia , any recommendations?
        Hide
        mcaruanagalizia Matthew Caruana Galizia added a comment -

        Please don't move this to info.

        Before seeing this warning, I didn't even know that the JBIG2 format existed. And then yes, maybe who knows, we would have never found things that we found after adding support.

        For want of a nail... the battle was lost.

        Show
        mcaruanagalizia Matthew Caruana Galizia added a comment - Please don't move this to info. Before seeing this warning, I didn't even know that the JBIG2 format existed. And then yes, maybe who knows, we would have never found things that we found after adding support. For want of a nail... the battle was lost.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        I didn't either...

        Now that I think about it, we should probably add warnings for the other image types that require non-ASL-compatible libs.

        Show
        tallison@mitre.org Tim Allison added a comment - I didn't either... Now that I think about it, we should probably add warnings for the other image types that require non-ASL-compatible libs.
        Hide
        tmortagne Thomas Mortagne added a comment - - edited

        Maybe something like a "don't bother me with optional parsers" property in TikaConfig ?

        Any clean way to disable it would be good enough for my use case.

        Show
        tmortagne Thomas Mortagne added a comment - - edited Maybe something like a "don't bother me with optional parsers" property in TikaConfig ? Any clean way to disable it would be good enough for my use case.
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        We can add a method to Initializable, something like checkConfiguration(defaultResponse), where options would be ignore, warn, throw similar to what we do w ServiceLoader problems. Users could set global default but also override for specific parsers via TikaConfig. Default would be warn.

        Show
        tallison@mitre.org Tim Allison added a comment - - edited We can add a method to Initializable, something like checkConfiguration(defaultResponse) , where options would be ignore, warn, throw similar to what we do w ServiceLoader problems. Users could set global default but also override for specific parsers via TikaConfig. Default would be warn.
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        Draft patch. This would allow users to turn off warnings from parsers.

        The idea is that there is a difference between a class loading error (class not found/game over), which we currently handle well, and a potential problem that 1) the user should be warned of and/or 2) the user should be able to turn off. Examples of this might include optional dependencies like the jbig2 issue that launched this issue.

        I added this functionality to the Initializable interface, and users can set the global default in the service-loader element (e.g. <service-loader initializableProblemHandler="throw"/>) and they can override it per Initializable (e.g. <parser class="org.apache.tika.parser.DummyInitializableParser" initializableProblemHandler="info">.

        I don't like the amount of code this adds, but it does differentiate between a parser complaining at initialization time and parser complaining while parsing...something a user couldn't do by setting the log level at the parser level.

        Recommendations?

        Show
        tallison@mitre.org Tim Allison added a comment - - edited Draft patch. This would allow users to turn off warnings from parsers. The idea is that there is a difference between a class loading error (class not found/game over), which we currently handle well, and a potential problem that 1) the user should be warned of and/or 2) the user should be able to turn off. Examples of this might include optional dependencies like the jbig2 issue that launched this issue. I added this functionality to the Initializable interface, and users can set the global default in the service-loader element (e.g. <service-loader initializableProblemHandler="throw"/> ) and they can override it per Initializable (e.g. <parser class="org.apache.tika.parser.DummyInitializableParser" initializableProblemHandler="info"> . I don't like the amount of code this adds, but it does differentiate between a parser complaining at initialization time and parser complaining while parsing...something a user couldn't do by setting the log level at the parser level. Recommendations?
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Jenkins build Tika-trunk #1299 (See https://builds.apache.org/job/Tika-trunk/1299/)
        TIKA-2389 and fix CHANGES.txt file (tallison: https://github.com/apache/tika/commit/93f941e2f539bf4d63165122f1de4e72c2460833)

        • (edit) CHANGES.txt
          TIKA-2389 – allow users to configure warnings for problems during (tallison: https://github.com/apache/tika/commit/4161f2281bbc5b476b80e00c4fad7d54d1d12827)
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
        • (edit) tika-core/src/main/java/org/apache/tika/config/TikaConfig.java
        • (add) tika-core/src/main/java/org/apache/tika/config/InitializableProblemHandler.java
        • (add) tika-core/src/test/resources/org/apache/tika/config/TIKA-2389-throw-per-parser.xml
        • (edit) tika-core/src/test/java/org/apache/tika/parser/DummyInitializableParser.java
        • (add) tika-core/src/test/resources/org/apache/tika/config/TIKA-2389-illegal.xml
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/image/ImageParser.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/sentiment/analysis/SentimentParser.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
        • (edit) tika-dl/src/main/java/org/apache/tika/dl/imagerec/DL4JInceptionV3Net.java
        • (add) tika-core/src/test/resources/org/apache/tika/config/TIKA-2389-warn-per-parser.xml
        • (edit) tika-core/src/main/java/org/apache/tika/config/ServiceLoader.java
        • (add) tika-core/src/test/resources/org/apache/tika/config/TIKA-2389-throw-default-overridden.xml
        • (add) tika-core/src/test/resources/org/apache/tika/config/TIKA-2389-throw-default.xml
        • (edit) tika-core/src/test/java/org/apache/tika/config/TikaConfigTest.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/jdbc/SQLite3Parser.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/recognition/tf/TensorflowImageRecParser.java
        • (edit) tika-core/src/main/java/org/apache/tika/config/Initializable.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/recognition/tf/TensorflowRESTRecogniser.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Tika-trunk #1299 (See https://builds.apache.org/job/Tika-trunk/1299/ ) TIKA-2389 and fix CHANGES.txt file (tallison: https://github.com/apache/tika/commit/93f941e2f539bf4d63165122f1de4e72c2460833 ) (edit) CHANGES.txt TIKA-2389 – allow users to configure warnings for problems during (tallison: https://github.com/apache/tika/commit/4161f2281bbc5b476b80e00c4fad7d54d1d12827 ) (edit) tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java (edit) tika-core/src/main/java/org/apache/tika/config/TikaConfig.java (add) tika-core/src/main/java/org/apache/tika/config/InitializableProblemHandler.java (add) tika-core/src/test/resources/org/apache/tika/config/ TIKA-2389 -throw-per-parser.xml (edit) tika-core/src/test/java/org/apache/tika/parser/DummyInitializableParser.java (add) tika-core/src/test/resources/org/apache/tika/config/ TIKA-2389 -illegal.xml (edit) tika-parsers/src/main/java/org/apache/tika/parser/image/ImageParser.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/sentiment/analysis/SentimentParser.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java (edit) tika-dl/src/main/java/org/apache/tika/dl/imagerec/DL4JInceptionV3Net.java (add) tika-core/src/test/resources/org/apache/tika/config/ TIKA-2389 -warn-per-parser.xml (edit) tika-core/src/main/java/org/apache/tika/config/ServiceLoader.java (add) tika-core/src/test/resources/org/apache/tika/config/ TIKA-2389 -throw-default-overridden.xml (add) tika-core/src/test/resources/org/apache/tika/config/ TIKA-2389 -throw-default.xml (edit) tika-core/src/test/java/org/apache/tika/config/TikaConfigTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/jdbc/SQLite3Parser.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/recognition/tf/TensorflowImageRecParser.java (edit) tika-core/src/main/java/org/apache/tika/config/Initializable.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/recognition/tf/TensorflowRESTRecogniser.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        This doesn't work as desired...

        The problem is that if a Parser is excluded, we still instantiate it before excluding it. This means that one instance of the parser will be instantiated before reading the initializable warning level from the configured parser.

        Show
        tallison@mitre.org Tim Allison added a comment - This doesn't work as desired... The problem is that if a Parser is excluded, we still instantiate it before excluding it. This means that one instance of the parser will be instantiated before reading the initializable warning level from the configured parser.
        Hide
        hudson Hudson added a comment -

        ABORTED: Integrated in Jenkins build Tika-trunk #1320 (See https://builds.apache.org/job/Tika-trunk/1320/)
        TIKA-2389 – add static checks to PDFParser, Tesseract, SQLLite to make (tallison: https://github.com/apache/tika/commit/05f8f89fe6b531caacb8b39d1f344e96db834a39)

        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
        • (edit) tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/image/ImageParser.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/jdbc/SQLite3Parser.java
        • (edit) tika-core/src/main/java/org/apache/tika/config/ServiceLoader.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
        Show
        hudson Hudson added a comment - ABORTED: Integrated in Jenkins build Tika-trunk #1320 (See https://builds.apache.org/job/Tika-trunk/1320/ ) TIKA-2389 – add static checks to PDFParser, Tesseract, SQLLite to make (tallison: https://github.com/apache/tika/commit/05f8f89fe6b531caacb8b39d1f344e96db834a39 ) (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java (edit) tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/image/ImageParser.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/jdbc/SQLite3Parser.java (edit) tika-core/src/main/java/org/apache/tika/config/ServiceLoader.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Jenkins build Tika-trunk #1327 (See https://builds.apache.org/job/Tika-trunk/1327/)
        TIKA 2262 : Adopt changes in TIKA-2389 (thejanwijesinghe.14: https://github.com/apache/tika/commit/44082d3dd456a88c9d79b5a600db88a1a865416c)

        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/captioning/tf/TensorflowRESTCaptioner.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Tika-trunk #1327 (See https://builds.apache.org/job/Tika-trunk/1327/ ) TIKA 2262 : Adopt changes in TIKA-2389 (thejanwijesinghe.14: https://github.com/apache/tika/commit/44082d3dd456a88c9d79b5a600db88a1a865416c ) (edit) tika-parsers/src/main/java/org/apache/tika/parser/captioning/tf/TensorflowRESTCaptioner.java

          People

          • Assignee:
            Unassigned
            Reporter:
            tmortagne Thomas Mortagne
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development