Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.14
    • Fix Version/s: 2.0, 1.15
    • Component/s: parser
    • Labels:
      None
    • Environment:

      Any

      Description

      If you are interested, I would like to add support for JBIG2 image files (.jb2, or .jbig2). I have encountered them PDFs.

      I will make a pull-request shortly.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user essiembre opened a pull request:

          https://github.com/apache/tika/pull/144

          JBIG2 support for TIKA-2232 contributed by pascal.essiembre

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/essiembre/tika TIKA-2232

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/144.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #144


          commit e6cbaa03be9e362c6572f6cde351e777abd4c763
          Author: Pascal Essiembre <pascal.essiembre@norconex.com>
          Date: 2017-01-06T02:26:06Z

          Fix for TIKA-2232 contributed by pascal.essiembre


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user essiembre opened a pull request: https://github.com/apache/tika/pull/144 JBIG2 support for TIKA-2232 contributed by pascal.essiembre You can merge this pull request into a Git repository by running: $ git pull https://github.com/essiembre/tika TIKA-2232 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/144.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #144 commit e6cbaa03be9e362c6572f6cde351e777abd4c763 Author: Pascal Essiembre <pascal.essiembre@norconex.com> Date: 2017-01-06T02:26:06Z Fix for TIKA-2232 contributed by pascal.essiembre
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tika/pull/144

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/144
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1171 (See https://builds.apache.org/job/Tika-trunk/1171/)
          Fix for TIKA-2232 contributed by pascal.essiembre (pascal.essiembre: rev e6cbaa03be9e362c6572f6cde351e777abd4c763)

          • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          • (add) tika-parsers/src/test/resources/test-documents/testPDF_JBIG2.pdf
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/image/ImageParser.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/image/ImageParserTest.java
          • (add) tika-parsers/src/test/resources/test-documents/testJBIG2.jb2
          • (edit) tika-parsers/pom.xml
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
            TIKA-2232 – shorten one unit test and update changes. (tallison: rev 86dbde4ec93d4dc962d99349d23f08c326e607a3)
          • (edit) CHANGES.txt
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1171 (See https://builds.apache.org/job/Tika-trunk/1171/ ) Fix for TIKA-2232 contributed by pascal.essiembre (pascal.essiembre: rev e6cbaa03be9e362c6572f6cde351e777abd4c763) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml (add) tika-parsers/src/test/resources/test-documents/testPDF_JBIG2.pdf (edit) tika-parsers/src/main/java/org/apache/tika/parser/image/ImageParser.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/image/ImageParserTest.java (add) tika-parsers/src/test/resources/test-documents/testJBIG2.jb2 (edit) tika-parsers/pom.xml (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java TIKA-2232 – shorten one unit test and update changes. (tallison: rev 86dbde4ec93d4dc962d99349d23f08c326e607a3) (edit) CHANGES.txt (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Do we want to check for JBIG2 on classpath in ImageParser before ImageParser includes jbig2 in supported types?

          Old behavior was users could see that the EmptyParser was applied (I think?). New behavior if jbig2 libs are not on classpath is that the image is processed by the ImageParser, but no metadata is added.

          For pdfs for those without jbig2 on the classpath, they'll receive a stacktrace for a missing library in their metadata...which makes sense.

          Show
          tallison@mitre.org Tim Allison added a comment - Do we want to check for JBIG2 on classpath in ImageParser before ImageParser includes jbig2 in supported types? Old behavior was users could see that the EmptyParser was applied (I think?). New behavior if jbig2 libs are not on classpath is that the image is processed by the ImageParser, but no metadata is added. For pdfs for those without jbig2 on the classpath, they'll receive a stacktrace for a missing library in their metadata...which makes sense.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Added PDF OCR test. Tesseract can't process jbig2 directly, but for jbig2 embedded in pdfs, if users go with option 2 for PDF OCR and use PDFBox to generate an image of the full page, embedded jbig2's are OCR'd.

          Thank you, Pascal Essiembre!

          Let's reopen if we want to check for jbig2 on classpath before ImageParser claims that it can handle jbig2.

          Show
          tallison@mitre.org Tim Allison added a comment - Added PDF OCR test. Tesseract can't process jbig2 directly, but for jbig2 embedded in pdfs, if users go with option 2 for PDF OCR and use PDFBox to generate an image of the full page, embedded jbig2's are OCR'd. Thank you, Pascal Essiembre ! Let's reopen if we want to check for jbig2 on classpath before ImageParser claims that it can handle jbig2.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1174 (See https://builds.apache.org/job/Tika-trunk/1174/)
          TIKA-2232 add unit test for OCR of jbig2 embedded in PDF. (tallison: rev ba26f6ee01574702f5eaa56bf45aaf06e043d6df)

          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/image/ImageParserTest.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1174 (See https://builds.apache.org/job/Tika-trunk/1174/ ) TIKA-2232 add unit test for OCR of jbig2 embedded in PDF. (tallison: rev ba26f6ee01574702f5eaa56bf45aaf06e043d6df) (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/image/ImageParserTest.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build tika-2.x #193 (See https://builds.apache.org/job/tika-2.x/193/)
          TIKA-2232 – add processing of jbig2 (with necessary non ASL 2.0 libs) (tallison: rev 0bc9bd89675d866b6ccd9e8b9e04ecfed8988544)

          • (add) tika-test-resources/src/test/resources/test-documents/testPDF_JBIG2.pdf
          • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          • (edit) tika-parser-modules/tika-parser-multimedia-module/pom.xml
          • (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/image/ImageParser.java
          • (edit) CHANGES.txt
          • (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
          • (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
          • (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/image/ImageParserTest.java
          • (add) tika-test-resources/src/test/resources/test-documents/testJBIG2.jb2
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #193 (See https://builds.apache.org/job/tika-2.x/193/ ) TIKA-2232 – add processing of jbig2 (with necessary non ASL 2.0 libs) (tallison: rev 0bc9bd89675d866b6ccd9e8b9e04ecfed8988544) (add) tika-test-resources/src/test/resources/test-documents/testPDF_JBIG2.pdf (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml (edit) tika-parser-modules/tika-parser-multimedia-module/pom.xml (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/image/ImageParser.java (edit) CHANGES.txt (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/image/ImageParserTest.java (add) tika-test-resources/src/test/resources/test-documents/testJBIG2.jb2
          Hide
          pascal.essiembre Pascal Essiembre added a comment -

          Either way. I think the most important is not to have JBIG2 images "silently" ignored when the library is not on classpath. So having some sort of indication when encountering such files without the library would be nice (either log or exception).

          Show
          pascal.essiembre Pascal Essiembre added a comment - Either way. I think the most important is not to have JBIG2 images "silently" ignored when the library is not on classpath. So having some sort of indication when encountering such files without the library would be nice (either log or exception).
          Hide
          nicholas.dipiazza Nicholas DiPiazza added a comment - - edited

          Pascal Essiembre totally

          obviously with the GPL3 license most people cannot use this jbig2-imageio Library. So can we please provide a way to turn off this exception?

          org.apache.pdfbox.filter.MissingImageReaderException: Cannot read JBIG2 image: jbig2-imageio is not installed
          	at org.apache.pdfbox.filter.Filter.findImageReader(Filter.java:128) ~[pdfbox-2.0.1.jar:2.0.1]
          	at org.apache.pdfbox.filter.JBIG2Filter.decode(JBIG2Filter.java:55) ~[pdfbox-2.0.1.jar:2.0.1]
          	at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69) ~[pdfbox-2.0.1.jar:2.0.1]
          	at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163) ~[pdfbox-2.0.1.jar:2.0.1]
          	at org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:235) ~[pdfbox-2.0.1.jar:2.0.1]
          	at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.<init>(PDImageXObject.java:147) ~[pdfbox-2.0.1.jar:2.0.1]
          	at org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:70) ~[pdfbox-2.0.1.jar:2.0.1]
          	at org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:385) ~[pdfbox-2.0.1.jar:2.0.1]	
          
          Show
          nicholas.dipiazza Nicholas DiPiazza added a comment - - edited Pascal Essiembre totally obviously with the GPL3 license most people cannot use this jbig2-imageio Library. So can we please provide a way to turn off this exception? org.apache.pdfbox.filter.MissingImageReaderException: Cannot read JBIG2 image: jbig2-imageio is not installed at org.apache.pdfbox.filter.Filter.findImageReader(Filter.java:128) ~[pdfbox-2.0.1.jar:2.0.1] at org.apache.pdfbox.filter.JBIG2Filter.decode(JBIG2Filter.java:55) ~[pdfbox-2.0.1.jar:2.0.1] at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:69) ~[pdfbox-2.0.1.jar:2.0.1] at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:163) ~[pdfbox-2.0.1.jar:2.0.1] at org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:235) ~[pdfbox-2.0.1.jar:2.0.1] at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.<init>(PDImageXObject.java:147) ~[pdfbox-2.0.1.jar:2.0.1] at org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:70) ~[pdfbox-2.0.1.jar:2.0.1] at org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:385) ~[pdfbox-2.0.1.jar:2.0.1]
          Hide
          tallison@mitre.org Tim Allison added a comment -

          We should be catching that and storing it in a metadata:warn key. Are you getting that with trunk?

          Show
          tallison@mitre.org Tim Allison added a comment - We should be catching that and storing it in a metadata:warn key. Are you getting that with trunk?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Reopen to handle jbig2 not on class path

          Show
          tallison@mitre.org Tim Allison added a comment - Reopen to handle jbig2 not on class path
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          Proposed change if jbig2 is not on the classpath:

          PDFParser extractInlineImages adds:

          X-TIKA:EXCEPTION:warn : org.apache.pdfbox.filter.MissingImageReaderException: Cannot read JBIG2 image: jbig2-imageio is not installed
          	at org.apache.pdfbox.filter.Filter.findImageReader(Filter.java:128)
          	at org.apache.pdfbox.filter.JBIG2Filter.decode(JBIG2Filter.java:54)
          

          to the metadata of the PDF...

          ImageParser checks for JBIG2 in {{try

          { Class.forName }

          ... }} before adding jbig2 to SUPPORTED_TYPES. If jbig2 is not on the cp, then the files are handled by the EmptyParser, as they used to be.

          Show
          tallison@mitre.org Tim Allison added a comment - - edited Proposed change if jbig2 is not on the classpath: PDFParser extractInlineImages adds: X-TIKA:EXCEPTION:warn : org.apache.pdfbox.filter.MissingImageReaderException: Cannot read JBIG2 image: jbig2-imageio is not installed at org.apache.pdfbox.filter.Filter.findImageReader(Filter.java:128) at org.apache.pdfbox.filter.JBIG2Filter.decode(JBIG2Filter.java:54) to the metadata of the PDF... ImageParser checks for JBIG2 in {{try { Class.forName } ... }} before adding jbig2 to SUPPORTED_TYPES . If jbig2 is not on the cp, then the files are handled by the EmptyParser, as they used to be.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Let me know if you'd like different behavior.

          Show
          tallison@mitre.org Tim Allison added a comment - Let me know if you'd like different behavior.
          Hide
          mcaruanagalizia Matthew Caruana Galizia added a comment -

          Could we at least log a warning once when the ClassNotFoundException is thrown? Otherwise I feel like we're sweeping the problem under the rug.

          In the meantime I've asked one of the Levigo developers if they'd consider switching to a license which is compatible with the ASL v2.

          Show
          mcaruanagalizia Matthew Caruana Galizia added a comment - Could we at least log a warning once when the ClassNotFoundException is thrown? Otherwise I feel like we're sweeping the problem under the rug. In the meantime I've asked one of the Levigo developers if they'd consider switching to a license which is compatible with the ASL v2.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          >In the meantime I've asked one of the Levigo developers if they'd consider switching to a license which is compatible with the ASL v2.

          Great!

          >Could we at least log a warning once when the ClassNotFoundException is thrown? Otherwise I feel like we're sweeping the problem under the rug.

          Done in ImageParser.

          Show
          tallison@mitre.org Tim Allison added a comment - >In the meantime I've asked one of the Levigo developers if they'd consider switching to a license which is compatible with the ASL v2. Great! >Could we at least log a warning once when the ClassNotFoundException is thrown? Otherwise I feel like we're sweeping the problem under the rug. Done in ImageParser.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build tika-2.x #203 (See https://builds.apache.org/job/tika-2.x/203/)
          TIKA-2232 – log/warn if jbig2 is not on classpath (tallison: rev 8d783d27a5e0ad5b7c617ffe9a0afee5e37928f8)

          • (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/image/ImageParser.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #203 (See https://builds.apache.org/job/tika-2.x/203/ ) TIKA-2232 – log/warn if jbig2 is not on classpath (tallison: rev 8d783d27a5e0ad5b7c617ffe9a0afee5e37928f8) (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/image/ImageParser.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1184 (See https://builds.apache.org/job/Tika-trunk/1184/)
          TIKA-2232 – log/warn if jbig2 parser is not on classpath (tallison: rev 9d97e16e594f27dc850de6cdf3831c7a61454d57)

          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/image/ImageParser.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1184 (See https://builds.apache.org/job/Tika-trunk/1184/ ) TIKA-2232 – log/warn if jbig2 parser is not on classpath (tallison: rev 9d97e16e594f27dc850de6cdf3831c7a61454d57) (edit) tika-parsers/src/main/java/org/apache/tika/parser/image/ImageParser.java

            People

            • Assignee:
              tallison@mitre.org Tim Allison
              Reporter:
              pascal.essiembre Pascal Essiembre
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development