Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2174

Too few formats in support declared by TesseractOCRParser

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.14
    • Fix Version/s: 2.0, 1.15
    • Component/s: parser
    • Labels:
      None

      Description

      A complete install of Leptonica with Tesseract will add support for formats that are not declared by TesseractOCRParser. These include JP2, JPX and PPM.

      Tesseract produces OCR output fine for JPX images as of this version:

        $ tesseract -v
           tesseract 3.04.01
             leptonica-1.73
               libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}}
      

      However, these types are not declared by getSupportTypes so no output is produced for PDFs which contained JPX images of scanned documents, for example.

        Activity

        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build tika-2.x #174 (See https://builds.apache.org/job/tika-2.x/174/)
        TIKA-2174 – clean up (tallison: rev 9a68f4ccc12a633ab1ae7837d561480cc3e0c05c)

        • (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
        • (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
        • (edit) tika-parser-modules/tika-parser-multimedia-module/pom.xml
        • (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #174 (See https://builds.apache.org/job/tika-2.x/174/ ) TIKA-2174 – clean up (tallison: rev 9a68f4ccc12a633ab1ae7837d561480cc3e0c05c) (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (edit) tika-parser-modules/tika-parser-multimedia-module/pom.xml (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Jenkins build tika-2.x-windows #75 (See https://builds.apache.org/job/tika-2.x-windows/75/)
        TIKA-2174 – clean up (tallison: rev 9a68f4ccc12a633ab1ae7837d561480cc3e0c05c)

        • (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
        • (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
        • (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
        • (edit) tika-parser-modules/tika-parser-multimedia-module/pom.xml
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x-windows #75 (See https://builds.apache.org/job/tika-2.x-windows/75/ ) TIKA-2174 – clean up (tallison: rev 9a68f4ccc12a633ab1ae7837d561480cc3e0c05c) (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (edit) tika-parser-modules/tika-parser-multimedia-module/pom.xml
        Hide
        hudson Hudson added a comment -

        UNSTABLE: Integrated in Jenkins build Tika-trunk #1140 (See https://builds.apache.org/job/Tika-trunk/1140/)
        TIKA-2174/TIKA-2175 – clean up (tallison: rev b97045aea303bac75bd3c937cde6b42c7a3b3c48)

        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
        Show
        hudson Hudson added a comment - UNSTABLE: Integrated in Jenkins build Tika-trunk #1140 (See https://builds.apache.org/job/Tika-trunk/1140/ ) TIKA-2174 / TIKA-2175 – clean up (tallison: rev b97045aea303bac75bd3c937cde6b42c7a3b3c48) (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build tika-2.x #173 (See https://builds.apache.org/job/tika-2.x/173/)
        TIKA-2174 – add ppm and update changes.txt (tallison: rev 3f24e6c3e2514a7be2d966305c53a3da0f397ef9)

        • (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
        • (edit) CHANGES.txt
        • (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #173 (See https://builds.apache.org/job/tika-2.x/173/ ) TIKA-2174 – add ppm and update changes.txt (tallison: rev 3f24e6c3e2514a7be2d966305c53a3da0f397ef9) (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java (edit) CHANGES.txt (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1139 (See https://builds.apache.org/job/Tika-trunk/1139/)
        TIKA-2174 – add .ppm to tesseract (tallison: rev 1aff6380d46b9104835909c31e7f2f36f621eca0)

        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
        • (edit) CHANGES.txt
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
          TIKA-2174 – fix jp2 (tallison: rev 98de2882842cecdf5d160c76c3b1f1e62c57d563)
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1139 (See https://builds.apache.org/job/Tika-trunk/1139/ ) TIKA-2174 – add .ppm to tesseract (tallison: rev 1aff6380d46b9104835909c31e7f2f36f621eca0) (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java (edit) CHANGES.txt (edit) tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java TIKA-2174 – fix jp2 (tallison: rev 98de2882842cecdf5d160c76c3b1f1e62c57d563) (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Ugh. Just added now. Thank you. Sorry, a bit distracted lately.

        Show
        tallison@mitre.org Tim Allison added a comment - Ugh. Just added now. Thank you. Sorry, a bit distracted lately.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Jenkins build tika-2.x-windows #74 (See https://builds.apache.org/job/tika-2.x-windows/74/)
        TIKA-2174 – add ppm and update changes.txt (tallison: rev 3f24e6c3e2514a7be2d966305c53a3da0f397ef9)

        • (edit) CHANGES.txt
        • (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
        • (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x-windows #74 (See https://builds.apache.org/job/tika-2.x-windows/74/ ) TIKA-2174 – add ppm and update changes.txt (tallison: rev 3f24e6c3e2514a7be2d966305c53a3da0f397ef9) (edit) CHANGES.txt (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
        Hide
        mcaruanagalizia Matthew Caruana Galizia added a comment -

        Thank you! I've also confirmed that Tesseract can handle image/x-portable-pixmap (PPM) files, so perhaps we could add that too?

        Show
        mcaruanagalizia Matthew Caruana Galizia added a comment - Thank you! I've also confirmed that Tesseract can handle image/x-portable-pixmap (PPM) files, so perhaps we could add that too?
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build tika-2.x #172 (See https://builds.apache.org/job/tika-2.x/172/)
        TIKA-2174 add jpx and jp2 to Tesseract (tallison: rev f2661f997e69fcaf388561f122b306021928a5d4)

        • (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
        • (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #172 (See https://builds.apache.org/job/tika-2.x/172/ ) TIKA-2174 add jpx and jp2 to Tesseract (tallison: rev f2661f997e69fcaf388561f122b306021928a5d4) (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Jenkins build tika-2.x-windows #73 (See https://builds.apache.org/job/tika-2.x-windows/73/)
        TIKA-2174 add jpx and jp2 to Tesseract (tallison: rev f2661f997e69fcaf388561f122b306021928a5d4)

        • (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
        • (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x-windows #73 (See https://builds.apache.org/job/tika-2.x-windows/73/ ) TIKA-2174 add jpx and jp2 to Tesseract (tallison: rev f2661f997e69fcaf388561f122b306021928a5d4) (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        There's more work to be done on jpx/jp2 extraction from PDFs, but I've added those file formats to our Tesseract parser for now.

        Show
        tallison@mitre.org Tim Allison added a comment - There's more work to be done on jpx/jp2 extraction from PDFs, but I've added those file formats to our Tesseract parser for now.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1136 (See https://builds.apache.org/job/Tika-trunk/1136/)
        TIKA-2174 add jp2 and jpx to file formats handled by TesseractOCRParser (tallison: rev c17d1b8a6bef4409787aa2b58b96f691dfcf1170)

        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1136 (See https://builds.apache.org/job/Tika-trunk/1136/ ) TIKA-2174 add jp2 and jpx to file formats handled by TesseractOCRParser (tallison: rev c17d1b8a6bef4409787aa2b58b96f691dfcf1170) (edit) tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Ok, y, we're seeing the same thing. I asked this on the PDFBox users' list. I don't know if this is a PDFBox issue or a Tika issue, but something is wrong.

        On memory and time for extracting inline images, also see this. Again, this could be caused by a misuse of PDFBox.

        See our updated wiki.

        Show
        tallison@mitre.org Tim Allison added a comment - Ok, y, we're seeing the same thing. I asked this on the PDFBox users' list. I don't know if this is a PDFBox issue or a Tika issue, but something is wrong. On memory and time for extracting inline images, also see this . Again, this could be caused by a misuse of PDFBox. See our updated wiki .
        Hide
        mcaruanagalizia Matthew Caruana Galizia added a comment -

        That issue went away once I added 'jp2' and 'jpx' to the list of supported types in TesseractOCRParser via a new proxy parser that declares support for these types. It seems the embedded images are then handed off to Tesseract but nothing is OCRed, although that seems to be a separate issue arising from PDFBox.

        Show
        mcaruanagalizia Matthew Caruana Galizia added a comment - That issue went away once I added 'jp2' and 'jpx' to the list of supported types in TesseractOCRParser via a new proxy parser that declares support for these types. It seems the embedded images are then handed off to Tesseract but nothing is OCRed, although that seems to be a separate issue arising from PDFBox.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Thank you. If you could share the stacktrace on this issue that you shared via Twitter, that'd help. One problem in the one test file (via Johan van der Knijff) I've looked at so far is that PDFBox's ImageIOUtil is not extracting any bytes for the embedded jp2 so Tika is identifying it as "octet-stream", which is I think what you saw in your stacktrace, no?

        Show
        tallison@mitre.org Tim Allison added a comment - Thank you. If you could share the stacktrace on this issue that you shared via Twitter, that'd help. One problem in the one test file (via Johan van der Knijff ) I've looked at so far is that PDFBox's ImageIOUtil is not extracting any bytes for the embedded jp2 so Tika is identifying it as "octet-stream", which is I think what you saw in your stacktrace, no?
        Hide
        mcaruanagalizia Matthew Caruana Galizia added a comment -

        Both on inline and independent files. I've renamed the issue and added PPM (image/x-portable-pixmap) to the list of formats that could be supported.

        Show
        mcaruanagalizia Matthew Caruana Galizia added a comment - Both on inline and independent files. I've renamed the issue and added PPM (image/x-portable-pixmap) to the list of formats that could be supported.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Thank you for opening this. Will fix. To confirm, you're running ocr on extracted inline images? We added a new way to handle ocr of PDFs that uses pdfbox to generate a single image of each page and then runs ocr on that. I suspect that one strategy will be better for some PDFs and the other for others. Will document on our wiki shortly.

        Show
        tallison@mitre.org Tim Allison added a comment - Thank you for opening this. Will fix. To confirm, you're running ocr on extracted inline images? We added a new way to handle ocr of PDFs that uses pdfbox to generate a single image of each page and then runs ocr on that. I suspect that one strategy will be better for some PDFs and the other for others. Will document on our wiki shortly.

          People

          • Assignee:
            Unassigned
            Reporter:
            mcaruanagalizia Matthew Caruana Galizia
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development