Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1948

Catch exceptions per page in PDFParser

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.13
    • Component/s: None
    • Labels:
      None

      Description

      In a discussion with Tilman Hausherr somewhere(???), I think he observed that we weren't doing a try/catch for each page. If there's an exception in an early page, it might still be possible to extract text from later pages in a problematic PDF.

      With very minimal modifications we could add a try/catch per page, store the caught exceptions, and then throw the first caught exception after the parse finishes.

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        I ran this proposed change against ~325k pdfs. There were very modest gains, with no downsides. We extracted more content (and it looked like good content, not mojibake) for 40 files. Any objections to adding this?

        Show
        tallison@mitre.org Tim Allison added a comment - I ran this proposed change against ~325k pdfs. There were very modest gains, with no downsides. We extracted more content (and it looked like good content, not mojibake) for 40 files. Any objections to adding this?
        Hide
        nicholasc Nick C added a comment -

        This is a good change. I had a patch to do something similar but I only ignored 2 errors. I wish there was a way you could set an option or have a handler set in ParserContext to determine if the error should be thrown or totally ignored

        Show
        nicholasc Nick C added a comment - This is a good change. I had a patch to do something similar but I only ignored 2 errors. I wish there was a way you could set an option or have a handler set in ParserContext to determine if the error should be thrown or totally ignored
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Behavior is configurable via PDFParserConfig. The default is the more robust option; users will have to change the config to get the old behavior.

        Show
        tallison@mitre.org Tim Allison added a comment - Behavior is configurable via PDFParserConfig. The default is the more robust option; users will have to change the config to get the old behavior.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in tika-trunk-jdk1.7 #951 (See https://builds.apache.org/job/tika-trunk-jdk1.7/951/)
        TIKA-1948 – handle per page IOExceptions more robustly in PDFParser (tallison: rev b4404c33f641d14507c5a13cc0b0f5e7c2cffab1)

        • tika-parsers/src/test/resources/test-documents/testPDF_bad_page_303226.pdf
        • tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
        • tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
        • tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
        • tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in tika-trunk-jdk1.7 #951 (See https://builds.apache.org/job/tika-trunk-jdk1.7/951/ ) TIKA-1948 – handle per page IOExceptions more robustly in PDFParser (tallison: rev b4404c33f641d14507c5a13cc0b0f5e7c2cffab1) tika-parsers/src/test/resources/test-documents/testPDF_bad_page_303226.pdf tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-2.x #81 (See https://builds.apache.org/job/tika-2.x/81/)
        TIKA-1948 handle IOExceptions per page in PDFParser (tallison: rev 2e2b96a1b940d6e4de649931b337ab290e8504e2)

        • tika-parser-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
        • CHANGES.txt
        • tika-parser-modules/tika-parser-pdf-module/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
        • tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java
        • tika-test-resources/src/test/resources/test-documents/testPDF_bad_page_303226.pdf
        • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
        • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-2.x #81 (See https://builds.apache.org/job/tika-2.x/81/ ) TIKA-1948 handle IOExceptions per page in PDFParser (tallison: rev 2e2b96a1b940d6e4de649931b337ab290e8504e2) tika-parser-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java CHANGES.txt tika-parser-modules/tika-parser-pdf-module/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java tika-test-resources/src/test/resources/test-documents/testPDF_bad_page_303226.pdf tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in tika-trunk-jdk1.7 #952 (See https://builds.apache.org/job/tika-trunk-jdk1.7/952/)
        TIKA-1948...not sure why these weren't comitted..argh. (tallison: rev e032ac61996be58cba306a69165b3986e7b4256a)

        • CHANGES.txt
        • tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in tika-trunk-jdk1.7 #952 (See https://builds.apache.org/job/tika-trunk-jdk1.7/952/ ) TIKA-1948 ...not sure why these weren't comitted..argh. (tallison: rev e032ac61996be58cba306a69165b3986e7b4256a) CHANGES.txt tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java

          People

          • Assignee:
            tallison@mitre.org Tim Allison
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development