Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1588

Upgrade to PDFBox 1.8.10 when available

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.10
    • Component/s: parser
    • Labels:
      None

      Description

      Let's use this ticket to discuss/prepare for the release and integration of PDFBox 1.8.10 when it is available.

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #798 (See https://builds.apache.org/job/tika-trunk-jdk1.7/798/)
          TIKA-1588 upgrade to PDFBox 1.8.10 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1692341)

          • /tika/trunk/CHANGES.txt
          • /tika/trunk/tika-parsers/pom.xml
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #798 (See https://builds.apache.org/job/tika-trunk-jdk1.7/798/ ) TIKA-1588 upgrade to PDFBox 1.8.10 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1692341 ) /tika/trunk/CHANGES.txt /tika/trunk/tika-parsers/pom.xml /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
          Hide
          tallison@mitre.org Tim Allison added a comment -

          r1692341

          Show
          tallison@mitre.org Tim Allison added a comment - r1692341
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Interesting. This must be another case of the multi-threading indeterminacy driven by the static caching of fonts in 1.8.x. This also may explain why there were some apparent differences on the recent NaN comparison I ran.

          Sorry to waste your time!

          Show
          tallison@mitre.org Tim Allison added a comment - Interesting. This must be another case of the multi-threading indeterminacy driven by the static caching of fonts in 1.8.x. This also may explain why there were some apparent differences on the recent NaN comparison I ran. Sorry to waste your time!
          Hide
          tilman Tilman Hausherr added a comment -

          The weird thing is that I can't find any differences with ExtractText and default settings. "respondæ" appears in both extractions. "æ" is an arrow in the PDF.

          Show
          tilman Tilman Hausherr added a comment - The weird thing is that I can't find any differences with ExtractText and default settings. "respondæ" appears in both extractions. "æ" is an arrow in the PDF.
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          Current version of reports attached comparing PDFBox 1.8.9 vs PDFBox 1.8.10 against the PDFs in govdocs1.

          Overall takeaway: no new exceptions, no fixed exceptions.

          Without looking carefully at the files, it looks like there is a slight improvement in 005937.pdf and 722558.pdf. It looks like there might be a very small regression in 167853.pdf, where 1 instance of respond has become respondæ

          I realize now that I should try this again with the PDFBOX-2823 catch blocks removed...doh!

          Show
          tallison@mitre.org Tim Allison added a comment - - edited Current version of reports attached comparing PDFBox 1.8.9 vs PDFBox 1.8.10 against the PDFs in govdocs1. Overall takeaway: no new exceptions, no fixed exceptions. Without looking carefully at the files, it looks like there is a slight improvement in 005937.pdf and 722558.pdf. It looks like there might be a very small regression in 167853.pdf, where 1 instance of respond has become respondæ I realize now that I should try this again with the PDFBOX-2823 catch blocks removed...doh!
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Should be able to remove catch blocks around dates once we upgrade to 1.8.10.

          Show
          tallison@mitre.org Tim Allison added a comment - Should be able to remove catch blocks around dates once we upgrade to 1.8.10.

            People

            • Assignee:
              tallison@mitre.org Tim Allison
              Reporter:
              tallison@mitre.org Tim Allison
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development