Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1285

Upgrade to PDFBox 2.0.0 when available

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6
    • Fix Version/s: 1.13
    • Component/s: parser
    • Labels:
      None

      Description

      This issue is to track fixes required when upgrading the PDFbox dependency to 2.0.0 Final once it's available, and using PDFBox's daily build before then.

      See TIKA-1268 comment.

      Relates to PDFBOX-1893

      1. pdfbox_reports_2_0_0_20150709.zip
        30 kB
        Tim Allison
      2. testPDF_childAttachments.pdf
        2.21 MB
        Tim Allison
      3. TIKA-1285_rev1641423.patch
        39 kB
        Jeremy Anderson
      4. TIKA-1285.patch
        33 kB
        Jeremy Anderson
      5. TIKA-1285v3.patch
        45 kB
        Tim Allison

        Issue Links

          Activity

          Hide
          rpialum Jeremy Anderson added a comment - - edited

          font.AdobeFontMetricParser (small change)

          pdf.PDF2XHTML (removal of xobject package, please review changes under TODO)

          Show
          rpialum Jeremy Anderson added a comment - - edited font.AdobeFontMetricParser (small change) pdf.PDF2XHTML (removal of xobject package, please review changes under TODO)
          Hide
          rpialum Jeremy Anderson added a comment - - edited

          Updated patch to include fixes as of revision 1621674 on Sept 4th. Major fixes include syncing up to Snapshot of PDFBox post Jempbox replacement by XmpBox.

          XmpBox still requires some refinement to properly handle all of the XMP packages encountered by Tika's unit tests. Some of these cases have been commented out until DomXmpParser can resolve them.

          Issues are not yet reported in JIRA for PDFBOX as I'm not familiar on how to proceed for them. The common Dom Xmp Parser issues encountered:

          • Invalid array definition, expecting Alt and found nothing [prefix=dc; name=title]
          • Invalid array type, expecting Seq and found Bag [prefix=dc; name=creator]
          • No type defined for {http://ns.adobe.com/pdf/1.3/}

            Trapped

          • Cannot find a definition for the namespace http://ns.adobe.com/pdfx/1.3/
          • Cannot find a definition for the namespace http://ns.adobe.com/lightroom/1.0/
          • xmp should start with a processing instruction

          Patch works in conjunction with PDFBOX-2318

          Show
          rpialum Jeremy Anderson added a comment - - edited Updated patch to include fixes as of revision 1621674 on Sept 4th. Major fixes include syncing up to Snapshot of PDFBox post Jempbox replacement by XmpBox. XmpBox still requires some refinement to properly handle all of the XMP packages encountered by Tika's unit tests. Some of these cases have been commented out until DomXmpParser can resolve them. Issues are not yet reported in JIRA for PDFBOX as I'm not familiar on how to proceed for them. The common Dom Xmp Parser issues encountered: Invalid array definition, expecting Alt and found nothing [prefix=dc; name=title] Invalid array type, expecting Seq and found Bag [prefix=dc; name=creator] No type defined for {http://ns.adobe.com/pdf/1.3/} Trapped Cannot find a definition for the namespace http://ns.adobe.com/pdfx/1.3/ Cannot find a definition for the namespace http://ns.adobe.com/lightroom/1.0/ xmp should start with a processing instruction Patch works in conjunction with PDFBOX-2318
          Hide
          rpialum Jeremy Anderson added a comment -

          Updated patch to work with PDF & POI Snapshot builds as of revision 1641423 from 11/24/2014. Note is still a work in progress and refactoring of using core PDF changes can probably be better done, especially when loading encrypted PDF's. Patch should compile and pass unit tests.

          Show
          rpialum Jeremy Anderson added a comment - Updated patch to work with PDF & POI Snapshot builds as of revision 1641423 from 11/24/2014. Note is still a work in progress and refactoring of using core PDF changes can probably be better done, especially when loading encrypted PDF's. Patch should compile and pass unit tests.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Unit test file that is crazily slow in the current version of 2.0.0. Note, too, that the tiff file is no longer extracted by ExtractImages.

          Show
          tallison@mitre.org Tim Allison added a comment - Unit test file that is crazily slow in the current version of 2.0.0. Note, too, that the tiff file is no longer extracted by ExtractImages.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thank you Jeremy Anderson for getting us started on this. I used your patch and made a few modifications. I'm attaching my current version. Several unit tests no longer pass. More work remains...

          Show
          tallison@mitre.org Tim Allison added a comment - Thank you Jeremy Anderson for getting us started on this. I used your patch and made a few modifications. I'm attaching my current version. Several unit tests no longer pass. More work remains...
          Hide
          tallison@mitre.org Tim Allison added a comment -

          First dump of stack traces in govdocs1 from the integration with pdfbox 2.0.0.

          Notes:
          I stopped the batch run early. This only covered ~50k pdfs.

          I forgot to turn on accesspermission checking. Some of the pdfs in here would normally have been skipped.

          I haven't reviewed any of the exceptions. They may be caused by code on the Tika side.

          Show
          tallison@mitre.org Tim Allison added a comment - First dump of stack traces in govdocs1 from the integration with pdfbox 2.0.0. Notes: I stopped the batch run early. This only covered ~50k pdfs. I forgot to turn on accesspermission checking. Some of the pdfs in here would normally have been skipped. I haven't reviewed any of the exceptions. They may be caused by code on the Tika side.
          Hide
          jayesh_ag jayesh added a comment - - edited

          Any idea guys, when we can accommodate PDFBox2.0 with tika binary?

          Thanks.

          Show
          jayesh_ag jayesh added a comment - - edited Any idea guys, when we can accommodate PDFBox2.0 with tika binary? Thanks.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Still hammering out some issues. If regression tests go well, I'd say a few weeks after PDFBox 2.0 is released. There's still quite a bit of important work on performance improvements that is going on on PDFBox.

          Are there specific features that 2.0 has that you need?

          Show
          tallison@mitre.org Tim Allison added a comment - Still hammering out some issues. If regression tests go well, I'd say a few weeks after PDFBox 2.0 is released. There's still quite a bit of important work on performance improvements that is going on on PDFBox. Are there specific features that 2.0 has that you need?
          Hide
          jayesh_ag jayesh added a comment -

          org.apache.fontbox.ttf.TrueTypeFont initializeTable
          SEVERE: An error occured when reading table hmtx
          java.io.EOFException

          org.apache.fontbox.util.FontManager findTTFontname
          WARNING: Font not found: Verdana

          After google, i found out that the above errors and other some errors were fixed in PDFBox 2.0.
          Hence was curious to know when that will be available in Tika.

          Show
          jayesh_ag jayesh added a comment - org.apache.fontbox.ttf.TrueTypeFont initializeTable SEVERE: An error occured when reading table hmtx java.io.EOFException org.apache.fontbox.util.FontManager findTTFontname WARNING: Font not found: Verdana After google, i found out that the above errors and other some errors were fixed in PDFBox 2.0. Hence was curious to know when that will be available in Tika.
          Hide
          arkadyzalko Arkady Zalkowitsch added a comment -

          I've opened an issue where the resolution should be done when you guys upgrade the PDFBox.
          https://issues.apache.org/jira/browse/PDFBOX-3004

          Good luck

          Show
          arkadyzalko Arkady Zalkowitsch added a comment - I've opened an issue where the resolution should be done when you guys upgrade the PDFBox. https://issues.apache.org/jira/browse/PDFBOX-3004 Good luck
          Hide
          chengas123 Ben McCann added a comment -

          I expect a Pdfbox 2.0 RC soon. There are only 5 issues still open marked as Fix Version 2.0 - https://issues.apache.org/jira/browse/PDFBOX-2883?jql=project%20%3D%20PDFBOX%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%202.0.0%20ORDER%20BY%20priority%20DESC

          It'd probably be worth testing against the latest pdfbox again now to be able to give them a heads up if there are any issues we know of

          Show
          chengas123 Ben McCann added a comment - I expect a Pdfbox 2.0 RC soon. There are only 5 issues still open marked as Fix Version 2.0 - https://issues.apache.org/jira/browse/PDFBOX-2883?jql=project%20%3D%20PDFBOX%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%202.0.0%20ORDER%20BY%20priority%20DESC It'd probably be worth testing against the latest pdfbox again now to be able to give them a heads up if there are any issues we know of
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Completely agree. If I update the PDFBox 2.0 branch of Tika on my github site, would you be willing to run tests on your documents?

          Show
          tallison@mitre.org Tim Allison added a comment - Completely agree. If I update the PDFBox 2.0 branch of Tika on my github site, would you be willing to run tests on your documents?
          Hide
          chengas123 Ben McCann added a comment -

          Yeah, that'd be great

          Show
          chengas123 Ben McCann added a comment - Yeah, that'd be great
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          Thank you, Ben McCann! The more eyes we have on this the better for both projects.

          Updated working wrapper is available here. Some clean up remains...

          Arkady Zalkowitsch and jayesh, would you be willing to run this on your batches of docs and let us know what you find? Extra points if you can compare memory usage and time to parse vs. 1.8.10!

          Also extra points for running this with the extract embedded images parameter turned on.

          Show
          tallison@mitre.org Tim Allison added a comment - - edited Thank you, Ben McCann ! The more eyes we have on this the better for both projects. Updated working wrapper is available here . Some clean up remains... Arkady Zalkowitsch and jayesh , would you be willing to run this on your batches of docs and let us know what you find? Extra points if you can compare memory usage and time to parse vs. 1.8.10! Also extra points for running this with the extract embedded images parameter turned on.
          Hide
          arkadyzalko Arkady Zalkowitsch added a comment -

          Ok, I will do this tomorrow. I have project release today. =P
          Thanks a lot!

          Show
          arkadyzalko Arkady Zalkowitsch added a comment - Ok, I will do this tomorrow. I have project release today. =P Thanks a lot!
          Hide
          tallison@mitre.org Tim Allison added a comment -

          No problem at all...I still need to run against our batch as well.

          Show
          tallison@mitre.org Tim Allison added a comment - No problem at all...I still need to run against our batch as well.
          Hide
          chengas123 Ben McCann added a comment -

          I did a bunch of testing today. It works pretty much as well as 1.8 did. There was one issue which caused me some trouble which is that it seems to be inserting extraneous spaces. See https://issues.apache.org/jira/browse/PDFBOX-3019

          Show
          chengas123 Ben McCann added a comment - I did a bunch of testing today. It works pretty much as well as 1.8 did. There was one issue which caused me some trouble which is that it seems to be inserting extraneous spaces. See https://issues.apache.org/jira/browse/PDFBOX-3019
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thank you for testing the dev wrapper and PDFBox 2.0, and thank you for the comments over on github.

          Out of curiosity, what type of testing did you do? How many docs? How did you compare, etc?

          My sense is that my Linux vm is killing the batch process quite a bit more often with 2.0 than with 1.8.x...because of memory issues.

          What type of load were you running? Did you see any memory issues?

          Show
          tallison@mitre.org Tim Allison added a comment - Thank you for testing the dev wrapper and PDFBox 2.0, and thank you for the comments over on github. Out of curiosity, what type of testing did you do? How many docs? How did you compare, etc? My sense is that my Linux vm is killing the batch process quite a bit more often with 2.0 than with 1.8.x...because of memory issues. What type of load were you running? Did you see any memory issues?
          Hide
          tboehme Timo Boehme added a comment -

          Did you try using the new memory settings possibilities? You can define a maximum main memory usage for storing PDF streams and if more is required it can use a temporary file (see load(File file, MemoryUsageSetting memUsageSetting) in PDDocument).

          Show
          tboehme Timo Boehme added a comment - Did you try using the new memory settings possibilities? You can define a maximum main memory usage for storing PDF streams and if more is required it can use a temporary file (see load(File file, MemoryUsageSetting memUsageSetting) in PDDocument ).
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Y, that's the first thing on my todo list on our wrapper – integrate the MemoryUsageSetting, which is very, very cool. I should have a chance to add that by the end of this week, and then we'll see.

          Show
          tallison@mitre.org Tim Allison added a comment - Y, that's the first thing on my todo list on our wrapper – integrate the MemoryUsageSetting, which is very, very cool. I should have a chance to add that by the end of this week, and then we'll see.
          Hide
          chengas123 Ben McCann added a comment -

          I didn't really do any load or memory testing. My testing was focused on accuracy of converting pdfs to text on a few hundred documents.

          Show
          chengas123 Ben McCann added a comment - I didn't really do any load or memory testing. My testing was focused on accuracy of converting pdfs to text on a few hundred documents.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Finished comparison of ~100k docs: here

          Show
          tallison@mitre.org Tim Allison added a comment - Finished comparison of ~100k docs: here
          Hide
          tallison@mitre.org Tim Allison added a comment -

          PDFBox 2.0.0 was released this morning. Will upgrade Tika over the next few days.

          Show
          tallison@mitre.org Tim Allison added a comment - PDFBox 2.0.0 was released this morning. Will upgrade Tika over the next few days.
          Hide
          hudson Hudson added a comment -

          UNSTABLE: Integrated in tika-trunk-jdk1.7 #931 (See https://builds.apache.org/job/tika-trunk-jdk1.7/931/)
          TIKA-1285 – upgrade to PDFBox 2.0.0 (tallison: rev 98eb56ec78f2e1d27de644f4f6647ea1cfbc930b)

          • tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
          • tika-parsers/src/main/java/org/apache/tika/parser/font/TrueTypeParser.java
          • tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
          • tika-bundle/pom.xml
          • tika-parsers/pom.xml
          • tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFEncodedStringDecoder.java
          • tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
          • tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
          • CHANGES.txt
          • tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
          • tika-parsers/src/main/java/org/apache/tika/parser/font/AdobeFontMetricParser.java
          Show
          hudson Hudson added a comment - UNSTABLE: Integrated in tika-trunk-jdk1.7 #931 (See https://builds.apache.org/job/tika-trunk-jdk1.7/931/ ) TIKA-1285 – upgrade to PDFBox 2.0.0 (tallison: rev 98eb56ec78f2e1d27de644f4f6647ea1cfbc930b) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java tika-parsers/src/main/java/org/apache/tika/parser/font/TrueTypeParser.java tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties tika-bundle/pom.xml tika-parsers/pom.xml tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFEncodedStringDecoder.java tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java CHANGES.txt tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java tika-parsers/src/main/java/org/apache/tika/parser/font/AdobeFontMetricParser.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #932 (See https://builds.apache.org/job/tika-trunk-jdk1.7/932/)
          TIKA-1285 – upgrade to PDFBox 2.0.0 – for now turn off tests with (tallison: rev 9ebf066dd96783c952f4c2a37a2a02af2b0c5aa0)

          • tika-parsers/src/test/java/org/apache/tika/parser/image/ImageParserTest.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #932 (See https://builds.apache.org/job/tika-trunk-jdk1.7/932/ ) TIKA-1285 – upgrade to PDFBox 2.0.0 – for now turn off tests with (tallison: rev 9ebf066dd96783c952f4c2a37a2a02af2b0c5aa0) tika-parsers/src/test/java/org/apache/tika/parser/image/ImageParserTest.java
          Hide
          tallison@mitre.org Tim Allison added a comment -

          We'll see what Hudson says, but I just pushed the mods to Tika's 2.x branch as well.

          A few notes:
          1) XMPBox is currently designed to handle PDF/A. There were exceptions on roughly 40% of XMPs extracted from our test corpus. We'll stick with jempbox 1.8.x for now for XMP parsing. We may consider migrating to Adobe's xmpcore. If anyone wants to help make XMPBox more robust, that'd be a huge service. Ref: this email

          2) PDFBox 2.0 has gotten rid of the classic parser, and now all parsing is done by the non-sequential parser. In my opinion, the PDFBox devs put a tremendous amount of work into making this new parser quite robust. However, for truncated or other truly damaged files, users may have some luck with the classic parser in 1.8.x.

          3) PDFBox 2.0 no longer extracts tiff files. See this exchange, and consider adding the optional dependencies to handle Tiffs, jpeg2000 and ...

          Other than those major points, in my opinion, PDFBox 2.0.0 should fix quite a few issues and is far more robust for bidi documents.

          Many thanks to the PDFBox devs, especially Andreas Lehmkühler, Maruan Sahyoun and Tilman Hausherr, for their work on PDFBox and on their collaboration on the eval process....more work remains on the latter.

          Show
          tallison@mitre.org Tim Allison added a comment - We'll see what Hudson says, but I just pushed the mods to Tika's 2.x branch as well. A few notes: 1) XMPBox is currently designed to handle PDF/A. There were exceptions on roughly 40% of XMPs extracted from our test corpus. We'll stick with jempbox 1.8.x for now for XMP parsing. We may consider migrating to Adobe's xmpcore. If anyone wants to help make XMPBox more robust, that'd be a huge service. Ref: this email 2) PDFBox 2.0 has gotten rid of the classic parser, and now all parsing is done by the non-sequential parser. In my opinion, the PDFBox devs put a tremendous amount of work into making this new parser quite robust. However, for truncated or other truly damaged files, users may have some luck with the classic parser in 1.8.x. 3) PDFBox 2.0 no longer extracts tiff files. See this exchange , and consider adding the optional dependencies to handle Tiffs, jpeg2000 and ... Other than those major points, in my opinion, PDFBox 2.0.0 should fix quite a few issues and is far more robust for bidi documents. Many thanks to the PDFBox devs, especially Andreas Lehmkühler , Maruan Sahyoun and Tilman Hausherr , for their work on PDFBox and on their collaboration on the eval process....more work remains on the latter.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thank you to all who contributed on the Tika side, especially: Jeremy Anderson, Ben McCann and Arkady Zalkowitsch.

          Onward...

          Show
          tallison@mitre.org Tim Allison added a comment - Thank you to all who contributed on the Tika side, especially: Jeremy Anderson , Ben McCann and Arkady Zalkowitsch . Onward...
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Finally, tika-devs, for the sake of tests, I followed PDFBox's test-scope inclusion of imageio:

              <!-- Copied from PDFBox:
                 For legal reasons (incompatible license), jai-imageio-core is to be used
                 only in the tests and may not be distributed. See also LEGAL-195 -->
          
              <dependency>
                <groupId>com.github.jai-imageio</groupId>
                <artifactId>jai-imageio-core</artifactId>
                <version>1.3.1</version>
                <scope>test</scope>
              </dependency>
          

          If we don't want to include this even in the test scope, I'm happy taking it out. We'll have to modify a unit test or two, but it will be trivial.

          Show
          tallison@mitre.org Tim Allison added a comment - Finally, tika-devs, for the sake of tests, I followed PDFBox's test-scope inclusion of imageio: <!-- Copied from PDFBox: For legal reasons (incompatible license), jai-imageio-core is to be used only in the tests and may not be distributed. See also LEGAL-195 --> <dependency> <groupId>com.github.jai-imageio</groupId> <artifactId>jai-imageio-core</artifactId> <version>1.3.1</version> <scope>test</scope> </dependency> If we don't want to include this even in the test scope, I'm happy taking it out. We'll have to modify a unit test or two, but it will be trivial.
          Hide
          hudson Hudson added a comment -

          UNSTABLE: Integrated in tika-2.x #57 (See https://builds.apache.org/job/tika-2.x/57/)
          TIKA-1285 – upgrade PDFBox to 2.0.0 in 2.x (tallison: rev 7bc3eae94d79bbbf5dc50143c404af22c02446bc)

          • tika-parser-modules/tika-parser-pdf-module/pom.xml
          • tika-parser-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
          • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFEncodedStringDecoder.java
          • tika-bundle/pom.xml
          • tika-parser-modules/tika-parser-pdf-module/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
          • tika-parser-modules/pom.xml
          • tika-parser-modules/tika-parser-xmp-commons/pom.xml
          • tika-parser-bundles/tika-parser-pdf-bundle/pom.xml
          • tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/font/TrueTypeParser.java
          • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
          • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
          • tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/font/AdobeFontMetricParser.java
          • CHANGES.txt
          • tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/image/ImageParserTest.java
          • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
          Show
          hudson Hudson added a comment - UNSTABLE: Integrated in tika-2.x #57 (See https://builds.apache.org/job/tika-2.x/57/ ) TIKA-1285 – upgrade PDFBox to 2.0.0 in 2.x (tallison: rev 7bc3eae94d79bbbf5dc50143c404af22c02446bc) tika-parser-modules/tika-parser-pdf-module/pom.xml tika-parser-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFEncodedStringDecoder.java tika-bundle/pom.xml tika-parser-modules/tika-parser-pdf-module/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties tika-parser-modules/pom.xml tika-parser-modules/tika-parser-xmp-commons/pom.xml tika-parser-bundles/tika-parser-pdf-bundle/pom.xml tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/font/TrueTypeParser.java tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/font/AdobeFontMetricParser.java CHANGES.txt tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/image/ImageParserTest.java tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
          Hide
          chengas123 Ben McCann added a comment -

          Thanks so much Tim! Do you know what Tika release this will be a part of?

          Show
          chengas123 Ben McCann added a comment - Thanks so much Tim! Do you know what Tika release this will be a part of?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          1.13...not sure of timeframe for that

          Show
          tallison@mitre.org Tim Allison added a comment - 1.13...not sure of timeframe for that
          Hide
          lfcnassif Luis Filipe Nassif added a comment -

          Hi Tim Allison

          There is any magic/recommendation to use both PDFBox 2.0 and 1.8 by the same app? Running ExtractText externally? There is a better way? I am still interested in parsing truncated and damaged pdf files...

          Show
          lfcnassif Luis Filipe Nassif added a comment - Hi Tim Allison There is any magic/recommendation to use both PDFBox 2.0 and 1.8 by the same app? Running ExtractText externally? There is a better way? I am still interested in parsing truncated and damaged pdf files...
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Y, I've been thinking about this, too. I wonder if we could shade/relocate PDFBox 1.8 ourselves, or perhaps ask our PDFBox colleagues to distribute a shaded+relocated 1.8 (o.a.pdfbox18...) that we could call with PDFParser18 or something.

          If we can get the shading to work, this would be a perfect use case for the back-off composite parser (still in planning stages)-- if there's an exception with PDFBox 2.0.0, retry with PDFBox 1.8.x.

          Show
          tallison@mitre.org Tim Allison added a comment - Y, I've been thinking about this, too. I wonder if we could shade/relocate PDFBox 1.8 ourselves, or perhaps ask our PDFBox colleagues to distribute a shaded+relocated 1.8 (o.a.pdfbox18...) that we could call with PDFParser18 or something. If we can get the shading to work, this would be a perfect use case for the back-off composite parser (still in planning stages)-- if there's an exception with PDFBox 2.0.0, retry with PDFBox 1.8.x.
          Hide
          lfcnassif Luis Filipe Nassif added a comment - - edited

          If the PDFBox team could distribute an o.a.pdfbox18 that would be great!

          Show
          lfcnassif Luis Filipe Nassif added a comment - - edited If the PDFBox team could distribute an o.a.pdfbox18 that would be great!
          Hide
          jahewson John Hewson added a comment - - edited

          The parser and the rest of PDFBox are tightly coupled, so it's not possible to switch out the 2.0 parser for the 1.8 parser. You'd have to switch out the whole of PDFBox, which of course you could do if you wanted.

          Show
          jahewson John Hewson added a comment - - edited The parser and the rest of PDFBox are tightly coupled, so it's not possible to switch out the 2.0 parser for the 1.8 parser. You'd have to switch out the whole of PDFBox, which of course you could do if you wanted.
          Hide
          jahewson John Hewson added a comment -

          It would be better to open JIRA issues for problem PDFs so that we can improve the 2.0 parser.

          Show
          jahewson John Hewson added a comment - It would be better to open JIRA issues for problem PDFs so that we can improve the 2.0 parser.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Y, that's what I was thinking about doing with shading+relocating.

          Show
          tallison@mitre.org Tim Allison added a comment - Y, that's what I was thinking about doing with shading+relocating.
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          As I mentioned on the pdfbox dev list, I'm hesitant to waste your time by submitting issues for truncated files. If AR can't parse it, I wouldn't expect PDFBox to have much luck.

          However, the classic parser in 1.8 was able to get some text+metadata out of some truncated files.

          If you go to my last pre-release-2.0.0 reports zip here: https://github.com/tballison/share/blob/master/pdfbox_comparisons/reports_pdfbox_2_0_20160310.zip?raw=true

          there's a file called textLostFromACausedByNewExceptionsInB.xlsx. That documents what text 1.8.11 (with the classic parser) was able to extract from files that 2.0.0 (with nonsequential parser) was not. Nearly all of the "new" exceptions in 2.0.0 were caused by truncated files.

          Show
          tallison@mitre.org Tim Allison added a comment - - edited As I mentioned on the pdfbox dev list, I'm hesitant to waste your time by submitting issues for truncated files. If AR can't parse it, I wouldn't expect PDFBox to have much luck. However, the classic parser in 1.8 was able to get some text+metadata out of some truncated files. If you go to my last pre-release-2.0.0 reports zip here: https://github.com/tballison/share/blob/master/pdfbox_comparisons/reports_pdfbox_2_0_20160310.zip?raw=true there's a file called textLostFromACausedByNewExceptionsInB.xlsx. That documents what text 1.8.11 (with the classic parser) was able to extract from files that 2.0.0 (with nonsequential parser) was not. Nearly all of the "new" exceptions in 2.0.0 were caused by truncated files.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          I opened TIKA-1912 to track this issue.

          Show
          tallison@mitre.org Tim Allison added a comment - I opened TIKA-1912 to track this issue.

            People

            • Assignee:
              Unassigned
              Reporter:
              rpialum Jeremy Anderson
            • Votes:
              5 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development