Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3058

Support TIKA Migration to PDFBox 2.0

    Details

    • Type: Task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 2.0.0
    • Component/s: Text extraction
    • Labels:
      None

      Description

      This issue is to track fixing issues which came up as part of TIKA-1285 (Upgrade to PDFBox 2.0.0 when available) mainly

      • new exceptions compared to PDFBox 1.8.x
      • regressions in text extraction
      • lower quality text extraction

      There should be individual issues to track tasks/bugs arising from that.

        Attachments

        1. content_diffs-1.8-to-2.0.xlsx
          1.52 MB
          Tilman Hausherr
        2. content_diffs-4.xlsx
          3.14 MB
          Tilman Hausherr
        3. NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU_1_8_10.json
          2 kB
          Tim Allison
        4. NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU_2_0.json
          2 kB
          Tim Allison
        5. textLostFromACausedByNewExceptionsInB.zip
          42 kB
          Tim Allison

          Issue Links

          1.
          COSArray.getObject() incorrect handling of indirect reference to COSNull Sub-task Closed Tilman Hausherr
          2.
          NPE in CFFParser.parseType1Dicts() Sub-task Closed Tilman Hausherr
          3.
          Text extraction fails with type 3 fonts Sub-task Closed Tilman Hausherr
          4.
          NPE in PDFStreamEngine.ShowText when no font set Sub-task Closed Tilman Hausherr
          5.
          java.io.IOException: Error: Unknown annotation type COSNull{} Sub-task Closed Unassigned
          6.
          Catalog cannot be found Sub-task Closed Andreas Lehmkühler
          7.
          Word concatenation in 2.0 not in 1.8 Sub-task Closed Tilman Hausherr
          8.
          Text extraction and height different in 2.0 Sub-task Closed Tilman Hausherr
          9.
          Null metadata in 2.0 in some files that had metadata in 1.8.10 with old parser Sub-task Closed Tilman Hausherr
          10.
          Avoid crazy /Length1 values in font descriptor Sub-task Closed Tilman Hausherr
          11.
          Text extraction partially garbled in this file, was OK in 1.8 Sub-task Closed Tilman Hausherr
          12.
          Text extraction garbled in this file, was OK in 1.8 Sub-task Closed Tilman Hausherr
          13.
          IndexOutOfBoundsException in PDFont.getWidth() Sub-task Closed Tilman Hausherr
          14.
          IndexOutOfBoundsException in PfbParser.parsePfb Sub-task Closed Tilman Hausherr
          15.
          NullPointerException in PDFStreamEngine.showText() Sub-task Closed Tilman Hausherr
          16.
          Text with vertical font not extracted correctly Sub-task Closed Andreas Lehmkühler
          17.
          Text extraction garbled in this file, was OK in 1.8 Sub-task Closed Unassigned
          18.
          Parsing fails when XRef stream object is 1 byte later Sub-task Closed Andreas Lehmkühler
          19.
          The trailer rebuild mechnism doesn't work Sub-task Closed Andreas Lehmkühler
          20.
          One 32kb truncated file causes OOM in 2.0.0-trunk Sub-task Closed Andreas Lehmkühler
          21.
          Rare new NPE in 2.0.0-trunk Sub-task Open Unassigned

            Activity

              People

              • Assignee:
                lehmi Andreas Lehmkühler
                Reporter:
                msahyoun Maruan Sahyoun
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: