Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3710

Text Stripper in 2.0 lost some texts - regression

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None

      Description

      After migration of our App from pdfbox 1.8 to 2.0, we noticed a regression: 4 lines of texts are disappeared. Those are the texts followed by black bullet (3 lines) and also "OVERALL" word which is placed above in table.

      Problematic PDF attached - highlight19.pdf_page1.pdf

      Also, attached the result of DrawPrintTextLocations example -
      highlight19.pdf_page1-marked-1.png

      Notice, that unicodes, red and blue boxes missing for problematic text. The main problem that these glyphs are absent in textPositions parameter which is passed to writeString function, line #275. In the 1.8 version these characters ARE present, so their positions along with their char codes could be extracted fine in our App.

      Also, attached picture of regression in our App - regression_in_blue.png. Here, blue boxes drawn where text WAS present and disappeared afterwards. (The purple boxes are OK and should be ignored.)

        Attachments

        1. highlight19.pdf_page1.pdf
          168 kB
          Roman
        2. highlight19.pdf_page1-marked-1.png
          617 kB
          Roman
        3. regression_in_blue.png
          211 kB
          Roman

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              rmakarov Roman
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: