Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3710

Text Stripper in 2.0 lost some texts - regression

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Text extraction
    • None

    Description

      After migration of our App from pdfbox 1.8 to 2.0, we noticed a regression: 4 lines of texts are disappeared. Those are the texts followed by black bullet (3 lines) and also "OVERALL" word which is placed above in table.

      Problematic PDF attached - highlight19.pdf_page1.pdf

      Also, attached the result of DrawPrintTextLocations example -
      highlight19.pdf_page1-marked-1.png

      Notice, that unicodes, red and blue boxes missing for problematic text. The main problem that these glyphs are absent in textPositions parameter which is passed to writeString function, line #275. In the 1.8 version these characters ARE present, so their positions along with their char codes could be extracted fine in our App.

      Also, attached picture of regression in our App - regression_in_blue.png. Here, blue boxes drawn where text WAS present and disappeared afterwards. (The purple boxes are OK and should be ignored.)

      Attachments

        1. regression_in_blue.png
          211 kB
          Roman
        2. highlight19.pdf_page1-marked-1.png
          617 kB
          Roman
        3. highlight19.pdf_page1.pdf
          168 kB
          Roman

        Activity

          People

            Unassigned Unassigned
            rmakarov Roman
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: