Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5213

PDFTextStripper adds next line symbol after sup values (regression)

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.0.22, 2.0.23, 2.0.24
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:

      Description

      Since version 2.0.22

      PDFTextStripper adds next line symbol after sup values.

      Like earlier

      "Other (12) 1,505 832"

      Now:

      "Other (12)
      1,505 832"

       

      You can see this by comparing files GS-2010-q4-earnings.pdf_expected.html (2.0.21 and earlier) and GS-2010-q4-earnings.pdf_result.html (2.0.22 and higher)

       

      If I took latest version of PDFbox like 2.0.24 and copy code of PDFTextStripper from 2.0.21 and use it then I don't see this issue. So it's regression only in PDFTextStripper.

       

      To reproduce, you can use next simple code (copied from your examples). pageBytes is file GS-2010-q4-earnings.pdf

      List<String> pages = new ArrayList<>();

      PDDocument pdDocument = null;
      try {
      String pass = "";
      PDFParser parser = new PDFParser(new RandomAccessReadBuffer(pageBytes), pass);
      pdDocument = parser.parse();

      int numberOfPages = pdDocument.getNumberOfPages();
      if (limit < numberOfPages)

      { numberOfPages = limit; }

      // //

      for (int i = 0; i < numberOfPages; i++)

      { PDFTextStripper stripper = new PDFTextStripper(); stripper.setStartPage(i + 1); stripper.setEndPage(i + 1); pages.add(stripper.getText(pdDocument)); }

      } catch (Exception e)

      { log.error(e.getMessage(), e);  }

      finally {
      if (pdDocument != null)

      { try

      { pdDocument.close(); }

      catch (IOException e) { log.error(e.getMessage(), e);  }

      }
      }

       

       

       

        Attachments

        1. image-2021-06-14-14-50-08-236.png
          48 kB
          Vladimir
        2. GS-2010-q4-earnings.pdf_result.html
          48 kB
          Vladimir
        3. GS-2010-q4-earnings.pdf_expected.html
          48 kB
          Vladimir
        4. GS-2010-q4-earnings.pdf
          230 kB
          Vladimir

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                Postrigan Vladimir
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: