Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5213

PDFTextStripper adds next line symbol after sup values (regression)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.0.22, 2.0.23, 2.0.24
    • None
    • Text extraction

    Description

      Since version 2.0.22

      PDFTextStripper adds next line symbol after sup values.

      Like earlier

      "Other (12) 1,505 832"

      Now:

      "Other (12)
      1,505 832"

       

      You can see this by comparing files GS-2010-q4-earnings.pdf_expected.html (2.0.21 and earlier) and GS-2010-q4-earnings.pdf_result.html (2.0.22 and higher)

       

      If I took latest version of PDFbox like 2.0.24 and copy code of PDFTextStripper from 2.0.21 and use it then I don't see this issue. So it's regression only in PDFTextStripper.

       

      To reproduce, you can use next simple code (copied from your examples). pageBytes is file GS-2010-q4-earnings.pdf

      List<String> pages = new ArrayList<>();

      PDDocument pdDocument = null;
      try {
      String pass = "";
      PDFParser parser = new PDFParser(new RandomAccessReadBuffer(pageBytes), pass);
      pdDocument = parser.parse();

      int numberOfPages = pdDocument.getNumberOfPages();
      if (limit < numberOfPages)

      { numberOfPages = limit; }

      // //

      for (int i = 0; i < numberOfPages; i++)

      { PDFTextStripper stripper = new PDFTextStripper(); stripper.setStartPage(i + 1); stripper.setEndPage(i + 1); pages.add(stripper.getText(pdDocument)); }

      } catch (Exception e)

      { log.error(e.getMessage(), e);  }

      finally {
      if (pdDocument != null)

      { try

      { pdDocument.close(); }

      catch (IOException e) { log.error(e.getMessage(), e);  }

      }
      }

       

       

       

      Attachments

        1. GS-2010-q4-earnings.pdf
          230 kB
          Vladimir
        2. GS-2010-q4-earnings.pdf_expected.html
          48 kB
          Vladimir
        3. GS-2010-q4-earnings.pdf_result.html
          48 kB
          Vladimir
        4. image-2021-06-14-14-50-08-236.png
          48 kB
          Vladimir

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Postrigan Vladimir
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: