Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
2.0.22, 2.0.23, 2.0.24
-
None
Description
Since version 2.0.22
PDFTextStripper adds next line symbol after sup values.
Like earlier
"Other (12) 1,505 832"
Now:
"Other (12)
1,505 832"
You can see this by comparing files GS-2010-q4-earnings.pdf_expected.html (2.0.21 and earlier) and GS-2010-q4-earnings.pdf_result.html (2.0.22 and higher)
If I took latest version of PDFbox like 2.0.24 and copy code of PDFTextStripper from 2.0.21 and use it then I don't see this issue. So it's regression only in PDFTextStripper.
To reproduce, you can use next simple code (copied from your examples). pageBytes is file GS-2010-q4-earnings.pdf
List<String> pages = new ArrayList<>();
PDDocument pdDocument = null;
try {
String pass = "";
PDFParser parser = new PDFParser(new RandomAccessReadBuffer(pageBytes), pass);
pdDocument = parser.parse();
int numberOfPages = pdDocument.getNumberOfPages();
if (limit < numberOfPages)
// //
for (int i = 0; i < numberOfPages; i++)
{ PDFTextStripper stripper = new PDFTextStripper(); stripper.setStartPage(i + 1); stripper.setEndPage(i + 1); pages.add(stripper.getText(pdDocument)); }} catch (Exception e)
{ log.error(e.getMessage(), e); }finally {
if (pdDocument != null)
{ try
{ pdDocument.close(); }catch (IOException e) { log.error(e.getMessage(), e); }
}
}
Attachments
Attachments
Issue Links
- is broken by
-
PDFBOX-5002 PDFTextStripper sometimes fuses two words on different lines
- Closed