Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.2.1
-
None
-
Windows server, .NET
Description
Using Text extraction on a file like this, http://www.organictogo.com/pdf/catering/Menu_WA_032509.pdf, the text (in all CAPS) "THAI VEGGIE WRAP" is extracted as:
"ThAI VeGGIe wRAP". However, examining the PDF, shows that it looks like this: "Thai V eggi e Wrap". The related text on the next lines, such as "Crisp red cabbage, cucumbers, carrots and lettuce with Thai" parse in just fine.
We are using this code to get the text in C#:
byte[] pdfData = myWebClient.DownloadData(pdfUrl);
string text = string.Empty;
ByteArrayInputStream stream = new ByteArrayInputStream(pdfData);
PDDocument doc = PDDocument.load(stream);
PDFTextStripper stripper = new PDFTextStripper();
text = stripper.getText(doc);
doc.close();