[PDFBOX-846] TextExtraction mixes case of text - ASF JIRA

Voters

Watch issue

Watchers

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.2.1
Fix Version/s: 1.3.1
Component/s: Text extraction
Labels:
None
Environment:
Windows server, .NET

Description

Using Text extraction on a file like this, http://www.organictogo.com/pdf/catering/Menu_WA_032509.pdf, the text (in all CAPS) "THAI VEGGIE WRAP" is extracted as:
"ThAI VeGGIe wRAP". However, examining the PDF, shows that it looks like this: "Thai V eggi e Wrap". The related text on the next lines, such as "Crisp red cabbage, cucumbers, carrots and lettuce with Thai" parse in just fine.

We are using this code to get the text in C#:

byte[] pdfData = myWebClient.DownloadData(pdfUrl);
string text = string.Empty;

ByteArrayInputStream stream = new ByteArrayInputStream(pdfData);
PDDocument doc = PDDocument.load(stream);
PDFTextStripper stripper = new PDFTextStripper();
text = stripper.getText(doc);
doc.close();

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Manage Attachments

PDFBOX846-Menu_WA_032509.pdf
16/Oct/10 17:41
400 kB
Andreas Lehmkühler
PDFBOX846-Menu_WA_032509.txt
16/Oct/10 17:41
20 kB
Andreas Lehmkühler

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Andreas Lehmkühler

Reporter:: Mark Looi

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 30/Sep/10 18:51

Updated:: 26/Oct/10 09:34

Resolved:: 16/Oct/10 17:45

Agile

View on Board

TextExtraction mixes case of text

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment