Details
-
Bug
-
Status: Closed
-
Resolution: Fixed
-
None
-
None
Description
[imported from SourceForge]
http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1185058
Originally submitted by seblaunay on 2005-04-18 01:59.
First, thanks for this wonderful api.
I have a problem extracting text from a pdf document
provided with adobe acrobat reader : ENUtxt.pdf.
The pdf contains text with chinese fonts which cannot
be extracted.
But, it contains also this text (extract with xpdf or
acrobat reader) :
---------------------------------------
Lorem ipsum dolor
ad minim
---------------------------------------
The problem is i obtain on my Writer with
PDFTextStripper.WriteText something like this :
---------------------------------------
-PSFNJQTVNEPMPS
BENJOJNWFSOJBNôH
---------------------------------------
And between this valid characters, there are these
invalid characters :
0x0, 0x1, 0x5, 0x6, 0x18.
Because, i sax the content of a document into a xml,
the resulting xml is not valid because it contains
invalid characters...
[attachment on SourceForge]
http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1185058&file_id=130664
ENUtxt.pdf (application/pdf), 7582 bytes
The pdf used
[comment on SourceForge]
Originally sent by seblaunay.
Logged In: YES
user_id=1261395
Document to test added.
Attachments
Attachments
Issue Links
- depends upon
-
PDFBOX-654 Extracting CJK text
- Closed