[PDFBOX-55] Invalid character while extracting text from a chinese pdf - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.1.0
Component/s: Text extraction
Labels:
None

Description

[imported from SourceForge]
http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1185058
Originally submitted by seblaunay on 2005-04-18 01:59.

First, thanks for this wonderful api.
I have a problem extracting text from a pdf document
provided with adobe acrobat reader : ENUtxt.pdf.
The pdf contains text with chinese fonts which cannot
be extracted.
But, it contains also this text (extract with xpdf or
acrobat reader) :
---------------------------------------
Lorem ipsum dolor
ad minim
---------------------------------------

The problem is i obtain on my Writer with
PDFTextStripper.WriteText something like this :
---------------------------------------
-PSFNJQTVNEPMPS
BENJOJNWFSOJBNôH
---------------------------------------
And between this valid characters, there are these
invalid characters :
0x0, 0x1, 0x5, 0x6, 0x18.

Because, i sax the content of a document into a xml,
the resulting xml is not valid because it contains
invalid characters...

[attachment on SourceForge]
http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1185058&file_id=130664
ENUtxt.pdf (application/pdf), 7582 bytes
The pdf used

[comment on SourceForge]
Originally sent by seblaunay.
Logged In: YES
user_id=1261395

Document to test added.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PDFBOX55-ENUtxt.pdf
10/Mar/10 18:34
7 kB
Andreas Lehmkühler

Issue Links

depends upon

PDFBOX-654 Extracting CJK text

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Anonymous

Votes:: 1 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 18/Apr/05 08:59

Updated:: 30/Mar/10 08:23

Resolved:: 10/Mar/10 18:37