Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-55

Invalid character while extracting text from a chinese pdf

    Details

    • Type: Bug
    • Status: Closed
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1.0
    • Component/s: Text extraction
    • Labels:
      None

      Description

      [imported from SourceForge]
      http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1185058
      Originally submitted by seblaunay on 2005-04-18 01:59.

      First, thanks for this wonderful api.
      I have a problem extracting text from a pdf document
      provided with adobe acrobat reader : ENUtxt.pdf.
      The pdf contains text with chinese fonts which cannot
      be extracted.
      But, it contains also this text (extract with xpdf or
      acrobat reader) :
      ---------------------------------------
      Lorem ipsum dolor
      ad minim
      ---------------------------------------

      The problem is i obtain on my Writer with
      PDFTextStripper.WriteText something like this :
      ---------------------------------------
      -PSFNJQTVNEPMPS
      BENJOJNWFSOJBNôH
      ---------------------------------------
      And between this valid characters, there are these
      invalid characters :
      0x0, 0x1, 0x5, 0x6, 0x18.

      Because, i sax the content of a document into a xml,
      the resulting xml is not valid because it contains
      invalid characters...

      [attachment on SourceForge]
      http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1185058&file_id=130664
      ENUtxt.pdf (application/pdf), 7582 bytes
      The pdf used

      [comment on SourceForge]
      Originally sent by seblaunay.
      Logged In: YES
      user_id=1261395

      Document to test added.

        Attachments

        1. PDFBOX55-ENUtxt.pdf
          7 kB
          Andreas Lehmkühler

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                Anonymous
              • Votes:
                1 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: