Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-55

Invalid character while extracting text from a chinese pdf

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Resolution: Fixed
    • None
    • 1.1.0
    • Text extraction
    • None

    Description

      [imported from SourceForge]
      http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1185058
      Originally submitted by seblaunay on 2005-04-18 01:59.

      First, thanks for this wonderful api.
      I have a problem extracting text from a pdf document
      provided with adobe acrobat reader : ENUtxt.pdf.
      The pdf contains text with chinese fonts which cannot
      be extracted.
      But, it contains also this text (extract with xpdf or
      acrobat reader) :
      ---------------------------------------
      Lorem ipsum dolor
      ad minim
      ---------------------------------------

      The problem is i obtain on my Writer with
      PDFTextStripper.WriteText something like this :
      ---------------------------------------
      -PSFNJQTVNEPMPS
      BENJOJNWFSOJBNôH
      ---------------------------------------
      And between this valid characters, there are these
      invalid characters :
      0x0, 0x1, 0x5, 0x6, 0x18.

      Because, i sax the content of a document into a xml,
      the resulting xml is not valid because it contains
      invalid characters...

      [attachment on SourceForge]
      http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1185058&file_id=130664
      ENUtxt.pdf (application/pdf), 7582 bytes
      The pdf used

      [comment on SourceForge]
      Originally sent by seblaunay.
      Logged In: YES
      user_id=1261395

      Document to test added.

      Attachments

        1. PDFBOX55-ENUtxt.pdf
          7 kB
          Andreas Lehmkühler

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Anonymous Anonymous
              Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: