Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-398

Russian extraction encoding failure

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 0.7.3, 0.8.0-incubator
    • None
    • Text extraction
    • None
    • Windows XP 32-bit, CentOS 5.2 32-bit

    Description

      I am doing some text extraction of Russian documents and some of them aren't extracting correctly. I am using PDFTextStripper.
      When I extract on windows using UTF-8 encoding, the output is garbage.
      When I extract on linux using any encoding, the output is garbage.
      The only way I can get viable output is when I extract the PDF on windows, but don't specify an encoding. If I do this the output is correct when viewed with Ultra Edit, but not in notepad. I can view the output in notepad only after I convert the file to utf-8 with iconv.
      It appears to me that the encoding isn't being read correctly from the PDF, and when it's
      outputted as UTF-8, it is being double encoded or something. I can detect this double encoding, and then
      run the file with no encoding specified, then convert it to UTF-8 using iconv, and it is OK.
      But, this method does not work on linux, as I cannot get the file to extract correctly using any encoding
      on linux.

      Attachments

        1. 7.pdf
          241 kB
          Adrian Romano
        2. garbage output.jpg
          118 kB
          Adrian Romano
        3. working output.jpg
          112 kB
          Adrian Romano

        Activity

          People

            Unassigned Unassigned
            romanoad Adrian Romano
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: