Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1956

Wrong character on conversion PDF to TXT

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Invalid
    • 1.8.4
    • None
    • Parsing
    • Windows

    Description

      I am trying to convert PDF to TXT and some PDF, after converted, the String present wrong character. Could be UNICODE problem ? Can somebody help me ?

      I oberved that the problem when try to convert PDF, created by PDFCreator, in Text. The character are wrong. Any suggesting ?

      the code

      public class PDFTextParser {

      PDFParser parser;
      String parsedText;
      PDFTextStripper pdfStripper;
      PDDocument pdDoc;
      COSDocument cosDoc;
      PDDocumentInformation pdDocInfo;

      // PDFTextParser Constructor
      public PDFTextParser() {
      }

      // Extract text from PDF Document
      public String pdftoText(String fileName) {

      System.out.println("Parsing text from PDF file " + fileName + "....");
      File f = new File(fileName);

      if (!f.isFile())

      { System.out.println("File " + fileName + " does not exist."); return null; }

      try

      { parser = new PDFParser(new FileInputStream(f)); }

      catch (Exception e)

      { System.out.println("Unable to open PDF Parser."); return null; }

      try

      { parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new PDFTextStripper(); pdDoc = new PDDocument(cosDoc); parsedText = pdfStripper.getText(pdDoc); }

      catch (Exception e) {
      System.out.println("An exception occured in parsing the PDF Document.");
      e.printStackTrace();
      try

      { if (cosDoc != null) cosDoc.close(); if (pdDoc != null) pdDoc.close(); }

      catch (Exception e1)

      { e.printStackTrace(); }

      return null;
      }
      System.out.println("Done.");
      return parsedText;
      }

      Attachments

        1. example b.pdf
          52 kB
          Vicente
        2. itext_pdfabc-sample.pdf
          371 kB
          Vicente

        Activity

          People

            Unassigned Unassigned
            Vicente Vicente
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: