Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1039

Arabic Text Extraction using PDFTextStripper working partially

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 1.5.0
    • None
    • Text extraction
    • Windows XP, Java 1.6

    Description

      I have been trying to extract the contents of PDF file (so as to index it with lucene). The PDF file contains arabic.

      Both PDF files contain the exact same information. The strange thing is PDFTextStripper extract data from one file correctly(gives proper arabic) but not from the other(gives complete question marks ???? or [][][][][] )

      Below is the code being used

      import java.io.File;
      import java.io.FileInputStream;
      import java.io.IOException;
      import org.apache.pdfbox.cos.COSDocument;
      import org.apache.pdfbox.pdfparser.PDFParser;
      import org.apache.pdfbox.pdmodel.PDDocument;
      import org.apache.pdfbox.util.PDFTextStripper;

      public class TesExtraction {

      // Extract text from PDF Document
      static String pdftoText(String fileName) {
      PDFParser parser;
      String parsedText = null;;
      PDFTextStripper pdfStripper = null;
      PDDocument pdDoc = null;
      COSDocument cosDoc = null;
      File file = new File(fileName);
      if (!file.isFile())

      { System.err.println("File " + fileName + " does not exist."); return null; }

      try

      { parser = new PDFParser(new FileInputStream(file)); }

      catch (IOException e)

      { System.err.println("Unable to open PDF Parser. " + e.getMessage()); return null; }

      try

      { parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new PDFTextStripper("CP-1252"); pdDoc = new PDDocument(cosDoc); pdfStripper.setStartPage(1); pdfStripper.setEndPage(5); parsedText = pdfStripper.getText(pdDoc); }

      catch (Exception e)

      { System.err .println("An exception occured in parsing the PDF Document." + e.getMessage()); }

      finally {
      try

      { if (cosDoc != null) cosDoc.close(); if (pdDoc != null) pdDoc.close(); }

      catch (Exception e)

      { e.printStackTrace(); }

      }
      return parsedText;
      }
      public static void main(String args[])

      { System.out.println(pdftoText("C:\\LuceneTest\\Data\\TestWord.pdf")); System.out.println(pdftoText("C:\\LuceneTest\\Data\\TestPDFCreator.pdf")); }

      }

      NOTE: Where can I upload the pdf files ?

      Attachments

        1. TestWord.pdf
          62 kB
          Franklin
        2. TestPDFCreator.pdf
          15 kB
          Franklin

        Activity

          People

            lehmi Andreas Lehmkühler
            frankee787 Franklin
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 168h
                168h
                Remaining:
                Remaining Estimate - 168h
                168h
                Logged:
                Time Spent - Not Specified
                Not Specified