Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1039

Arabic Text Extraction using PDFTextStripper working partially

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 1.5.0
    • Fix Version/s: None
    • Component/s: Text extraction
    • Environment:
      Windows XP, Java 1.6

      Description

      I have been trying to extract the contents of PDF file (so as to index it with lucene). The PDF file contains arabic.

      Both PDF files contain the exact same information. The strange thing is PDFTextStripper extract data from one file correctly(gives proper arabic) but not from the other(gives complete question marks ???? or [][][][][] )

      Below is the code being used

      import java.io.File;
      import java.io.FileInputStream;
      import java.io.IOException;
      import org.apache.pdfbox.cos.COSDocument;
      import org.apache.pdfbox.pdfparser.PDFParser;
      import org.apache.pdfbox.pdmodel.PDDocument;
      import org.apache.pdfbox.util.PDFTextStripper;

      public class TesExtraction {

      // Extract text from PDF Document
      static String pdftoText(String fileName) {
      PDFParser parser;
      String parsedText = null;;
      PDFTextStripper pdfStripper = null;
      PDDocument pdDoc = null;
      COSDocument cosDoc = null;
      File file = new File(fileName);
      if (!file.isFile())

      { System.err.println("File " + fileName + " does not exist."); return null; }

      try

      { parser = new PDFParser(new FileInputStream(file)); }

      catch (IOException e)

      { System.err.println("Unable to open PDF Parser. " + e.getMessage()); return null; }

      try

      { parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new PDFTextStripper("CP-1252"); pdDoc = new PDDocument(cosDoc); pdfStripper.setStartPage(1); pdfStripper.setEndPage(5); parsedText = pdfStripper.getText(pdDoc); }

      catch (Exception e)

      { System.err .println("An exception occured in parsing the PDF Document." + e.getMessage()); }

      finally {
      try

      { if (cosDoc != null) cosDoc.close(); if (pdDoc != null) pdDoc.close(); }

      catch (Exception e)

      { e.printStackTrace(); }

      }
      return parsedText;
      }
      public static void main(String args[])

      { System.out.println(pdftoText("C:\\LuceneTest\\Data\\TestWord.pdf")); System.out.println(pdftoText("C:\\LuceneTest\\Data\\TestPDFCreator.pdf")); }

      }

      NOTE: Where can I upload the pdf files ?

        Attachments

        1. TestPDFCreator.pdf
          15 kB
          Franklin
        2. TestWord.pdf
          62 kB
          Franklin

          Activity

            People

            • Assignee:
              lehmi Andreas Lehmkühler
              Reporter:
              frankee787 Franklin
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 168h
                168h
                Remaining:
                Remaining Estimate - 168h
                168h
                Logged:
                Time Spent - Not Specified
                Not Specified