Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Not A Problem
-
1.5.0
-
None
-
Windows XP, Java 1.6
Description
I have been trying to extract the contents of PDF file (so as to index it with lucene). The PDF file contains arabic.
Both PDF files contain the exact same information. The strange thing is PDFTextStripper extract data from one file correctly(gives proper arabic) but not from the other(gives complete question marks ???? or [][][][][] )
Below is the code being used
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
public class TesExtraction {
// Extract text from PDF Document
static String pdftoText(String fileName) {
PDFParser parser;
String parsedText = null;;
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File(fileName);
if (!file.isFile())
try
catch (IOException e)
{ System.err.println("Unable to open PDF Parser. " + e.getMessage()); return null; } try
catch (Exception e)
{ System.err .println("An exception occured in parsing the PDF Document." + e.getMessage()); } finally {
try
catch (Exception e)
{ e.printStackTrace(); } }
return parsedText;
}
public static void main(String args[])
}
NOTE: Where can I upload the pdf files ?