Description
Hi,
I got some parsing problems when using Tika 1.1. Some pdfs, docs and ppts were not getting parsed.
So, tried with 1.3. Still some pdfs/docs/ppts can not be parsed.
my code (Test.java):
import java.io.File;
import java.io.InputStream;
import java.io.FileInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.html.BoilerpipeContentHandler;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.html.HtmlParser;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
public class Test {
private static final String validBoilerpipeFilenameRegEx = ".*(
.)(htm|html|shtml|php|asp|aspx)$";
public String parseFile(File inFile) {
if (inFile == null || !inFile.isFile() || !inFile.canRead()) return null;
InputStream is = null;
String outputText = "";
try {
// Open input stream
is = new FileInputStream(inFile);
// Prepare parser
BodyContentHandler contenthandler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName());
ParseContext pc = new ParseContext();
// Call parse with boilerpipe if valid boilerpipe extension; otherwise, call regular parse.
if (!inFile.getName().matches(validBoilerpipeFilenameRegEx))
{ Parser parser = new AutoDetectParser(); parser.parse(is, contenthandler, metadata, pc); }else
{ Parser parser = new HtmlParser(); BoilerpipeContentHandler bh = new BoilerpipeContentHandler(contenthandler, new ArticleExtractor()); parser.parse(is, bh, metadata, pc); }// Prepare text for write
outputText = contenthandler.toString();
} catch (Exception e)
{ System.out.println(e); return null; }finally {
try
{ if (is != null) is.close(); }catch (Exception e) {}
}
return outputText;
}
======
output:
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@3a6ac461
url_4080_ETS11_TAGMatrix_rev070111.pdf
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@2b03be0
url_2275_Paper26Pages253-269.pdf
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@4f9a32e0
url_5889_viz.96.pdf
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@4e513d61
url_1556_sensys_awoo03.pdf
org.apache.tika.exception.TikaException: Unable to extract PDF content
url_1763_approx-alg-notes.pdf
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@426295eb
url_5300_sudoku2.pdf?referrer=webcluster&
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@7c2e1f1f
url_1441_ChoosingYourFirstCSCourse2011-FINAL.pdf
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@7eda18ac
url_4272_20080218121324_723.pdf
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6f0ffb38
url_2491_2106_crime_scene.doc
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4cedf389
url_5227_Romano-Library%20Research%20Series%20-%20March%2029%202007%20FINAL(small).ppt
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6126f827
url_5250_linked%20list.ppt
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@3749eb9f
url_2011_undergrad-brochure.pdf
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@3a289d2e
url_5709_final_presentation_bak.ppt
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@5ddc0e7a
url_5319_2011_2012_advising_guidelines.pdf
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@7dc5ddc9
url_3502_TheEvolvingRoleTech.pdf
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4963f7a1
url_2403_class_presentation_Btree.ppt
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@7ba85d38
url_4040_fukunaga_jair07_bin.pdf
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6a8046f4
url_2472_COP3530OverheadsF99.doc
Thanks,
Qian