Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1097

not able to parse pdfs/docs/ppts using 1.1 and 1.3 tika parser‏‏

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Invalid
    • 1.1, 1.3
    • 1.1, 1.3
    • parser
    • None
    • linux redhat

    Description

      Hi,

      I got some parsing problems when using Tika 1.1. Some pdfs, docs and ppts were not getting parsed.
      So, tried with 1.3. Still some pdfs/docs/ppts can not be parsed.

      my code (Test.java):

      import java.io.File;

      import java.io.InputStream;

      import java.io.FileInputStream;

      import org.apache.tika.metadata.Metadata;

      import org.apache.tika.parser.AutoDetectParser;

      import org.apache.tika.parser.ParseContext;

      import org.apache.tika.parser.Parser;

      import org.apache.tika.parser.html.BoilerpipeContentHandler;

      import org.apache.tika.sax.BodyContentHandler;

      import org.apache.tika.parser.html.HtmlParser;

      import de.l3s.boilerpipe.extractors.ArticleExtractor;

      public class Test {

      private static final String validBoilerpipeFilenameRegEx = ".*(
      .)(htm|html|shtml|php|asp|aspx)$";

      public String parseFile(File inFile) {

      if (inFile == null || !inFile.isFile() || !inFile.canRead()) return null;

      InputStream is = null;

      String outputText = "";

      try {

      // Open input stream

      is = new FileInputStream(inFile);

      // Prepare parser

      BodyContentHandler contenthandler = new BodyContentHandler(-1);

      Metadata metadata = new Metadata();

      metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName());

      ParseContext pc = new ParseContext();

      // Call parse with boilerpipe if valid boilerpipe extension; otherwise, call regular parse.

      if (!inFile.getName().matches(validBoilerpipeFilenameRegEx))

      { Parser parser = new AutoDetectParser(); parser.parse(is, contenthandler, metadata, pc); }

      else

      { Parser parser = new HtmlParser(); BoilerpipeContentHandler bh = new BoilerpipeContentHandler(contenthandler, new ArticleExtractor()); parser.parse(is, bh, metadata, pc); }

      // Prepare text for write

      outputText = contenthandler.toString();

      } catch (Exception e)

      { System.out.println(e); return null; }

      finally {

      try

      { if (is != null) is.close(); }

      catch (Exception e) {}

      }

      return outputText;

      }

      ======

      output:

      org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@3a6ac461

      url_4080_ETS11_TAGMatrix_rev070111.pdf

      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@2b03be0

      url_2275_Paper26Pages253-269.pdf

      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@4f9a32e0

      url_5889_viz.96.pdf

      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@4e513d61

      url_1556_sensys_awoo03.pdf

      org.apache.tika.exception.TikaException: Unable to extract PDF content

      url_1763_approx-alg-notes.pdf

      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@426295eb

      url_5300_sudoku2.pdf?referrer=webcluster&

      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@7c2e1f1f

      url_1441_ChoosingYourFirstCSCourse2011-FINAL.pdf

      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@7eda18ac

      url_4272_20080218121324_723.pdf

      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6f0ffb38

      url_2491_2106_crime_scene.doc

      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4cedf389

      url_5227_Romano-Library%20Research%20Series%20-%20March%2029%202007%20FINAL(small).ppt

      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6126f827

      url_5250_linked%20list.ppt

      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@3749eb9f

      url_2011_undergrad-brochure.pdf

      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@3a289d2e

      url_5709_final_presentation_bak.ppt

      org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@5ddc0e7a

      url_5319_2011_2012_advising_guidelines.pdf

      org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@7dc5ddc9

      url_3502_TheEvolvingRoleTech.pdf

      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4963f7a1

      url_2403_class_presentation_Btree.ppt

      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@7ba85d38

      url_4040_fukunaga_jair07_bin.pdf

      org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@6a8046f4

      url_2472_COP3530OverheadsF99.doc

      Thanks,

      Qian

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            qiandiao Qian Diao
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment