Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1999

org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 2.0, 1.14
    • Component/s: parser
    • Labels:
      None
    • Environment:

      Ubuntu 16.04 (64 bit)
      Oracle Java 1.8.0_91-b14 (64 bit)

      Description

      When trying to read the following PDF document:

      http://www.arcadiz.com/content/assets/Artikel_CloudWorks_Vernieuwingen_zorg_vragen_om_veel_snellere_verbindingen.pdf

      TIKA crashes for me with a java.lang.StackOverflowError, caused by a large number of recursion in:

          at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:58)
      

      For some reason, the Tika App doesn't exhibit this behavior, but the following MWE exposes the issue for me:

      import java.io.ByteArrayOutputStream;
      import java.io.File;
      import java.io.FileInputStream;
      import org.apache.tika.metadata.Metadata;
      import org.apache.tika.parser.AutoDetectParser;
      import org.apache.tika.parser.ParseContext;
      import org.apache.tika.sax.ToHTMLContentHandler;
      
      public class test
      {
          public static void main(String [] args) throws Exception {
              String p = "/home/eggie/faulty_pdf_document.pdf";
              
              FileInputStream input = new FileInputStream(new File(p));
              AutoDetectParser tk = new AutoDetectParser();
              ByteArrayOutputStream os = new ByteArrayOutputStream();
              ToHTMLContentHandler handler = new ToHTMLContentHandler(os, "UTF-8");
              ParseContext pc = new ParseContext();
              System.out.println("Parsing");
              tk.parse(input, handler, new Metadata(), pc);
          }
      }
      

        Attachments

          Activity

            People

            • Assignee:
              tallison@apache.org Tim Allison
              Reporter:
              MadEgg Egbert
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: