Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2442

Non-terminal interactive form fields not handled recursively

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.14
    • 1.17
    • parser
    • None

    Description

      (I am not sure if this is a Tika or a PDFBox problem; I tried finding a form extractor in PDFBox, but the app api does not have one. PDFDebugger does show me the expected tree structure.)

      The attached PDF has a non-terminal field named “parent” and two children, “child1” and “child2.” According to the PDF spec in section 8.6, the fully qualified field names should be parent.child1 and parent.child2. That is the output given by pdftk:

      > pdftk simple-form.pdf dump_data_fields

      FieldType: Text
      FieldName: parent.child1
      FieldFlags: 0
      FieldValue: child1 value
      FieldJustification: Left

      FieldType: Text
      FieldName: parent.child2
      FieldFlags: 0
      FieldValue: child2 value
      FieldJustification: Left

      Tika with the ToXMLContentHandler seems to silently ignore the children, however, returning only a parent with no value.

      Calling code:

      import java.io.FileInputStream;
      import org.apache.tika.detect.DefaultDetector;
      import org.apache.tika.detect.Detector;
      import org.apache.tika.metadata.Metadata;
      import org.apache.tika.parser.AutoDetectParser;
      import org.apache.tika.parser.ParseContext;
      import org.apache.tika.parser.Parser;
      import org.apache.tika.parser.PasswordProvider;
      import org.apache.tika.sax.ToXMLContentHandler;

      class readAsXHTML {
      public static String readAsXHTML(String filename) throws Exception {
      ToXMLContentHandler handler = new ToXMLContentHandler();
      Detector detector = new DefaultDetector();
      Parser parser = new AutoDetectParser(detector);
      ParseContext context = new ParseContext();
      Metadata metadata = new Metadata();
      FileInputStream fh = null;

      final String pass = password;

      try

      { fh = new FileInputStream(filename); parser.parse(fh, handler, metadata, context); return(handler.toString()); }

      finally {
      if (fh != null)

      { fh.close(); }

      }
      }
      }

      Abbreviated output:

      <body><div class="page"><p />
      </div>
      <div class="acroform"><ol> <li>parent: </li>
      </ol>
      </div>
      </body>

      Expected:
      <body><div class="page"><p />
      </div>
      <div class="acroform"><ol>
      <li>parent.child1: child1 value</li>
      <li>parent.child2: child2 value</li>
      </ol>
      </div>
      </body>

      Attachments

        1. simple-form.pdf
          2 kB
          Christopher Creutzig

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ccreutzig Christopher Creutzig
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: