Tika
  1. Tika
  2. TIKA-973

PDF form data isn't included in extracted content.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.2
    • Fix Version/s: 1.5
    • Component/s: general
    • Labels:
      None

      Description

      When extracting content from PDFs, PDF form data isn't extracted.

      The following code extracts this data via PDF box, but it seems like something Tika should be doing.

      PDDocumentCatalog docCatalog = load.getDocumentCatalog();
      if (docCatalog != null) {
      PDAcroForm acroForm = docCatalog.getAcroForm();
      if (acroForm != null) {
      @SuppressWarnings("unchecked")
      List<PDField> fields = acroForm.getFields();
      if (fields != null && fields.size() > 0) {
      documentContent.append(" ");
      for (PDField field : fields) {
      if (field.getValue()!=null)

      { documentContent.append(field.getValue()); documentContent.append(" "); }

      }
      }
      }
      }

      1. TIKA-973.patch.tar.gz
        524 kB
        Tim Allison
      2. i-9_screenshot.png
        227 kB
        Tim Allison
      3. TIKA-973-patch.tar.gz
        469 kB
        Tim Allison

        Activity

        Hide
        Tim Allison added a comment -

        Will submit patch and tests by end of the week.

        Show
        Tim Allison added a comment - Will submit patch and tests by end of the week.
        Hide
        Tim Allison added a comment -

        Patch attached. Dumps contents of pdf forms at end of document.

        AcroForm field name metadata is in attribute values. Basic format is <ol>.

        Let me know how this looks.

        Thank you, Ben Litchfield, for org.apache.pdfbox.examples.fdf.PrintFields

        Show
        Tim Allison added a comment - Patch attached. Dumps contents of pdf forms at end of document. AcroForm field name metadata is in attribute values. Basic format is <ol>. Let me know how this looks. Thank you, Ben Litchfield, for org.apache.pdfbox.examples.fdf.PrintFields
        Hide
        Nick Burch added a comment -

        Patch looks promising to me, but I don't know enough about PDF so I've not been able to give it a thorough review

        Let's give it a few days before applying, to give others a chance to offer feedback

        One thing that might be good is in the unit test, to check for data from each form in turn, so we cover more cases

        Show
        Nick Burch added a comment - Patch looks promising to me, but I don't know enough about PDF so I've not been able to give it a thorough review Let's give it a few days before applying, to give others a chance to offer feedback One thing that might be good is in the unit test, to check for data from each form in turn, so we cover more cases
        Hide
        Tim Allison added a comment -

        Agree on both. Also would appreciate feedback on what the output should be. The current code extracts this unseemly xhtml:

        <div class="acroform">
        <ol> <li partialName="form1[0]" fullName="form1[0]"/>
        <ol> <li partialName="#subform[6]" fullName="form1[0].#subform[6]"/>
        <li partialName="MiddleInitial[0]" fullName="form1[0].#subform[6].MiddleInitial[0]" altName="Enter Middle Initial (MI)">X</li>
        <li partialName="FamilyName[0]" fullName="form1[0].#subform[6].FamilyName[0]" altName="Section 1. Employee Information and Attestation. Family Name (Last Name)">Doe</li>
        <li partialName="GivenName[0]" fullName="form1[0].#subform[6].GivenName[0]" altName="Given Name (First Name)">John</li>
        <li partialName="OtherNamesUsed[0]" fullName="form1[0].#subform[6].OtherNamesUsed[0]" altName="Maiden Name">Mr. Doe</li>
        <li partialName="StreetNumberName[0]" fullName="form1[0].#subform[6].StreetNumberName[0]" altName=" Street Number and Name">123 Main St.</li>
        >

        ...

        Another idea I had was to include the partialName in the contents and not fill out the attrs:
        <li>StreetNumberName[0]: 123 Main St</li>

        More unit tests on way...

        Show
        Tim Allison added a comment - Agree on both. Also would appreciate feedback on what the output should be. The current code extracts this unseemly xhtml: <div class="acroform"> <ol> <li partialName="form1 [0] " fullName="form1 [0] "/> <ol> <li partialName="#subform [6] " fullName="form1 [0] .#subform [6] "/> <li partialName="MiddleInitial [0] " fullName="form1 [0] .#subform [6] .MiddleInitial [0] " altName="Enter Middle Initial (MI)">X</li> <li partialName="FamilyName [0] " fullName="form1 [0] .#subform [6] .FamilyName [0] " altName="Section 1. Employee Information and Attestation. Family Name (Last Name)">Doe</li> <li partialName="GivenName [0] " fullName="form1 [0] .#subform [6] .GivenName [0] " altName="Given Name (First Name)">John</li> <li partialName="OtherNamesUsed [0] " fullName="form1 [0] .#subform [6] .OtherNamesUsed [0] " altName="Maiden Name">Mr. Doe</li> <li partialName="StreetNumberName [0] " fullName="form1 [0] .#subform [6] .StreetNumberName [0] " altName=" Street Number and Name">123 Main St.</li> > ... Another idea I had was to include the partialName in the contents and not fill out the attrs: <li>StreetNumberName [0] : 123 Main St</li> More unit tests on way...
        Hide
        Nick Burch added a comment -

        To make reviewing easier, it might be handy if you could upload a PNG screenshot of one of these forms, so it's quick to view that alongside suggested html

        I'd be minded to go for something like:
        <li title="Street Number and Name">StreetNumberName[0]: 123 Main St</li>

        So we'd have the alt name, the partial name, the value, but not the full name (but we would have the form/subform name elsewhere)

        Show
        Nick Burch added a comment - To make reviewing easier, it might be handy if you could upload a PNG screenshot of one of these forms, so it's quick to view that alongside suggested html I'd be minded to go for something like: <li title="Street Number and Name">StreetNumberName [0] : 123 Main St</li> So we'd have the alt name, the partial name, the value, but not the full name (but we would have the form/subform name elsewhere)
        Hide
        Tim Allison added a comment -

        Screenshot attached. Thanks again to: http://benlitchfield.sys-con.com/node/48543?page=0,1 for the code example and example doc.

        The middle ground that you recommend makes sense.

        Show
        Tim Allison added a comment - Screenshot attached. Thanks again to: http://benlitchfield.sys-con.com/node/48543?page=0,1 for the code example and example doc. The middle ground that you recommend makes sense.
        Hide
        Tim Allison added a comment -

        Middle-road change made. The alternate name is an attribute and partial name is added to content followed by a ":".

        I also added a few more tests.

        Show
        Tim Allison added a comment - Middle-road change made. The alternate name is an attribute and partial name is added to content followed by a ":". I also added a few more tests.
        Hide
        Tim Allison added a comment -

        Fixed in r1549727.

        Show
        Tim Allison added a comment - Fixed in r1549727.
        Hide
        Tim Allison added a comment -

        In hindsight, would prefer to use test documents that are unequivocally consistent with Apache License. I've removed docs from trunk and commented out test cases (r1550725). If anyone would like to contribute an example doc that is unequivocally consistent with Apache License 2.0, I'll modify the test case for that doc. I'll be on the lookout for test docs and will leave this open until test cases are turned back on. The functionality within Tika is still available, of course.

        Show
        Tim Allison added a comment - In hindsight, would prefer to use test documents that are unequivocally consistent with Apache License. I've removed docs from trunk and commented out test cases (r1550725). If anyone would like to contribute an example doc that is unequivocally consistent with Apache License 2.0, I'll modify the test case for that doc. I'll be on the lookout for test docs and will leave this open until test cases are turned back on. The functionality within Tika is still available, of course.

          People

          • Assignee:
            Tim Allison
            Reporter:
            Michael Graessle
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development