Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1857

Enhance PDFParser to extract text from XFA forms

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.13
    • Component/s: parser
    • Labels:
    • Flags:
      Patch

      Description

      Extract text from PDF Forms (XFA). Information about XFA: https://en.wikipedia.org/wiki/XFA

      1. 041617_filled_out.pdf
        815 kB
        Tim Allison
      2. doc8.pdf
        109 kB
        Kenneth Lui
      3. govdocs1_xfas.zip
        8.26 MB
        Tim Allison
      4. xfa_in_govdocs1.txt
        3 kB
        Tim Allison

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user essiembre opened a pull request:

          https://github.com/apache/tika/pull/74

          XFA support to PDFParser for TIKA-1857 contributed by pascal.essiembre

          Pull request to add XFA support to PDFParser.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/essiembre/tika TIKA-1857

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/74.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #74


          commit 3340c4546c77feac5b2bc5b0ad864329e6e7bfce
          Author: Pascal Essiembre <pascal.essiembre@norconex.com>
          Date: 2016-02-16T04:07:15Z

          Added XFA support to PDFParser for TIKA-1857 contributed by pascal.essiembre


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user essiembre opened a pull request: https://github.com/apache/tika/pull/74 XFA support to PDFParser for TIKA-1857 contributed by pascal.essiembre Pull request to add XFA support to PDFParser. You can merge this pull request into a Git repository by running: $ git pull https://github.com/essiembre/tika TIKA-1857 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/74.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #74 commit 3340c4546c77feac5b2bc5b0ad864329e6e7bfce Author: Pascal Essiembre <pascal.essiembre@norconex.com> Date: 2016-02-16T04:07:15Z Added XFA support to PDFParser for TIKA-1857 contributed by pascal.essiembre
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          from TIKA-1607's comment

          In the case of XFA forms, the form IS the content.

          Got it. Doh. Thank you.

          As I look at a few of these docs from govdocs1 w/ XFA data, it looks like the form also contains the PDF's standard metadata...(author etc.) which is not necessarily stored in the older mechanism: COSDictionary. govdocs1's 517660.pdf shows this – the author and title can be extracted from the XFA, but that info is not extracted with our current methods.

          I'll support whichever way you pick, but I personally can't see use cases where extracting that workaround message is the intent when using Tika. I do see value in keeping the entire DOM though. Maybe you can do as you suggest, but "in addition" to returning the XFA text as the content?

          Y, that would be in addition. Thank you, again.

          Show
          tallison@mitre.org Tim Allison added a comment - - edited from TIKA-1607 's comment In the case of XFA forms, the form IS the content. Got it. Doh. Thank you. As I look at a few of these docs from govdocs1 w/ XFA data, it looks like the form also contains the PDF's standard metadata...(author etc.) which is not necessarily stored in the older mechanism: COSDictionary. govdocs1's 517660.pdf shows this – the author and title can be extracted from the XFA, but that info is not extracted with our current methods. I'll support whichever way you pick, but I personally can't see use cases where extracting that workaround message is the intent when using Tika. I do see value in keeping the entire DOM though. Maybe you can do as you suggest, but "in addition" to returning the XFA text as the content? Y, that would be in addition. Thank you, again.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          list of PDFs in govdocs1 that have a non-null PDXFAResource object, found with PDFBox 2.0's trunk.

          Show
          tallison@mitre.org Tim Allison added a comment - list of PDFs in govdocs1 that have a non-null PDXFAResource object, found with PDFBox 2.0's trunk.
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          I've only looked at a handful of files that contain xfa...this metadata is entirely new to me. The files I've looked at come from govdocs1 and are fairly old by now.

          In the attached 041617_filled_out.pdf, I've added content to the forms and saved the document.

          With the patch, I'm getting all of the boilerplate from the xfa extraction, but I'm not getting any content from the form because it isn't in <(speak|text|exData)> elements. However, with our old code, I am seeing the entered data, e.g. my_exhibitor.

          Is this PDF storing the contents of the form in both the xfa and in the traditional AcroForm?

          I imagine that won't happen in all PDFs, and there will be an either/or?

          To avoid duplication of content, do we want to skip processing of AcroForm data if XFA exists? Will we miss anything?

          The other major question: I like the narrow focus that the current regexes yield, but why wouldn't we want to run our HtmlParser or our DcXMLParser against the bytes and pull everything out? We'd have to skip inline/embedded images or handle those properly at some point...but any other reasons?

          Tilman Hausherr, have you worked with XFA? Any recommendations for pulling as much info as we can without duplication?

          We could make this configurable, of course.

          Show
          tallison@mitre.org Tim Allison added a comment - - edited I've only looked at a handful of files that contain xfa...this metadata is entirely new to me. The files I've looked at come from govdocs1 and are fairly old by now. In the attached 041617_filled_out.pdf , I've added content to the forms and saved the document. With the patch, I'm getting all of the boilerplate from the xfa extraction, but I'm not getting any content from the form because it isn't in <(speak|text|exData)> elements. However, with our old code, I am seeing the entered data, e.g. my_exhibitor . Is this PDF storing the contents of the form in both the xfa and in the traditional AcroForm? I imagine that won't happen in all PDFs, and there will be an either/or? To avoid duplication of content, do we want to skip processing of AcroForm data if XFA exists? Will we miss anything? The other major question: I like the narrow focus that the current regexes yield, but why wouldn't we want to run our HtmlParser or our DcXMLParser against the bytes and pull everything out? We'd have to skip inline/embedded images or handle those properly at some point...but any other reasons? Tilman Hausherr , have you worked with XFA? Any recommendations for pulling as much info as we can without duplication? We could make this configurable, of course.
          Hide
          tilman Tilman Hausherr added a comment -

          Sorry, I have no experience with XFA. Maruan Sahyoun might know more.

          Show
          tilman Tilman Hausherr added a comment - Sorry, I have no experience with XFA. Maruan Sahyoun might know more.
          Hide
          msahyoun Maruan Sahyoun added a comment -

          The reason you are not getting the data is that this is stored as part of the data node in an xml data structure which matches the binding information in the field. That data is in xfa.datasets.data with the my_exibitor value stored in the Exhibitorname field.

          Extracting speak|text|exData will give you the boilerplate text but not the field value.

          Now there are two types of XFA forms - static and dynamic. Static XFA forms will have an XFA entry and AcroForm fields. Dynamic XFA forms will only have an XFA entry and no AcroForm fields.

          When an XFA form is filled out with an XFA aware PDF processor for static forms both the xfa.datasets.data information is updated as well as the V entry of the AcroForm form field. If you fill out a static form with a non XFA aware PDF processor it will only see the AcroForm information and as a result only updates the AcroForm form fields V entry.

          When trying to fill a dynamic XFA form with a non XFA aware PDF processor it will not see any form fields at all.

          I'm happy to provide more information on that topic but thought that this will give you a first outline.

          Show
          msahyoun Maruan Sahyoun added a comment - The reason you are not getting the data is that this is stored as part of the data node in an xml data structure which matches the binding information in the field. That data is in xfa.datasets.data with the my_exibitor value stored in the Exhibitorname field. Extracting speak|text|exData will give you the boilerplate text but not the field value. Now there are two types of XFA forms - static and dynamic. Static XFA forms will have an XFA entry and AcroForm fields. Dynamic XFA forms will only have an XFA entry and no AcroForm fields. When an XFA form is filled out with an XFA aware PDF processor for static forms both the xfa.datasets.data information is updated as well as the V entry of the AcroForm form field. If you fill out a static form with a non XFA aware PDF processor it will only see the AcroForm information and as a result only updates the AcroForm form fields V entry. When trying to fill a dynamic XFA form with a non XFA aware PDF processor it will not see any form fields at all. I'm happy to provide more information on that topic but thought that this will give you a first outline.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          This is great. Thank you!

          So, to get the best coverage for extracted content, should we do the following:

          Check for fields in the AcroForm.

          a) If those exist (Static XFA), use the content extracted from the AcroForm and ignore the XFA
          b) If they don't exist (Dynamic XFA), scrape/extract info from the XFA

          In your experience, will we miss any info if we ignore the XFA for Static XFAs and rely solely on the AcroForm?

          Show
          tallison@mitre.org Tim Allison added a comment - This is great. Thank you! So, to get the best coverage for extracted content, should we do the following: Check for fields in the AcroForm. a) If those exist (Static XFA), use the content extracted from the AcroForm and ignore the XFA b) If they don't exist (Dynamic XFA), scrape/extract info from the XFA In your experience, will we miss any info if we ignore the XFA for Static XFAs and rely solely on the AcroForm?
          Hide
          msahyoun Maruan Sahyoun added a comment - - edited

          Sorry for my delay in answering your question.

          May I propose the following strategy:

          a) for static XFA if there is datasets.data use that content for the field values otherwise extract from the AcroForm.
          b) for dynamic XFA scrape/extract info from the XFA.

          Why a different proposal for a) from yours? Adobe Reader/Acrobat use the information from dataset.data for the field value over the possibly differing content in AcroForm (which might happen if the form has been filled out with an XFA aware processor and afterwards was amended with a non XFA aware processor)

          Show
          msahyoun Maruan Sahyoun added a comment - - edited Sorry for my delay in answering your question. May I propose the following strategy: a) for static XFA if there is datasets.data use that content for the field values otherwise extract from the AcroForm. b) for dynamic XFA scrape/extract info from the XFA. Why a different proposal for a) from yours? Adobe Reader/Acrobat use the information from dataset.data for the field value over the possibly differing content in AcroForm (which might happen if the form has been filled out with an XFA aware processor and afterwards was amended with a non XFA aware processor)
          Hide
          tallison@mitre.org Tim Allison added a comment -

          No problem at all. I think this will take some time for me to get right...there's no rush.

          Do I understand correctly then: no matter whether static or dynamic, try to pull data from XFA; if that doesn't exist, fall back to the AcroForm?

          Also, is there an obvious way to determine static vs. dynamic aside from checking to see if there are fields in the AcroForm?

          Thank you, again!

          Show
          tallison@mitre.org Tim Allison added a comment - No problem at all. I think this will take some time for me to get right...there's no rush. Do I understand correctly then: no matter whether static or dynamic, try to pull data from XFA; if that doesn't exist, fall back to the AcroForm? Also, is there an obvious way to determine static vs. dynamic aside from checking to see if there are fields in the AcroForm? Thank you, again!
          Hide
          msahyoun Maruan Sahyoun added a comment -

          Do I understand correctly then: no matter whether static or dynamic, try to pull data from XFA; if that doesn't exist, fall back to the AcroForm?

          if you'd like to replicate Adobe Reader/Acrobat behavior - yes. BTW don't know what will happen with PDF 2.0 as there XFA is deprecated which might have an implication for future versions.

          Also, is there an obvious way to determine static vs. dynamic aside from checking to see if there are fields in the AcroForm?

          there is PDAcroForm.xfaIsDynamic() which will give you the information (which checks if there is XFA and no AcroForm fields)

          Show
          msahyoun Maruan Sahyoun added a comment - Do I understand correctly then: no matter whether static or dynamic, try to pull data from XFA; if that doesn't exist, fall back to the AcroForm? if you'd like to replicate Adobe Reader/Acrobat behavior - yes. BTW don't know what will happen with PDF 2.0 as there XFA is deprecated which might have an implication for future versions. Also, is there an obvious way to determine static vs. dynamic aside from checking to see if there are fields in the AcroForm? there is PDAcroForm.xfaIsDynamic() which will give you the information (which checks if there is XFA and no AcroForm fields)
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Doh! Sorry. I was looking at PDXFAResource. Thank you, again.

          PDF 2.0 as there XFA is deprecated

          Oh, no...I guess we could copy/paste from the current PDFBox if XFA goes away in PDFBox...less than ideal. I don't see deprecation tags in PDXFAResource or PDAcroForm's getXFA()...which XFA handling might go away?

          Show
          tallison@mitre.org Tim Allison added a comment - Doh! Sorry. I was looking at PDXFAResource. Thank you, again. PDF 2.0 as there XFA is deprecated Oh, no...I guess we could copy/paste from the current PDFBox if XFA goes away in PDFBox...less than ideal. I don't see deprecation tags in PDXFAResource or PDAcroForm's getXFA() ...which XFA handling might go away?
          Hide
          msahyoun Maruan Sahyoun added a comment -

          XFA is not deprecated in PDFBox. It will be deprecated in the PDF 2.0 specification (as it currently stands)

          Show
          msahyoun Maruan Sahyoun added a comment - XFA is not deprecated in PDFBox. It will be deprecated in the PDF 2.0 specification (as it currently stands)
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Ha. Sorry. Figured that was a typo. We'll still have it around for a while to process though. Thank you, again.

          Show
          tallison@mitre.org Tim Allison added a comment - Ha. Sorry. Figured that was a typo. We'll still have it around for a while to process though. Thank you, again.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          194 xfas from govdocs1 as exported with PDFBox 2.0 (trunk built from within the last few weeks).

          Show
          tallison@mitre.org Tim Allison added a comment - 194 xfas from govdocs1 as exported with PDFBox 2.0 (trunk built from within the last few weeks).
          Hide
          tallison@mitre.org Tim Allison added a comment -

          I implemented a first attempt XFA scraper with StAX; this pulls the content from the fields that Pascal identified into the ContentHhandler, and it merges the "values" from the data section with the fields section.

          Currently, if XFA exists, I process that and skip the AcroForm data.

          I'm not certain what the best path is for ignoring/processing content extracted from the "regular" PDF if there is XFA data.

          For now, I'm also processing the contents of the rest of the PDF. I'm more averse to losing data than to duplication because my main use case is search...but I realize this will be really frustrating to users who want "just one copy" of the content.

          In looking at the pdfs with xfa data in govdocs1, it looks like there would be lost content in some files if we processed only the XFA and did not do the regular text extraction. On the other hand, for most of the files I examined, it looked like the content is entirely duplicative – Pascal Essiembre's point above.

          I propose adding a parameter to the PDFParserConfig along the lines of ifXFAExistsProcessItAlone...this would allow the behavior of Pascal's patch. I propose that the default be set to "false", erring on the side of extracting more content at the cost of duplication.

          Is this ok? Or, is there an easy way to determine if regular content is entirely duplicative of XFA content?

          Show
          tallison@mitre.org Tim Allison added a comment - I implemented a first attempt XFA scraper with StAX; this pulls the content from the fields that Pascal identified into the ContentHhandler, and it merges the "values" from the data section with the fields section. Currently, if XFA exists, I process that and skip the AcroForm data. I'm not certain what the best path is for ignoring/processing content extracted from the "regular" PDF if there is XFA data. For now, I'm also processing the contents of the rest of the PDF. I'm more averse to losing data than to duplication because my main use case is search...but I realize this will be really frustrating to users who want "just one copy" of the content. In looking at the pdfs with xfa data in govdocs1, it looks like there would be lost content in some files if we processed only the XFA and did not do the regular text extraction. On the other hand, for most of the files I examined, it looked like the content is entirely duplicative – Pascal Essiembre 's point above. I propose adding a parameter to the PDFParserConfig along the lines of ifXFAExistsProcessItAlone ...this would allow the behavior of Pascal's patch. I propose that the default be set to "false", erring on the side of extracting more content at the cost of duplication. Is this ok? Or, is there an easy way to determine if regular content is entirely duplicative of XFA content?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tika/pull/74

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/74
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Pascal Essiembre, thank you for this pull request! I made a few modifications, but we now have basic XFA processing, thanks to you. To obtain the XFA-only behavior, you'll need to do something like this:

                  ParseContext context = new ParseContext();
                  PDFParserConfig config = new PDFParserConfig();
                  config.setIfXFAExtractOnlyXFA(true);
                  context.set(PDFParserConfig.class, config);
          

          Maruan Sahyoun, thank you, again, for helping me understand XFA and Acroforms!

          For posterity, here are some areas for improvement in XFA parsing:

          • handle metadata stored in <desc> section (govdocs1: 754282.pdf, 982106.pdf)
          • handle pdf metadata (access permissions, etc.) in <pdf> element
          • extract different types of uris as metadata
          • add extraction of <image> data (govdocs1: 754282.pdf)
          • add computation of traversal order for fields
          • figure out when text extracted from xfa fields is duplicative of that
            extracted from the rest of the pdf...and do this efficiently and quickly
          • avoid duplication with <speak> and <tooltip> elements
          Show
          tallison@mitre.org Tim Allison added a comment - Pascal Essiembre , thank you for this pull request! I made a few modifications, but we now have basic XFA processing, thanks to you. To obtain the XFA-only behavior, you'll need to do something like this: ParseContext context = new ParseContext(); PDFParserConfig config = new PDFParserConfig(); config.setIfXFAExtractOnlyXFA(true); context.set(PDFParserConfig.class, config); Maruan Sahyoun , thank you, again, for helping me understand XFA and Acroforms! For posterity, here are some areas for improvement in XFA parsing: handle metadata stored in <desc> section (govdocs1: 754282.pdf, 982106.pdf) handle pdf metadata (access permissions, etc.) in <pdf> element extract different types of uris as metadata add extraction of <image> data (govdocs1: 754282.pdf) add computation of traversal order for fields figure out when text extracted from xfa fields is duplicative of that extracted from the rest of the pdf...and do this efficiently and quickly avoid duplication with <speak> and <tooltip> elements
          Hide
          hudson Hudson added a comment -

          UNSTABLE: Integrated in tika-trunk-jdk1.7 #916 (See https://builds.apache.org/job/tika-trunk-jdk1.7/916/)
          TIKA-1857: add basic XFA extraction support via Pascal Essiembre. (tallison: rev dbefe9830b26d05f9ce53503565a069bcc63d7c1)

          • tika-parsers/src/test/resources/test-documents/testPDF_XFA_govdocs1_258578.pdf
          • tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
          • tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
          • tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
          • tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
          • tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
          • tika-parsers/src/main/java/org/apache/tika/parser/pdf/XFAExtractor.java
            TIKA-1857: add basic XFA extraction support via Pascal Essiembre. (tallison: rev 7c245fa87507cf0887838001c54c65b79b7e7cbc)
          • CHANGES.txt
          Show
          hudson Hudson added a comment - UNSTABLE: Integrated in tika-trunk-jdk1.7 #916 (See https://builds.apache.org/job/tika-trunk-jdk1.7/916/ ) TIKA-1857 : add basic XFA extraction support via Pascal Essiembre. (tallison: rev dbefe9830b26d05f9ce53503565a069bcc63d7c1) tika-parsers/src/test/resources/test-documents/testPDF_XFA_govdocs1_258578.pdf tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties tika-parsers/src/main/java/org/apache/tika/parser/pdf/XFAExtractor.java TIKA-1857 : add basic XFA extraction support via Pascal Essiembre. (tallison: rev 7c245fa87507cf0887838001c54c65b79b7e7cbc) CHANGES.txt
          Hide
          hudson Hudson added a comment -

          UNSTABLE: Integrated in tika-2.x #41 (See https://builds.apache.org/job/tika-2.x/41/)
          TIKA-1857: add basic XFA extraction via Pascal Essiembre. (tallison: rev f1e4ebdb422d24b7080d02620f3c38f6dda57910)

          • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
          • CHANGES.txt
          • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
          • tika-parser-modules/tika-parser-pdf-module/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
          • tika-parser-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
          • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/XFAExtractor.java
          • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
          • tika-test-resources/src/test/resources/test-documents/testPDF_XFA_govdocs1_258578.pdf
          Show
          hudson Hudson added a comment - UNSTABLE: Integrated in tika-2.x #41 (See https://builds.apache.org/job/tika-2.x/41/ ) TIKA-1857 : add basic XFA extraction via Pascal Essiembre. (tallison: rev f1e4ebdb422d24b7080d02620f3c38f6dda57910) tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java CHANGES.txt tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java tika-parser-modules/tika-parser-pdf-module/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties tika-parser-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/XFAExtractor.java tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java tika-test-resources/src/test/resources/test-documents/testPDF_XFA_govdocs1_258578.pdf
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #919 (See https://builds.apache.org/job/tika-trunk-jdk1.7/919/)
          Fix for side effect of TIKA-1857-- javax.xml.stream is no longer (tallison: rev 9a1ba9494cf2a786e4615f0d72ca5f7c303840fa)

          • tika-bundle/pom.xml
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #919 (See https://builds.apache.org/job/tika-trunk-jdk1.7/919/ ) Fix for side effect of TIKA-1857 -- javax.xml.stream is no longer (tallison: rev 9a1ba9494cf2a786e4615f0d72ca5f7c303840fa) tika-bundle/pom.xml
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Pascal Essiembre, we may be headed towards a release of 1.13 within the month (ish). Will the current update meet your needs? Thank you, again, for your patch!

          Show
          tallison@mitre.org Tim Allison added a comment - Pascal Essiembre , we may be headed towards a release of 1.13 within the month (ish). Will the current update meet your needs? Thank you, again, for your patch!
          Hide
          pascal.essiembre Pascal Essiembre added a comment -

          Yes, it looks like the changes will do just fine. Thank you!

          Show
          pascal.essiembre Pascal Essiembre added a comment - Yes, it looks like the changes will do just fine. Thank you!
          Hide
          hkkenneth Kenneth Lui added a comment -

          Hi, I tried to use this feature but it doesn't seem to work. I understand this is not the right place to ask troubleshooting type of question, so I put the details at http://stackoverflow.com/questions/42217327/apache-tika-extract-only-field-names-from-pdf-xfa-forms-but-not-the-text-content . Could you please help whether I misconfigured Tika or it is an issue about the feature implementation. Thanks!

          Show
          hkkenneth Kenneth Lui added a comment - Hi, I tried to use this feature but it doesn't seem to work. I understand this is not the right place to ask troubleshooting type of question, so I put the details at http://stackoverflow.com/questions/42217327/apache-tika-extract-only-field-names-from-pdf-xfa-forms-but-not-the-text-content . Could you please help whether I misconfigured Tika or it is an issue about the feature implementation. Thanks!
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          Are you able to share mocked up xml, sanitized of patient data?

          Show
          tallison@mitre.org Tim Allison added a comment - - edited Are you able to share mocked up xml, sanitized of patient data?
          Hide
          hkkenneth Kenneth Lui added a comment - - edited

          I cannot copy the file out of the secured environment. But this is a file I found on the Internet to have the same issue and I used this to test my pdfbox script as well.

          Edit: the comment seems to be not obvious that I attached doc8.pdf. That is the file I am referring to.

          Show
          hkkenneth Kenneth Lui added a comment - - edited I cannot copy the file out of the secured environment. But this is a file I found on the Internet to have the same issue and I used this to test my pdfbox script as well. Edit: the comment seems to be not obvious that I attached doc8.pdf. That is the file I am referring to.
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited
          <etd:PelnaNazwa>IT IS EASY</etd:PelnaNazwa>
          <etd:ImiePierwsze>JUST TRY</etd:ImiePierwsze>
          <etd:Nazwisko>DUDE</etd:Nazwisko>
          <etd:Wojewodztwo>DO YOUR OWN JOB</etd:Wojewodztwo>
          <etd:Powiat>DON'T EXPECT ME TO DO IT!</etd:Powiat>
          <etd:Gmina>IT'S XML!</etd:Gmina>
          <etd:Miejscowosc>READ THE DOCUMENTATION</etd:Miejscowosc>
          <etd:KodPocztowy>DUDE</etd:KodPocztowy>
          <etd:Poczta>LEARN BEFORE YOU CODE</etd:Poczta>
          

          Is now extracted as:

          	<li fieldName="PelnaNazwa">Nazwa pełna: IT IS EASY</li>
          <li fieldName="Nazwisko">Nazwisko: DUDE</li>
          	<li fieldName="ImiePierwsze">ImiePierwsze: JUST TRY</li>
          	<li fieldName="Wojewodztwo">Województwo: DO YOUR OWN JOB</li>
          	<li fieldName="Powiat">Powiat: DON'T EXPECT ME TO DO IT!</li>
          	<li fieldName="Gmina">Gmina: IT'S XML!</li>
                  <li fieldName="Miejscowosc">Miejscowość: READ THE DOCUMENTATION</li>
          	<li fieldName="KodPocztowy">Kod pocztowy: DUDE</li>
          	<li fieldName="Poczta">Poczta: LEARN BEFORE YOU CODE</li>
          

          Once our git is back up and running, I'll push the fix. Thank you for raising this issue and sharing a triggering document.

          Show
          tallison@mitre.org Tim Allison added a comment - - edited <etd:PelnaNazwa>IT IS EASY</etd:PelnaNazwa> <etd:ImiePierwsze>JUST TRY</etd:ImiePierwsze> <etd:Nazwisko>DUDE</etd:Nazwisko> <etd:Wojewodztwo>DO YOUR OWN JOB</etd:Wojewodztwo> <etd:Powiat>DON'T EXPECT ME TO DO IT!</etd:Powiat> <etd:Gmina>IT'S XML!</etd:Gmina> <etd:Miejscowosc>READ THE DOCUMENTATION</etd:Miejscowosc> <etd:KodPocztowy>DUDE</etd:KodPocztowy> <etd:Poczta>LEARN BEFORE YOU CODE</etd:Poczta> Is now extracted as: <li fieldName="PelnaNazwa">Nazwa pełna: IT IS EASY</li> <li fieldName="Nazwisko">Nazwisko: DUDE</li> <li fieldName="ImiePierwsze">ImiePierwsze: JUST TRY</li> <li fieldName="Wojewodztwo">Województwo: DO YOUR OWN JOB</li> <li fieldName="Powiat">Powiat: DON'T EXPECT ME TO DO IT!</li> <li fieldName="Gmina">Gmina: IT'S XML!</li> <li fieldName="Miejscowosc">Miejscowość: READ THE DOCUMENTATION</li> <li fieldName="KodPocztowy">Kod pocztowy: DUDE</li> <li fieldName="Poczta">Poczta: LEARN BEFORE YOU CODE</li> Once our git is back up and running, I'll push the fix. Thank you for raising this issue and sharing a triggering document.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          I pushed the fix to our new repo. Let me know if that fixes this issue. Thank you.

          Show
          tallison@mitre.org Tim Allison added a comment - I pushed the fix to our new repo. Let me know if that fixes this issue. Thank you.

            People

            • Assignee:
              Unassigned
              Reporter:
              pascal.essiembre Pascal Essiembre
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development