Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3138

PDF parser with XFA produce malformed XML

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.24.1
    • 1.25
    • parser
    • None

    Description

      When 

      curl -T xfa.pdf http://<tika-server>/rmeta/xml --header "Accept: application/json"

      malformed xml is made

      <div class="xfa_content">....</xfa_content> 

      instead of 

      <div class="xfa_content">....</div>

       

      https://github.com/apache/tika/blob/main/tika-parsers/src/main/java/org/apache/tika/parser/pdf/XFAExtractor.java#L75

      does not have a correct match here:

      https://github.com/apache/tika/blob/main/tika-parsers/src/main/java/org/apache/tika/parser/pdf/XFAExtractor.java#L109

      and

      https://github.com/apache/tika/blob/main/tika-parsers/src/main/java/org/apache/tika/parser/pdf/XFAExtractor.java#L138

      Attachments

        Activity

          People

            tallison Tim Allison
            wiwi wiwi
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: