Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-4171

Tika server only returns last value for PDFs that have multiple of the same key

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.0.0-BETA, 2.9.2
    • tika-server
    • None

    Description

      Thanks for the great work on Tika server, it is the only OSS that can handle Adobe's protected form format that FERC uses. 

      One problem that I'm hitting is that the FERC form that I am parsing has multiple values for the same key name, e.g. in the screenshot below line 1-7 all have the same key name. When Tika Server parses this PDF, it only returns the value in row 7 (losing the previous 6 values).

      My hunch is that somewhere in Tika Server, the values are getting stored in some dictionary object, so the final value is the only survivor. Would it be possible to return the extra values as a list from Tika Server? 

      Example PDF attached - thank you for taking a look!

      Attachments

        1. testPDF_XFA_govdocs1_258578.pdf.html
          5 kB
          Tilman Hausherr
        2. screenshot.png
          70 kB
          Cassandra Xia
        3. example-output.txt
          79 kB
          Tim Allison
        4. 876503.pdf
          946 kB
          Tilman Hausherr
        5. 20230801-5207_QF20-270 East River Solar Form 556 recert FINAL.pdf
          1.62 MB
          Cassandra Xia

        Issue Links

          Activity

            People

              tallison Tim Allison
              cssndrx Cassandra Xia
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: