Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2057

Extract PDF DocInfo fields into separate metadata fields

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 2.0, 1.14
    • Component/s: metadata
    • Labels:
      None

      Description

      Hi,

      I have a PDF in which title has been set twice – once as Dublin core metadata:

      <dc:title>
        <rdf:Alt>
          <rdf:li xml:lang="x-default">
            Consumer credit cards - conditions of use
          </rdf:li>
        </rdf:Alt>
      </dc:title>

      and again in the PDF DocInfo section:

      /Title(Consumer Credit Card - Conditions of Use)

      When I use Tika to transform the PDF into HTML

      java -jar tika-app-1.13.jar int_Consumer_Conditions_of_use.pdf

      it outputs this metadata:

      <meta name="dc:title" content="Consumer credit cards - conditions of use"/>

      and this <title> tag:

      <title>Consumer credit cards - conditions of use</title>

      meaning we no longer have access to the DocInfo title.

      Is there some way you could adapt Tika to copy this PDF DocInfo forward during a conversion under a new type of metadata, e.g.

      <meta name="docinfo:title" content="Consumer Credit Card - Conditions of Use"/>

        Activity

        Hide
        jhaynes John Haynes added a comment -

        (Attaching file with dc:title and DocInfo title)

        Show
        jhaynes John Haynes added a comment - (Attaching file with dc:title and DocInfo title)
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Y, I've been wanting to do this for a while. I'd want to keep the current behavior as is for the pure Dublin Core keys, but we could add other keys to maintain the DocInfo information.

        Recommendations? pdf:docinfo:title

        Show
        tallison@mitre.org Tim Allison added a comment - Y, I've been wanting to do this for a while. I'd want to keep the current behavior as is for the pure Dublin Core keys, but we could add other keys to maintain the DocInfo information. Recommendations? pdf:docinfo:title
        Hide
        jhaynes John Haynes added a comment -
        pdf:docinfo:title

        Looks good!

        Show
        jhaynes John Haynes added a comment - pdf:docinfo:title Looks good!
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1109 (See https://builds.apache.org/job/Tika-trunk/1109/)
        TIKA-2057 - maintain DocInfo metadata in PDFs (tallison: rev ce07d8a10499fae015f07ca4fd4daf3473ca5193)

        • (edit) CHANGES.txt
        • (add) tika-parsers/src/test/resources/test-documents/testPDF_diffTitles.pdf
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
        • (add) tika-core/src/main/java/org/apache/tika/metadata/PDF.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1109 (See https://builds.apache.org/job/Tika-trunk/1109/ ) TIKA-2057 - maintain DocInfo metadata in PDFs (tallison: rev ce07d8a10499fae015f07ca4fd4daf3473ca5193) (edit) CHANGES.txt (add) tika-parsers/src/test/resources/test-documents/testPDF_diffTitles.pdf (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java (add) tika-core/src/main/java/org/apache/tika/metadata/PDF.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Fixed. Thank you for opening this.

        Show
        tallison@mitre.org Tim Allison added a comment - Fixed. Thank you for opening this.

          People

          • Assignee:
            tallison@mitre.org Tim Allison
            Reporter:
            jhaynes John Haynes
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development