Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2057

Extract PDF DocInfo fields into separate metadata fields

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 2.0, 1.14
    • Component/s: metadata
    • Labels:
      None

      Description

      Hi,

      I have a PDF in which title has been set twice – once as Dublin core metadata:

      <dc:title>
        <rdf:Alt>
          <rdf:li xml:lang="x-default">
            Consumer credit cards - conditions of use
          </rdf:li>
        </rdf:Alt>
      </dc:title>

      and again in the PDF DocInfo section:

      /Title(Consumer Credit Card - Conditions of Use)

      When I use Tika to transform the PDF into HTML

      java -jar tika-app-1.13.jar int_Consumer_Conditions_of_use.pdf

      it outputs this metadata:

      <meta name="dc:title" content="Consumer credit cards - conditions of use"/>

      and this <title> tag:

      <title>Consumer credit cards - conditions of use</title>

      meaning we no longer have access to the DocInfo title.

      Is there some way you could adapt Tika to copy this PDF DocInfo forward during a conversion under a new type of metadata, e.g.

      <meta name="docinfo:title" content="Consumer Credit Card - Conditions of Use"/>

        Attachments

          Activity

            People

            • Assignee:
              tallison@mitre.org Tim Allison
              Reporter:
              jhaynes John Haynes
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: