Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2057

Extract PDF DocInfo fields into separate metadata fields

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.13
    • 1.14, 2.0.0
    • metadata
    • None

    Description

      Hi,

      I have a PDF in which title has been set twice – once as Dublin core metadata:

      <dc:title>
        <rdf:Alt>
          <rdf:li xml:lang="x-default">
            Consumer credit cards - conditions of use
          </rdf:li>
        </rdf:Alt>
      </dc:title>

      and again in the PDF DocInfo section:

      /Title(Consumer Credit Card - Conditions of Use)

      When I use Tika to transform the PDF into HTML

      java -jar tika-app-1.13.jar int_Consumer_Conditions_of_use.pdf

      it outputs this metadata:

      <meta name="dc:title" content="Consumer credit cards - conditions of use"/>

      and this <title> tag:

      <title>Consumer credit cards - conditions of use</title>

      meaning we no longer have access to the DocInfo title.

      Is there some way you could adapt Tika to copy this PDF DocInfo forward during a conversion under a new type of metadata, e.g.

      <meta name="docinfo:title" content="Consumer Credit Card - Conditions of Use"/>

      Attachments

        Activity

          People

            tallison Tim Allison
            jhaynes John Haynes
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: