[TIKA-2057] Extract PDF DocInfo fields into separate metadata fields - ASF JIRA

XML

Word

Printable

JSON

Hi,

I have a PDF in which title has been set twice – once as Dublin core metadata:

<dc:title>
  <rdf:Alt>
    <rdf:li xml:lang="x-default">
      Consumer credit cards - conditions of use
    </rdf:li>
  </rdf:Alt>
</dc:title>

and again in the PDF DocInfo section:

/Title(Consumer Credit Card - Conditions of Use)

When I use Tika to transform the PDF into HTML

java -jar tika-app-1.13.jar int_Consumer_Conditions_of_use.pdf

it outputs this metadata:

<meta name="dc:title" content="Consumer credit cards - conditions of use"/>

and this <title> tag:

<title>Consumer credit cards - conditions of use</title>

meaning we no longer have access to the DocInfo title.

Is there some way you could adapt Tika to copy this PDF DocInfo forward during a conversion under a new type of metadata, e.g.

<meta name="docinfo:title" content="Consumer Credit Card - Conditions of Use"/>