Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2559

Expose language metadata from PDF documents

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 1.18, 2.0.0
    • Component/s: parser
    • Labels:
      None

      Description

      Tika does not currently return the language from a PDF's metadata (for an example PDF I'm seeking permission to share with you - Perhaps for all PDFs).

      It would be useful to me (and I imagine others) if it could do so.


      The example PDF I have does get a language when processed with exiftool...

      $ exiftool -X /tmp/my-example.pdf |grep -i lang
       <PDF:Language>en-US</PDF:Language>

      where as it does not with Tika.

       

      I looked briefly into the PDF parsing code, and it appears that the language value in question is available within PDFBox's document catalog, so I can pass it through with a change such as...

      diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
      index b2a15cab6..66b1c9343 100644
      --- a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
      +++ b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
      @@ -224,7 +224,10 @@ public class PDFParser extends AbstractParser implements Initializable {
               metadata.set(AccessPermissions.CAN_PRINT_DEGRADED,
                       Boolean.toString(ap.canPrintDegraded()));
      
      -
      +        if (document.getDocumentCatalog().getLanguage() != null) {
      +            metadata.set(Metadata.CONTENT_LANGUAGE, document.getDocumentCatalog().getLanguage());
      +        }
      +
               //now go for the XMP
               Document dom = loadDOM(document.getDocumentCatalog().getMetadata(), metadata, context);
      
      diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
      index 93966e4f2..7b7ba14fe 100644
      --- a/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
      +++ b/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
      @@ -1310,6 +1310,14 @@ public class PDFParserTest extends TikaTest {
               assertContains("Tika - Content", content);
           }
      
      +    @Test
      +    public void testMissingLanguage() throws Exception {
      +        Metadata metadata = getXML("my-example.pdf").metadata;
      +        System.out.println(metadata);
      +        assertEquals("application/pdf", metadata.get(Metadata.CONTENT_TYPE));
      +        assertEquals("en-US", metadata.get(Metadata.CONTENT_LANGUAGE));
      +    }
      +
           @Test
           public void testConfiguringMoreParams() throws Exception {
               try (InputStream configIs = getClass().getResourceAsStream("/org/apache/tika/parser/pdf/tika-inline-config.xml")) {
      

       

      It's my first time looking at this code, so that change may be a bit naive, but hopefully shows what I'm getting at.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              mattsheppard Matt Sheppard
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: