Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2559

Expose language metadata from PDF documents

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0
    • 1.18, 2.0.0
    • parser
    • None

    Description

      Tika does not currently return the language from a PDF's metadata (for an example PDF I'm seeking permission to share with you - Perhaps for all PDFs).

      It would be useful to me (and I imagine others) if it could do so.


      The example PDF I have does get a language when processed with exiftool...

      $ exiftool -X /tmp/my-example.pdf |grep -i lang
       <PDF:Language>en-US</PDF:Language>

      where as it does not with Tika.

       

      I looked briefly into the PDF parsing code, and it appears that the language value in question is available within PDFBox's document catalog, so I can pass it through with a change such as...

      diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
      index b2a15cab6..66b1c9343 100644
      --- a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
      +++ b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
      @@ -224,7 +224,10 @@ public class PDFParser extends AbstractParser implements Initializable {
               metadata.set(AccessPermissions.CAN_PRINT_DEGRADED,
                       Boolean.toString(ap.canPrintDegraded()));
      
      -
      +        if (document.getDocumentCatalog().getLanguage() != null) {
      +            metadata.set(Metadata.CONTENT_LANGUAGE, document.getDocumentCatalog().getLanguage());
      +        }
      +
               //now go for the XMP
               Document dom = loadDOM(document.getDocumentCatalog().getMetadata(), metadata, context);
      
      diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
      index 93966e4f2..7b7ba14fe 100644
      --- a/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
      +++ b/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
      @@ -1310,6 +1310,14 @@ public class PDFParserTest extends TikaTest {
               assertContains("Tika - Content", content);
           }
      
      +    @Test
      +    public void testMissingLanguage() throws Exception {
      +        Metadata metadata = getXML("my-example.pdf").metadata;
      +        System.out.println(metadata);
      +        assertEquals("application/pdf", metadata.get(Metadata.CONTENT_TYPE));
      +        assertEquals("en-US", metadata.get(Metadata.CONTENT_LANGUAGE));
      +    }
      +
           @Test
           public void testConfiguringMoreParams() throws Exception {
               try (InputStream configIs = getClass().getResourceAsStream("/org/apache/tika/parser/pdf/tika-inline-config.xml")) {
      

       

      It's my first time looking at this code, so that change may be a bit naive, but hopefully shows what I'm getting at.

      Attachments

        Activity

          People

            Unassigned Unassigned
            mattsheppard Matt Sheppard
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: