Description
Tika does not currently return the language from a PDF's metadata (for an example PDF I'm seeking permission to share with you - Perhaps for all PDFs).
It would be useful to me (and I imagine others) if it could do so.
The example PDF I have does get a language when processed with exiftool...
$ exiftool -X /tmp/my-example.pdf |grep -i lang <PDF:Language>en-US</PDF:Language>
where as it does not with Tika.
I looked briefly into the PDF parsing code, and it appears that the language value in question is available within PDFBox's document catalog, so I can pass it through with a change such as...
diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java index b2a15cab6..66b1c9343 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java @@ -224,7 +224,10 @@ public class PDFParser extends AbstractParser implements Initializable { metadata.set(AccessPermissions.CAN_PRINT_DEGRADED, Boolean.toString(ap.canPrintDegraded())); - + if (document.getDocumentCatalog().getLanguage() != null) { + metadata.set(Metadata.CONTENT_LANGUAGE, document.getDocumentCatalog().getLanguage()); + } + //now go for the XMP Document dom = loadDOM(document.getDocumentCatalog().getMetadata(), metadata, context); diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java index 93966e4f2..7b7ba14fe 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java @@ -1310,6 +1310,14 @@ public class PDFParserTest extends TikaTest { assertContains("Tika - Content", content); } + @Test + public void testMissingLanguage() throws Exception { + Metadata metadata = getXML("my-example.pdf").metadata; + System.out.println(metadata); + assertEquals("application/pdf", metadata.get(Metadata.CONTENT_TYPE)); + assertEquals("en-US", metadata.get(Metadata.CONTENT_LANGUAGE)); + } + @Test public void testConfiguringMoreParams() throws Exception { try (InputStream configIs = getClass().getResourceAsStream("/org/apache/tika/parser/pdf/tika-inline-config.xml")) {
It's my first time looking at this code, so that change may be a bit naive, but hopefully shows what I'm getting at.