Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Won't Fix
-
None
-
None
-
None
-
None
Description
While working on TIKA-1285 (migrate to PDFBox 2.0.0), rpialum noticed that the DomXmpParser was failing on some XMP with:
org.apache.xmpbox.xml.XmpParsingException: Invalid array type, expecting Seq and found Bag [prefix=dc; name=creator] at org.apache.xmpbox.xml.DomXmpParser.manageArray(DomXmpParser.java:449) at org.apache.xmpbox.xml.DomXmpParser.createProperty(DomXmpParser.java:338) at org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:305) at org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:234) at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:198) at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:202)
One file that triggers this is available on Tika-1252 here
The raw xmp for that file includes:
<dc:creator> <rdf:Bag> <rdf:li>Single Author</rdf:li> </rdf:Bag> </dc:creator>
On TIKA-1252, I confirmed that this is against the spec link and alexandre.madurell@gmail.com confirmed that this was what Acrobat was generating link.
So, would it be easy enough to allow for some divergence from the standard?
Code to reproduce issue in tika setup:
@Test public void oneOffMetadataTest() throws Exception { PDDocument doc = PDDocument.load(this.getClass().getResourceAsStream("/test-documents/sampleAcrobat_4_x.pdf")); DomXmpParser p = new DomXmpParser(); p.setStrictParsing(false); p.parse(doc.getDocumentCatalog().getMetadata().exportXMPMetadata()); }
Attachments
Issue Links
- is depended upon by
-
TIKA-1285 Upgrade to PDFBox 2.0.0 when available
- Closed