[PDFBOX-2855] Allow some flexibility for divergences from the standard on Seq vs Bag in DomXMPParser - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

While working on ~~TIKA-1285~~ (migrate to PDFBox 2.0.0), rpialum noticed that the DomXmpParser was failing on some XMP with:

org.apache.xmpbox.xml.XmpParsingException: Invalid array type, expecting Seq and found Bag [prefix=dc; name=creator]
	at org.apache.xmpbox.xml.DomXmpParser.manageArray(DomXmpParser.java:449)
	at org.apache.xmpbox.xml.DomXmpParser.createProperty(DomXmpParser.java:338)
	at org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:305)
	at org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:234)
	at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:198)
	at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:202)

One file that triggers this is available on Tika-1252 here

The raw xmp for that file includes:

         <dc:creator>
            <rdf:Bag>
               <rdf:li>Single Author</rdf:li>
            </rdf:Bag>
         </dc:creator>

On ~~TIKA-1252~~, I confirmed that this is against the spec link and alexandre.madurell@gmail.com confirmed that this was what Acrobat was generating link.

So, would it be easy enough to allow for some divergence from the standard?

Code to reproduce issue in tika setup:

    @Test
    public void oneOffMetadataTest() throws Exception {
        PDDocument doc = PDDocument.load(this.getClass().getResourceAsStream("/test-documents/sampleAcrobat_4_x.pdf"));
        DomXmpParser p = new DomXmpParser();
        p.setStrictParsing(false);
        p.parse(doc.getDocumentCatalog().getMetadata().exportXMPMetadata());
    }

Attachments

Issue Links

is depended upon by

TIKA-1285 Upgrade to PDFBox 2.0.0 when available

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Tim Allison

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Jul/15 17:45

Updated:: 17/Mar/16 19:14

Resolved:: 07/Jul/15 19:40