Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2855

Allow some flexibility for divergences from the standard on Seq vs Bag in DomXMPParser

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      While working on TIKA-1285 (migrate to PDFBox 2.0.0), Jeremy Anderson noticed that the DomXmpParser was failing on some XMP with:

      org.apache.xmpbox.xml.XmpParsingException: Invalid array type, expecting Seq and found Bag [prefix=dc; name=creator]
      	at org.apache.xmpbox.xml.DomXmpParser.manageArray(DomXmpParser.java:449)
      	at org.apache.xmpbox.xml.DomXmpParser.createProperty(DomXmpParser.java:338)
      	at org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:305)
      	at org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:234)
      	at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:198)
      	at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:202)
      

      One file that triggers this is available on Tika-1252 here

      The raw xmp for that file includes:

               <dc:creator>
                  <rdf:Bag>
                     <rdf:li>Single Author</rdf:li>
                  </rdf:Bag>
               </dc:creator>
      

      On TIKA-1252, I confirmed that this is against the spec link and Alexandre Madurell confirmed that this was what Acrobat was generating link.

      So, would it be easy enough to allow for some divergence from the standard?

      Code to reproduce issue in tika setup:

          @Test
          public void oneOffMetadataTest() throws Exception {
              PDDocument doc = PDDocument.load(this.getClass().getResourceAsStream("/test-documents/sampleAcrobat_4_x.pdf"));
              DomXmpParser p = new DomXmpParser();
              p.setStrictParsing(false);
              p.parse(doc.getDocumentCatalog().getMetadata().exportXMPMetadata());
          }
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tallison@mitre.org Tim Allison
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: