Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2855

Allow some flexibility for divergences from the standard on Seq vs Bag in DomXMPParser

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • None
    • None

    Description

      While working on TIKA-1285 (migrate to PDFBox 2.0.0), rpialum noticed that the DomXmpParser was failing on some XMP with:

      org.apache.xmpbox.xml.XmpParsingException: Invalid array type, expecting Seq and found Bag [prefix=dc; name=creator]
      	at org.apache.xmpbox.xml.DomXmpParser.manageArray(DomXmpParser.java:449)
      	at org.apache.xmpbox.xml.DomXmpParser.createProperty(DomXmpParser.java:338)
      	at org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:305)
      	at org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:234)
      	at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:198)
      	at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:202)
      

      One file that triggers this is available on Tika-1252 here

      The raw xmp for that file includes:

               <dc:creator>
                  <rdf:Bag>
                     <rdf:li>Single Author</rdf:li>
                  </rdf:Bag>
               </dc:creator>
      

      On TIKA-1252, I confirmed that this is against the spec link and alexandre.madurell@gmail.com confirmed that this was what Acrobat was generating link.

      So, would it be easy enough to allow for some divergence from the standard?

      Code to reproduce issue in tika setup:

          @Test
          public void oneOffMetadataTest() throws Exception {
              PDDocument doc = PDDocument.load(this.getClass().getResourceAsStream("/test-documents/sampleAcrobat_4_x.pdf"));
              DomXmpParser p = new DomXmpParser();
              p.setStrictParsing(false);
              p.parse(doc.getDocumentCatalog().getMetadata().exportXMPMetadata());
          }
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tallison Tim Allison
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: