Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3058 Support TIKA Migration to PDFBox 2.0
  3. PDFBOX-3068

Null metadata in 2.0 in some files that had metadata in 1.8.10 with old parser

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.8.10, 1.8.11, 2.0.0
    • 1.8.11, 2.0.0
    • Parsing
    • None

    Description

      Tilman's observation on 'Microsoft' below revealed 1) that we should use our BodyContentHandler so that title metadata doesn't slip into the body content and 2) the title and all metadata values from PDDocumentInformation is null for at least: NZ/NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU

              Path p = Paths.get("..NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU");
              PDDocument d = PDDocument.load(p.toFile());
              assertNull(d.getDocumentInformation().getTitle());
              assertEquals(8, d.getDocumentInformation().getMetadataKeys().size());
      

      Manually reviewing a handful of documents in the metadata/metadata_value_count_diffs.csv file here, this looks to be quite pervasive...unless I'm botching the right way to load the documents and metadata.

      Attachments

        Activity

          People

            tilman Tilman Hausherr
            tallison Tim Allison
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: