Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3058 Support TIKA Migration to PDFBox 2.0
  3. PDFBOX-3068

Null metadata in 2.0 in some files that had metadata in 1.8.10 with old parser

    Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.8.10, 1.8.11, 2.0.0
    • Fix Version/s: 1.8.11, 2.0.0
    • Component/s: Parsing
    • Labels:
      None

      Description

      Tilman's observation on 'Microsoft' below revealed 1) that we should use our BodyContentHandler so that title metadata doesn't slip into the body content and 2) the title and all metadata values from PDDocumentInformation is null for at least: NZ/NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU

              Path p = Paths.get("..NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU");
              PDDocument d = PDDocument.load(p.toFile());
              assertNull(d.getDocumentInformation().getTitle());
              assertEquals(8, d.getDocumentInformation().getMetadataKeys().size());
      

      Manually reviewing a handful of documents in the metadata/metadata_value_count_diffs.csv file here, this looks to be quite pervasive...unless I'm botching the right way to load the documents and metadata.

        Attachments

          Activity

            People

            • Assignee:
              tilman Tilman Hausherr
              Reporter:
              tallison@mitre.org Tim Allison
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: