[PDFBOX-3881] Handling of Byte Order Mark with Metadata-Fields - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.7
Fix Version/s: 2.0.8, 3.0.0 PDFBox
Component/s: Parsing
Labels:
- BOM
Environment:
Windows

Description

PDDocumentInformation e.g. getAuthor() honors the byte order of the extracted string and removes the byte order mark signs.

But if the extracted string does only contain the byte order mark signs the corresponding string "þÿ" is returned.

Is this the intended solution?
I'd appreciate to remove the byte order mark signs also, if the extracted string does only contain these signs.

Problematic code:

public String getString()
  {
  if (this.bytes.length > 2)
    {
      if (((this.bytes[0] & 0xFF) == 254) && ((this.bytes[1] & 0xFF) == 255))
      {

        return new String(this.bytes, 2, this.bytes.length - 2, Charsets.UTF_16BE);
      }
      if (((this.bytes[0] & 0xFF) == 255) && ((this.bytes[1] & 0xFF) == 254))
      {

        return new String(this.bytes, 2, this.bytes.length - 2, Charsets.UTF_16LE);
      }
    }
    

    return PDFDocEncoding.toString(this.bytes);
  }

Attachment has an example pdf

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ERiCDruck_23776162_ESt_0_20170727_121644-pdfcreator.pdf
27/Jul/17 10:49
34 kB
Nico Prenzel

Activity

People

Assignee:: Tilman Hausherr

Reporter:: Nico Prenzel

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 27/Jul/17 10:59

Updated:: 02/Nov/17 21:00

Resolved:: 28/Jul/17 17:47