Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3881

Handling of Byte Order Mark with Metadata-Fields

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.0.7
    • Fix Version/s: 2.0.8, 3.0.0 PDFBox
    • Component/s: Parsing
    • Labels:
    • Environment:
      Windows

      Description

      PDDocumentInformation e.g. getAuthor() honors the byte order of the extracted string and removes the byte order mark signs.

      But if the extracted string does only contain the byte order mark signs the corresponding string "þÿ" is returned.

      Is this the intended solution?
      I'd appreciate to remove the byte order mark signs also, if the extracted string does only contain these signs.

      Problematic code:

      public String getString()
        {
        if (this.bytes.length > 2)
          {
            if (((this.bytes[0] & 0xFF) == 254) && ((this.bytes[1] & 0xFF) == 255))
            {
      
              return new String(this.bytes, 2, this.bytes.length - 2, Charsets.UTF_16BE);
            }
            if (((this.bytes[0] & 0xFF) == 255) && ((this.bytes[1] & 0xFF) == 254))
            {
      
              return new String(this.bytes, 2, this.bytes.length - 2, Charsets.UTF_16LE);
            }
          }
          
      
          return PDFDocEncoding.toString(this.bytes);
        }
      

      Attachment has an example pdf

        Attachments

          Activity

            People

            • Assignee:
              tilman Tilman Hausherr
              Reporter:
              NicoPrenzel Nico Prenzel
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: