drawString() in PDPageContentStream just writes the text into PDF as any COSString would choose to represent it. This is not the right thing to do. When the font is a CID keyed font, every glyph is 16 bit wide by definition, and COSString won't necessarily notice and write it correctly.
Not quite: every CID can be up to 16-bits wide, but many (or for < 256 glyphs, all) will fit inside 8 bits. The byte-width of a string is controlled by
whether or not it starts with a BOM, not which font it uses the current font's CMap but is always 16-bits with TTF.
Therefore, drawString() must know what font is currently being drawn, and ask that font to encode the String to whatever byte sequence it takes to draw those glyphs. So, PDFont must be added to the drawString() API, and PDFont ought to have a method for "public byte encode(String)".
drawString() is only valid after setFont() has been called, so it doesn't need adding to the API, we can just use the current font. PDFont#encode is a good idea, yes.
PDFont needs a clearly specified API which performs java String to font-specific encoding transformation.
Yes, as above.
Observe that there are no methods in PDFont called decode(), and I have a hard time figuring out what any one of these methods actually do, because everything seems to be called "encode" or "lookup". It seems that the encode(byte, int int) performs decoding, so it should be renamed such.
Yes, I don't know if anybody knows what those methods are actually doing, including the original author.
In general I'd recommend pushing the encode/decode job down to the font layer. Provide just two methods: "byte encode(String)" and "String decode(byte)". Their job is to convert between the byte sequences required by that font and java Strings, and they handle full runs of text, not just single characters. They will then use single- or multibyte encodings as the font requires without the higher level having to do crazy stuff like processEncodedText() currently does in PDFStreamEngine.
processEncodedText() is indeed crazy and needs fixing, but what you propose won't work because the 16-bit string encoding is not set by the font, it's set on a per-string basis by having that string start with a BOM.
There are unfortunately very many ways to encode text in PDF, and especially if text needs to be decodable from the byte stream generated by other programs, the full complexity must be faced and implemented. These are to be solved in a case-by-case basis in the PDFont hierarchy. The PDFont highest class methods for encode and decode should be defined as abstract to reflect the fact that encoding depends on the particular subtype of the font.
Yes, though as far as decoding the correct text is concerned all you have to do is make sure that the ToUnicode map is built correctly - you can put any old garbage in the actual strings (any many PDFs do).
It may be that for some of these fonts the implementation is same because the actual mechanics can be handled by varying the Encoding instance, though.
Maybe, though the Encoding class is for Type1 fonts (and equivalent, e.g. Type1C) only.