We can encode whether the posting is embedded or not by storing a byte or a negative pointer for example. There are ways to do it with minimal to no more space.
Remember than vInt/Long don't handle negative numbers well (they take max # bytes, I think).
The thing is - there is a performance penalty to storing too many bytes in the terms dict because it may affect terms lookup. docFreq may not be a very good decision.
True, but I'd expect "typically" rare terms (occurring in 1 or 2 docs across the corpus) also generally tend to have low frequency within that document. Hmm, or maybe not – maybe there's only a single article about Dr. Froobalaz, but in that article Froobalaz is mentioned many many times.
For example, a term may have one posting element with a huge payload.
True, though such apps (the exception not the rule) could override the codec.
Fixed #bytes might also allow for faster scanning, ie if we always leave a 20 byte slot we know we can then seek +20 bytes ahead, vs pulsing codec which must decode the postings for the term when scanning over it. (Though if we thought this mattered we could also write the #bytes up front).
Net/net I think we should pursue this; we should probably keep both options available and then we can test.