Does it make sense to put this in an FST where the key is the term bytes and the value is what you're doing now for the positions, offsets, and payloads in a byte array?
That's a neat idea We should [almost] just be able to use MemoryPostingsFormat, since it already stores all postings in an FST.
I think a FST would not compress as much as what LZ4 or Deflate can do? But maybe it could speed up TermsEnum.seekCeil on large documents so it might be an interesting idea regarding random access speed?
Likely it would not compress as well, since LZ4/Deflate are able to share common infix fragments too, but FST only shares prefix/suffix. It'd be interesting to test ... but we should explore this (FST-backed TermVectorsFormat) in a new issue I think ... this issue seems awesome enough already
Or... can we simply reference the terms by ord (an int) instead of writing each term bytes?
Using ords matching the main terms dict is a neat idea too! It would be much more compact ... but, when reading the term vectors we'd need to resolve-by-ord against the main terms dictionary (not all postings formats support that: it's optional, and eg our default PF doesn't), which would likely be slower than today.
Is that information available somewhere when writing/merging term vectors?
Unfortunately, no. We only assign ords when it's time to flush the segment ... but we write term vectors "live" as we index each document. If we changed that, eg buffered up term vectors, then we could get the ords when we wrote them.