I wanted to see what we're loosing with the removal of the AutoPrefix so I ran a small test with English Wikipedia title.
I indexed the 12M titles in three indices:
- default: keyword analyzer and the default postings format
- auto_prefix: keyword analyzer and the AutoPrefixPostings format with minAutoPrefix=24, maxAutoPrefix=Integer.MAX
- edge: edge ngram analyzer with minGram=1,maxGram=Integer.MAX and the default postings format.
|size in MB
This table shows the size that each index takes on disk in bytes. As you can see the auto_prefix is very close to the size of the default one even though we compute all the prefix with more than 24 terms. Compared to the edge_ngram which multiplies the index size by a factor 7, the auto prefix seems to be a good trade off for fields where prefix queries are the norm. I didn't compare the query time but any prefix with more than 24 terms could be resolved by one inverted list in the auto_prefix index so it is equivalent to the edge_ngram index.
The downside of the auto_prefix seems to be the merge, it takes more than 1 minute to optimize, this is 10 times slower than the default index. Though this is expected since the default index uses a keyword analyzer.
I understand that the new points APIs is better for numeric prefix/range queries but the auto prefix seems to be a good fit for prefix string queries. It saves a lot of spaces compared to edge ngram and the indexation is faster. I am not saying we should restore the functionality inside the default BlockTreeTerms but maybe we could create a separate postings format that exposes this feature ?