Yeap, at the beginning of this project we tried to implement this autocomplete system using regular inverted indexes, but the response time required for autocomplete to work from a user perspective is very low (<50ms), and it would be quite hard to achieve such a performance with inverted indexes.
I still think this is the way to go, but as you say we have to be careful with the data generation part. Most of the work should be put in making sure that the data is well distributed and organized in order to avoid combinatorial explosion.
Let me go in detail with the sources of data permutations and the reasoning behind them:
1) With regards to infix matches, if a user types "barcelona" we want to match "hotels in barcelona". In order to achieve this, we generate:
hotels in barcelona => hotels in barcelona
in barcelona => hotels in barcelona
barcelona => hotels in barcelona
The FST should be able to conflate these prefixes nicely in just one path, right?. Therefore this part shouldn't be a problem.
2) In addition, another feature we want to achieve is to be able to match inputs without prepositions. That means that if the user types "hotels barcelona jacuzzi", we should be able to match "hoteles in barcelona with jacuzzi". Now the only way we envision of doing it properly is to generate this permutation within the data:
hotels barcelona jacuzzi => hotels in barcelona with jacuzzi
I can see how this can explode the FST by creating different branches. Theoretically this could be done at runtime without the need of generating the data, but we don't see a way to do it in a clean way. To make things more complicated we've implemented fuzzy matching at query time (we use a levenshtein automata generated with the user input + an edit distance and then we intersect with the FST), and this is making very complicated to do preposition handling at query time.
3) PP permutations (i.e. hoteles in barcelona with jacuzzi and hoteles with jacuzzi in barcelona). I don't really see a way to work around this. Probably we need to be careful and only generate these permutations for the top-K cities, in order to limit the potential size.
Summarizing, I believe that we can reduce the set of "bad permutations" a lot if we can figure out how to implement the prepositions at runtime. If you have any ideas, let me know. Thanks!