Here's my take on it: The UnifiedHighlighter (and PostingsHighlighter from which it derives) processes the MultiTermQueries (e.g. wildcards) in the query and creates multiple CharacterRunAutomaton intended to match the same things. CharacterRunAutomaton takes a Automaton as input, and when it does it's processing, it matches the Character code points (integers from 0 to 0x10FFFF) against the integers in the Automaton. However, this strategy assumes that the Automaton was constructed based on character code points. But AutomatonQuery.getAutomaton is intended to match byte by byte (integers 0 to 255). PrefixQuery.toAutomaton will get 2 bytes for the the "я" in BytesRef form, and add 2 states. This does not line up with the assumptions of CharacterRunAutomaton.
A short term immediate "fix" is simply to put AutomatonQuery last in the if-else list as Dmitry indicated. As such, PrefixQuery will work again. This was broken by
LUCENE-6367 (Lucene 5.1). TermRangeQuery, which also now extends AutomatonQuery, will likewise work – broken by LUCENE-5879 (Lucene 5.2). Again, back when MultiTermHighlighting was first written, neither of those queries extended AutomatonQuery. But there will be bugs for other types of AutomatonQuery (namely WildcardQuery and RegexpQuery) that have yet to be reported.
Robert Muir or Michael McCandless I wonder if you have any thoughts on how to fix this. An idea I have is to not use a CharacterRunAutomaton in the UnifiedHighlighter; use a ByteRunAutomaton instead. Then, add ByteRunAutomaton.run(char ...etc) that converts each character to the equivalent UTF8 bytes to match. Even with that, I wonder if this points to areas to improve the automata API so that people don't bump into this trap in the future. For example, maybe have the Automata self-report if it's byte oriented, Unicode codepoint oriented, or something custom. Then, RunAutomaton could throw an exception if there is a mis-match. However that would be a runtime error; maybe the Automata could be typed.
Any way, what I'd like to do is do a short term fix that addresses many common cases and the title of this issue. And then do a more thorough fix in a follow-on issue. Ishan Chattopadhyaya do you think this could go into 6.4.2 or are you only looking for "critical" issues? It's debatable what's critical and not. This bug has been around since 5.1 so perhaps it isn't.
(a patch will follow shortly)