I refactored the code so that the tag and document probabilities are computed and used to find the most important document terms corresponding to a given tag term during the index creation phase. These most important document terms (ranked by information gain) for a given tag term is stored as meta information in the index when the index is created. I added a class TagIndexWriter which extends IndexWriter which is used to create an index which can be used to run MoreLikeThisUsingTags queries.
I recreated a test index with one million documents, and assigned tags (tag_0,...tag_4) to 10%,20%.. and so on of the documents.
The time taken to generate a query on an index created using TagIndexWriter:
tag name, number of documents, time in ms
tag_0, 10134, 22
tag_1, 19996, 29
tag_2, 30010, 6
tag_3, 39907, 6
tag_4, 50148, 9
Since the document terms corresponding to a tag term is computed during the indexing phase, the time taken to generate a MoreLikeThisUsingTags query is constant.