I also agree with tokenized field caching, which is a use case for nutch. Let me elaborate on the use case. In a nutch deployment, we generate indexes from the web documents, and indeed the set of fields is known a priori. Then the indexes are distributed to several index servers running on hadoop's RPC calls. Then the query is sent to all of the index servers, the results are collected and merged on the fly. Since the indexes need not be disjoint(since crawling is an adaptive process) the results should be merged, without having a document more then once. So we need a unique key to represent the document. Default nutch codebase uses the site field(url's hostname), which is untokenized for such a task, and allow only 1 - 2 documents from a site in the search results. For obvious performance reasons, the site field is cached in the index servers with FieldCache.getStrings(). The problem arises when we want to show more than one result from a specific site (for example in a site:apache.org query ), and if we have the same url indexed in more than one index server. We use the tokenized url field in the FieldCache, then deleting duplicates becomes error prone. Since we use FieldCache.getStrings() rather that FieldCache.getStringIndex(), the problem here is not tokenized field sorting, but tokenized field not caching correctly, an example of which is an array like [com, edu. www, youtube, ] from the getStrings() method(for each doc, only a token is returned, rather than the whole url).
Well, if you are still with me, here is my proposal :
1. in FieldCacheImpl.java in both getStrings and getStringIndex functions add
Field docField = getField(reader, field);
if (docField != null && docField.isStored() && docField.isTokenized())
throw new RuntimeException("Caching in Tokenized Fields is not allowed");
2. subclass FieldCacheImpl as StoredFieldCacheImpl and implement stored field caching there, delegating untokenized fields to super class
3. add the implementation to FieldCache.java :
public static FieldCache DEFAULT = new FieldCacheImpl();
public static FieldCache STORED_CACHE = new StoredCacheImpl();
this way both lucene internals will not be affected and a stored field caching could be performed.