Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.3, 0.4, 0.5
-
None
Description
Collocations generated using Mahout could be used to form a whitelist of terms to index into a Lucene index. This patch will provide a way to generate a serialized BloomFilter from CollocationsOutput and a Lucene filter that will take a BloomFilter and emit tokens that are members of that filter. This would allow a set of interesting collocations to be pre-computed for a corpus and then allow the documents to be indexed using only those collocations.