Details
-
New Feature
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
3.1, 4.0-ALPHA
-
None
-
None
-
New
Description
The co-occurrence filter to be developed here will output sets of tokens that co-occur within a given window onto a token stream.
These token sets can be ordered either lexically (to allow order-independent matching/counting) or positionally (e.g. sliding windows of positionally ordered co-occurring terms that include all terms in the window are called n-grams or shingles).
The parameters to this filter will be:
- window size: this can be a fixed sequence length, sentence/paragraph context (these will require sentence/paragraph segmentation, which is not in Lucene yet), or over the entire token stream (full field width)
- minimum number of co-occurring terms: >= 2
- maximum number of co-occurring terms: <= window size
- token set ordering (lexical or positional)
One use case for co-occurring token sets is as candidates for collocations.
Attachments
Issue Links
- relates to
-
LUCENE-5318 Co-occurrence counts from Concordance
- Open