Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2498

add sentence boundary charfilter

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • modules/analysis
    • None
    • New

    Description

      From the discussion of LUCENE-2167:

      It would be nice to have a CharFilter? to mark sentence boundaries.
      Such functionality would be useful for:

      • prevent phrase queries with 0 slop from matching across sentences
      • inhibiting multiword synonyms, or shingles, etc.

      For sentence boundary detection we could use Jflex's support for the Unicode Sentence_Break property etc,
      and the UAX#29 definition as a default grammar.

      One idea is to just mark the boundaries with a user-provided String.

      As a simple use-case, a user could then add this string to a stopfilter, and it would introduce a position increment.
      This would inhibit phrase queries, etc.

      a user could use the sentence-markers to do more advanced processing downstream.

      Attachments

        Activity

          People

            Unassigned Unassigned
            rcmuir Robert Muir
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: