Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9112

SegmentingTokenizerBase splits terms that occupy 1024th positions in text

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 9.0
    • None
    • modules/analysis
    • New

    Description

      The OpenNLP tokenizer show weird behaviour when text contains spurious punctuation such as having triple dots trailing a sentence...

      1. the first dot becomes part of the token, having 'sentence.' becomes the token
      2. much further down the text, a seemingly unrelated token is then suddenly split up, in my example (see attached unit test) the name 'Baron' is split into 'Baro' and 'n', this is the real problem

      The problems never seem to occur when using small texts in unit tests but it certainly does in real world examples. Depending on how many 'spurious' dots, a completely different term can become split, or the same term in just a different location.

      I am not too sure if this is actually a problem in the Lucene code, but it is a problem and i have a Lucene unit test proving the problem.

      Attachments

        1. LUCENE-9112.patch
          6 kB
          Markus Jelsma
        2. LUCENE-9112.patch
          6 kB
          Markus Jelsma
        3. en-token.bin
          583 kB
          Markus Jelsma
        4. en-sent.bin
          35 kB
          Markus Jelsma
        5. LUCENE-9112-unittest.patch
          4 kB
          Markus Jelsma
        6. LUCENE-9112-unittest.patch
          4 kB
          Markus Jelsma

        Activity

          People

            Unassigned Unassigned
            markus17 Markus Jelsma
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: