Uploaded image for project: 'James Mailbox'
  1. James Mailbox
  2. MAILBOX-301

Lucene terms length exceeded on some emails

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • master
    • master
    • elasticsearch
    • None

    Description

      Lucene supports a maximum term size of 32KB

      This term size can get exceeded, causing the index to fail.

      Thus, the team had position "ignore_above" filters to filter out too long terms and positionned it's value to Lucene maximum.

      However, as stated in https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html :

      Note:
      
      The value for ignore_above is the character count, but Lucene counts bytes. If you use UTF-8 text with many non-ASCII characters, you may want to set the limit to 32766 / 3 = 10922 since UTF-8 characters may occupy at most 3 bytes.
      

      Thus the maximum value is computed for string length in ES and not based on bytes length in Lucene.

      We can craft a char sequence in UTF-8 exceeding the Lucene value but not triggering the ES limit.

      A much lower value (like 4KB) seems more reasonable, as long terms my not be significant.

      Note:

      • Implement tests:
      • Demonstrating this bug
      • Demonstrating only too long terms are ignored

      Attachments

        Activity

          People

            Unassigned Unassigned
            btellier Benoit Tellier
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: