Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
master
-
None
Description
Lucene supports a maximum term size of 32KB
This term size can get exceeded, causing the index to fail.
Thus, the team had position "ignore_above" filters to filter out too long terms and positionned it's value to Lucene maximum.
However, as stated in https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html :
Note:
The value for ignore_above is the character count, but Lucene counts bytes. If you use UTF-8 text with many non-ASCII characters, you may want to set the limit to 32766 / 3 = 10922 since UTF-8 characters may occupy at most 3 bytes.
Thus the maximum value is computed for string length in ES and not based on bytes length in Lucene.
We can craft a char sequence in UTF-8 exceeding the Lucene value but not triggering the ES limit.
A much lower value (like 4KB) seems more reasonable, as long terms my not be significant.
Note:
- Implement tests:
- Demonstrating this bug
- Demonstrating only too long terms are ignored