Details
-
Bug
-
Status: Done
-
Major
-
Resolution: Done
-
None
-
None
Description
The bro uri field in [HTTP::Info](https://www.bro.org/sphinx/scripts/base/protocols/http/main.bro.html#type-HTTP::Info) can exceed the Lucene-imposed limit of 32766 per term (non-analyzed fields are treated as a single term, and we are setting it as not_analyzed here - https://github.com/apache/incubator-metron/blob/master/metron-deployment/roles/metron_elasticsearch_templates/files/es_templates/bro_index.template). The resolution options that I've been able to find appear to be:
1. Set analyzed to "[no](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-index.html)", which will not add that field to the index, making it not queryable.
2. Change the type to [binary](https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html), which will not store it by default.
3. Use "[ignore_above](https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html)" to set a limit, above which strings are not indexed.
4. Set the field as "analyzed".
Here is an example error message:
```
[4]: index [bro_index_2016.10.25.21], type [bro_doc], id [AVf-iCuooLg3mHEm2PpH], message [java.lang.IllegalArgumentException: Document contains at least one immense term in field="uri" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[<redacted>]...', original message: bytes can be at most 32766 in length; got 38623]
```
Relevant Lucene documentation: https://lucene.apache.org/core/6_2_1/core/constant-values.html#org.apache.lucene.index.IndexWriter.MAX_TERM_LENGTH
Attachments
Issue Links
- links to