Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
7.2.1
-
None
-
None
-
Steps to reproduce:
1. Create index
PUT testindex
{
"settings" : {
"index" : {
"number_of_shards" : 2,
"number_of_replicas" : 2
},
"analysis": {
"filter": {
"wordDelimiter": {
"type": "word_delimiter",
"generate_word_parts": "true",
"generate_number_parts": "true",
"catenate_words": "false",
"catenate_numbers": "false",
"catenate_all": "false",
"split_on_case_change": "true",
"preserve_original": "true",
"split_on_numerics": "true",
"stem_english_possessive": "true"
}
},
"analyzer": {
"content_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"asciifolding",
"wordDelimiter",
"lowercase"
]
}
}
}
}
}2. Analyze Text-
POST testindex/_analyze
{
"analyzer": "content_analyzer",
"text": "ElasticSearch.TestProject"
}Following tokens are generated-
{ "token": "elasticsearch-testproject", "start_offset": 0, "end_offset": 25, "type": "word", "position": 0 },
{ "token": "elastic", "start_offset": 0, "end_offset": 7, "type": "word", "position": 0 },
{ "token": "search", "start_offset": 7, "end_offset": 13, "type": "word", "position": 1 },
{ "token": "test", "start_offset": 14, "end_offset": 18, "type": "word", "position": 2 },
{ "token": "project", "start_offset": 18, "end_offset": 25, "type": "word", "position": 3 }Expected Result:
Besides the above tokens even elasticsearch and testproject should be generated. such that the phrase query "elasticsearch testproject" should also match.Another example could be-
Text "Super-Duper-0-AutoCoder" with above analyzer generates a token autocoder while the text "Super-Duper-AutoCoder" does NOT generate the token autocoder.Steps to reproduce : 1. Create index PUT testindex { "settings" : { "index" : { "number_of_shards" : 2, "number_of_replicas" : 2 }, "analysis": { "filter": { "wordDelimiter": { "type": "word_delimiter", "generate_word_parts": "true", "generate_number_parts": "true", "catenate_words": "false", "catenate_numbers": "false", "catenate_all": "false", "split_on_case_change": "true", "preserve_original": "true", "split_on_numerics": "true", "stem_english_possessive": "true" } }, "analyzer": { "content_analyzer": { "type": "custom", "tokenizer": "whitespace", "filter": [ "asciifolding", "wordDelimiter", "lowercase" ] } } } } } 2. Analyze Text- POST testindex/_analyze { "analyzer": "content_analyzer", "text": "ElasticSearch.TestProject" } Following tokens are generated- { "token": "elasticsearch-testproject", "start_offset": 0, "end_offset": 25, "type": "word", "position": 0 } , { "token": "elastic", "start_offset": 0, "end_offset": 7, "type": "word", "position": 0 } , { "token": "search", "start_offset": 7, "end_offset": 13, "type": "word", "position": 1 } , { "token": "test", "start_offset": 14, "end_offset": 18, "type": "word", "position": 2 } , { "token": "project", "start_offset": 18, "end_offset": 25, "type": "word", "position": 3 } Expected Result: Besides the above tokens even elasticsearch and testproject should be generated. such that the phrase query "elasticsearch testproject" should also match. Another example could be- Text "Super-Duper-0-AutoCoder" with above analyzer generates a token autocoder while the text "Super-Duper-AutoCoder" does NOT generate the token autocoder .
-
New
Description
When using word delimiter token filter some expected tokens are not generated.
When I try to analyze the text "ElasticSearch.TestProject"
I expect the tokens elastic, search, test, project, elasticsearch, testproject, elasticsearch.testproject to be generated since I have split_on_case_change, split_on_numerics on and using a whitespace tokenizer and have preserve original true.
But Actually I only see following tokens -
elasticsearch.testproject, elastic, search, test, project