Overview

PreAnalyzedField fails to index documents without tokens like the following data:

{
  "v": "1",
  "str": "foo",
  "tokens": []
}

Details

PreAnalyzedField consumes field values which have been pre-analyzed in advance. The format of pre-analyzed value is like follows:

{
  "v":"1",
  "str":"test",
  "tokens": [
    {"t":"one","s":123,"e":128,"i":22,"p":"DQ4KDQsODg8=","y":"word"},
    {"t":"two","s":5,"e":8,"i":1,"y":"word"},
    {"t":"three","s":20,"e":22,"i":1,"y":"foobar"}
  ]
}

As the document mensions, "str" and "tokens" are optional, i.e., both an empty value and no key are allowed. However, when "tokens" is empty or not defined, PreAnalyzedField throws IOException and fails to index the document.

This error is related to the behavior of Field#tokenStream. This method tries to create TokenStream by following steps (NOTE: assume indexed=true):

If the field has tokenStream value, returns it.
Otherwise, creates tokenStream by parsing the stored value.

If pre-analyzed value doesn't have tokens, the second step will be executed. Unfortunately, since PreAnalyzedField always returns PreAnalyzedAnalyzer as the index analyzer and the stored value (i.e., the value of "str") is not the pre-analyzed format, this step will fail due to the pre-analyzed format error (i.e., IOException).

How to reproduce

1. Download latest solr package and prepare solr server according to Solr Tutorial.
2. Add following fieldType and field to the schema.

    <fieldType name="preanalyzed-with-analyzer" class="solr.PreAnalyzedField">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>
    <field name="pre_with_analyzer" type="preanalyzed-with-analyzer" indexed="true" stored="true" multiValued="false"/>

3. Index following documents and Solr will throw IOException.

// This is OK
{"id": 1, "pre_with_analyzer": "{\'v\':\'1\',\'str\':\'document one\',\'tokens\':[{\'t\':\'one\'},{\'t\':\'two\'},{\'t\':\'three\',\'i\':100}]}"}

// Solr throws IOException
{"id": 2, "pre_with_analyzer": "{\'v\':\'1\',\'str\':\'document two\', \'tokens\':[]}"}

// Solr throws IOException
{"id": 3, "pre_with_analyzer": "{\'v\':\'1\',\'str\':\'document three\'}"}

How to fix

Because we don't need to analyze again if "tokens" is empty or not set, we can avoid this error by setting EmptyTokenStream as tokenStream instead like the following code:

parse.hasTokenStream() ? parse : new EmptyTokenStream()

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-12518.patch
26/Jun/18 07:23
7 kB
Yuki Yano

Activity

People

Assignee:: Unassigned

Reporter:: Yuki Yano

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Jun/18 07:23

Updated:: 08/Jun/19 15:01