Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-12518

PreAnalyzedField fails to index documents without tokens

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • update
    • None

    Description

      Overview

      PreAnalyzedField fails to index documents without tokens like the following data:

      {
        "v": "1",
        "str": "foo",
        "tokens": []
      }
      

      Details

      PreAnalyzedField consumes field values which have been pre-analyzed in advance. The format of pre-analyzed value is like follows:

      {
        "v":"1",
        "str":"test",
        "tokens": [
          {"t":"one","s":123,"e":128,"i":22,"p":"DQ4KDQsODg8=","y":"word"},
          {"t":"two","s":5,"e":8,"i":1,"y":"word"},
          {"t":"three","s":20,"e":22,"i":1,"y":"foobar"}
        ]
      }
      

      As the document mensions, "str" and "tokens" are optional, i.e., both an empty value and no key are allowed. However, when "tokens" is empty or not defined, PreAnalyzedField throws IOException and fails to index the document.

      This error is related to the behavior of Field#tokenStream. This method tries to create TokenStream by following steps (NOTE: assume indexed=true):

      • If the field has tokenStream value, returns it.
      • Otherwise, creates tokenStream by parsing the stored value.

      If pre-analyzed value doesn't have tokens, the second step will be executed. Unfortunately, since PreAnalyzedField always returns PreAnalyzedAnalyzer as the index analyzer and the stored value (i.e., the value of "str") is not the pre-analyzed format, this step will fail due to the pre-analyzed format error (i.e., IOException).

      How to reproduce

      1. Download latest solr package and prepare solr server according to Solr Tutorial.
      2. Add following fieldType and field to the schema.

          <fieldType name="preanalyzed-with-analyzer" class="solr.PreAnalyzedField">
            <analyzer>
              <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            </analyzer>
          </fieldType>
          <field name="pre_with_analyzer" type="preanalyzed-with-analyzer" indexed="true" stored="true" multiValued="false"/>
      

      3. Index following documents and Solr will throw IOException.

      // This is OK
      {"id": 1, "pre_with_analyzer": "{\'v\':\'1\',\'str\':\'document one\',\'tokens\':[{\'t\':\'one\'},{\'t\':\'two\'},{\'t\':\'three\',\'i\':100}]}"}
      
      // Solr throws IOException
      {"id": 2, "pre_with_analyzer": "{\'v\':\'1\',\'str\':\'document two\', \'tokens\':[]}"}
      
      // Solr throws IOException
      {"id": 3, "pre_with_analyzer": "{\'v\':\'1\',\'str\':\'document three\'}"}
      

      How to fix

      Because we don't need to analyze again if "tokens" is empty or not set, we can avoid this error by setting EmptyTokenStream as tokenStream instead like the following code:

      parse.hasTokenStream() ? parse : new EmptyTokenStream()
      

      Attachments

        1. SOLR-12518.patch
          7 kB
          Yuki Yano

        Activity

          People

            Unassigned Unassigned
            yuyano Yuki Yano
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: