Solr
  1. Solr
  2. SOLR-4648

Create a PreAnalyzedUpdateProcessor

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.3, 6.0
    • Fix Version/s: 4.3, 6.0
    • Component/s: update
    • Labels:
      None

      Description

      Spin-off from the discussion in SOLR-4619.

      Instead of using a PreAnalyzedField type we can use an UpdateRequestProcessor that converts any input field values to StorableField-s, using the PreAnalyzedParser-s, and then directly passes StorableField-s to DocumentBuilder for indexing.

      1. SOLR-4648.patch
        25 kB
        Andrzej Bialecki
      2. SOLR-4648.patch
        20 kB
        Andrzej Bialecki
      3. SOLR-4648.patch
        16 kB
        Andrzej Bialecki
      4. SOLR-4648.patch
        16 kB
        Andrzej Bialecki

        Activity

        Hide
        Andrzej Bialecki added a comment -

        Patch and unit test.

        Show
        Andrzej Bialecki added a comment - Patch and unit test.
        Hide
        Andrzej Bialecki added a comment -

        Oops, this needs to be visible after all...

        Show
        Andrzej Bialecki added a comment - Oops, this needs to be visible after all...
        Hide
        David Smiley added a comment -

        Great start Andrzej!

        Reviewing the patch...

        I think PreAnalyzedUpdateProcessorFactory should extend Jan Høydahl's FieldMutatingUpdateProcessorFactory, which will shorten some of your code and give it more flexibility on how fields are matched.

        In PreAnalyzedField line 112:

              LOG.warn("Error parsing pre-analyzed field '" + field.getName() + "': " + e.toString());
        

        Shouldn't you pass 'e' as the last arg to warn()?

        It would be interesting to test this against the example Solr schema with pre-analyzing every field to see if it works, comparing the index output using the SimpleTextCodec. I strongly suspect it won't, since there is more to the state of a Field than it's tokenStream and stored value – which seem to be the only thing the code in this patch accounts for. For example its 'type' (Lucene FieldType).

        In summary, great start!

        Show
        David Smiley added a comment - Great start Andrzej! Reviewing the patch... I think PreAnalyzedUpdateProcessorFactory should extend Jan Høydahl 's FieldMutatingUpdateProcessorFactory, which will shorten some of your code and give it more flexibility on how fields are matched. In PreAnalyzedField line 112: LOG.warn( "Error parsing pre-analyzed field '" + field.getName() + "': " + e.toString()); Shouldn't you pass 'e' as the last arg to warn()? It would be interesting to test this against the example Solr schema with pre-analyzing every field to see if it works, comparing the index output using the SimpleTextCodec. I strongly suspect it won't, since there is more to the state of a Field than it's tokenStream and stored value – which seem to be the only thing the code in this patch accounts for. For example its 'type' (Lucene FieldType). In summary, great start!
        Hide
        Andrzej Bialecki added a comment -

        David, thanks for the review. I'll prepare a new patch, using the FieldMutatingUP* classes - indeed, I missed them and they help a lot.

        since there is more to the state of a Field than it's tokenStream and stored value...

        I can try creating the StorableField using the SchemaField.createField(...) for this field's type, so that most of its state will be specific to the declared schema field, and the only mutations will be to set the optional stringValue and the tokenStreamValue, and to flip the type flags accordingly.

        Show
        Andrzej Bialecki added a comment - David, thanks for the review. I'll prepare a new patch, using the FieldMutatingUP* classes - indeed, I missed them and they help a lot. since there is more to the state of a Field than it's tokenStream and stored value... I can try creating the StorableField using the SchemaField.createField(...) for this field's type, so that most of its state will be specific to the declared schema field, and the only mutations will be to set the optional stringValue and the tokenStreamValue, and to flip the type flags accordingly.
        Hide
        Andrzej Bialecki added a comment -

        Updated patch, using FieldMutatingUP*, more tests and javadocs.

        Show
        Andrzej Bialecki added a comment - Updated patch, using FieldMutatingUP*, more tests and javadocs.
        Hide
        David Smiley added a comment -

        Feedback on your patch:

        • in Solr, it's bad form to call Class.forName() as you're doing in PreAnalzyedField; you're supposed to use the SolrResourceLoader instead. Instead use schema.getResourceLoader().findClass(implName, PreAnalyzedParser.class);
        • minor: you're using 4-spaces indent instead of 2 in the main value loop in your URP
        • in those log.debug() calls, it's creating the string to potentially not even log it. Instead use this feature of SLF4J: log.debug("Ignoring stored part - field '{}' is not stored.", sf.getName()); So few people use that great feature of SLF4J and instead rely on more bulky surrounding conditionals.
        • When you call sf.createFields(o, 1.0f) in the URP, isn't this doing analysis – the very analysis that you're trying to avoid by pre-analyzing? It is. And isn't 'o' the JSON or similar representation of the analyzed token stream, and thus shouldn't be passed in here? When I run your test, I get a logged error because of this, actually. Stepping through with a debugger further validated this. This must be remedied. Looking at what FieldType.createField() is doing, I propose you do the same in this URP:
        org.apache.lucene.document.FieldType newType = new org.apache.lucene.document.FieldType();
            newType.setIndexed(field.indexed());
            newType.setTokenized(field.isTokenized());
            newType.setStored(field.stored());
            newType.setOmitNorms(field.omitNorms());
            newType.setIndexOptions(getIndexOptions(field, val));
            newType.setStoreTermVectors(field.storeTermVector());
            newType.setStoreTermVectorOffsets(field.storeTermOffsets());
            newType.setStoreTermVectorPositions(field.storeTermPositions());
        //getIndexOptions() is:
            IndexOptions options = IndexOptions.DOCS_AND_FREQS_AND_POSITIONS;
            if (field.omitTermFreqAndPositions()) {
              options = IndexOptions.DOCS_ONLY;
            } else if (field.omitPositions()) {
              options = IndexOptions.DOCS_AND_FREQS;
            } else if (field.storeOffsetsWithPositions()) {
              options = IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS;
            }
        //then what createField() does is:
            Field f = new Field(name, val, type);
            f.setBoost(boost);
        

        This won't work for all field types (e.g. numbers) but it will for TextField (the one with analysis!) and could potentially for a custom FieldType. I don't think we should put much if any work into supporting other FieldTypes.

        The resulting "Field" instance can be shared from one document to the next I believe, and so you can cache this in the URP and reset its value & tokenStream.

        Show
        David Smiley added a comment - Feedback on your patch: in Solr, it's bad form to call Class.forName() as you're doing in PreAnalzyedField; you're supposed to use the SolrResourceLoader instead. Instead use schema.getResourceLoader().findClass(implName, PreAnalyzedParser.class); minor: you're using 4-spaces indent instead of 2 in the main value loop in your URP in those log.debug() calls, it's creating the string to potentially not even log it. Instead use this feature of SLF4J: log.debug("Ignoring stored part - field '{}' is not stored.", sf.getName()); So few people use that great feature of SLF4J and instead rely on more bulky surrounding conditionals. When you call sf.createFields(o, 1.0f) in the URP, isn't this doing analysis – the very analysis that you're trying to avoid by pre-analyzing? It is. And isn't 'o' the JSON or similar representation of the analyzed token stream, and thus shouldn't be passed in here? When I run your test, I get a logged error because of this, actually. Stepping through with a debugger further validated this. This must be remedied. Looking at what FieldType.createField() is doing, I propose you do the same in this URP: org.apache.lucene.document.FieldType newType = new org.apache.lucene.document.FieldType(); newType.setIndexed(field.indexed()); newType.setTokenized(field.isTokenized()); newType.setStored(field.stored()); newType.setOmitNorms(field.omitNorms()); newType.setIndexOptions(getIndexOptions(field, val)); newType.setStoreTermVectors(field.storeTermVector()); newType.setStoreTermVectorOffsets(field.storeTermOffsets()); newType.setStoreTermVectorPositions(field.storeTermPositions()); //getIndexOptions() is: IndexOptions options = IndexOptions.DOCS_AND_FREQS_AND_POSITIONS; if (field.omitTermFreqAndPositions()) { options = IndexOptions.DOCS_ONLY; } else if (field.omitPositions()) { options = IndexOptions.DOCS_AND_FREQS; } else if (field.storeOffsetsWithPositions()) { options = IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS; } //then what createField() does is: Field f = new Field(name, val, type); f.setBoost(boost); This won't work for all field types (e.g. numbers) but it will for TextField (the one with analysis!) and could potentially for a custom FieldType. I don't think we should put much if any work into supporting other FieldTypes. The resulting "Field" instance can be shared from one document to the next I believe, and so you can cache this in the URP and reset its value & tokenStream.
        Hide
        Andrzej Bialecki added a comment -

        Thanks again for a thorough review. New patch is attached.

        in Solr, it's bad form to call Class.forName() ..

        Fixed.

        minor: you're using 4-spaces indent instead of 2 in the main value loop in your URP

        Fixed (it was actually a left-over from removing an outer-level loop, all other indents are 2 spaces).

        in those log.debug() calls, it's creating the string to potentially not even log it

        Fixed.

        Looking at what FieldType.createField() is doing, I propose you do the same in this URP ...

        Funny thing, I had this in one version of the patch, and then decided to reuse SF.createField(..) to avoid code duplication. The problem is that SchemaField.isTokenized() has package-level visibility so it's not visible in the UP's package. I fixed this by providing a utility method in PreAnalyzedField to create a FieldType. Also, I moved there the chunk of logic for setting / resetting the Field content and type flags based on SchemaField. Overall, it simplifies the UP.

        The resulting "Field" instance can be shared from one document to the next I believe, and so you can cache this in the URP and reset its value & tokenStream.

        Hmm, this doesn't seem feasible at all. First, this cache would have to be thread safe, and prevent reuse of Field instances until the document is actually processed by IndexWriter - I don't think there's a mechanism to enforce this in the context of this class? Also it would have to cache multiple instances of Field, because processing a single document may result in creating multiple instances (at least one per pre-analyzed field, more if fields are multi-valued).

        Show
        Andrzej Bialecki added a comment - Thanks again for a thorough review. New patch is attached. in Solr, it's bad form to call Class.forName() .. Fixed. minor: you're using 4-spaces indent instead of 2 in the main value loop in your URP Fixed (it was actually a left-over from removing an outer-level loop, all other indents are 2 spaces). in those log.debug() calls, it's creating the string to potentially not even log it Fixed. Looking at what FieldType.createField() is doing, I propose you do the same in this URP ... Funny thing, I had this in one version of the patch, and then decided to reuse SF.createField(..) to avoid code duplication. The problem is that SchemaField.isTokenized() has package-level visibility so it's not visible in the UP's package. I fixed this by providing a utility method in PreAnalyzedField to create a FieldType. Also, I moved there the chunk of logic for setting / resetting the Field content and type flags based on SchemaField. Overall, it simplifies the UP. The resulting "Field" instance can be shared from one document to the next I believe, and so you can cache this in the URP and reset its value & tokenStream. Hmm, this doesn't seem feasible at all. First, this cache would have to be thread safe, and prevent reuse of Field instances until the document is actually processed by IndexWriter - I don't think there's a mechanism to enforce this in the context of this class? Also it would have to cache multiple instances of Field, because processing a single document may result in creating multiple instances (at least one per pre-analyzed field, more if fields are multi-valued).
        Hide
        David Smiley added a comment -

        I like how your refactoring has clarified the logic in the URP. And I like your new PreAnalyzedUpdateProcessorFactory.createFieldType() method.

        I think you're right that sharing Field instances isn't such a good idea after all.

        • minor: in the URP, "for (Object o : src.getValues()) {" can be "for (Object o : src) {" since it implements Iterable, and for the single-value case avoids redundant collection wrapping.

        The remaining thing I'm thinking of is wether PreAnalyzedUpdateProcessorFactory.createFieldType() should move down into Solr's FieldType where it can be re-used so we don't duplicate this code. Seems pretty clear this is a good idea. Then, I'm wondering if these Lucene FieldType instances could be cached on Solr's SchemaField so that they don't have to be needlessly re-created for each indexed value that runs through Solr. The only obstacle I see to this is that getIndexOptions(field,val) takes the value, and if that value were to alter the logic then the FieldType can't be shared. This is a protected method and I don't see anything that overrides it, and the default implementation doesn't use the value. I'll create another issue for that; I can get off track easily and try to fix the world in one issue

        Show
        David Smiley added a comment - I like how your refactoring has clarified the logic in the URP. And I like your new PreAnalyzedUpdateProcessorFactory.createFieldType() method. I think you're right that sharing Field instances isn't such a good idea after all. minor: in the URP, "for (Object o : src.getValues()) {" can be "for (Object o : src) {" since it implements Iterable, and for the single-value case avoids redundant collection wrapping. The remaining thing I'm thinking of is wether PreAnalyzedUpdateProcessorFactory.createFieldType() should move down into Solr's FieldType where it can be re-used so we don't duplicate this code. Seems pretty clear this is a good idea. Then, I'm wondering if these Lucene FieldType instances could be cached on Solr's SchemaField so that they don't have to be needlessly re-created for each indexed value that runs through Solr. The only obstacle I see to this is that getIndexOptions(field,val) takes the value, and if that value were to alter the logic then the FieldType can't be shared. This is a protected method and I don't see anything that overrides it, and the default implementation doesn't use the value. I'll create another issue for that; I can get off track easily and try to fix the world in one issue
        Hide
        Andrzej Bialecki added a comment -

        minor: in the URP, "for (Object o : src.getValues()) {" can be "for (Object o : src) {" since it implements Iterable

        I'll fix this before committing. Thanks again for the review.

        PreAnalyzedUpdateProcessorFactory.createFieldType() ..

        +1 to create a separate issue.

        Show
        Andrzej Bialecki added a comment - minor: in the URP, "for (Object o : src.getValues()) {" can be "for (Object o : src) {" since it implements Iterable I'll fix this before committing. Thanks again for the review. PreAnalyzedUpdateProcessorFactory.createFieldType() .. +1 to create a separate issue.
        Hide
        Andrzej Bialecki added a comment -

        Committed to trunk in rev. 1464889.
        Committed to branch_4x in rev. 1464902.

        Show
        Andrzej Bialecki added a comment - Committed to trunk in rev. 1464889. Committed to branch_4x in rev. 1464902.
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Andrzej Bialecki
            Reporter:
            Andrzej Bialecki
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development