Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-17052

SchemaCodecFactory/IndexSchema/FieldType relationships are kludgy, buggy, and inefficient

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      While getting familiar with the SolreCore + CodecFactory + SchemaCodecFactory + FieldType related code relevant to SOLR-17045, SOLR-17046, & SOLR-17047 It occurred to me that there is a lot of ineffeciencies and kludginess to how FieldType based "codec overrides" are used (and validated) by SchemaCodecFactory (and SolrCore.initCodec) :

      • SolrCore.initCodec needs to be aware of all the possible ways a FieldType instance might support codec overrides
        • ... so it can fail if any are specified unless the CodecFactory instanceOf SolrCoreAware
          • ... even though that still doesn't ensure the factory supports those field type overrides
        • This validation currently just looks at getPostingsFormatForField & getDocValuesFormatForField
          • ... it's ignorant about DenseVectorField 's assumptions about being able to override aspects of the KnnVectorsFormat
          • ... and AFAICT, what validation is don't doesn't help if the Schema API is used to add new field types (w/ postingsFormat or docValuesFormat overrides)
      • in all of the the SchemaCodecFactory "per-field" methods (getPostingsFormatForField, getDocValuesFormatForField, & getKnnVectorsFormatForField) ...
        • ... every call to these methods resolves a SchemaField instance – even though only the (Solr) FieldType is needed
          • Asking the IndexSchema for the SchemaField of a fieldName has more overhead then just asking for the FieldType
          • None of the things these methods care about can be configured on a per-fieldName bassis anyway.
        • For PostingsFormat and DocValuesFormat, every call to these methods repeats the SPI lookup on the "format name" configured on the FieldType instance
        • For KnnVectorsFormat every call to this method constructs a new SolrDelegatingKnnVectorsFormat – even though the same instance could be re-used for every field of the same FieldType instance.
      • In FieldType ...
        • ... there is no validation anywhere that the postingsFormat or docValuesFormat are valid
          • ... bogus values only cause a problem when the SchemaCodecFactory tries to resolve them (when indexing)
      • In DenseVectorField ...
        • ... checkSchemaField validates (and logs warnings) based on the vectorEncoding and dimensions...
          • ... Even though these validations aren't "field" specific – they are "type" specific, and could be validated in DenseVectorField.init()
        • BUT! ... there is no validation anywhere that the knnAlgorithm is supported, or that the HNSW options make sense for it
          • These are only validated by the Codec.getKnnVectorsFormatForField(...) impl provided by SchemaCodecFactory ...
            • ... and they are redundenly validated on every call

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              hossman Chris M. Hostetter
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: