Solr
  1. Solr
  2. SOLR-217

schema option to ignore unused fields

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.2
    • Fix Version/s: 1.2
    • Component/s: update
    • Labels:
      None

      Description

      One thing that causes problems for me (and i assume others) is that Solr is schema-strict in that unknown fields cause solr to throw exceptions and there is no way to relax this constraint. this can cause all sorts of serious problems if you have automated feeding applications that do things like SELECT * FROM table1 or where you want to add other fields to the document for processing purposes before sending them to solr but don't want to deal with 'cleanup'

      1. ASF.LICENSE.NOT.GRANTED--ignoreNonIndexedNonStoredField.patch
        0.7 kB
        Will Johnson
      2. ignoreUnnamedFields_v3.patch
        3 kB
        Hoss Man
      3. ignoreUnnamedFields_v3.patch
        2 kB
        Will Johnson
      4. ignoreUnnamedFields.patch
        2 kB
        Will Johnson

        Activity

        Hide
        Will Johnson added a comment -

        the attached patch solve this problme by adding a new option to schema.xml to allow unnamed fields including those that don't match dynamic fields to be ignored. the default is false if the attribute is missing which is consistent with existing SOLR functionality. if you want to enable this feature the schema.xml would look like:

        .... blah blah blah ...
        <fields ignoreUnnamedFields="true">
        .... blah blah blah ...

        Show
        Will Johnson added a comment - the attached patch solve this problme by adding a new option to schema.xml to allow unnamed fields including those that don't match dynamic fields to be ignored. the default is false if the attribute is missing which is consistent with existing SOLR functionality. if you want to enable this feature the schema.xml would look like: .... blah blah blah ... <fields ignoreUnnamedFields="true"> .... blah blah blah ...
        Hide
        Yonik Seeley added a comment -

        This is a unique enough of a requirement, I'm not sure an additional configuration switch is warranted.

        However, another solution might be to allow fields to be unstored and unindexed (essentially doing nothing). That would allow you to map a dynamic field of "*" to an unstored + unindexed field.
        It would also allow people to transition schemas + older clients. They could change the old field to unstored + unindexed and use a copyField to move it to the new field.

        Show
        Yonik Seeley added a comment - This is a unique enough of a requirement, I'm not sure an additional configuration switch is warranted. However, another solution might be to allow fields to be unstored and unindexed (essentially doing nothing). That would allow you to map a dynamic field of "*" to an unstored + unindexed field. It would also allow people to transition schemas + older clients. They could change the old field to unstored + unindexed and use a copyField to move it to the new field.
        Hide
        Will Johnson added a comment -

        i was actually taking this requirement from the other enterprise search
        engines that i've worked with that do this by default. ie, solr is
        different in this case. your *->nothing method sounds good as well but it
        doesn't seem as obvious to someone reading the schema or trying to feed
        data. you might also run into problems later on when there are other
        properties for 'things to do' for fields other than indexing and searching.

        • will
        Show
        Will Johnson added a comment - i was actually taking this requirement from the other enterprise search engines that i've worked with that do this by default. ie, solr is different in this case. your *->nothing method sounds good as well but it doesn't seem as obvious to someone reading the schema or trying to feed data. you might also run into problems later on when there are other properties for 'things to do' for fields other than indexing and searching. will
        Hide
        Erik Hatcher added a comment -

        I like Yonik's suggestion of allowing unstored+unindexed fields to be no-op.

        Show
        Erik Hatcher added a comment - I like Yonik's suggestion of allowing unstored+unindexed fields to be no-op.
        Hide
        Hoss Man added a comment -

        whatever mechanism we may add for supporting something like this, the default if unspecified should definitely be an error ... if Solr is asked to index data it doesn't know what to do with it should complain, rather then silently ignoring it ... this will help people with typos in their schema or indexing code find their problems faster.

        As for the proposed solutions: my initial reaction to reading the comments so far was to agree with Will: having an explicit true/false option makes it much cleraer to people reading the schema what's going on ... but in thinking about the possible use cases I prefer yonik's approach: leveraging the existing field/dynamcField syntax will allow people to not only say "any unknown field should be ignored" but also "field XXXX should be ignored" and "any unknown field that starts with S_* should be ignored"

        (there's also the question as to hwat should happen if i did have a stored="true" dynamicField of "*" and i set ignoreUnnamedFields="true")

        For the example config, we might want to do something like this to make it more obvious what's going on, and to serve as a recommended config style...

        <!-- since fields of this type are by default not stored or indexed, any data added to
        them will be ignored outright
        -->
        <fieldtype name="ignored" stored="false" indexed="false" class="solr.StrField" />
        ...
        <!-- ignore any fields that don't already match an existing field name or dynamic field -->
        <dynamicField name="*" type="ignored" />

        Show
        Hoss Man added a comment - whatever mechanism we may add for supporting something like this, the default if unspecified should definitely be an error ... if Solr is asked to index data it doesn't know what to do with it should complain, rather then silently ignoring it ... this will help people with typos in their schema or indexing code find their problems faster. As for the proposed solutions: my initial reaction to reading the comments so far was to agree with Will: having an explicit true/false option makes it much cleraer to people reading the schema what's going on ... but in thinking about the possible use cases I prefer yonik's approach: leveraging the existing field/dynamcField syntax will allow people to not only say "any unknown field should be ignored" but also "field XXXX should be ignored" and "any unknown field that starts with S_* should be ignored" (there's also the question as to hwat should happen if i did have a stored="true" dynamicField of "*" and i set ignoreUnnamedFields="true") For the example config, we might want to do something like this to make it more obvious what's going on, and to serve as a recommended config style... <!-- since fields of this type are by default not stored or indexed, any data added to them will be ignored outright --> <fieldtype name="ignored" stored="false" indexed="false" class="solr.StrField" /> ... <!-- ignore any fields that don't already match an existing field name or dynamic field --> <dynamicField name="*" type="ignored" />
        Hide
        Will Johnson added a comment -

        I like that solution and I can definitely see the advantages of having
        dumb_*=ignored and so on. How does this patch sound instead of the
        previous:

        public Field createField(SchemaField field, String externalVal, float
        boost) {
        String val;
        try

        { val = toInternal(externalVal); }

        catch (NumberFormatException e)

        { throw new SolrException(500, "Error while creating field '" + field + "' from value '" + externalVal + "'", e, false); }

        if (val==null) return null;
        if (!field.indexed() && !field.stored())

        { log.finest("Ignoring unindexed/unstored field: " + field); return null; }

        ... blah blah blah....

        • will
        Show
        Will Johnson added a comment - I like that solution and I can definitely see the advantages of having dumb_*=ignored and so on. How does this patch sound instead of the previous: public Field createField(SchemaField field, String externalVal, float boost) { String val; try { val = toInternal(externalVal); } catch (NumberFormatException e) { throw new SolrException(500, "Error while creating field '" + field + "' from value '" + externalVal + "'", e, false); } if (val==null) return null; if (!field.indexed() && !field.stored()) { log.finest("Ignoring unindexed/unstored field: " + field); return null; } ... blah blah blah.... will
        Hide
        J.J. Larrea added a comment -

        +1 to Hoss' elaboration of Yonik's suggested approach, except for reverse-compatibility (where we DO want an error for unknown fields) schema.xml should probably read something like:

        <!-- since fields of this type are by default not stored or indexed, any data added to
        them will be ignored outright
        -->
        <fieldtype name="ignored" stored="false" indexed="false" class="solr.StrField" />
        ...
        <!-- uncomment the following to ignore any fields that don't already match an existing
        field name or dynamic field, rather than reporting them as an error.
        alternately, change the type="ignored" to some other type e.g. "text" if you want
        unknown fields indexed and/or stored by default -->
        <!-dynamicField name="*" type="ignored" /->

        Show
        J.J. Larrea added a comment - +1 to Hoss' elaboration of Yonik's suggested approach, except for reverse-compatibility (where we DO want an error for unknown fields) schema.xml should probably read something like: <!-- since fields of this type are by default not stored or indexed, any data added to them will be ignored outright --> <fieldtype name="ignored" stored="false" indexed="false" class="solr.StrField" /> ... <!-- uncomment the following to ignore any fields that don't already match an existing field name or dynamic field, rather than reporting them as an error. alternately, change the type="ignored" to some other type e.g. "text" if you want unknown fields indexed and/or stored by default --> <!- dynamicField name="*" type="ignored" / ->
        Hide
        Will Johnson added a comment -

        since we now have required fields (http://issues.apache.org/jira/browse/SOLR-181) any chance we can have ignored fields as well? let me know if something else needs to be done to the patch but as far as i can tell the code works and people seem to agree that it's the correct approach.

        • will
        Show
        Will Johnson added a comment - since we now have required fields ( http://issues.apache.org/jira/browse/SOLR-181 ) any chance we can have ignored fields as well? let me know if something else needs to be done to the patch but as far as i can tell the code works and people seem to agree that it's the correct approach. will
        Hide
        Yonik Seeley added a comment -

        Will, could you please add the last patch again, and click "Grant License to ASF"?

        Show
        Yonik Seeley added a comment - Will, could you please add the last patch again, and click "Grant License to ASF"?
        Hide
        Will Johnson added a comment -

        v3 patch included. this version of the patch also takes into account the suggested example/solr/conf/schema.xml changes.

        Show
        Will Johnson added a comment - v3 patch included. this version of the patch also takes into account the suggested example/solr/conf/schema.xml changes.
        Hide
        Hoss Man added a comment -

        added a simple test to the existing patch.

        one thing to note is that this will result in the field being "ignored" if you try to query on it as well ... but this is a more general problem of qhat to do when people try to query on an unindexed field (see SOLR-223)

        will commit in a day or so barring objections

        Show
        Hoss Man added a comment - added a simple test to the existing patch. one thing to note is that this will result in the field being "ignored" if you try to query on it as well ... but this is a more general problem of qhat to do when people try to query on an unindexed field (see SOLR-223 ) will commit in a day or so barring objections
        Hide
        Hoss Man added a comment -

        commited r536278

        Show
        Hoss Man added a comment - commited r536278

          People

          • Assignee:
            Hoss Man
            Reporter:
            Will Johnson
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development