Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.9, 5.0
    • Component/s: None
    • Labels:
      None

      Description

      Given the SOLR-2444 updated fl syntax and the SOLR-2719 regression, it would be useful to add some kind of validation regarding the field names you can use on Solr.
      The objective would be adding consistency, allowing only field names that you can then use within fl, sorting etc.

      The rules, taken from the actual StrParser behaviour, seem to be the following:

      • same used for java identifiers (Character#isJavaIdentifierPart), plus the use of trailing '.' and '-'
      • for the first character the rule is Character#isJavaIdentifierStart minus '$' (The dash can't be used as first character (SOLR-3191) for example)
      1. SOLR-3207.patch
        22 kB
        Luca Cavanna

        Issue Links

          Activity

          Uwe Schindler made changes -
          Fix Version/s 4.9 [ 12326731 ]
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.8 [ 12326254 ]
          Hide
          Uwe Schindler added a comment -

          Move issue to Solr 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Solr 4.9.
          David Smiley made changes -
          Fix Version/s 4.8 [ 12326254 ]
          Fix Version/s 4.7 [ 12325573 ]
          Uwe Schindler made changes -
          Fix Version/s 4.7 [ 12325573 ]
          Fix Version/s 4.6 [ 12325000 ]
          Adrien Grand made changes -
          Fix Version/s 4.6 [ 12325000 ]
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.5 [ 12324743 ]
          Steve Rowe made changes -
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.5 [ 12324743 ]
          Fix Version/s 4.4 [ 12324324 ]
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Uwe Schindler made changes -
          Fix Version/s 4.4 [ 12324324 ]
          Fix Version/s 4.3 [ 12324128 ]
          Robert Muir made changes -
          Fix Version/s 4.3 [ 12324128 ]
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.2 [ 12323893 ]
          Mark Miller made changes -
          Fix Version/s 4.2 [ 12323893 ]
          Fix Version/s 5.0 [ 12321664 ]
          Fix Version/s 4.1 [ 12321141 ]
          Robert Muir made changes -
          Fix Version/s 4.1 [ 12321141 ]
          Fix Version/s 4.0 [ 12314992 ]
          Hide
          Hoss Man added a comment -

          the giant elephant in the room that doesn't seem to have been discussed is that trying to validate that field names meet some strict criteria when loading schema.xml doesn't really address dynamic fields – the patch ensures that <dynamicField name="..."/> configurations have names which are validated, but i don't see anything that would considering the actually field names people use with those dynamic fields – ie: "*_i" might be a perfectly valid dynamicField at startup, but that startup validation isn't going to help me if i index a document containing the field name "$ - foo_i"

          In general, i'm opposed to the idea of "locking down" what field names can be used across the board. My preference would be to let people us any field name their heart desires, but provide better documentation on what field name restrictions exist on which features and provide (ie: "using a field name in function requires that the field name match ValidatorX; using a field name in fl requires can only be used with field names conform to ValidatorX and ValidatorY; etc...").

          If we want to provide automated "validation" of these things for people, then let's make it part of the LukeRequestHandler: for any field name returned by the LukeRequestHandler, let's include a warnings section advising them which validation rules that field name doesn't match, and what features depend on that validation rule – this info could then easily be exposed in the admin UI.

          We could also provide an optional UpdateProcessor people could configure with a list of individual Validators which could reject any document containing a field that didn't match the validator (or optionally: translate the field name to something thta did conform) to help people enforce these even on dynamic fields.

          So by default: any field is allowed, but if i create one with a funky name (either explicitly or as a result of loading data using a dynamicField) the admin UI starts warning me that feature XYZ won't work with fields A, B, C; and if i want to ensure feature D works will all of my fields i add an update processor to ensure it.

          Show
          Hoss Man added a comment - the giant elephant in the room that doesn't seem to have been discussed is that trying to validate that field names meet some strict criteria when loading schema.xml doesn't really address dynamic fields – the patch ensures that <dynamicField name="..."/> configurations have names which are validated, but i don't see anything that would considering the actually field names people use with those dynamic fields – ie: "*_i" might be a perfectly valid dynamicField at startup, but that startup validation isn't going to help me if i index a document containing the field name " $ - foo_i " In general, i'm opposed to the idea of "locking down" what field names can be used across the board. My preference would be to let people us any field name their heart desires, but provide better documentation on what field name restrictions exist on which features and provide (ie: "using a field name in function requires that the field name match ValidatorX; using a field name in fl requires can only be used with field names conform to ValidatorX and ValidatorY; etc..."). If we want to provide automated "validation" of these things for people, then let's make it part of the LukeRequestHandler: for any field name returned by the LukeRequestHandler, let's include a warnings section advising them which validation rules that field name doesn't match, and what features depend on that validation rule – this info could then easily be exposed in the admin UI. We could also provide an optional UpdateProcessor people could configure with a list of individual Validators which could reject any document containing a field that didn't match the validator (or optionally: translate the field name to something thta did conform) to help people enforce these even on dynamic fields. So by default: any field is allowed, but if i create one with a funky name (either explicitly or as a result of loading data using a dynamicField) the admin UI starts warning me that feature XYZ won't work with fields A, B, C; and if i want to ensure feature D works will all of my fields i add an update processor to ensure it.
          Hide
          Yonik Seeley added a comment -

          Regarding the trailing characters, do you mean we shouldn't use isJavaIdentifierPart anymore but something else?

          That was just a shortcut... looking again, it's pretty open (maybe more open than we want?) esp since unicode changes over time. Anyway, isJavaIdentifierPart doesn't include "-" or "." If people do need another separator type character, we could allow "$" too (just not as the first char, since that's taken by variable dereferencing).

          That's even more restrictive than my patch since I've used the existing rules applied while parsing the fl parameter (ReturnFields class).

          Allowing '-' in the "fl" was just to resolve that "regression" for people who already used fieldnames like that and are upgrading.
          If we want to start validating field names strictly, then we should bump the schema version number (and should skip validating when the version number is less than that).

          Show
          Yonik Seeley added a comment - Regarding the trailing characters, do you mean we shouldn't use isJavaIdentifierPart anymore but something else? That was just a shortcut... looking again, it's pretty open (maybe more open than we want?) esp since unicode changes over time. Anyway, isJavaIdentifierPart doesn't include "-" or "." If people do need another separator type character, we could allow "$" too (just not as the first char, since that's taken by variable dereferencing). That's even more restrictive than my patch since I've used the existing rules applied while parsing the fl parameter (ReturnFields class). Allowing '-' in the "fl" was just to resolve that "regression" for people who already used fieldnames like that and are upgrading. If we want to start validating field names strictly, then we should bump the schema version number (and should skip validating when the version number is less than that).
          Hide
          Luca Cavanna added a comment -

          The first letter should be ok as checked in my patch. Regarding the trailing characters, do you mean we shouldn't use isJavaIdentifierPart anymore but something else? That's even more restrictive than my patch since I've used the existing rules applied while parsing the fl parameter (ReturnFields class). No problem for me, are we all sure we want to proceed this way?

          I'll update my patch later on. Then I'd document this within the Schema wiki page.

          That's a big change, any opinion is welcome!

          Show
          Luca Cavanna added a comment - The first letter should be ok as checked in my patch. Regarding the trailing characters, do you mean we shouldn't use isJavaIdentifierPart anymore but something else? That's even more restrictive than my patch since I've used the existing rules applied while parsing the fl parameter (ReturnFields class). No problem for me, are we all sure we want to proceed this way? I'll update my patch later on. Then I'd document this within the Schema wiki page. That's a big change, any opinion is welcome!
          Hide
          Yonik Seeley added a comment -

          same used for java identifiers (Character#isJavaIdentifierPart), plus the use of trailing '.' and '-'

          I think we should prob define it as I documented in the schema:

          <!-- field names should consist of alphanumeric or underscore characters only and
          not start with a digit. This is not currently strictly enforced,
          but other field names will not have first class support from all components
          and back compatibility is not guaranteed.
          -->

          Show
          Yonik Seeley added a comment - same used for java identifiers (Character#isJavaIdentifierPart), plus the use of trailing '.' and '-' I think we should prob define it as I documented in the schema: <!-- field names should consist of alphanumeric or underscore characters only and not start with a digit. This is not currently strictly enforced, but other field names will not have first class support from all components and back compatibility is not guaranteed. -->
          Luca Cavanna made changes -
          Attachment SOLR-3207.patch [ 12518475 ]
          Hide
          Luca Cavanna added a comment -

          First draft patch. I introduced a new FieldNameValidator class which is used within the IndexSchema class to validate every field name. The new class exposes also some boolean methods which are used within the ReturnFields class, in order to apply the same rules there to detect a field name. That's needed to make sure that we accept field names that we can handle within the fl parameter.

          Apparently, if you use a placeholder as field name you receive on IndexSchema the default value, which can be empty. That's why I'm allowing empty field names. I'm not even sure I understood correctly how placeholders work, can someone help me out with this?

          Let me know what you think about my patch!

          Show
          Luca Cavanna added a comment - First draft patch. I introduced a new FieldNameValidator class which is used within the IndexSchema class to validate every field name. The new class exposes also some boolean methods which are used within the ReturnFields class, in order to apply the same rules there to detect a field name. That's needed to make sure that we accept field names that we can handle within the fl parameter. Apparently, if you use a placeholder as field name you receive on IndexSchema the default value, which can be empty. That's why I'm allowing empty field names. I'm not even sure I understood correctly how placeholders work, can someone help me out with this? Let me know what you think about my patch!
          Luca Cavanna made changes -
          Link This issue is related to SOLR-2444 [ SOLR-2444 ]
          Luca Cavanna made changes -
          Field Original Value New Value
          Link This issue is related to SOLR-2719 [ SOLR-2719 ]
          Hide
          Luca Cavanna added a comment -

          Guys, let me know your thoughts about those rules, I probably missed something!

          Show
          Luca Cavanna added a comment - Guys, let me know your thoughts about those rules, I probably missed something!
          Luca Cavanna created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Luca Cavanna
            • Votes:
              1 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:

                Development