Solr
  1. Solr
  2. SOLR-2719

REGRESSION ReturnFields incorrect parse fields with hyphen - breaks existing "fl=my-field-name" type usecases

    Details

      Description

      fl=my-hyphen-field in query params parsed as "my" instead of "my-hyphen-field".

      OAS.search.ReturnFields use method getId() from OAS.search.QueryParsing
      in which check chars "if (!Character.isJavaIdentifierPart(ch) && ch != '.')"
      Hyphen is not JavaIdentifierPart and this check break when first "-" is found.

      This problem solve by passing '-' to check:
      if (!Character.isJavaIdentifierPart(ch) && ch != '.' && ch != '-') break;

      But I don't know how it can affect on whole project.

      1. SOLR-2719.patch
        10 kB
        Luca Cavanna
      2. SOLR-2719.patch
        12 kB
        Luca Cavanna
      3. SOLR-2719.patch
        1 kB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          setting this as a blocker for 4.0 since it is a fairly serious regression for anyone using field names with "-" in them

          Show
          Hoss Man added a comment - setting this as a blocker for 4.0 since it is a fairly serious regression for anyone using field names with "-" in them
          Hide
          Nik V. Babichev added a comment -

          Is it so hard to fix it?
          How can I help in fixing?

          Show
          Nik V. Babichev added a comment - Is it so hard to fix it? How can I help in fixing?
          Hide
          Luca Cavanna added a comment - - edited

          My patch isn't a fix but just a starting point: it adds a ReturnFieldsTest class which tests some of the new fl features. Some tests are of course failing. The biggest problem is the hyphen within the field name, which I guess is widely used. This could be corrected as suggested by Nik, but we have problems with other characters, even if less used within field names.

          Solr doesn't validate field names, but now a lot of potential field names can't actually be used within the fl parameter, or even worse they break the query. Some of my test methods are intentionally weird, like the ~idtest or id$test, but those field names are both allowed by Solr. I'm afraid we might have the same problem with sorting since the QueryParsing#parseSort uses the same StrParser#getId method.

          The main rule to identify the end of a field name in StrParser#getId seems to be the following:

          if (!Character.isJavaIdentifierPart(ch) && ch != '.')
          	break;
          

          I guess it should be extended, not just with the hyphen, but in my opinion the point here is not just correct the hyphen regression. I think we should introduce consistency between fl and decide which characters Solr should accept within a field name. I mean, if Solr accepts everything, we'll always have this fl problem. What are your thoughts guys?

          Show
          Luca Cavanna added a comment - - edited My patch isn't a fix but just a starting point: it adds a ReturnFieldsTest class which tests some of the new fl features. Some tests are of course failing. The biggest problem is the hyphen within the field name, which I guess is widely used. This could be corrected as suggested by Nik, but we have problems with other characters, even if less used within field names. Solr doesn't validate field names, but now a lot of potential field names can't actually be used within the fl parameter, or even worse they break the query. Some of my test methods are intentionally weird, like the ~idtest or id$test, but those field names are both allowed by Solr. I'm afraid we might have the same problem with sorting since the QueryParsing#parseSort uses the same StrParser#getId method. The main rule to identify the end of a field name in StrParser#getId seems to be the following: if (! Character .isJavaIdentifierPart(ch) && ch != '.') break ; I guess it should be extended, not just with the hyphen, but in my opinion the point here is not just correct the hyphen regression. I think we should introduce consistency between fl and decide which characters Solr should accept within a field name. I mean, if Solr accepts everything, we'll always have this fl problem. What are your thoughts guys?
          Hide
          Ryan McKinley added a comment -

          I added the tests with @Ignore in revision: 1296434

          Show
          Ryan McKinley added a comment - I added the tests with @Ignore in revision: 1296434
          Hide
          Yonik Seeley added a comment -

          I've been saying for a while that using roughly java identifiers for field names was best practice, but we should document it somewhere.

          I don't think we should change StrParser.getId to be more permissive though - that will just cause more problems in the future (say when we want to start adding infix and want a-b to be a minus b. There's not a regression in that specific code since the function parser has never accepted "-" as part of a field name.

          Show
          Yonik Seeley added a comment - I've been saying for a while that using roughly java identifiers for field names was best practice, but we should document it somewhere. I don't think we should change StrParser.getId to be more permissive though - that will just cause more problems in the future (say when we want to start adding infix and want a-b to be a minus b. There's not a regression in that specific code since the function parser has never accepted "-" as part of a field name.
          Hide
          Luca Cavanna added a comment -

          Yonik, I see your point. On the other hand, the dash is a widely used character within field names. The regression is on the Solr behaviour, and I think it's pretty annoying from a user perspective.

          Anyway, if that's the direction of the project, no problem. What matters more than anything else is consistency. We should document it somewhere as you wrote, but I'd also propose a field names validation at Solr startup, using the StrParser rules, so that Solr accepts only allowed field names and can guarantee the proper behaviour with all allowed field names.

          What do you think?

          Show
          Luca Cavanna added a comment - Yonik, I see your point. On the other hand, the dash is a widely used character within field names. The regression is on the Solr behaviour, and I think it's pretty annoying from a user perspective. Anyway, if that's the direction of the project, no problem. What matters more than anything else is consistency. We should document it somewhere as you wrote, but I'd also propose a field names validation at Solr startup, using the StrParser rules, so that Solr accepts only allowed field names and can guarantee the proper behaviour with all allowed field names. What do you think?
          Hide
          Mark Miller added a comment -

          but I'd also propose a field names validation at Solr startup,

          +1 - rather than playing loosey goosey on what's a valid field name, we should doc and validate for it explicitly.

          Show
          Mark Miller added a comment - but I'd also propose a field names validation at Solr startup, +1 - rather than playing loosey goosey on what's a valid field name, we should doc and validate for it explicitly.
          Hide
          Yonik Seeley added a comment -

          Yonik, I see your point. On the other hand, the dash is a widely used character within field names. The regression is on the Solr behaviour, and I think it's pretty annoying from a user perspective.

          The easiest way to handle this would be this code in ReturnFields:

          
                  // short circuit test for a really simple field name
                  String key = null;
                  String field = sp.getId(null);
                  char ch = sp.ch();
          

          Instead of using getId, we should hand-roll something that also accepts "-" as part of the field name. That would leave function parser (and other users of getId) alone, but allow fieldnames with dashes in the fl param.

          What matters more than anything else is consistency.

          If we really want to go for consistency, then we should not accept "-" anywhere (rather than attempting to expand it to everywhere).

          Show
          Yonik Seeley added a comment - Yonik, I see your point. On the other hand, the dash is a widely used character within field names. The regression is on the Solr behaviour, and I think it's pretty annoying from a user perspective. The easiest way to handle this would be this code in ReturnFields: // short circuit test for a really simple field name String key = null ; String field = sp.getId( null ); char ch = sp.ch(); Instead of using getId, we should hand-roll something that also accepts "-" as part of the field name. That would leave function parser (and other users of getId) alone, but allow fieldnames with dashes in the fl param. What matters more than anything else is consistency. If we really want to go for consistency, then we should not accept "-" anywhere (rather than attempting to expand it to everywhere).
          Hide
          Luca Cavanna added a comment -

          How about trying to achieve both? I mean, are there many other places where we should do the same (adding the dash support)? QueryParsing#parseSort has the same problem. Anything else? I'm probably missing something.

          Depending on where we need to add support for dash to add consistency, I would try to add support for the trailing dash here for backward compatibility (I'd have a patch almost ready), and work on validation as well.

          Show
          Luca Cavanna added a comment - How about trying to achieve both? I mean, are there many other places where we should do the same (adding the dash support)? QueryParsing#parseSort has the same problem. Anything else? I'm probably missing something. Depending on where we need to add support for dash to add consistency, I would try to add support for the trailing dash here for backward compatibility (I'd have a patch almost ready), and work on validation as well.
          Hide
          Yonik Seeley added a comment -

          How about trying to achieve both? I mean, are there many other places where we should do the same (adding the dash support)?

          I think this "regression" is limited to "fl" since that code was changed to support pseudo-fields.

          QueryParsing#parseSort has the same problem.

          I just tried trunk with "sort=a-b_s desc" and it seemed to work fine.

          Show
          Yonik Seeley added a comment - How about trying to achieve both? I mean, are there many other places where we should do the same (adding the dash support)? I think this "regression" is limited to "fl" since that code was changed to support pseudo-fields. QueryParsing#parseSort has the same problem. I just tried trunk with "sort=a-b_s desc" and it seemed to work fine.
          Hide
          Luca Cavanna added a comment -

          Right Yonik, I've been misled by the

          String field = sp.getId(null);
          

          at the beginning of QueryParsing#parseSort, while the method to look at was getSimpleName. Sorting is ok (but I don't completely understand the sp.getId at the beginning).

          I attached a new patch: I added to StrParser a getFieldName method and added to getId the possibility to pass a vararg parameter of allowed trailing chars.
          I made also some changes to ReturnFields to make the code a little bit more readable to me, hope is the same for you guys. Tests work.
          I removed the weird tests about trailing tilde and so on, kept just the trailing dash test which now works (removed @Ignore).

          Let me know what you think.

          I'm going to open a new issue to add field names validation.

          Show
          Luca Cavanna added a comment - Right Yonik, I've been misled by the String field = sp.getId( null ); at the beginning of QueryParsing#parseSort, while the method to look at was getSimpleName. Sorting is ok (but I don't completely understand the sp.getId at the beginning). I attached a new patch: I added to StrParser a getFieldName method and added to getId the possibility to pass a vararg parameter of allowed trailing chars. I made also some changes to ReturnFields to make the code a little bit more readable to me, hope is the same for you guys. Tests work. I removed the weird tests about trailing tilde and so on, kept just the trailing dash test which now works (removed @Ignore). Let me know what you think. I'm going to open a new issue to add field names validation.
          Hide
          Luca Cavanna added a comment -

          Since this issue is blocker for 4.0, it would great to close it soon. Let me know if there's something more I can do to help!

          Show
          Luca Cavanna added a comment - Since this issue is blocker for 4.0, it would great to close it soon. Let me know if there's something more I can do to help!
          Hide
          Yonik Seeley added a comment -

          Here's a simpler patch that tries to change less (only that first getId call).

          I didn't go with the varargs version because it creates objects on every call.

          Show
          Yonik Seeley added a comment - Here's a simpler patch that tries to change less (only that first getId call). I didn't go with the varargs version because it creates objects on every call.
          Hide
          Luca Cavanna added a comment -

          Thanks Yonik. I'm of course ok with your patch!
          I agree with your varargs comment, my code was also more generic than it should. But maybe some other changes could be useful to make the code a little more readable here. Furthermore, if we can avoid copy & pasting the same code it's better I guess.
          Anyway, the most important thing is closing this issue, up to you.

          Show
          Luca Cavanna added a comment - Thanks Yonik. I'm of course ok with your patch! I agree with your varargs comment, my code was also more generic than it should. But maybe some other changes could be useful to make the code a little more readable here. Furthermore, if we can avoid copy & pasting the same code it's better I guess. Anyway, the most important thing is closing this issue, up to you.
          Hide
          Yonik Seeley added a comment -

          committed. I also added a note about recommended field naming to the schema.

          Show
          Yonik Seeley added a comment - committed. I also added a note about recommended field naming to the schema.

            People

            • Assignee:
              Yonik Seeley
              Reporter:
              Nik V. Babichev
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development