Solr
  1. Solr
  2. SOLR-2606

Solr sort no longer works on field names with some punctuation in them

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.1, 3.2, 3.3
    • Fix Version/s: 3.4, 4.0-ALPHA
    • Component/s: search
    • Labels:
      None
    • Environment:

      Linux

      Description

      We just upgraded from Solr 1.4 to 3.2. For the most part the upgrade went fine, however we discovered that sorting on field names with dashes in them is no longer working properly. For example, the following query used to work:

      http://[our solr server]/select/?q=computer&sort=static-need-binary+asc

      and now it gives this error:

      HTTP Status 400 - undefined field static

      type Status report

      message undefined field static

      description The request sent by the client was syntactically incorrect (undefined field static).

      It appears the parser for sorting has been changed so that it now tokenizes differently, and assumes field names cannot have dashes in them. However, field names clearly can have dashes in them. The exact same query which worked fine for us in 1.4 is now breaking in 3.2. Changing the sort field to use a field name that doesn't have a dash in it works just fine.

      1. SOLR-2606.patch
        9 kB
        Hoss Man
      2. SOLR-2606.patch
        3 kB
        Hoss Man
      3. SOLR-2606.test.only.patch
        2 kB
        Hoss Man

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          bulk close for 3.4

          Show
          Robert Muir added a comment - bulk close for 3.4
          Hide
          Hoss Man added a comment -

          Committed revision 1152657. - 3x

          If you are using Solr 3.3, you can test out this fix by applying the following patch...

          http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/solr/core/src/java/org/apache/solr/search/QueryParsing.java?r1=1152657&r2=1152656&pathrev=1152657&view=patch

          Show
          Hoss Man added a comment - Committed revision 1152657. - 3x If you are using Solr 3.3, you can test out this fix by applying the following patch... http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/solr/core/src/java/org/apache/solr/search/QueryParsing.java?r1=1152657&r2=1152656&pathrev=1152657&view=patch
          Hide
          Hoss Man added a comment -

          Committed revision 1152653. - trunk

          I already know 3x isn't going to be a straight forward merge, at a minimum because of changes in SortField

          Show
          Hoss Man added a comment - Committed revision 1152653. - trunk I already know 3x isn't going to be a straight forward merge, at a minimum because of changes in SortField
          Hide
          Hoss Man added a comment -

          updated patch to include some randomized testing of potential esoteric sort field names.

          also tweaked parse logic to use isWhitespace(peekChar()) (instead of just comparing with ' ') to inspect the leftovers after the short circuit getId(null) call since that's more consistent with eastws() (but didn't see an easy way to generate randomized whitespace in the test to really sanity check it)

          Show
          Hoss Man added a comment - updated patch to include some randomized testing of potential esoteric sort field names. also tweaked parse logic to use isWhitespace(peekChar()) (instead of just comparing with ' ' ) to inspect the leftovers after the short circuit getId(null) call since that's more consistent with eastws() (but didn't see an easy way to generate randomized whitespace in the test to really sanity check it)
          Hide
          Hoss Man added a comment -

          My money's on this:

          Mike: close, but that's kind of a red herring.

          If you look at SOLR-1297 and the patch I mentioned you'll note that since using java identifier rules for field names is considered the common case, that test is done first as a short circuit (like the comment says) and function parsing is done after it – but if that fails then the logic i mentioned was suppose to kick in (by catching the exception and then doing the quick "is what's left a field name?" check)

          after reviewing it a bit, i realized that I was kind of stupid when i added the "is what's left a field name?" backcompat work in SOLR-1297, and overlooked some error cases:

          1) Other types of Exceptions can be thrown by the function parsing besides ParseException (notably: SolrException and NumberFormatException in this test)
          2) the "field" variable wasn't being reset to null if it was determined that getId(null) had parsed a valid id but that id hadn't used up enough of the sort spec (ie: wasn't followed by whitespace).

          Attached patch fixes these oversights and fixes the tests in the previous patch.

          it would be nice to have some more tests to be sure i'm still not missing something stupid.

          Show
          Hoss Man added a comment - My money's on this: Mike: close, but that's kind of a red herring. If you look at SOLR-1297 and the patch I mentioned you'll note that since using java identifier rules for field names is considered the common case, that test is done first as a short circuit (like the comment says) and function parsing is done after it – but if that fails then the logic i mentioned was suppose to kick in (by catching the exception and then doing the quick "is what's left a field name?" check) after reviewing it a bit, i realized that I was kind of stupid when i added the "is what's left a field name?" backcompat work in SOLR-1297 , and overlooked some error cases: 1) Other types of Exceptions can be thrown by the function parsing besides ParseException (notably: SolrException and NumberFormatException in this test) 2) the "field" variable wasn't being reset to null if it was determined that getId(null) had parsed a valid id but that id hadn't used up enough of the sort spec (ie: wasn't followed by whitespace). Attached patch fixes these oversights and fixes the tests in the previous patch. it would be nice to have some more tests to be sure i'm still not missing something stupid.
          Hide
          Mike Sokolov added a comment -

          My money's on this:

          // short circuit test for a really simple field name
          String field = sp.getId(null);

          in QueryParsing.parseSort.

          getId() uses isJavaIdentifer() to detect the end of a "simple" field name, which I'm sure can't match '-' or ':'.

          Show
          Mike Sokolov added a comment - My money's on this: // short circuit test for a really simple field name String field = sp.getId(null); in QueryParsing.parseSort. getId() uses isJavaIdentifer() to detect the end of a "simple" field name, which I'm sure can't match '-' or ':'.
          Hide
          Hoss Man added a comment -

          Lucene has no limits on fieldnames. I've long advocated sticking to java identifiers (minus the $) as best practice for a whole bunch of reasons. Maybe it was a mistake not enforcing it to begin with.

          That ship has kind of sailed. At this point it's really a matter of trying to do the best we can with what the users got.

          For those unfamiliar...

          Historically speaking, Solr has never enforced limits on field names except on a feature by feature basis – so you can have any character you want in field names in your schema.xml, but some characters don't play nice with some features: whitespace or commas in field names have never worked with the "fl" param, many punctuation characters would cause you problems with the Lucene query parser, etc...

          for sorting, the only limitation until SOLR-1297 has ever been that you can't sort on a field containing whitespace, and that naming a field "score" was going to cause you problems.

          With SOLR-1297 the decision was sorting by function was valuable enough that "things that look like functions" should be parsed as functions, even though it's theoretically possible that you might have a field named "log(price)" (or a "*" dynamicField) in your schema.

          but at the same time, i added logic to try and ensure that if the function parsing approach didn't produce a valid sort, it would fallback to the legacy parsing approach before assuming an error (see "SOLR-1297.better.field.support.patch")

          Apparently there are some combinations that we missed.

          Attached patch demonstrates the problem specifically with ":" and "-" in field names.

          given that rails uses ":" in field names when using solr, and "-" is a pretty common character for people to expect to work in field names, we should definitely try to get to the bottom of this.

          Show
          Hoss Man added a comment - Lucene has no limits on fieldnames. I've long advocated sticking to java identifiers (minus the $) as best practice for a whole bunch of reasons. Maybe it was a mistake not enforcing it to begin with. That ship has kind of sailed. At this point it's really a matter of trying to do the best we can with what the users got. For those unfamiliar... Historically speaking, Solr has never enforced limits on field names except on a feature by feature basis – so you can have any character you want in field names in your schema.xml, but some characters don't play nice with some features: whitespace or commas in field names have never worked with the "fl" param, many punctuation characters would cause you problems with the Lucene query parser, etc... for sorting, the only limitation until SOLR-1297 has ever been that you can't sort on a field containing whitespace, and that naming a field "score" was going to cause you problems. With SOLR-1297 the decision was sorting by function was valuable enough that "things that look like functions" should be parsed as functions, even though it's theoretically possible that you might have a field named "log(price)" (or a "*" dynamicField) in your schema. but at the same time, i added logic to try and ensure that if the function parsing approach didn't produce a valid sort, it would fallback to the legacy parsing approach before assuming an error (see " SOLR-1297 .better.field.support.patch") Apparently there are some combinations that we missed. Attached patch demonstrates the problem specifically with ":" and "-" in field names. given that rails uses ":" in field names when using solr, and "-" is a pretty common character for people to expect to work in field names, we should definitely try to get to the bottom of this.
          Hide
          Yonik Seeley added a comment -

          Perhaps a test-class producting randomized (legal) field names of could be of use for this and other tests?

          Define "legal"
          Lucene has no limits on fieldnames. I've long advocated sticking to java identifiers (minus the $) as best practice for a whole bunch of reasons. Maybe it was a mistake not enforcing it to begin with.

          Show
          Yonik Seeley added a comment - Perhaps a test-class producting randomized (legal) field names of could be of use for this and other tests? Define "legal" Lucene has no limits on fieldnames. I've long advocated sticking to java identifiers (minus the $) as best practice for a whole bunch of reasons. Maybe it was a mistake not enforcing it to begin with.
          Hide
          Tor Henning Ueland added a comment -

          Confirming this issue for 3.2 (3.2-SNAPSHOT 1133561)
          Data can be loaded from fields (fl), the issue only appears when trying to sort on such fields.

          Show
          Tor Henning Ueland added a comment - Confirming this issue for 3.2 (3.2-SNAPSHOT 1133561) Data can be loaded from fields (fl), the issue only appears when trying to sort on such fields.
          Hide
          Jan Høydahl added a comment -

          Perhaps a test-class producting randomized (legal) field names of could be of use for this and other tests?

          Show
          Jan Høydahl added a comment - Perhaps a test-class producting randomized (legal) field names of could be of use for this and other tests?
          Hide
          Hoss Man added a comment -

          thanks for reporting this.

          Another user on the mailing list reported a similar bug with 3.1 when trying to sort on a field named "scores:rails_f" (giving an error of "undefined field scores")

          I'm fairly certain this is caused by the enhancements made in SOLR-1297 to add sorting on functions.

          Work was done in that issue to try and ensure that "eccentric" field names could still be used in sorts even when the parsing code was changed, but evidently some basic cases involving punctuation still slipped through the cracks.

          Show
          Hoss Man added a comment - thanks for reporting this. Another user on the mailing list reported a similar bug with 3.1 when trying to sort on a field named "scores:rails_f" (giving an error of "undefined field scores") I'm fairly certain this is caused by the enhancements made in SOLR-1297 to add sorting on functions. Work was done in that issue to try and ensure that "eccentric" field names could still be used in sorts even when the parsing code was changed, but evidently some basic cases involving punctuation still slipped through the cracks.

            People

            • Assignee:
              Hoss Man
              Reporter:
              Mitsu Hadeishi
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development