Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Component/s: search
    • Labels:
      None

      Description

      Lucene allows sorting by docid, but Solr currently does not provide a way to specify it.

      1. SOLR-1478.patch
        0.6 kB
        Erik Hatcher

        Activity

        Hide
        Erik Hatcher added a comment -

        This patch adds a special sort field (like "score" is implemented) to enable sorting by docid.

        The character "#" was used simply to avoid any potential field name overlap, but this requires URL encoding it to %23, so maybe some other string should be used?

        Here's an example URL: http://localhost:8983/solr/select?q=*:*&sort=%23%20desc&fl=id

        Seems like score and docid sorting should avoid using normal field name strings, so maybe score and docid or something.

        I marked this for 1.4, because it's a trivial patch. Discussion welcome.

        Show
        Erik Hatcher added a comment - This patch adds a special sort field (like "score" is implemented) to enable sorting by docid. The character "#" was used simply to avoid any potential field name overlap, but this requires URL encoding it to %23, so maybe some other string should be used? Here's an example URL: http://localhost:8983/solr/select?q=*:*&sort=%23%20desc&fl=id Seems like score and docid sorting should avoid using normal field name strings, so maybe score and docid or something. I marked this for 1.4, because it's a trivial patch. Discussion welcome.
        Hide
        Shalin Shekhar Mangar added a comment - - edited

        Perhaps something like _ DOCID _ instead of #. I am even tempted to suggest just using DOCID like we have SCORE.

        [Edit] - Jira ate my suggestion

        Show
        Shalin Shekhar Mangar added a comment - - edited Perhaps something like _ DOCID _ instead of #. I am even tempted to suggest just using DOCID like we have SCORE. [Edit] - Jira ate my suggestion
        Hide
        Grant Ingersoll added a comment -

        Does Solr ever expose the docid to users?

        Show
        Grant Ingersoll added a comment - Does Solr ever expose the docid to users?
        Hide
        Erik Hatcher added a comment -

        Only the LukeRequestHandler, that I can tell, allows fetching a document by docid and returns it in the response too.

        I don't see a need to return the docid even if one is sorting by it. Sorting by docid allows for last-in-first-out, or first-in-first-out, sorting without any caching overhead of sorting by a field.

        Show
        Erik Hatcher added a comment - Only the LukeRequestHandler, that I can tell, allows fetching a document by docid and returns it in the response too. I don't see a need to return the docid even if one is sorting by it. Sorting by docid allows for last-in-first-out, or first-in-first-out, sorting without any caching overhead of sorting by a field.
        Hide
        Grant Ingersoll added a comment -

        Sounds good.

        Show
        Grant Ingersoll added a comment - Sounds good.
        Hide
        Erik Hatcher added a comment -

        I committed, and left the special "field" as "#". I'd rather avoid a string that could potentially be a field name in use, and sorting by docid will be such a specialized case that the encoding confusion won't be too bad. Folks have to deal with URL encoding everywhere anyway. I kinda like that character to mean "number".

        Show
        Erik Hatcher added a comment - I committed, and left the special "field" as "#". I'd rather avoid a string that could potentially be a field name in use, and sorting by docid will be such a specialized case that the encoding confusion won't be too bad. Folks have to deal with URL encoding everywhere anyway. I kinda like that character to mean "number".
        Hide
        Yonik Seeley added a comment -

        A few things I don't like about '#'

        • unlike many other characters, the browser can't encode it for you. For example, I can type in "sort=foo desc" into my browser and it can encode the space for me. If I type in a literal #, Solr will silently truncate the request at that point. People will have trouble with this one.
        • it can require lexical modification to other parsers (as opposed to semantic modification). Things like function queries or anything else that parse out field names or parameters would need to be modified at the lexical level to accept # - it's generally easier to just check for a special name.
        • it looks like a comment
        Show
        Yonik Seeley added a comment - A few things I don't like about '#' unlike many other characters, the browser can't encode it for you. For example, I can type in "sort=foo desc" into my browser and it can encode the space for me. If I type in a literal #, Solr will silently truncate the request at that point. People will have trouble with this one. it can require lexical modification to other parsers (as opposed to semantic modification). Things like function queries or anything else that parse out field names or parameters would need to be modified at the lexical level to accept # - it's generally easier to just check for a special name. it looks like a comment
        Hide
        Yonik Seeley added a comment -

        Does this work with distributed search?

        Show
        Yonik Seeley added a comment - Does this work with distributed search?
        Hide
        Shalin Shekhar Mangar added a comment -

        I don't like having an arbitrary character like '#' signifying a sort type because it does not explain itself to a user. Once 1.4 goes out, it will be public API and we won't be able to change this easily. Erik, please consider this again.

        This also does not work with distributed search which should be clearly noted wherever we decide to document this. ShardDoc.java line 170 says that it is possible to support it but I'm not sure what Yonik had in mind.

        Show
        Shalin Shekhar Mangar added a comment - I don't like having an arbitrary character like '#' signifying a sort type because it does not explain itself to a user. Once 1.4 goes out, it will be public API and we won't be able to change this easily. Erik, please consider this again. This also does not work with distributed search which should be clearly noted wherever we decide to document this. ShardDoc.java line 170 says that it is possible to support it but I'm not sure what Yonik had in mind.
        Hide
        Shalin Shekhar Mangar added a comment -

        Does this work with distributed search?

        No, it throws an exception:

        SEVERE: java.lang.RuntimeException: Doc sort not supported
        	at org.apache.solr.handler.component.ShardFieldSortedHitQueue.getCachedComparator(ShardDoc.java:171)
        	at org.apache.solr.handler.component.ShardFieldSortedHitQueue.<init>(ShardDoc.java:96)
        	at org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:393)
        	at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:298)
        	at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:290)
        
        Show
        Shalin Shekhar Mangar added a comment - Does this work with distributed search? No, it throws an exception: SEVERE: java.lang.RuntimeException: Doc sort not supported at org.apache.solr.handler.component.ShardFieldSortedHitQueue.getCachedComparator(ShardDoc.java:171) at org.apache.solr.handler.component.ShardFieldSortedHitQueue.<init>(ShardDoc.java:96) at org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:393) at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:298) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:290)
        Hide
        Yonik Seeley added a comment -
        _id_
        _docid_
        

        ?

        The chance of collision is super low - I'd wager that no one has ever used _id_ in their schema (single underscores on either side... it's doubled to prevent wiki syntax from turning it into italics)

        Show
        Yonik Seeley added a comment - _id_ _docid_ ? The chance of collision is super low - I'd wager that no one has ever used _ id _ in their schema (single underscores on either side... it's doubled to prevent wiki syntax from turning it into italics)
        Hide
        Steve Rowe added a comment - - edited

        Providing aliases would allow all parties to get what they want. Downside: maintenance/documentation issues with multiple syntaxes (minor IMHO). Upside: collision probability goes down even further.

        edit oops, completely wrong on the "upside" – collision probability actually goes up, not down, since the set of noncolliding field names is reduced by each reserved pseudo-field name. Still, aliases totally rock.

        Show
        Steve Rowe added a comment - - edited Providing aliases would allow all parties to get what they want. Downside: maintenance/documentation issues with multiple syntaxes (minor IMHO). Upside: collision probability goes down even further. edit oops, completely wrong on the "upside" – collision probability actually goes up, not down, since the set of noncolliding field names is reduced by each reserved pseudo-field name. Still, aliases totally rock.
        Hide
        Steve Rowe added a comment -

        Another thought: the XML specification reserves names matching regex /^xml/i to itself for future use (see http://www.w3.org/TR/xml/#sec-common-syn). Maybe Solr should do the same? That way, this discussion wouldn't have to be repeated for each new pseudo-field.

        Show
        Steve Rowe added a comment - Another thought: the XML specification reserves names matching regex /^xml/i to itself for future use (see http://www.w3.org/TR/xml/#sec-common-syn ). Maybe Solr should do the same? That way, this discussion wouldn't have to be repeated for each new pseudo-field.
        Hide
        Yonik Seeley added a comment -

        A Lucene field name can be anything... so '#' could also be a collision.
        If we wish to reserve certain names going forward, I'd vote for reserving ids with an underscore on either side.

        But really, the whole collision thing is overblown... this is a single name that people will not have used before. On a practical level, I don't believe it's an issue.
        We will need another one too - as a container for document metadata. I've suggested meta for that in SOLR-705.

        We aren't adding these all the time... there was exactly one before this.. "score". No future document level metadata will collide since they will be contained in whatever meta ends up being.

        Further advantages to _id_ (single underscores surrounding the id):

        • consistent with magic fieldnames _query_ and _val_ for nested queries in the query parser, and I could see supporting _id_:1 in the future
        • people may want to return the actual ids for documents... wherever that info goes (separate return vector like sort_field_values for distributed search or _meta_) it will be nicer for clients if the label for it is actually an identifier and not '#'
        Show
        Yonik Seeley added a comment - A Lucene field name can be anything... so '#' could also be a collision. If we wish to reserve certain names going forward, I'd vote for reserving ids with an underscore on either side. But really, the whole collision thing is overblown... this is a single name that people will not have used before. On a practical level, I don't believe it's an issue. We will need another one too - as a container for document metadata. I've suggested meta for that in SOLR-705 . We aren't adding these all the time... there was exactly one before this.. "score". No future document level metadata will collide since they will be contained in whatever meta ends up being. Further advantages to _ id _ (single underscores surrounding the id): consistent with magic fieldnames _ query _ and _ val _ for nested queries in the query parser, and I could see supporting _ id _:1 in the future people may want to return the actual ids for documents... wherever that info goes (separate return vector like sort_field_values for distributed search or _ meta _) it will be nicer for clients if the label for it is actually an identifier and not '#'
        Hide
        Yonik Seeley added a comment -

        I've been thinking _docid_ instead of _id_ since it's further from "id", what we normally use as the unique key field for documents.

        OK, since Erik also proposed that as an alternative, and because Shalin also seems to be OK with that alternative, I'll commit that change unless I hear that more people favor a different alternative (keeping # or using _id_)

        Show
        Yonik Seeley added a comment - I've been thinking _docid_ instead of _id_ since it's further from "id", what we normally use as the unique key field for documents. OK, since Erik also proposed that as an alternative, and because Shalin also seems to be OK with that alternative, I'll commit that change unless I hear that more people favor a different alternative (keeping # or using _id_)
        Hide
        Shalin Shekhar Mangar added a comment -

        I've been thinking docid instead of id since it's further from "id", what we normally use as the unique key field for documents.

        +1

        Show
        Shalin Shekhar Mangar added a comment - I've been thinking docid instead of id since it's further from "id", what we normally use as the unique key field for documents. +1
        Hide
        Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        Grant Ingersoll added a comment - Bulk close for Solr 1.4

          People

          • Assignee:
            Unassigned
            Reporter:
            Erik Hatcher
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development