Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3
    • Component/s: None
    • Labels:
      None

      Description

      1. RandomSortField.java
        6 kB
        Ryan McKinley
      2. RandomSortField.java
        6 kB
        Yonik Seeley
      3. RandomSortField.java
        6 kB
        Ryan McKinley
      4. RandomSortField.java
        4 kB
        Yonik Seeley
      5. RandomSortField.java
        2 kB
        Yonik Seeley
      6. SOLR-264-RandomSortField-2.patch
        2 kB
        Koji Sekiguchi
      7. SOLR-264-RandomSortField-2.patch
        2 kB
        Koji Sekiguchi
      8. SOLR-264-RandomSortOrder.patch
        4 kB
        Ryan McKinley
      9. SOLR-264-RandomSortOrder.patch
        4 kB
        Ryan McKinley
      10. SOLR-264-RandomSortOrder.patch
        4 kB
        Ryan McKinley

        Activity

        Hide
        Ryan McKinley added a comment -

        Implements a 'ScoreDocComparator' that returns a random number.

        I don't know if this is a fair distribution, but the results look ok.

        This is the meat:

        static final ScoreDocComparator RANDOM = new ScoreDocComparator() {
        final Random rand = new Random();

        public int compare (ScoreDoc i, ScoreDoc j)

        { return sortValue(i).compareTo( sortValue(j) ); }

        public Comparable sortValue (ScoreDoc i)

        { return new Float(rand.nextFloat()); }

        public int sortType()

        { return SortField.CUSTOM; }

        };

        • - - - -

        One thing that is wierd is that it seems to need a valid field name. Solr sets up sorting with a field cache, so I just pick any field name (by default the unique key) and use that. Any reason this is a bad idea?

        • - - -

        Old discussion on this topic:
        http://www.nabble.com/random-order-tf3198916.html#a8881481

        Show
        Ryan McKinley added a comment - Implements a 'ScoreDocComparator' that returns a random number. I don't know if this is a fair distribution, but the results look ok. This is the meat: static final ScoreDocComparator RANDOM = new ScoreDocComparator() { final Random rand = new Random(); public int compare (ScoreDoc i, ScoreDoc j) { return sortValue(i).compareTo( sortValue(j) ); } public Comparable sortValue (ScoreDoc i) { return new Float(rand.nextFloat()); } public int sortType() { return SortField.CUSTOM; } }; - - - - One thing that is wierd is that it seems to need a valid field name. Solr sets up sorting with a field cache, so I just pick any field name (by default the unique key) and use that. Any reason this is a bad idea? - - - Old discussion on this topic: http://www.nabble.com/random-order-tf3198916.html#a8881481
        Hide
        Yonik Seeley added a comment -

        Rather than mess with query parsing to hack in support for a "random" key word, a more elegant way would be to create a RandomFieldType specifically for sorting, then hook it in via the schema in the normal manner.

        Since this is such a unique requirement (probably not broadly applicable), one might want to make this as a custom component in a QueryComponent chain that we discussed recently.
        http://www.nabble.com/search-components-%28plugins%29-tf3898040.html#a11050274

        Show
        Yonik Seeley added a comment - Rather than mess with query parsing to hack in support for a "random" key word, a more elegant way would be to create a RandomFieldType specifically for sorting, then hook it in via the schema in the normal manner. Since this is such a unique requirement (probably not broadly applicable), one might want to make this as a custom component in a QueryComponent chain that we discussed recently. http://www.nabble.com/search-components-%28plugins%29-tf3898040.html#a11050274
        Hide
        Ryan McKinley added a comment -

        >
        > Rather than mess with query parsing to hack in support for a "random" key word, a more elegant way would be to create a RandomFieldType specifically for sorting, then hook it in via the schema in the normal manner.
        >

        That sounds good – then you don't run the risk of namespace overlap. The special keyword 'score' already makes me a bit nervous. It seems like you could make a field named score, but could not sort by it and could not refer to it (except as *)

        > Since this is such a unique requirement (probably not broadly applicable)

        Is it so unique? In 'full text search' sure, but if the domain is 'database' it seems pretty fundamental. I can't be the only person who wants to do:

        SELECT * FROM features ORDER BY rand();

        fq=feature:true&sort=random

        Show
        Ryan McKinley added a comment - > > Rather than mess with query parsing to hack in support for a "random" key word, a more elegant way would be to create a RandomFieldType specifically for sorting, then hook it in via the schema in the normal manner. > That sounds good – then you don't run the risk of namespace overlap. The special keyword 'score' already makes me a bit nervous. It seems like you could make a field named score, but could not sort by it and could not refer to it (except as *) > Since this is such a unique requirement (probably not broadly applicable) Is it so unique? In 'full text search' sure, but if the domain is 'database' it seems pretty fundamental. I can't be the only person who wants to do: SELECT * FROM features ORDER BY rand(); fq=feature:true&sort=random
        Hide
        Ryan McKinley added a comment -

        Random sorting implemented as a FieldType.

        to enable random sorting, you need to add something like:

        <fieldType name="random" class="solr.RandomSortField" />
        <field name="random" type="random" indexed="true" stored="false"/>

        to schema.xml

        Show
        Ryan McKinley added a comment - Random sorting implemented as a FieldType. to enable random sorting, you need to add something like: <fieldType name="random" class="solr.RandomSortField" /> <field name="random" type="random" indexed="true" stored="false"/> to schema.xml
        Hide
        Yonik Seeley added a comment -

        nextFloat() generates a random int and then does a floating point divide (one of the slowest operations around).
        Instead, use
        return rand.nextBoolean() ? -1 : 1;
        Or a branch-free version:
        return rand.nextInt() >>> 31;

        Show
        Yonik Seeley added a comment - nextFloat() generates a random int and then does a floating point divide (one of the slowest operations around). Instead, use return rand.nextBoolean() ? -1 : 1; Or a branch-free version: return rand.nextInt() >>> 31;
        Hide
        Ryan McKinley added a comment -

        using:
        rand.nextBoolean() ? -1 : 1;

        the fancy pants >>> returns a distribution 0,1 – to get a reasonable random sort, it needs to be 1 or -1

        Show
        Ryan McKinley added a comment - using: rand.nextBoolean() ? -1 : 1; the fancy pants >>> returns a distribution 0,1 – to get a reasonable random sort, it needs to be 1 or -1
        Hide
        Yonik Seeley added a comment -

        > the fancy pants >>> returns a distribution 0,1 – to get a reasonable random sort, it needs to be 1 or -1

        Hmmm, so
        (rand.nextInt() & 0x2) - 1

        Show
        Yonik Seeley added a comment - > the fancy pants >>> returns a distribution 0,1 – to get a reasonable random sort, it needs to be 1 or -1 Hmmm, so (rand.nextInt() & 0x2) - 1
        Hide
        Hoss Man added a comment -

        Uhh, guys ... this code violates the contract of Comparator (which is admittedly not an explicit part of the of ScoreDocComparator contract, but it is implied) because:

        1) the method isn't stable ... compare(x,y) is not garunteed to have the same sign as a subsequent compare(x,y)
        2) it doesn't guarantee that the sign of compare(x,y) is the negation of the sign of compare(y,x)
        3) it doesn't guarantee a transitive property ... if ((compare(x, y)>0) && (compare(y, z)>0)) then: compare(x, z)>0.

        Show
        Hoss Man added a comment - Uhh, guys ... this code violates the contract of Comparator (which is admittedly not an explicit part of the of ScoreDocComparator contract, but it is implied) because: 1) the method isn't stable ... compare(x,y) is not garunteed to have the same sign as a subsequent compare(x,y) 2) it doesn't guarantee that the sign of compare(x,y) is the negation of the sign of compare(y,x) 3) it doesn't guarantee a transitive property ... if ((compare(x, y)>0) && (compare(y, z)>0)) then: compare(x, z)>0.
        Hide
        Yonik Seeley added a comment -

        Yeah, I briefly thought of that... but decided it might be OK (won't cause an infinite loop) since sorting is done via a priority queue.

        But thinking on this further, it still has a lot of problems besides a single query:

        • caching... oops, there goes your randomness
        • code that needs to query more than once (first pass gets "n" docs, then requests more)
        Show
        Yonik Seeley added a comment - Yeah, I briefly thought of that... but decided it might be OK (won't cause an infinite loop) since sorting is done via a priority queue. But thinking on this further, it still has a lot of problems besides a single query: caching... oops, there goes your randomness code that needs to query more than once (first pass gets "n" docs, then requests more)
        Hide
        Ryan McKinley added a comment -

        Dooh. I was worried about this at first, but it did not seem to matter. Perhaps my dataset isn't big enough to get itself stuck in a keep sorting loop...

        Do you see any holes in:

        private static class RandomScoreDocComparator implements ScoreDocComparator {
        final long start = System.currentTimeMillis();

        public int compare (ScoreDoc i, ScoreDoc j)

        { int vI = new Random(start+i.doc).nextInt(); int vJ = new Random(start+j.doc).nextInt(); return vI - vJ; }

        public Comparable sortValue (ScoreDoc i)

        { return new Integer(new Random(start+i.doc).nextInt()); }

        public int sortType()

        { return SortField.CUSTOM; }

        };

        Show
        Ryan McKinley added a comment - Dooh. I was worried about this at first, but it did not seem to matter. Perhaps my dataset isn't big enough to get itself stuck in a keep sorting loop... Do you see any holes in: private static class RandomScoreDocComparator implements ScoreDocComparator { final long start = System.currentTimeMillis(); public int compare (ScoreDoc i, ScoreDoc j) { int vI = new Random(start+i.doc).nextInt(); int vJ = new Random(start+j.doc).nextInt(); return vI - vJ; } public Comparable sortValue (ScoreDoc i) { return new Integer(new Random(start+i.doc).nextInt()); } public int sortType() { return SortField.CUSTOM; } };
        Hide
        Yonik Seeley added a comment -

        Hold up a minute... I have a better way... gotta run to a meeting thoughj

        Show
        Yonik Seeley added a comment - Hold up a minute... I have a better way... gotta run to a meeting thoughj
        Hide
        Hoss Man added a comment -

        the only clean way to do this that i know of i to build aFieldCache-esque rray of size maxDoc, put a random number of each doc into that array, and then use that array to get the sortValue ... document order is randomized, but consistent for all uses of the same array (if you leverage the FieldCache custom type, that means its' consistent per IndexReader)

        this was brought up recently on java-user, after i suggested an alternative approach of just promoting N randomly selected docs to the front of the results (where N is greater then your expected pagination...

        http://www.nabble.com/Several-questions-about-scoring-sorting-%2B-random-sorting-in-an-image-related-application-tf3928435.html#a11141191

        Show
        Hoss Man added a comment - the only clean way to do this that i know of i to build aFieldCache-esque rray of size maxDoc, put a random number of each doc into that array, and then use that array to get the sortValue ... document order is randomized, but consistent for all uses of the same array (if you leverage the FieldCache custom type, that means its' consistent per IndexReader) this was brought up recently on java-user, after i suggested an alternative approach of just promoting N randomly selected docs to the front of the results (where N is greater then your expected pagination... http://www.nabble.com/Several-questions-about-scoring-sorting-%2B-random-sorting-in-an-image-related-application-tf3928435.html#a11141191
        Hide
        Yonik Seeley added a comment -

        I just attached a draft approach (completely untested, uncommented, etc)

        The approach would be to use a dynamic field random_* and then pass the seed for the random function in the field name. So q=foo&sort=rand_165623

        The docids are hashed, including the seed, to get random (but repeatable) values.
        This should satisfy repeatability, cacheability, and sanity (can repeat a sequence if desired).
        Since the SortComparatorSource isn't a singleton in this case, I implemented hashCode + equals for the queryCache.

        Show
        Yonik Seeley added a comment - I just attached a draft approach (completely untested, uncommented, etc) The approach would be to use a dynamic field random_* and then pass the seed for the random function in the field name. So q=foo&sort=rand_165623 The docids are hashed, including the seed, to get random (but repeatable) values. This should satisfy repeatability, cacheability, and sanity (can repeat a sequence if desired). Since the SortComparatorSource isn't a singleton in this case, I implemented hashCode + equals for the queryCache.
        Hide
        Yonik Seeley added a comment -

        Of course this could also implement a ValueSource if someone really wanted to throw randomness into scoring...

        Show
        Yonik Seeley added a comment - Of course this could also implement a ValueSource if someone really wanted to throw randomness into scoring...
        Hide
        Hoss Man added a comment -

        wow.

        why is it whenever i read a patch from yonik that starts with a bunch of bitshift operators, i start to worry that in 7 days a weird girl is going to climb out of a well and then i'm going to die?

        that's some freaky shit ... but i can't see anything wrong with the approach. (assuming the hash function does what it says it does)

        in general, any approach that uses a fixed seed per SortComparatorSource should work, and getting the seed from the "filed name" seems like a slick way to do it ... we wouldn't even have to require that the field name match any sort of pattern (ie: end with _ and a number) we could just hash on the field name.

        people could even choose to use a regular field (instead of a dynamic field) and accept that they'd get a fixed ordering per commit/> if we also used the IndexReader version in computing the seed.

        Show
        Hoss Man added a comment - wow. why is it whenever i read a patch from yonik that starts with a bunch of bitshift operators, i start to worry that in 7 days a weird girl is going to climb out of a well and then i'm going to die? that's some freaky shit ... but i can't see anything wrong with the approach. (assuming the hash function does what it says it does) in general, any approach that uses a fixed seed per SortComparatorSource should work, and getting the seed from the "filed name" seems like a slick way to do it ... we wouldn't even have to require that the field name match any sort of pattern (ie: end with _ and a number) we could just hash on the field name. people could even choose to use a regular field (instead of a dynamic field) and accept that they'd get a fixed ordering per commit/> if we also used the IndexReader version in computing the seed.
        Hide
        Yonik Seeley added a comment -

        Just for the fun of it, added prototype (read: completely untested) ValueSource.

        Show
        Yonik Seeley added a comment - Just for the fun of it, added prototype (read: completely untested) ValueSource.
        Hide
        Yonik Seeley added a comment -

        > we wouldn't even have to require that the field name match any sort of pattern (ie: end with _ and a number) we could just hash on the field name.

        Excellent point. I had considered adding the hash of the prefix to the field value... but just hashing over the complete name and not requiring a specific format is much better.

        You'll have to let me know the reference to the girl and the well stuff sometime

        Show
        Yonik Seeley added a comment - > we wouldn't even have to require that the field name match any sort of pattern (ie: end with _ and a number) we could just hash on the field name. Excellent point. I had considered adding the hash of the prefix to the field value... but just hashing over the complete name and not requiring a specific format is much better. You'll have to let me know the reference to the girl and the well stuff sometime
        Hide
        Ryan McKinley added a comment -

        wow.

        Here is an updated version that uses the field name hash for the seed. I added a few comments and I have seen it run and give random results.

        It gets configured with:
        <dynamicField name="rand*" type="random" indexed="true" stored="false"/>

        then you get nicely repeatably random results for:

        thanks yonik!

        Show
        Ryan McKinley added a comment - wow. Here is an updated version that uses the field name hash for the seed. I added a few comments and I have seen it run and give random results. It gets configured with: <dynamicField name="rand*" type="random" indexed="true" stored="false"/> then you get nicely repeatably random results for: http://localhost:8983/solr/select/?q=*:*&fl=name&sort=rand_1234%20desc http://localhost:8983/solr/select/?q=*:*&fl=name&sort=rand_2345%20desc http://localhost:8983/solr/select/?q=*:*&fl=name&sort=rand_ABDC%20desc http://localhost:8983/solr/select/?q=*:*&fl=name&sort=rand_21%20desc thanks yonik!
        Hide
        Yonik Seeley added a comment -

        If we want the comparator to be transitive, a subtraction doesn't work with negative numbers in there because of overflow. We could either do explicit comparisons, or only use positive ints. I chose positive ints because it might be a nicer range for ValueSource (function query).

        Show
        Yonik Seeley added a comment - If we want the comparator to be transitive, a subtraction doesn't work with negative numbers in there because of overflow. We could either do explicit comparisons, or only use positive ints. I chose positive ints because it might be a nicer range for ValueSource (function query).
        Hide
        Yonik Seeley added a comment -

        Might be nice to check cache statistics to see that the same seed results in a hit.
        same with function query: val(rand_1234)

        Show
        Yonik Seeley added a comment - Might be nice to check cache statistics to see that the same seed results in a hit. same with function query: val (rand_1234)
        Hide
        Hoss Man added a comment -

        i was kind of wondering about the negative overflow (isn't that underflow) but i was trusting you that it worked ... isn't there still a potential problem with the transitive property if i.doc + seed causes positive integer overflow?

        (i guess it's okay because the overflow is the same regardless of whether it's i or j, but in the previous case the overflow happend *after*the hash call.

        FWIW: i still think newComparator should be something like...

        public ScoreDocComparator newComparator(IndexReader reader, String fieldname) throws IOException

        { return new RandomComparator(seed ^ reader.getVersion()); }

        ...so that people who want orderings that are randomized per each <commit/> can just use...

        <field name="random" type="random" />

        With the attachment as is, most changes to the index (ie: add a few documents, delete a few documents) won't have a significant impact on the "random" order because they won't change the majority of the docIds... you have to change the "seed" to see any noticable effects.

        Show
        Hoss Man added a comment - i was kind of wondering about the negative overflow (isn't that underflow) but i was trusting you that it worked ... isn't there still a potential problem with the transitive property if i.doc + seed causes positive integer overflow? (i guess it's okay because the overflow is the same regardless of whether it's i or j, but in the previous case the overflow happend *after*the hash call. FWIW: i still think newComparator should be something like... public ScoreDocComparator newComparator(IndexReader reader, String fieldname) throws IOException { return new RandomComparator(seed ^ reader.getVersion()); } ...so that people who want orderings that are randomized per each <commit/> can just use... <field name="random" type="random" /> With the attachment as is, most changes to the index (ie: add a few documents, delete a few documents) won't have a significant impact on the "random" order because they won't change the majority of the docIds... you have to change the "seed" to see any noticable effects.
        Hide
        Yonik Seeley added a comment -

        > i guess it's okay because the overflow is the same regardless of whether it's i or j, but in the previous case the overflow happend *after*the hash call

        Yep, no bits of randomness are lost, so having an intermediate value over/underflow is fine.

        Including the reader version is an interesting idea, and would probably lead to less confusion (a user might otherwise be tricked into thinking that docs always sort the same relative to each other across a commit).

        Show
        Yonik Seeley added a comment - > i guess it's okay because the overflow is the same regardless of whether it's i or j, but in the previous case the overflow happend *after*the hash call Yep, no bits of randomness are lost, so having an intermediate value over/underflow is fine. Including the reader version is an interesting idea, and would probably lead to less confusion (a user might otherwise be tricked into thinking that docs always sort the same relative to each other across a commit).
        Hide
        Yonik Seeley added a comment -

        LOL... I'm obviously too tired for more coding tonight. Could the next person to modify this patch please change
        return key >>> 31; // only positive numbers
        to
        return key >>> 1;

        Show
        Yonik Seeley added a comment - LOL... I'm obviously too tired for more coding tonight. Could the next person to modify this patch please change return key >>> 31; // only positive numbers to return key >>> 1;
        Hide
        Ryan McKinley added a comment -

        changed to use:
        "return key >>> 1;"
        and
        seed ^ reader.getVersion()

        actually it needs to be:
        (int)(seed^reader.getVersion())

        the long->int overflow wrapping should not be a problem.

        Show
        Ryan McKinley added a comment - changed to use: "return key >>> 1;" and seed ^ reader.getVersion() actually it needs to be: (int)(seed^reader.getVersion()) the long->int overflow wrapping should not be a problem.
        Hide
        Koji Sekiguchi added a comment - - edited

        > public SortComparatorSource getFactory() {
        > :
        > return new SortComparatorSource() {
        > :
        > public boolean equals(Object o)

        { > return (o instanceof RandomSort) && getField().equals(((RandomSort) o).getField()); > }

        > };
        >}

        The equals() method should care of SortComparatorSource, instead of RandomSort...? The patch uses SortComparatorSource in equals() method.

        Show
        Koji Sekiguchi added a comment - - edited > public SortComparatorSource getFactory() { > : > return new SortComparatorSource() { > : > public boolean equals(Object o) { > return (o instanceof RandomSort) && getField().equals(((RandomSort) o).getField()); > } > }; >} The equals() method should care of SortComparatorSource, instead of RandomSort...? The patch uses SortComparatorSource in equals() method.
        Hide
        Koji Sekiguchi added a comment -

        Some non-ASCII characters were slipped into the previous patch. I'd like to upload new one.
        regards,

        Show
        Koji Sekiguchi added a comment - Some non-ASCII characters were slipped into the previous patch. I'd like to upload new one. regards,
        Hide
        Hoss Man added a comment -

        Good catch Koji, i applied your patch as well as fixed up a few other things i noticed (ValueSource wasn't using IndexReader in the seed, confusing "seed" terminology used because of getSeed(field), etc...)

        Show
        Hoss Man added a comment - Good catch Koji, i applied your patch as well as fixed up a few other things i noticed (ValueSource wasn't using IndexReader in the seed, confusing "seed" terminology used because of getSeed(field), etc...)
        Hide
        Hoss Man added a comment -

        this was all committed a little while ago and seems to be working.

        Show
        Hoss Man added a comment - this was all committed a little while ago and seems to be working.

          People

          • Assignee:
            Unassigned
            Reporter:
            Ryan McKinley
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development