Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 6.0
    • Fix Version/s: 4.10, 6.0
    • Component/s: search
    • Labels:
      None

      Description

      This ticket allows Solr to export full sorted result sets. A new export request handler has been created that sets up the default writer type (SortingResponseWriter) and the required rank query (ExportQParserPlugin). The syntax is:

      /solr/collection1/export?q=*:*&fl=a,b,c&sort=a desc,b desc
      

      This capability will open up Solr for a whole range of uses that were typically done using aggregation engines like Hadoop. For example:

      Large Distributed Joins

      A client outside of Solr calls two different Solr collections and returns the results sorted by a join key. The client iterates through both streams and performs a merge join.

      Fully Distributed Field Collapsing/Grouping

      A client outside of Solr makes individual calls to all the servers in a single collection and returns results sorted by the collapse key. The client merge joins the sorted lists on the collapse key to perform the field collapse.

      High Cardinality Distributed Aggregation

      A client outside of Solr makes individual calls to all the servers in a single collection and sorts on a high cardinality field. The client then merge joins the sorted lists to perform the high cardinality aggregation.

      Large Scale Time Series Rollups

      A client outside Solr makes individual calls to all servers in a collection and sorts on time dimensions. The client merge joins the sorted result sets and rolls up the time dimensions as it iterates through the data.

      In these scenarios Solr is being used as a distributed sorting engine. Developers can write clients that take advantage of this sorting capability in any way they wish.

      Session Analysis and Aggregation

      A client outside Solr makes individual calls to all servers in a collection and sorts on the sessionID. The client merge joins the sorted results and aggregates sessions as it iterates through the results.

      1. 0001-SOLR_5244.patch
        15 kB
        Lianyi Han
      2. SOLR-5244.patch
        61 kB
        Joel Bernstein
      3. SOLR-5244.patch
        61 kB
        Joel Bernstein
      4. SOLR-5244.patch
        59 kB
        Joel Bernstein
      5. SOLR-5244.patch
        28 kB
        Joel Bernstein
      6. SOLR-5244.patch
        26 kB
        Joel Bernstein
      7. SOLR-5244.patch
        15 kB
        Joel Bernstein
      8. SOLR-5244.patch
        14 kB
        Joel Bernstein

        Activity

        Hide
        Mark Miller added a comment -

        It would be great if Solr could efficiently export entire search result sets

        Definitely!

        without scoring or ranking documents.

        With scoring or ranking would be great too I'm sure it can wait though.

        Show
        Mark Miller added a comment - It would be great if Solr could efficiently export entire search result sets Definitely! without scoring or ranking documents. With scoring or ranking would be great too I'm sure it can wait though.
        Hide
        Joel Bernstein added a comment -

        I think scoring and ranking will be difficult because the priority queues will take up too much memory and be too slow.

        There are two use cases that I was mainly thinking of:

        1) Providing a searchable data source for Hadoop or other aggregation engines. So Map jobs could search Solr to bring in millions of records very quickly.

        2) Doing distributed joins. Allowing a remote search engine to pull data very quickly from Solr so it can filter local search results.

        Show
        Joel Bernstein added a comment - I think scoring and ranking will be difficult because the priority queues will take up too much memory and be too slow. There are two use cases that I was mainly thinking of: 1) Providing a searchable data source for Hadoop or other aggregation engines. So Map jobs could search Solr to bring in millions of records very quickly. 2) Doing distributed joins. Allowing a remote search engine to pull data very quickly from Solr so it can filter local search results.
        Hide
        Mark Miller added a comment -

        I think scoring and ranking will be difficult because the priority queues will take up too much memory and be too slow.

        I agree that it will be more difficult and certainly a slower operation, but if you are looking to export an entire results list, 'slow' is very relative and use case dependent.

        My main interest in this is in 1 - it's a pretty common want. Using search for sub-selection that can be processed by something else.

        I think it would be great if that sub selection could come out ranked though - I think that is also valuable for 1 - and while the other system could somehow rank, it would have to dupe the lucene logic to do it as well. It would be nice to just be able to dump either way and make your decision based on use case and speed reqs. It's obviously going to be much slower though. And you would have to deal with huge result sets and limited ram of course.

        Show
        Mark Miller added a comment - I think scoring and ranking will be difficult because the priority queues will take up too much memory and be too slow. I agree that it will be more difficult and certainly a slower operation, but if you are looking to export an entire results list, 'slow' is very relative and use case dependent. My main interest in this is in 1 - it's a pretty common want. Using search for sub-selection that can be processed by something else. I think it would be great if that sub selection could come out ranked though - I think that is also valuable for 1 - and while the other system could somehow rank, it would have to dupe the lucene logic to do it as well. It would be nice to just be able to dump either way and make your decision based on use case and speed reqs. It's obviously going to be much slower though. And you would have to deal with huge result sets and limited ram of course.
        Hide
        Mark Miller added a comment -

        And you would have to deal with huge result sets and limited ram of course.

        When I was thinking about this before, I was thinking about perhaps the deep paging stuff that was done as a cursor - but I have not looked at that feature at all yet.

        Show
        Mark Miller added a comment - And you would have to deal with huge result sets and limited ram of course. When I was thinking about this before, I was thinking about perhaps the deep paging stuff that was done as a cursor - but I have not looked at that feature at all yet.
        Hide
        Kranti Parisa added a comment -

        Joel,

        In one of my emails to the dev-group I have asked the following question


        I am sure this doesn't exist today, but just wondering about your thoughts.

        When we use Join queries (first time or with out hitting Filter Cache) and say debug=true, we are able to see good amount of debug info in the response.

        Do we have any plans of supporting this debug info even when we hit the Filter Cache. I believe that this information will be helpful with/without hitting the caches.

        Consider this use case: in production, a request comes in and builds the Filter Cache for a Join Query and at some point of time we want to run that query manually with debug turned on, we can't see a bunch of very useful stats/numbers.

        ExportQParserPlugin will save the BitSet into the request context even when we hit the caches? The idea of saving the BitSets into the request context is very helpful when we do Joins. Because, when we write the response, for each document we would want to specify what all the cores this document was matched for the given criteria/filters

        So, I think it is also a good idea to support an extra local_param in the new join implementations (SOLR-4787) say matchFlag="true" and if its true save the BitSet into the request context (even in the case of a cache hit). by default it can be "false" so that we don't need to save the BitSet in memory

        Example response:
        <doc>
        <long name="id">111</long>
        <str name="title">my title</str>
        <arr name="joinMatches">
        <str>coreA</str>
        <str>coreB</str>
        </arr>
        </doc>

        I was able to achieve that saving the BitSet into the join debug info. but was not able to get the point about cache hits. I think your idea of saving that into the request context makes more sense.

        Your thoughts?

        Show
        Kranti Parisa added a comment - Joel, In one of my emails to the dev-group I have asked the following question I am sure this doesn't exist today, but just wondering about your thoughts. When we use Join queries (first time or with out hitting Filter Cache) and say debug=true, we are able to see good amount of debug info in the response. Do we have any plans of supporting this debug info even when we hit the Filter Cache. I believe that this information will be helpful with/without hitting the caches. Consider this use case: in production, a request comes in and builds the Filter Cache for a Join Query and at some point of time we want to run that query manually with debug turned on, we can't see a bunch of very useful stats/numbers. — ExportQParserPlugin will save the BitSet into the request context even when we hit the caches? The idea of saving the BitSets into the request context is very helpful when we do Joins. Because, when we write the response, for each document we would want to specify what all the cores this document was matched for the given criteria/filters So, I think it is also a good idea to support an extra local_param in the new join implementations ( SOLR-4787 ) say matchFlag="true" and if its true save the BitSet into the request context (even in the case of a cache hit). by default it can be "false" so that we don't need to save the BitSet in memory Example response: <doc> <long name="id">111</long> <str name="title">my title</str> <arr name="joinMatches"> <str>coreA</str> <str>coreB</str> </arr> </doc> I was able to achieve that saving the BitSet into the join debug info. but was not able to get the point about cache hits. I think your idea of saving that into the request context makes more sense. Your thoughts?
        Hide
        Joel Bernstein added a comment -

        The way it's currently written the ExportQParserPlugin will defeat all caching. So it always runs. PostFilters can't be cached in the FilterCache, so only the QueryResultCache is being bypassed here.

        I see your point that if you need caching and the ability to use the request context, then you are out of luck. Not sure exactly how to solve this though.

        Show
        Joel Bernstein added a comment - The way it's currently written the ExportQParserPlugin will defeat all caching. So it always runs. PostFilters can't be cached in the FilterCache, so only the QueryResultCache is being bypassed here. I see your point that if you need caching and the ability to use the request context, then you are out of luck. Not sure exactly how to solve this though.
        Hide
        Kranti Parisa added a comment - - edited

        hmm, it would be nice if there is a way to make both filter caches + request context work simultaneously.

        Show
        Kranti Parisa added a comment - - edited hmm, it would be nice if there is a way to make both filter caches + request context work simultaneously.
        Hide
        Joel Bernstein added a comment -

        On the topic of ranking the entire result set, one approach would be:

        1)Gather the bitset in the postfilter and hang onto the scores.
        2)In the output writer loop over the bitset and add the docs to a priority queue with a size of 200. After one loop over the full set, send out the first 200 docs, turning them off in the bitset.
        3)Clear the priority queue and repeat step 2 until the BitSet is cleared.

        The main limitation here is the need to hold onto all the scores. We could loose this by using the query() function query to generate scores each loop over the BitSet. This would slow things down but allow Solr to sort unlimited amount of data, fairly quickly.

        If we were sorting on a field and didn't need to hang onto scores we could sort very large sets of data using on-disk doc-values.

        Show
        Joel Bernstein added a comment - On the topic of ranking the entire result set, one approach would be: 1)Gather the bitset in the postfilter and hang onto the scores. 2)In the output writer loop over the bitset and add the docs to a priority queue with a size of 200. After one loop over the full set, send out the first 200 docs, turning them off in the bitset. 3)Clear the priority queue and repeat step 2 until the BitSet is cleared. The main limitation here is the need to hold onto all the scores. We could loose this by using the query() function query to generate scores each loop over the BitSet. This would slow things down but allow Solr to sort unlimited amount of data, fairly quickly. If we were sorting on a field and didn't need to hang onto scores we could sort very large sets of data using on-disk doc-values.
        Hide
        Mikhail Khludnev added a comment -

        FWIW,
        https://github.com/m-khl/solr-patches/tree/streaming
        in this branch (a little bit old) I implemented something like that with two differences:

        • it doesn't allocate bitset for results, but streams into output straightforwardly
        • distributed search for indexes presorted by uniqKey. it merges these streamed results with increasing external ids.
        Show
        Mikhail Khludnev added a comment - FWIW, https://github.com/m-khl/solr-patches/tree/streaming in this branch (a little bit old) I implemented something like that with two differences: it doesn't allocate bitset for results, but streams into output straightforwardly distributed search for indexes presorted by uniqKey. it merges these streamed results with increasing external ids.
        Hide
        Joel Bernstein added a comment - - edited

        Mikhail,

        I like your approach, it solves the distributed search problem in a nice way. Getting access to the output writer from the search component solves a lot of problems. Does it cause any issues with the normal response writer flow?

        Show
        Joel Bernstein added a comment - - edited Mikhail, I like your approach, it solves the distributed search problem in a nice way. Getting access to the output writer from the search component solves a lot of problems. Does it cause any issues with the normal response writer flow?
        Hide
        Upayavira added a comment -

        I may be being a luddite here, but surely when it comes to ranking, there's two parts: scoring and sorting. I would assume that it is the sorting portion that is going to be most expensive. How hard it it to export all documents, in index order, but with the score attached, so that an external system can handle the ordering, with access to the Solr-calculated scores?

        Show
        Upayavira added a comment - I may be being a luddite here, but surely when it comes to ranking, there's two parts: scoring and sorting. I would assume that it is the sorting portion that is going to be most expensive. How hard it it to export all documents, in index order, but with the score attached, so that an external system can handle the ordering, with access to the Solr-calculated scores?
        Hide
        Joel Bernstein added a comment -

        Hi Upayavira,

        You're right, adding the scores would be a fairly easy change to this, while actually doing the ranking would cause a bigger change. I plan to get back to this feature soon and will look to add scores. The other major thing I want to do for this is to include a SolrCloud aware Solrj client that other applications can use to pull back the exports from each shard. I have some prototypes of this client working now so I hope to have something to share soon.

        Joel

        Show
        Joel Bernstein added a comment - Hi Upayavira, You're right, adding the scores would be a fairly easy change to this, while actually doing the ranking would cause a bigger change. I plan to get back to this feature soon and will look to add scores. The other major thing I want to do for this is to include a SolrCloud aware Solrj client that other applications can use to pull back the exports from each shard. I have some prototypes of this client working now so I hope to have something to share soon. Joel
        Hide
        Joel Bernstein added a comment -

        More testing of this feature shows the real challenge will be performance of exporting string fields. Right now the docId->BytesRef lookup is way to slow to be interesting on a large scale, even with in memory docValues. This must be do to the compression on the docValues.

        To get this working we'll need to have faster memory caches in place. I think we can build segment level caches at commit time by caching the top X terms in a particular field based on docFrequency. The cache would be a read only ord to BytesRef (hppc IntObjectOpenHashMap) which we should be able to perform in neighborhood of 10 million lookups per second. The in-memory docId->BytesRef lookup performs at less then 1 million records per-second.

        I think if we also move to a threaded approach we'll be able increase throughput.

        I'm shooting to achieve an export rate of 5+ million small records per-second from a single server. This would scale linearly with the number of servers so a cluster of 100 servers could export 500+ million small records per-second.

        Show
        Joel Bernstein added a comment - More testing of this feature shows the real challenge will be performance of exporting string fields. Right now the docId->BytesRef lookup is way to slow to be interesting on a large scale, even with in memory docValues. This must be do to the compression on the docValues. To get this working we'll need to have faster memory caches in place. I think we can build segment level caches at commit time by caching the top X terms in a particular field based on docFrequency. The cache would be a read only ord to BytesRef (hppc IntObjectOpenHashMap) which we should be able to perform in neighborhood of 10 million lookups per second. The in-memory docId->BytesRef lookup performs at less then 1 million records per-second. I think if we also move to a threaded approach we'll be able increase throughput. I'm shooting to achieve an export rate of 5+ million small records per-second from a single server. This would scale linearly with the number of servers so a cluster of 100 servers could export 500+ million small records per-second.
        Hide
        Mikhail Khludnev added a comment -

        Does it cause any issues with the normal response writer flow?

        I don't think so. it hits dedicated handlers. So, it's well separated from regular flow.

        More testing of this feature shows

        i wonder if you can post numbers and profiler stacktrace.
        How many fields are dumped in your test case?
        I have one thought: BinaryDocValuesImpl.get(int, BytesRef) hits docToOffset and bytes after that per every given docnum. Asserting that sequential reading is faster than a random one it makes sense to buffer array of offsets and then look through it for reading bytes. Also, looping by binaryFieldWriters per every doc seems like a columnar performance killer.

        I think we can build segment level caches..

        can you highlight how it differs from old good FieldCaches (I mean what's produced by FieldCacheImpl.BinaryDocValuesCache) ?

        I'm shooting to achieve an export rate of 5+ million small records

        It sounds really ambitious to me. My expectation about average IO rate is 100-200 MB/sec (and I might wrong here). so few millions might hit the ceiling.

        Show
        Mikhail Khludnev added a comment - Does it cause any issues with the normal response writer flow? I don't think so. it hits dedicated handlers. So, it's well separated from regular flow. More testing of this feature shows i wonder if you can post numbers and profiler stacktrace. How many fields are dumped in your test case? I have one thought: BinaryDocValuesImpl.get(int, BytesRef) hits docToOffset and bytes after that per every given docnum. Asserting that sequential reading is faster than a random one it makes sense to buffer array of offsets and then look through it for reading bytes . Also, looping by binaryFieldWriters per every doc seems like a columnar performance killer. I think we can build segment level caches.. can you highlight how it differs from old good FieldCaches (I mean what's produced by FieldCacheImpl.BinaryDocValuesCache) ? I'm shooting to achieve an export rate of 5+ million small records It sounds really ambitious to me. My expectation about average IO rate is 100-200 MB/sec (and I might wrong here). so few millions might hit the ceiling.
        Hide
        Joel Bernstein added a comment - - edited

        Mikhail,

        For my test I was extracting a single string field, using in memory docValues. Because docValues are column oriented, as you mentioned, each column lookup will be an additional hit on performance.

        I'm seeing two possible approaches to this:

        1) Add a special cache that speeds up the docId-> bytesRef lookup. This would be a segment level cache of the top N terms (by frequency) in the index. The cache would be a simple int to BytesRef hashmap, mapping the segment level ord to the bytesRef. This cache would be much faster then the binaryDocValues docId->byteRef lookup, so if there was a decent cache hit rate, performance could be improved dramatically.

        This approach would improve performance if the fields were kept separate so you could pick and choose what to export.

        2) Only export a single field. With this approach you would have one docValues field that would hold the entire extract record. You could use json or a binary format to structure this field anyway you want. With this approach, caches wouldn't help but you'd eliminate the penalty for looking up data in multiple columns.

        I'm leaning towards this approach.

        With either approach, threading could be used to increase throughput. You could have a thread per segment extracting records and adding to a queue, and a single thread pulling from the queue and streaming the data out.

        You're right, 5 million is not going to happen with the network limitations. Then the goal could be to export data as fast as the network can send it out. You could throttle this by having fewer threads extracting records from the segments.

        Joel

        Show
        Joel Bernstein added a comment - - edited Mikhail, For my test I was extracting a single string field, using in memory docValues. Because docValues are column oriented, as you mentioned, each column lookup will be an additional hit on performance. I'm seeing two possible approaches to this: 1) Add a special cache that speeds up the docId-> bytesRef lookup. This would be a segment level cache of the top N terms (by frequency) in the index. The cache would be a simple int to BytesRef hashmap, mapping the segment level ord to the bytesRef. This cache would be much faster then the binaryDocValues docId->byteRef lookup, so if there was a decent cache hit rate, performance could be improved dramatically. This approach would improve performance if the fields were kept separate so you could pick and choose what to export. 2) Only export a single field. With this approach you would have one docValues field that would hold the entire extract record. You could use json or a binary format to structure this field anyway you want. With this approach, caches wouldn't help but you'd eliminate the penalty for looking up data in multiple columns. I'm leaning towards this approach. With either approach, threading could be used to increase throughput. You could have a thread per segment extracting records and adding to a queue, and a single thread pulling from the queue and streaming the data out. You're right, 5 million is not going to happen with the network limitations. Then the goal could be to export data as fast as the network can send it out. You could throttle this by having fewer threads extracting records from the segments. Joel
        Hide
        Mikhail Khludnev added a comment -

        1) Add a special cache that speeds up the docId-> bytesRef lookup. This would be a segment level cache of the top N terms (by frequency) in the index. The cache would be a simple int to BytesRef hashmap, mapping the segment level ord to the bytesRef

        that's exactly what you've got on FieldCache.DEFAULT.getTerms() for an indexed field without docvalues enabled. See. FieldCacheImpl.BinaryDocValuesCache

        Show
        Mikhail Khludnev added a comment - 1) Add a special cache that speeds up the docId-> bytesRef lookup. This would be a segment level cache of the top N terms (by frequency) in the index. The cache would be a simple int to BytesRef hashmap, mapping the segment level ord to the bytesRef that's exactly what you've got on FieldCache.DEFAULT.getTerms() for an indexed field without docvalues enabled. See. FieldCacheImpl.BinaryDocValuesCache
        Hide
        Joel Bernstein added a comment - - edited

        I'll do some testing of the performance of this. Unless I'm missing something though, it looks like you have go through a PagedBytes.Reader, PackedInts.Reader to get the BytesRef. I think would have similar performance to the in memory BinaryDocValues I was using for my initial test.

        The cache I was thinking of building would be backed by hppc IntObjectOpenHashMap, which I should been able to do 10 million+ read operations per second.

        Show
        Joel Bernstein added a comment - - edited I'll do some testing of the performance of this. Unless I'm missing something though, it looks like you have go through a PagedBytes.Reader, PackedInts.Reader to get the BytesRef. I think would have similar performance to the in memory BinaryDocValues I was using for my initial test. The cache I was thinking of building would be backed by hppc IntObjectOpenHashMap, which I should been able to do 10 million+ read operations per second.
        Hide
        Lianyi Han added a comment - - edited

        This plugin works great in our project and we have made two small changes in this patch

        1> add omitHeader option
        2> allow the cost parameter with a default value of 200, which might helps to order the post filters if you have more than one of them.

        Show
        Lianyi Han added a comment - - edited This plugin works great in our project and we have made two small changes in this patch 1> add omitHeader option 2> allow the cost parameter with a default value of 200, which might helps to order the post filters if you have more than one of them.
        Hide
        Joel Bernstein added a comment -

        Ok, going to restart the work on this ticket. Here are my thoughts on design goals for the initial release.

        1) Change the ExportQParserPlugin to a RankQuery rather then a PostFilter. This will allow Solr to properly log the result counts for exported queries. The PostFilter really can't do that.

        2) Include a simple BinaryExportWriter that does not support complex types or multi-value fields or scoring. This would mainly be used by internal Solr processes. In particular I have in mind call backs from custom MergeStrategy implementations.

        3) Include an AvroExportWriter that supports complex fields and returning of scores. Avro looks like the best choice to me for a general purpose binary export writer.

        Show
        Joel Bernstein added a comment - Ok, going to restart the work on this ticket. Here are my thoughts on design goals for the initial release. 1) Change the ExportQParserPlugin to a RankQuery rather then a PostFilter. This will allow Solr to properly log the result counts for exported queries. The PostFilter really can't do that. 2) Include a simple BinaryExportWriter that does not support complex types or multi-value fields or scoring. This would mainly be used by internal Solr processes. In particular I have in mind call backs from custom MergeStrategy implementations. 3) Include an AvroExportWriter that supports complex fields and returning of scores. Avro looks like the best choice to me for a general purpose binary export writer.
        Hide
        Joel Bernstein added a comment - - edited

        New patch with ExportQParserPlugin as a RankQuery, and a SortingResponseWriter that exports a full sorted result set in json format.

        Show
        Joel Bernstein added a comment - - edited New patch with ExportQParserPlugin as a RankQuery, and a SortingResponseWriter that exports a full sorted result set in json format.
        Hide
        David Smiley added a comment -

        Nice work but you've got to admit that fq={!export} isn't the slightest bit intuitive. And people may not even know of it's existence and instead do rows=1000000 (some big number). Can we shoot for a better more intuitive user experience where Solr can "do the right thing" and have this behavior kick in when rows is big enough to encompass the entire results? Otherwise this is yet another detail users need to know in certain circumstances.

        Show
        David Smiley added a comment - Nice work but you've got to admit that fq={!export } isn't the slightest bit intuitive. And people may not even know of it's existence and instead do rows=1000000 (some big number). Can we shoot for a better more intuitive user experience where Solr can "do the right thing" and have this behavior kick in when rows is big enough to encompass the entire results? Otherwise this is yet another detail users need to know in certain circumstances.
        Hide
        Varun Thacker added a comment -

        Nice work but you've got to admit that fq={!export} isn't the slightest bit intuitive. And people may not even know of it's existence and instead do rows=1000000 (some big number). Can we shoot for a better more intuitive user experience where Solr can "do the right thing" and have this behavior kick in when rows is big enough to encompass the entire results? Otherwise this is yet another detail users need to know in certain circumstances.

        First thing that came to my head - &rows=all or something like that. Any good?

        Show
        Varun Thacker added a comment - Nice work but you've got to admit that fq={!export} isn't the slightest bit intuitive. And people may not even know of it's existence and instead do rows=1000000 (some big number). Can we shoot for a better more intuitive user experience where Solr can "do the right thing" and have this behavior kick in when rows is big enough to encompass the entire results? Otherwise this is yet another detail users need to know in certain circumstances. First thing that came to my head - &rows=all or something like that. Any good?
        Hide
        Joel Bernstein added a comment - - edited

        The description on the ticket is a little out of date. This is the latest syntax:

        q=*:*&rq={!xport}&wt=xsort&sort=...&fl=...
        

        The

        {!xport}

        has been changed from a PostFilter to RankQuery.
        How about the syntax rows=-1 for exporting all rows. -1 might be easier to integrate then "all" because rows is expecting an integer value. This would automatically add the RankQuery parameter and the default wt (xsort)

        So the user syntax would simply:

        q=*:*&rows=-1&sort=...&fl=...
        

        Output of this is nothing like a normal search result so we wouldn't want this to trigger by mistake though.

        Show
        Joel Bernstein added a comment - - edited The description on the ticket is a little out of date. This is the latest syntax: q=*:*&rq={!xport}&wt=xsort&sort=...&fl=... The {!xport} has been changed from a PostFilter to RankQuery. How about the syntax rows=-1 for exporting all rows. -1 might be easier to integrate then "all" because rows is expecting an integer value. This would automatically add the RankQuery parameter and the default wt (xsort) So the user syntax would simply: q=*:*&rows=-1&sort=...&fl=... Output of this is nothing like a normal search result so we wouldn't want this to trigger by mistake though.
        Hide
        David Smiley added a comment -

        I like rows=-1 for the same reason (vs. "all"). I'm not 100% sure I like the 'wt' being automatically changed from what the user might specify but I can think of no better alternative. It's too bad the response format can't simply be the same. I haven't dug into this issue to understand the rationale. I do appreciate that a binary output (e.g. Avro) is useful as an explicit option if asked for.

        Show
        David Smiley added a comment - I like rows=-1 for the same reason (vs. "all"). I'm not 100% sure I like the 'wt' being automatically changed from what the user might specify but I can think of no better alternative. It's too bad the response format can't simply be the same. I haven't dug into this issue to understand the rationale. I do appreciate that a binary output (e.g. Avro) is useful as an explicit option if asked for.
        Hide
        Yonik Seeley added a comment -

        Yeah, it feels like this should just all be optimization (as opposed to changes in interface).
        An avro response writer would then be a separate / orthogonal issue. Don't we want faster big responses in JSON & CSV too?

        Show
        Yonik Seeley added a comment - Yeah, it feels like this should just all be optimization (as opposed to changes in interface). An avro response writer would then be a separate / orthogonal issue. Don't we want faster big responses in JSON & CSV too?
        Hide
        Joel Bernstein added a comment -

        This ticket is designed to work with a completely different set of use cases then Solr currently supports. So, having the interface like a typical search might not be the best thing. Here are some example use cases:

        Large Distributed Joins:

        A client outside of Solr calls two different Solr collections and returns the results sorted by a join key. The client iterates through both streams and performs a merge join.

        Distributed Field Collapsing:

        A client outside of Solr makes individual calls to all the servers in a single collection and returns results sort by the collapse key. The client merge joins the sorted lists on the collapse key to perform the field collapse.

        High Cardinality Distributed Aggregation:

        A client outside of Solr makes individual calls to all the servers in a single collection and sorts on a high cardinality field. The client then merge joins the sorted lists to perform the high cardinality aggregation.

        In these scenarios Solr is being used as a distributed sorting engine in the same way Hadoop uses sorting in "shuffle" stage.

        Show
        Joel Bernstein added a comment - This ticket is designed to work with a completely different set of use cases then Solr currently supports. So, having the interface like a typical search might not be the best thing. Here are some example use cases: Large Distributed Joins: A client outside of Solr calls two different Solr collections and returns the results sorted by a join key. The client iterates through both streams and performs a merge join. Distributed Field Collapsing: A client outside of Solr makes individual calls to all the servers in a single collection and returns results sort by the collapse key. The client merge joins the sorted lists on the collapse key to perform the field collapse. High Cardinality Distributed Aggregation: A client outside of Solr makes individual calls to all the servers in a single collection and sorts on a high cardinality field. The client then merge joins the sorted lists to perform the high cardinality aggregation. In these scenarios Solr is being used as a distributed sorting engine in the same way Hadoop uses sorting in "shuffle" stage.
        Hide
        Joel Bernstein added a comment -

        Also at this point I'm going to drop the avro response writer and just start with json. I'll update the ticket description shortly to explain the use cases etc...

        Show
        Joel Bernstein added a comment - Also at this point I'm going to drop the avro response writer and just start with json. I'll update the ticket description shortly to explain the use cases etc...
        Hide
        Joel Bernstein added a comment -

        This patch supports exporting single and multivalue fields for the following types: int, long, float, double, string.

        This patch also includes tests for exporting these fields.

        Show
        Joel Bernstein added a comment - This patch supports exporting single and multivalue fields for the following types: int, long, float, double, string. This patch also includes tests for exporting these fields.
        Hide
        Joel Bernstein added a comment - - edited

        New patch with all tests passing. Also added syntax error handling.

        It lookes like rows=-1 is not the best way to signal the export because it seems to already be used to signal other behavior.

        So right now the syntax is:

        q=hello&rq={!xport}&wt=xsort&fl=...&sort=...
        

        In general the use of the RankQuery (rq param) is more intuitive then when a PostFilter was being used to collect the BitSet.

        Happy to try a different syntax though if there are more ideas.

        Show
        Joel Bernstein added a comment - - edited New patch with all tests passing. Also added syntax error handling. It lookes like rows=-1 is not the best way to signal the export because it seems to already be used to signal other behavior. So right now the syntax is: q=hello&rq={!xport}&wt=xsort&fl=...&sort=... In general the use of the RankQuery (rq param) is more intuitive then when a PostFilter was being used to collect the BitSet. Happy to try a different syntax though if there are more ideas.
        Hide
        Joel Bernstein added a comment - - edited

        I ran some simple performance metrics on the export and found that it bogs down when sorting on string fields. Reconciling the segment level ordinals during segment changes is the issue.

        I see two possible changes:
        1) Use the BytesRef for sorting rather then the ordinals. This will eliminate the need to reconcile ordinals. It remains to be seen though if the BytesRef sorting will be fast enough.

        2) Use global ordinals for sorting. This will be the fastest approach but will incur a hit at commit time.

        Show
        Joel Bernstein added a comment - - edited I ran some simple performance metrics on the export and found that it bogs down when sorting on string fields. Reconciling the segment level ordinals during segment changes is the issue. I see two possible changes: 1) Use the BytesRef for sorting rather then the ordinals. This will eliminate the need to reconcile ordinals. It remains to be seen though if the BytesRef sorting will be fast enough. 2) Use global ordinals for sorting. This will be the fastest approach but will incur a hit at commit time.
        Hide
        Joel Bernstein added a comment -

        Another approach would be to use a thread pool to do segment level sorting on ordinal. Then use the BytesRefs to merge the segment results to get the top X. Propably won't be as fast as sorting on top level ordinals, but should be much faster then sorting the entire set on BytesRef.

        Show
        Joel Bernstein added a comment - Another approach would be to use a thread pool to do segment level sorting on ordinal. Then use the BytesRefs to merge the segment results to get the top X. Propably won't be as fast as sorting on top level ordinals, but should be much faster then sorting the entire set on BytesRef.
        Hide
        Joel Bernstein added a comment -

        New patch with custom sorting routines that uses top level String ordinals for sorting. Provides good sorting performance. The DocValues are also surprising fast to load.

        Show
        Joel Bernstein added a comment - New patch with custom sorting routines that uses top level String ordinals for sorting. Provides good sorting performance. The DocValues are also surprising fast to load.
        Hide
        Erik Hatcher added a comment -

        Just a thought at first glance, those are some scary/hairy implementation details with the quirky parameter requirements, so maybe this could start out as a request handler (that can still be a SearchHandler subclass and thus support components) that gets mapped to /export (which sets as defaults or invariants the magic incantations). ??

        Show
        Erik Hatcher added a comment - Just a thought at first glance, those are some scary/hairy implementation details with the quirky parameter requirements, so maybe this could start out as a request handler (that can still be a SearchHandler subclass and thus support components) that gets mapped to /export (which sets as defaults or invariants the magic incantations). ??
        Hide
        Joel Bernstein added a comment -

        I like this idea. The request would look something like this then:

        /export?q=blah&fl=field1,field2&sort=field+desc

        The defaults would specify the rq and wt parameter.

        Show
        Joel Bernstein added a comment - I like this idea. The request would look something like this then: /export?q=blah&fl=field1,field2&sort=field+desc The defaults would specify the rq and wt parameter.
        Hide
        David Smiley added a comment -

        +1 great idea Erik

        Show
        David Smiley added a comment - +1 great idea Erik
        Hide
        Carl Tremblay added a comment -

        After for some initial tests, it seems that multi-shards queries/exports are not supported. Any plans to support it in the near future? Or is it already and I am making a bad usage of the patch?

        thanks,

        Show
        Carl Tremblay added a comment - After for some initial tests, it seems that multi-shards queries/exports are not supported. Any plans to support it in the near future? Or is it already and I am making a bad usage of the patch? thanks,
        Hide
        Joel Bernstein added a comment -

        You are correct. The initial patch is designed for clients outside of Solr to handle the merging of the streams. I think the next step would be to build a SolrJ library that handles different types of merge algorithms (collapse, join, union, aggregate etc...). We could then embed that in a custom request handler inside Solr, or clients could embed that in there application.

        Show
        Joel Bernstein added a comment - You are correct. The initial patch is designed for clients outside of Solr to handle the merging of the streams. I think the next step would be to build a SolrJ library that handles different types of merge algorithms (collapse, join, union, aggregate etc...). We could then embed that in a custom request handler inside Solr, or clients could embed that in there application.
        Hide
        Lewis G added a comment -

        Dear Colleague,
        I am currently out of the office and will read your email when I return. If this is a matter involving the NCBI BioSystems database, please email biosystems.help@ncbi.nlm.nih.gov.
        Regards,
        Lewis

        Show
        Lewis G added a comment - Dear Colleague, I am currently out of the office and will read your email when I return. If this is a matter involving the NCBI BioSystems database, please email biosystems.help@ncbi.nlm.nih.gov. Regards, Lewis
        Hide
        Joel Bernstein added a comment -

        Added the /export request handler. I'll update the ticket description with the new syntax shortly.

        Show
        Joel Bernstein added a comment - Added the /export request handler. I'll update the ticket description with the new syntax shortly.
        Hide
        Erik Hatcher added a comment -

        Joel Bernstein could/should we go even further and hide the xport query parser and xsort response writer so that they aren't registered at all and /export may be a SearchHandler subclass that hides that stuff completely internally to it? Or, if this feature is only useful for the "query" component, maybe it could be a non-SearchHandler request handler? For a couple of reasons I bring this up: it seems awkward that a "query parser" is used to trigger this sort of thing and that neither of these components would be useful for anything but exporting. Just thinking outloud. At the very least, instead of them being defined in /export "defaults" they should be in "invariants", no?

        Show
        Erik Hatcher added a comment - Joel Bernstein could/should we go even further and hide the xport query parser and xsort response writer so that they aren't registered at all and /export may be a SearchHandler subclass that hides that stuff completely internally to it? Or, if this feature is only useful for the "query" component, maybe it could be a non-SearchHandler request handler? For a couple of reasons I bring this up: it seems awkward that a "query parser" is used to trigger this sort of thing and that neither of these components would be useful for anything but exporting. Just thinking outloud. At the very least, instead of them being defined in /export "defaults" they should be in "invariants", no?
        Hide
        Joel Bernstein added a comment -

        Yes I think we should do what you suggest. I may not have time to implement this before Solr 4.10 though. I don't think we need to hold this up because the client interface will remain stable and we can simply slide in the new SearchHandler in a later release.

        Also moving forward they''ll be different types of export functionality and a specialized SearchHandler will be needed to sort out the different options.

        Show
        Joel Bernstein added a comment - Yes I think we should do what you suggest. I may not have time to implement this before Solr 4.10 though. I don't think we need to hold this up because the client interface will remain stable and we can simply slide in the new SearchHandler in a later release. Also moving forward they''ll be different types of export functionality and a specialized SearchHandler will be needed to sort out the different options.
        Hide
        Joel Bernstein added a comment - - edited

        We also need a graceful way to not have the /export handler go into distributed mode, which is not support in this release.

        We could simply add distrib=false to the default params, which will work.

        I think when we introduce the SearcHandler we should also introduce some basic distributed support as well, but this will take some thought and design.

        As it stands now, users can write SolrCloud aware clients and solve a lot if interesting use case with this. So I think we should release it.

        Show
        Joel Bernstein added a comment - - edited We also need a graceful way to not have the /export handler go into distributed mode, which is not support in this release. We could simply add distrib=false to the default params, which will work. I think when we introduce the SearcHandler we should also introduce some basic distributed support as well, but this will take some thought and design. As it stands now, users can write SolrCloud aware clients and solve a lot if interesting use case with this. So I think we should release it.
        Hide
        Joel Bernstein added a comment -

        Made the /export init poperties invariants and added distrib=false to ensure that the requests aren't distributed when in SolrCloud mode.

        In the initial release developers can create SolrCloud aware clients that query each node in the cluster and merge the results in any way they see fit.

        Show
        Joel Bernstein added a comment - Made the /export init poperties invariants and added distrib=false to ensure that the requests aren't distributed when in SolrCloud mode. In the initial release developers can create SolrCloud aware clients that query each node in the cluster and merge the results in any way they see fit.
        Hide
        ASF subversion and git services added a comment -

        Commit 1618068 from Joel Bernstein in branch 'dev/trunk'
        [ https://svn.apache.org/r1618068 ]

        SOLR-5244: Exporting Full Sorted Result Sets

        Show
        ASF subversion and git services added a comment - Commit 1618068 from Joel Bernstein in branch 'dev/trunk' [ https://svn.apache.org/r1618068 ] SOLR-5244 : Exporting Full Sorted Result Sets
        Hide
        ASF subversion and git services added a comment -

        Commit 1618152 from Joel Bernstein in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1618152 ]

        SOLR-5244: Exporting Full Sorted Result Sets

        Show
        ASF subversion and git services added a comment - Commit 1618152 from Joel Bernstein in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1618152 ] SOLR-5244 : Exporting Full Sorted Result Sets
        Hide
        ASF subversion and git services added a comment -

        Commit 1618224 from Joel Bernstein in branch 'dev/trunk'
        [ https://svn.apache.org/r1618224 ]

        SOLR-5244: Exporting Full Sorted Result Sets

        Show
        ASF subversion and git services added a comment - Commit 1618224 from Joel Bernstein in branch 'dev/trunk' [ https://svn.apache.org/r1618224 ] SOLR-5244 : Exporting Full Sorted Result Sets
        Hide
        ASF subversion and git services added a comment -

        Commit 1618228 from Joel Bernstein in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1618228 ]

        SOLR-5244: Exporting Full Sorted Result Sets

        Show
        ASF subversion and git services added a comment - Commit 1618228 from Joel Bernstein in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1618228 ] SOLR-5244 : Exporting Full Sorted Result Sets
        Hide
        ASF subversion and git services added a comment -

        Commit 1619783 from Joel Bernstein in branch 'dev/trunk'
        [ https://svn.apache.org/r1619783 ]

        SOLR-5244: Exporting Full Sorted Result Sets

        Show
        ASF subversion and git services added a comment - Commit 1619783 from Joel Bernstein in branch 'dev/trunk' [ https://svn.apache.org/r1619783 ] SOLR-5244 : Exporting Full Sorted Result Sets
        Hide
        ASF subversion and git services added a comment -

        Commit 1619804 from Joel Bernstein in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1619804 ]

        SOLR-5244: Exporting Full Sorted Result Sets

        Show
        ASF subversion and git services added a comment - Commit 1619804 from Joel Bernstein in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1619804 ] SOLR-5244 : Exporting Full Sorted Result Sets

          People

          • Assignee:
            Joel Bernstein
            Reporter:
            Joel Bernstein
          • Votes:
            12 Vote for this issue
            Watchers:
            24 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development