Solr
  1. Solr
  2. SOLR-1566

Allow components to add fields to outgoing documents

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: search
    • Labels:
      None

      Description

      Currently it is not possible for components to add fields to outgoing documents which are not in the the stored fields of the document. This makes it cumbersome to add computed fields/metadata .

      1. SOLR-1566-PageTool.patch
        1 kB
        Jan Høydahl
      2. SOLR-1566-DocTransformer.patch
        95 kB
        Ryan McKinley
      3. SOLR-1566_parsing.patch
        11 kB
        Yonik Seeley
      4. SOLR-1566-DocTransformer.patch
        93 kB
        Ryan McKinley
      5. SOLR-1566-DocTransformer.patch
        81 kB
        Ryan McKinley
      6. SOLR-1566-DocTransformer.patch
        84 kB
        Ryan McKinley
      7. SOLR-1566-DocTransformer.patch
        79 kB
        Ryan McKinley
      8. SOLR-1566-DocTransformer.patch
        84 kB
        Ryan McKinley
      9. SOLR-1566-rm.patch
        73 kB
        Ryan McKinley
      10. SOLR-1566-rm.patch
        16 kB
        Ryan McKinley
      11. SOLR-1566-rm.patch
        18 kB
        Ryan McKinley
      12. SOLR-1566-rm.patch
        15 kB
        Ryan McKinley
      13. SOLR-1566-rm.patch
        22 kB
        Ryan McKinley
      14. SOLR-1566-gsi.patch
        7 kB
        Grant Ingersoll
      15. SOLR-1566.patch
        22 kB
        Noble Paul
      16. SOLR-1566.patch
        18 kB
        Noble Paul
      17. SOLR-1566.patch
        10 kB
        Noble Paul
      18. SOLR-1566.patch
        10 kB
        Noble Paul

        Issue Links

          Activity

          Hide
          Noble Paul added a comment -

          idea as a patch.

          Any component must be able to add fields to the document. There are two ways to do it. Add the value directly or add a component (DocFieldSource) which will add the value just in time before the document is written out. The component can choose to add the values based on the other values or it can add some metadata to the document

          There is a new class SolrDocumentSlice extends DocSlice . The requirement is that all responsewriters must use SolrDocumentSlice#getSolrDocumentIterator() to write out documents.

          I have not made the necessary changes to the response writers yet. I wish to get the feedback from others on the approach before going ahead further

          Show
          Noble Paul added a comment - idea as a patch. Any component must be able to add fields to the document. There are two ways to do it. Add the value directly or add a component (DocFieldSource) which will add the value just in time before the document is written out. The component can choose to add the values based on the other values or it can add some metadata to the document There is a new class SolrDocumentSlice extends DocSlice . The requirement is that all responsewriters must use SolrDocumentSlice#getSolrDocumentIterator() to write out documents. I have not made the necessary changes to the response writers yet. I wish to get the feedback from others on the approach before going ahead further
          Hide
          Noble Paul added a comment -

          untested patch. But
          I am happy with the changes I have made. If others think it is fine , I can add testcases and commit

          Show
          Noble Paul added a comment - untested patch. But I am happy with the changes I have made. If others think it is fine , I can add testcases and commit
          Hide
          Grant Ingersoll added a comment -

          I think we should try to coordinate w/ SOLR-1298 a bit and take a step back

          Show
          Grant Ingersoll added a comment - I think we should try to coordinate w/ SOLR-1298 a bit and take a step back
          Hide
          Noble Paul added a comment - - edited

          Yeah. SOLR-1298 and this are trying to achieve something in common. It is not possible for these 2 issues to have 2 different directions.

          It is not enough to have just function results to be added as values. We should be able to add just about anything

          Show
          Noble Paul added a comment - - edited Yeah. SOLR-1298 and this are trying to achieve something in common. It is not possible for these 2 issues to have 2 different directions. It is not enough to have just function results to be added as values. We should be able to add just about anything
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hide
          Grant Ingersoll added a comment -

          I know this sounds simple and maybe I'm missing something, but why can't we have a simple data structure that is attached to the SolrResponse and then in the ResponseWriters, (pretty much where it adds the Score onto the item) it looks up the document in the data structure (it's likely a Map where the key is the doc id and the value is a list of values to add to the doc) and then adds the fields, optionally marking them as "pseudo" or annotative fields? I can't imagine the memory usage would get out of hand unless you are retrieving a large number of docs (but then you have memory issues anyway) and it is short-lived so it likely never proceeds out of Eden memory anyway. Everybody already has access to the SolrResponse to some extent. For those who want, we could likely make it a configuration item such that even the holding Map isn't created, but that likely isn't too big of a deal.

          Thoughts?

          Show
          Grant Ingersoll added a comment - I know this sounds simple and maybe I'm missing something, but why can't we have a simple data structure that is attached to the SolrResponse and then in the ResponseWriters, (pretty much where it adds the Score onto the item) it looks up the document in the data structure (it's likely a Map where the key is the doc id and the value is a list of values to add to the doc) and then adds the fields, optionally marking them as "pseudo" or annotative fields? I can't imagine the memory usage would get out of hand unless you are retrieving a large number of docs (but then you have memory issues anyway) and it is short-lived so it likely never proceeds out of Eden memory anyway. Everybody already has access to the SolrResponse to some extent. For those who want, we could likely make it a configuration item such that even the holding Map isn't created, but that likely isn't too big of a deal. Thoughts?
          Hide
          Chris Male added a comment -

          Thats basically where I was headed with SOLR-1298. Although I still believe that having them added as fields to the Documents properly, whether it be at the SolrIndexSearcher level or somewhere higher, means we can more easily use the values in other areas without having to write custom code in each place.

          Show
          Chris Male added a comment - Thats basically where I was headed with SOLR-1298 . Although I still believe that having them added as fields to the Documents properly, whether it be at the SolrIndexSearcher level or somewhere higher, means we can more easily use the values in other areas without having to write custom code in each place.
          Hide
          Grant Ingersoll added a comment -

          Thats basically where I was headed with SOLR-1298

          Hmm, missed that, let me check it out again.

          means we can more easily use the values in other areas

          True, but Solr doesn't really materialize the Document until the end anyway, so all you are doing is creating the store anyway.

          I've got a basic implementation already that simply adds a small storage piece onto SolrQueryResponse. It is a <code>Map<Integer, List<DocAnnotations>><code> where a DocAnnotation is a String name and a Object value. At RespWriting time, it simply adds the objects as fields onto the appropriate documents. The key to the Map is the Lucene doc id. Not sure if that is ideal, but I figure anyone adding to a document would have that handle, but that remains to be seen.

          Show
          Grant Ingersoll added a comment - Thats basically where I was headed with SOLR-1298 Hmm, missed that, let me check it out again. means we can more easily use the values in other areas True, but Solr doesn't really materialize the Document until the end anyway, so all you are doing is creating the store anyway. I've got a basic implementation already that simply adds a small storage piece onto SolrQueryResponse. It is a <code>Map<Integer, List<DocAnnotations>><code> where a DocAnnotation is a String name and a Object value. At RespWriting time, it simply adds the objects as fields onto the appropriate documents. The key to the Map is the Lucene doc id. Not sure if that is ideal, but I figure anyone adding to a document would have that handle, but that remains to be seen.
          Hide
          Chris Male added a comment -

          I've got a basic implementation already that simply adds a small storage piece onto SolrQueryResponse. It is a <code>Map<Integer, List<DocAnnotations>><code> where a DocAnnotation is a String name and a Object value. At RespWriting time, it simply adds the objects as fields onto the appropriate documents. The key to the Map is the Lucene doc id. Not sure if that is ideal, but I figure anyone adding to a document would have that handle, but that remains to be seen.

          Sounds reasonable to me. Except I still fear that by creating the DocAnnotation idea we are now adding another construct to Documents. Sure, as you say, Solr doesn't materialize Documents till it needs to write them out, but still a Document consists only of Fields. Now they consist of Fields and DocAnnotations. Its not a big issue I suppose, just wondering whether its needed. Perhaps a simple renaming to MetadataField or AnnotatedField or something would suffice.

          Show
          Chris Male added a comment - I've got a basic implementation already that simply adds a small storage piece onto SolrQueryResponse. It is a <code>Map<Integer, List<DocAnnotations>><code> where a DocAnnotation is a String name and a Object value. At RespWriting time, it simply adds the objects as fields onto the appropriate documents. The key to the Map is the Lucene doc id. Not sure if that is ideal, but I figure anyone adding to a document would have that handle, but that remains to be seen. Sounds reasonable to me. Except I still fear that by creating the DocAnnotation idea we are now adding another construct to Documents. Sure, as you say, Solr doesn't materialize Documents till it needs to write them out, but still a Document consists only of Fields. Now they consist of Fields and DocAnnotations. Its not a big issue I suppose, just wondering whether its needed. Perhaps a simple renaming to MetadataField or AnnotatedField or something would suffice.
          Hide
          Grant Ingersoll added a comment -

          Now they consist of Fields and DocAnnotations.

          Nah, they don't really. It's just sdoc.addField(annot.name, annot.value)

          The only thing that is missing in that is it might be nice to be able to mark that field as being a "pseudo" field.

          Show
          Grant Ingersoll added a comment - Now they consist of Fields and DocAnnotations. Nah, they don't really. It's just sdoc.addField(annot.name, annot.value) The only thing that is missing in that is it might be nice to be able to mark that field as being a "pseudo" field.
          Hide
          Chris Male added a comment -

          Nah, they don't really. It's just sdoc.addField(annot.name, annot.value)

          Ah okay, super!

          The only thing that is missing in that is it might be nice to be able to mark that field as being a "pseudo" field.

          That would be nice. Perhaps we can expand SolrDocument to include a list of pseudo fields? If that needs to be converted into a lucene Document then it can be merged with the rest.

          Show
          Chris Male added a comment - Nah, they don't really. It's just sdoc.addField(annot.name, annot.value) Ah okay, super! The only thing that is missing in that is it might be nice to be able to mark that field as being a "pseudo" field. That would be nice. Perhaps we can expand SolrDocument to include a list of pseudo fields? If that needs to be converted into a lucene Document then it can be merged with the rest.
          Hide
          Grant Ingersoll added a comment -

          I'd add, however, that it is no different from adding the score. Also, I don't know that marking the field is required, as, IMO, the Document that is written out is intended to be "truth" as far as Solr is concerned, not as far as the Lucene index is concerned.

          Show
          Grant Ingersoll added a comment - I'd add, however, that it is no different from adding the score. Also, I don't know that marking the field is required, as, IMO, the Document that is written out is intended to be "truth" as far as Solr is concerned, not as far as the Lucene index is concerned.
          Hide
          Chris Male added a comment -

          Would you change how score is handled so its like other pseudo fields?

          Show
          Chris Male added a comment - Would you change how score is handled so its like other pseudo fields?
          Hide
          Grant Ingersoll added a comment -

          Then again, there are things like sorting, etc. that we may or may not want to support, so you wouldn't want the app dev. to expect to sort by a pseudo field (or maybe you do)

          Show
          Grant Ingersoll added a comment - Then again, there are things like sorting, etc. that we may or may not want to support, so you wouldn't want the app dev. to expect to sort by a pseudo field (or maybe you do)
          Hide
          Chris Male added a comment -

          If we think in terms of spatial distances, then it might be useful to be able to sort by the pseudo field? I know we currently resolve that issue by sorting by function query, but if you already calculate the distance in some other spatial process, might as well be able to sort on it?

          Show
          Chris Male added a comment - If we think in terms of spatial distances, then it might be useful to be able to sort by the pseudo field? I know we currently resolve that issue by sorting by function query, but if you already calculate the distance in some other spatial process, might as well be able to sort on it?
          Hide
          Grant Ingersoll added a comment -

          Would you change how score is handled so its like other pseudo fields?

          I doubt it. It is a bit different, I think, than other pseudo-fields.

          Show
          Grant Ingersoll added a comment - Would you change how score is handled so its like other pseudo fields? I doubt it. It is a bit different, I think, than other pseudo-fields.
          Hide
          Grant Ingersoll added a comment -

          If we think in terms of spatial distances, then it might be useful to be able to sort by the pseudo field? I know we currently resolve that issue by sorting by function query, but if you already calculate the distance in some other spatial process, might as well be able to sort on it?

          Yeah, that is true, but sorting is such a low level operation and this issue is about any component doing it, no matter where in the list it occurs. I just don't see how to make the two sync up w/o having to resort and that would be a perf. killer in many ways (CPU and memory). IMO, I think this piece of functionality should be about "marking up" the result set that is going to be returned.

          Of course, even with the approach I propose, it has the potential to be a memory killer if done wrong. For instance, I doubt you'd want to stuff in the Map every single distance you calculate, you really only want to add those items that you know are going to be in the final result list. Perhaps the alternate way is to provide a call back mechanism at Document binding time (i.e. the ResponseWriter) where it gives some Interface the document and then the implementation can add what it wants per the application.

          Show
          Grant Ingersoll added a comment - If we think in terms of spatial distances, then it might be useful to be able to sort by the pseudo field? I know we currently resolve that issue by sorting by function query, but if you already calculate the distance in some other spatial process, might as well be able to sort on it? Yeah, that is true, but sorting is such a low level operation and this issue is about any component doing it, no matter where in the list it occurs. I just don't see how to make the two sync up w/o having to resort and that would be a perf. killer in many ways (CPU and memory). IMO, I think this piece of functionality should be about "marking up" the result set that is going to be returned. Of course, even with the approach I propose, it has the potential to be a memory killer if done wrong. For instance, I doubt you'd want to stuff in the Map every single distance you calculate, you really only want to add those items that you know are going to be in the final result list. Perhaps the alternate way is to provide a call back mechanism at Document binding time (i.e. the ResponseWriter) where it gives some Interface the document and then the implementation can add what it wants per the application.
          Hide
          Chris Male added a comment -

          Yeah, that is true, but sorting is such a low level operation and this issue is about any component doing it, no matter where in the list it occurs. I just don't see how to make the two sync up w/o having to resort and that would be a perf. killer in many ways (CPU and memory). IMO, I think this piece of functionality should be about "marking up" the result set that is going to be returned.

          I concur. I think its best not to do too much stuff in this issue. If we get pseudo fields being written out, then later on we can address advanced usages.

          Of course, even with the approach I propose, it has the potential to be a memory killer if done wrong. For instance, I doubt you'd want to stuff in the Map every single distance you calculate, you really only want to add those items that you know are going to be in the final result list.

          Yup, this is the crux of the issue.

          Perhaps the alternate way is to provide a call back mechanism at Document binding time (i.e. the ResponseWriter) where it gives some Interface the document and then the implementation can add what it wants per the application.

          I just wrote something about how I didn't see how that addressed the problem, but then it clicked what you are getting at. Through this Callback you can then re-calculate the distances for your page of results and recalculating 10 distances is a nominal cost.

          I think thats a really great idea, but I wonder if that shouldn't be an additional process to what you've already made? For spatial it makes perfect sense, but I wonder whether its appropriate for other use cases? I'm trying to imagine some use case where the value can only be captured as part of some convoluted query or analysis process. Consequently you want to store it when you can and not at binding time. But maybe a) those use cases dont exist and b) We should leave it up to the user to create a storage system that hooks into the callback.

          Show
          Chris Male added a comment - Yeah, that is true, but sorting is such a low level operation and this issue is about any component doing it, no matter where in the list it occurs. I just don't see how to make the two sync up w/o having to resort and that would be a perf. killer in many ways (CPU and memory). IMO, I think this piece of functionality should be about "marking up" the result set that is going to be returned. I concur. I think its best not to do too much stuff in this issue. If we get pseudo fields being written out, then later on we can address advanced usages. Of course, even with the approach I propose, it has the potential to be a memory killer if done wrong. For instance, I doubt you'd want to stuff in the Map every single distance you calculate, you really only want to add those items that you know are going to be in the final result list. Yup, this is the crux of the issue. Perhaps the alternate way is to provide a call back mechanism at Document binding time (i.e. the ResponseWriter) where it gives some Interface the document and then the implementation can add what it wants per the application. I just wrote something about how I didn't see how that addressed the problem, but then it clicked what you are getting at. Through this Callback you can then re-calculate the distances for your page of results and recalculating 10 distances is a nominal cost. I think thats a really great idea, but I wonder if that shouldn't be an additional process to what you've already made? For spatial it makes perfect sense, but I wonder whether its appropriate for other use cases? I'm trying to imagine some use case where the value can only be captured as part of some convoluted query or analysis process. Consequently you want to store it when you can and not at binding time. But maybe a) those use cases dont exist and b) We should leave it up to the user to create a storage system that hooks into the callback.
          Hide
          Grant Ingersoll added a comment -

          but I wonder if that shouldn't be an additional process to what you've already made?

          Totally agree.

          b) We should leave it up to the user to create a storage system that hooks into the callback.

          Yep, that would work too. Heck, it would even allow one to materialize stuff from a DB or Cassandra, etc. We'd probably want to be able to do bulk operations if that is the case. But again, that's getting ahead of ourselves, I think.

          Show
          Grant Ingersoll added a comment - but I wonder if that shouldn't be an additional process to what you've already made? Totally agree. b) We should leave it up to the user to create a storage system that hooks into the callback. Yep, that would work too. Heck, it would even allow one to materialize stuff from a DB or Cassandra, etc. We'd probably want to be able to do bulk operations if that is the case. But again, that's getting ahead of ourselves, I think.
          Hide
          Grant Ingersoll added a comment -

          The biggest problem here, is the ResponseWriters are a tangled web and it is not always clear they have access to the SolrQueryResponse at the right place, but maybe I'm not understanding them fully yet.

          Show
          Grant Ingersoll added a comment - The biggest problem here, is the ResponseWriters are a tangled web and it is not always clear they have access to the SolrQueryResponse at the right place, but maybe I'm not understanding them fully yet.
          Hide
          Chris Male added a comment -

          Sounds like a good time to untangle the web.

          Show
          Chris Male added a comment - Sounds like a good time to untangle the web.
          Hide
          Grant Ingersoll added a comment -

          Hmm, so digging in a bit more here and looking at Noble's patch, I think what I have, Noble has and what Chris has are very similar. The big diff is Chris and Noble hook into the SolrIndexSearcher, whereas what I was intending to do was to hook into the response writers. As mentioned in the prev. comment, this is a bit tedious, such that it likely does make sense to hook into the SolrIndexSearcher.

          Then again, the SolrQueryRequest already has a context object on it (which is what Chris' patch uses, I think), so maybe we should just use that along w/ the Resp. Writers. Argh.

          Show
          Grant Ingersoll added a comment - Hmm, so digging in a bit more here and looking at Noble's patch, I think what I have, Noble has and what Chris has are very similar. The big diff is Chris and Noble hook into the SolrIndexSearcher, whereas what I was intending to do was to hook into the response writers. As mentioned in the prev. comment, this is a bit tedious, such that it likely does make sense to hook into the SolrIndexSearcher. Then again, the SolrQueryRequest already has a context object on it (which is what Chris' patch uses, I think), so maybe we should just use that along w/ the Resp. Writers. Argh.
          Hide
          Grant Ingersoll added a comment -

          Here's a half-baked patch that I'm putting up b/c Ryan indicated to me he might have some time to work on it. It doesn't have the call back capabilities I think we need (that can then hook into Functions) but it does have:

          1. Per request storage (although I think we should just re-use the SolrQueryRequest.context Map)
          2. ResponseWriter integration (i.e. no SolrIndexSearcher integration)

          Show
          Grant Ingersoll added a comment - Here's a half-baked patch that I'm putting up b/c Ryan indicated to me he might have some time to work on it. It doesn't have the call back capabilities I think we need (that can then hook into Functions) but it does have: 1. Per request storage (although I think we should just re-use the SolrQueryRequest.context Map) 2. ResponseWriter integration (i.e. no SolrIndexSearcher integration)
          Hide
          Ryan McKinley added a comment -

          attaching some stuff that really does not work, but at least sketches a direction till it hit the 'how do we actually write the fields' road block...

          posting here just for reference.

          This takes the approach of adding an 'augmenter' calss:

          public abstract class DocAugmenter 
          {
          	/**
          	 * Call this before you start writing out fields
          	 * @param rsp 
          	 * @param req 
          	 * 
          	 * @return true if it has any fields to augment
          	 */
          	public abstract boolean setup(SolrQueryRequest req);
          	
          	/**
          	 * Make sure 'setup' has been called first!
          	 */
          	public abstract void augment( int docId, Document doc );
          }
          

          the 'setup' call give the augmeter a chance to bulk load all the values if that is what it wants to do. perhaps we would want a shutdown also

          Show
          Ryan McKinley added a comment - attaching some stuff that really does not work, but at least sketches a direction till it hit the 'how do we actually write the fields' road block... posting here just for reference. This takes the approach of adding an 'augmenter' calss: public abstract class DocAugmenter { /** * Call this before you start writing out fields * @param rsp * @param req * * @ return true if it has any fields to augment */ public abstract boolean setup(SolrQueryRequest req); /** * Make sure 'setup' has been called first! */ public abstract void augment( int docId, Document doc ); } the 'setup' call give the augmeter a chance to bulk load all the values if that is what it wants to do. perhaps we would want a shutdown also
          Hide
          Grant Ingersoll added a comment -

          I think we all have generally worked around the same issues here, between this and SOLR-1298. I guess we just need to pick some names and work it out.

          One thing about this last patch (and mine, I think) is that perhaps we should just put the augmenter on the Request. That way, you don't have to add the response in a bunch of places. Besides, in my mind anyway, you are requesting augmentation via the Augmenter provided.

          Also, I'm not sure why StdAugmenter is instantiated in SolrCore. Wouldn't we want to allow for that to be driven by some user implementations?

          Perhaps, since there are a few of us w/ eyes on this, we should first try to tackle the ResponseWriter mess.

          Show
          Grant Ingersoll added a comment - I think we all have generally worked around the same issues here, between this and SOLR-1298 . I guess we just need to pick some names and work it out. One thing about this last patch (and mine, I think) is that perhaps we should just put the augmenter on the Request. That way, you don't have to add the response in a bunch of places. Besides, in my mind anyway, you are requesting augmentation via the Augmenter provided. Also, I'm not sure why StdAugmenter is instantiated in SolrCore. Wouldn't we want to allow for that to be driven by some user implementations? Perhaps, since there are a few of us w/ eyes on this, we should first try to tackle the ResponseWriter mess.
          Hide
          Ryan McKinley added a comment -

          updating to /trunk and moving back to the SolrDocument approach even though it does not work (it is more clear)

          Show
          Ryan McKinley added a comment - updating to /trunk and moving back to the SolrDocument approach even though it does not work (it is more clear)
          Hide
          Bill Bell added a comment -

          OK. Do we have something that works? Near works? Are we going with this or SOLR-1298 ? We need a decision ?

          If not defined in schema.xml we could just add a parameter: fi=distance:hsin(6500, score, 39, -109)

          FI = Field Injection (1 or more).

          It could be handled in the responseWriter. Later adding it to schema.xml can be tackled?

          Thanks.

          Show
          Bill Bell added a comment - OK. Do we have something that works? Near works? Are we going with this or SOLR-1298 ? We need a decision ? If not defined in schema.xml we could just add a parameter: fi=distance:hsin(6500, score, 39, -109) FI = Field Injection (1 or more). It could be handled in the responseWriter. Later adding it to schema.xml can be tackled? Thanks.
          Hide
          Ryan McKinley added a comment -

          updated patch to work with trunk

          note, this does not do anything yet, but points towards a direction

          Show
          Ryan McKinley added a comment - updated patch to work with trunk note, this does not do anything yet, but points towards a direction
          Hide
          Ryan McKinley added a comment -

          updating to trunk... still not really working (or doing anyting)

          Show
          Ryan McKinley added a comment - updating to trunk... still not really working (or doing anyting)
          Hide
          Ryan McKinley added a comment -

          Updated path that actually works!

          Show
          Ryan McKinley added a comment - Updated path that actually works!
          Hide
          Ryan McKinley added a comment -

          I just added a path that almost works. Rather then bang my head on it some more, i think some feedback would be great.

          Originally, I hoped to clean up the ResponeWriter mess, and make a single place that would 'augment' docs – but there are so many micro optimizations for each format that approach seems difficult.

          Instead I went for an approach the moves Set<String> returnFields to a class that knows more then just a list of strings. This holds an augmenter that can add to SolrDocument or write to the response.

          Now you can make a request like:

          http://localhost:8983/solr/select?q=*:*&fl=id,score,_docid_,_shard

          and get back a response

            "response":{"numFound":17,"start":0,"maxScore":1.0,"docs":[
                {
                  "id":"GB18030TEST",
                  "score":1.0,
                  "_docid_":0,
                  "_shard_":"getshardid???"},
                {
                  "id":"SP2514N",
                  "score":1.0,
                  "_docid_":1,
                  "_shard_":"getshardid???"},
                {
                  "id":"6H500F0",
                  "features":[
                    "SATA 3.0Gb/s, NCQ",
                    "8.5ms seek",
                    "16MB cache"],
                  "score":1.0,
                  "_docid_":2,
                  "_shard_":"getshardid???"},
          ...
          
          

          right now, docid just returns the lucene docid, and shard returns a constant string saying "getshardid???"

          The distributed tests (BasicDistributedZkTest, and TestDistributedSearch) don't pass, but everything else does. I will wait for general feedback before trying to track that down.

          Also looking at SOLR-1298, I would love some feedback on how we could read function queries and ideally used ones that are already defined elsewhere.

          feedback welcome!

          Show
          Ryan McKinley added a comment - I just added a path that almost works. Rather then bang my head on it some more, i think some feedback would be great. Originally, I hoped to clean up the ResponeWriter mess, and make a single place that would 'augment' docs – but there are so many micro optimizations for each format that approach seems difficult. Instead I went for an approach the moves Set<String> returnFields to a class that knows more then just a list of strings. This holds an augmenter that can add to SolrDocument or write to the response. Now you can make a request like: http://localhost:8983/solr/select?q=*:*&fl=id,score,_docid_,_shard and get back a response "response" :{ "numFound" :17, "start" :0, "maxScore" :1.0, "docs" :[ { "id" : "GB18030TEST" , "score" :1.0, "_docid_" :0, "_shard_" : "getshardid???" }, { "id" : "SP2514N" , "score" :1.0, "_docid_" :1, "_shard_" : "getshardid???" }, { "id" : "6H500F0" , "features" :[ "SATA 3.0Gb/s, NCQ" , "8.5ms seek" , "16MB cache" ], "score" :1.0, "_docid_" :2, "_shard_" : "getshardid???" }, ... right now, docid just returns the lucene docid, and shard returns a constant string saying "getshardid???" The distributed tests (BasicDistributedZkTest, and TestDistributedSearch) don't pass, but everything else does. I will wait for general feedback before trying to track that down. Also looking at SOLR-1298 , I would love some feedback on how we could read function queries and ideally used ones that are already defined elsewhere. feedback welcome!
          Hide
          Yonik Seeley added a comment -

          Hey Ryan, I've also been working on this recently, trying to keep all of the various use-cases in mind (it's difficult!). It's all just been about brainstorming on the back-end (the chain of things that can modify documents), but I also have code that parses multiple "fl" params, including field globbing and function queries. It doesn't implement those yet, just allows for their specification.

          Show
          Yonik Seeley added a comment - Hey Ryan, I've also been working on this recently, trying to keep all of the various use-cases in mind (it's difficult!). It's all just been about brainstorming on the back-end (the chain of things that can modify documents), but I also have code that parses multiple "fl" params, including field globbing and function queries. It doesn't implement those yet, just allows for their specification.
          Hide
          Bill Bell added a comment -

          Here is one use case:

          • Return geodist() as a field in the results.
          Show
          Bill Bell added a comment - Here is one use case: Return geodist() as a field in the results.
          Hide
          Ryan McKinley added a comment -

          This takes a different approach – it forces all the writers to use SolrDocument rather then DocList

          Show
          Ryan McKinley added a comment - This takes a different approach – it forces all the writers to use SolrDocument rather then DocList
          Hide
          Ryan McKinley added a comment -

          Here is a different approach that allows the whole Document to be transformed rather then just letting you add fields to the response. The basic interface is now:

          public abstract class DocTransformer
          {
            public abstract void transform(SolrDocument doc, int docid, Float score);
          }
          

          Some notes about the patch – many tests fail because this does not support the old xml style where multivalued fields show up as a single item.

          Show
          Ryan McKinley added a comment - Here is a different approach that allows the whole Document to be transformed rather then just letting you add fields to the response. The basic interface is now: public abstract class DocTransformer { public abstract void transform(SolrDocument doc, int docid, Float score); } Some notes about the patch – many tests fail because this does not support the old xml style where multivalued fields show up as a single item.
          Hide
          Ryan McKinley added a comment -

          Updated patch. This changes the the API to:

          public abstract class DocTransformer
          {
            public void setContext( TransformContext context ) {}
            public abstract void transform(SolrDocument doc, int docid);
          }
          

          This will let the TransformContext hold objects that may be useful for filling up the SolrDocument later.

          For example, the DocIterator is included in TransformContext and that is used to fill in the score (rather then passing Float score to every transformer). Another example would be if we have the Query object and want to add 'explain' info directly to the document.

          Bill – re geodist(), yes that is my real motivation for this! see SOLR-1298

          Something is still weird with the Distributed search stuff, but figure I should get feedback on general approach before worrying about that.

          Show
          Ryan McKinley added a comment - Updated patch. This changes the the API to: public abstract class DocTransformer { public void setContext( TransformContext context ) {} public abstract void transform(SolrDocument doc, int docid); } This will let the TransformContext hold objects that may be useful for filling up the SolrDocument later. For example, the DocIterator is included in TransformContext and that is used to fill in the score (rather then passing Float score to every transformer). Another example would be if we have the Query object and want to add 'explain' info directly to the document. Bill – re geodist(), yes that is my real motivation for this! see SOLR-1298 Something is still weird with the Distributed search stuff, but figure I should get feedback on general approach before worrying about that.
          Hide
          Ryan McKinley added a comment -

          updating this patch with a non-trivial transformation. This adds explain info directly to the results SOLR-2417

          Show
          Ryan McKinley added a comment - updating this patch with a non-trivial transformation. This adds explain info directly to the results SOLR-2417
          Hide
          Ryan McKinley added a comment -

          updated patch for trunk.

          still need to fix test failure in distributed test case.

          I think the basic approach is sound – i'd like to commit soon so we have something to iterate against. I think yoniks thoughts are about parsing the the request query. I hope what i have is OK, and could be replaced/improved when he has code to do that.

          Show
          Ryan McKinley added a comment - updated patch for trunk. still need to fix test failure in distributed test case. I think the basic approach is sound – i'd like to commit soon so we have something to iterate against. I think yoniks thoughts are about parsing the the request query. I hope what i have is OK, and could be replaced/improved when he has code to do that.
          Hide
          Ryan McKinley added a comment -

          updated patch – distributed tests still fail

          Show
          Ryan McKinley added a comment - updated patch – distributed tests still fail
          Hide
          Yonik Seeley added a comment -

          Here's my prototype parsing patch that adds support for multiple "fl" params, field globbing and function queries.

          Show
          Yonik Seeley added a comment - Here's my prototype parsing patch that adds support for multiple "fl" params, field globbing and function queries.
          Hide
          Ryan McKinley added a comment -

          all tests pass.

          I think this is ready to go — and we can iterate from here

          Show
          Ryan McKinley added a comment - all tests pass. I think this is ready to go — and we can iterate from here
          Hide
          Ryan McKinley added a comment -

          Yonik SOLR-1566_parsing.patch looks good – should be able to replace the 'ReturnFields' class in my patch when it is ready to go

          Show
          Ryan McKinley added a comment - Yonik SOLR-1566 _parsing.patch looks good – should be able to replace the 'ReturnFields' class in my patch when it is ready to go
          Hide
          Yonik Seeley added a comment -

          I've tried to go over my brainstorming notes and pull out some stuff that might make sense (as opposed to the stuff that no one will understand w/o a lot more context, like why approach X or approach Y won't work).

          • Know when not to load any stored fields at all (fields could all come from pseudofields, including external fields)
            • example: fl= {!func key=id}id // use the fieldcache to load an id (important option for distributed search)
              ** example: fl={!func key=id}

              ,mul(popularity,editorial_score)&rows=1000000000 // stream the entire thing

          • General transformer class: something that adds, changes, or removes fields based on an external/database lookup
            • may need to work in batches of external ids (a getNextDocument() iterator pattern could allow
              buffering and then an external request with a bunch of ids)
            • probably wants to work in batches of SolrDocument, not docids
            • may need to inspect the FieldsSpec to see what fields were requested
              • ask: did the request ask for field "x" that I am responsible for?
              • ask: did the request specifically ask for field "x"
          • function query that needs to work in-order for good efficiency (query/scorer is one example)
            • need to work in potentially big blocks (100 docids at a time), sort the ids then retrieve values in order
            • don't want to instantiate all SolrDocuments at once... just keep a float[100] or Object[100]
            • if multiple function queries, don't make each sort it's own list of docids, share somehow?
              • the sharing mechanism could just be a private implementation to this type of transformer
          • transformers that need the Index... don't make each one look up the right AtomicReaderContext, share it somehow
          • external fields should work with highlighting (eventually)
            • needs to be possible for highlighting to work on SolrDocuments and only run the transformers once... replacing the original DocList with something else
          • different DocLists need different transformer lists
            • example: multiple DocLists as the result of grouping
            • example: allow overriding "fl" per grouping command
          • need more than one level of pseudo-fields...
            • example: user may specify fl=add(a,b)... that's "global" unless overridden
              but the grouping code may want to add the group pseudo-field when using
              the flattened/simple format
          Show
          Yonik Seeley added a comment - I've tried to go over my brainstorming notes and pull out some stuff that might make sense (as opposed to the stuff that no one will understand w/o a lot more context, like why approach X or approach Y won't work). Know when not to load any stored fields at all (fields could all come from pseudofields, including external fields) example: fl= {!func key=id}id // use the fieldcache to load an id (important option for distributed search) ** example: fl={!func key=id} ,mul(popularity,editorial_score)&rows=1000000000 // stream the entire thing General transformer class: something that adds, changes, or removes fields based on an external/database lookup may need to work in batches of external ids (a getNextDocument() iterator pattern could allow buffering and then an external request with a bunch of ids) probably wants to work in batches of SolrDocument, not docids may need to inspect the FieldsSpec to see what fields were requested ask: did the request ask for field "x" that I am responsible for? ask: did the request specifically ask for field "x" function query that needs to work in-order for good efficiency (query/scorer is one example) need to work in potentially big blocks (100 docids at a time), sort the ids then retrieve values in order don't want to instantiate all SolrDocuments at once... just keep a float [100] or Object [100] if multiple function queries, don't make each sort it's own list of docids, share somehow? the sharing mechanism could just be a private implementation to this type of transformer transformers that need the Index... don't make each one look up the right AtomicReaderContext, share it somehow external fields should work with highlighting (eventually) needs to be possible for highlighting to work on SolrDocuments and only run the transformers once... replacing the original DocList with something else different DocLists need different transformer lists example: multiple DocLists as the result of grouping example: allow overriding "fl" per grouping command need more than one level of pseudo-fields... example: user may specify fl=add(a,b)... that's "global" unless overridden but the grouping code may want to add the group pseudo-field when using the flattened/simple format
          Hide
          Ryan McKinley added a comment -

          Yonik, i took your patch and appled to trunk in: SOLR-2444

          I think it makes sense to start a new issue for how the params are parsed since it looks like a lot is going on there

          Show
          Ryan McKinley added a comment - Yonik, i took your patch and appled to trunk in: SOLR-2444 I think it makes sense to start a new issue for how the params are parsed since it looks like a lot is going on there
          Hide
          Ryan McKinley added a comment -

          This is a large issue with many smaller threads, lets close this one and open smaller issues to sort of the details – In particular:

          • SOLR-2444 will track what we do with the fl parameer
          • SOLR-1298 will track getting function queries working
          Show
          Ryan McKinley added a comment - This is a large issue with many smaller threads, lets close this one and open smaller issues to sort of the details – In particular: SOLR-2444 will track what we do with the fl parameer SOLR-1298 will track getting function queries working
          Hide
          Yonik Seeley added a comment -

          Somewhere we lost some formatting goodness.

          before the patch we had

            "response":{"numFound":22,"start":0,"docs":[
                {
                  "id":"GB18030TEST",
                  "name":"Test with some GB18030 encoded characters"},
                {
                  "id":"SP2514N",
                  "name":"Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133"},
                {
                  "id":"6H500F0",
                  "name":"Maxtor DiamondMax 11 - hard drive - 500 GB - SATA-300"},
                {
          

          And now we have

            "response":{"numFound":19,"start":0,"docs":[{
                  "id":"GB18030TEST",
                  "name":"Test with some GB18030 encoded characters"},{
                  "id":"SP2514N",
                  "name":"Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133"},{
                  "id":"6H500F0",
                  "name":"Maxtor DiamondMax 11 - hard drive - 500 GB - SATA-300"},{
          

          Which makes it visually tough to find the start of a new document (well... not so hard w/ just 2 fields, but pretty hard with many larger fields)

          Show
          Yonik Seeley added a comment - Somewhere we lost some formatting goodness. before the patch we had "response" :{ "numFound" :22, "start" :0, "docs" :[ { "id" : "GB18030TEST" , "name" : "Test with some GB18030 encoded characters" }, { "id" : "SP2514N" , "name" : "Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133" }, { "id" : "6H500F0" , "name" : "Maxtor DiamondMax 11 - hard drive - 500 GB - SATA-300" }, { And now we have "response" :{ "numFound" :19, "start" :0, "docs" :[{ "id" : "GB18030TEST" , "name" : "Test with some GB18030 encoded characters" },{ "id" : "SP2514N" , "name" : "Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133" },{ "id" : "6H500F0" , "name" : "Maxtor DiamondMax 11 - hard drive - 500 GB - SATA-300" },{ Which makes it visually tough to find the start of a new document (well... not so hard w/ just 2 fields, but pretty hard with many larger fields)
          Hide
          Yonik Seeley added a comment -

          Hmmm, I think I just hit another issue:

          http://localhost:8983/solr/browse

          java.lang.ClassCastException: org.apache.solr.response.ResultContext cannot be cast to org.apache.solr.common.SolrDocumentList
          	at org.apache.solr.response.PageTool.<init>(PageTool.java:46)
          	at org.apache.solr.response.VelocityResponseWriter.write(VelocityResponseWriter.java:62)
          
          Show
          Yonik Seeley added a comment - Hmmm, I think I just hit another issue: http://localhost:8983/solr/browse java.lang.ClassCastException: org.apache.solr.response.ResultContext cannot be cast to org.apache.solr.common.SolrDocumentList at org.apache.solr.response.PageTool.<init>(PageTool.java:46) at org.apache.solr.response.VelocityResponseWriter.write(VelocityResponseWriter.java:62)
          Hide
          Jan Høydahl added a comment -

          Also encountered the Velocity bug. Fixed it by patching PageTool (attached).

          Show
          Jan Høydahl added a comment - Also encountered the Velocity bug. Fixed it by patching PageTool (attached).
          Hide
          Yonik Seeley added a comment -

          Somewhere we lost some formatting goodness.

          I just committed a small fix for this.

          Show
          Yonik Seeley added a comment - Somewhere we lost some formatting goodness. I just committed a small fix for this.
          Hide
          Ryan McKinley added a comment -

          This has been in trunk for a while – any new problems should get their own JIRA issue

          Show
          Ryan McKinley added a comment - This has been in trunk for a while – any new problems should get their own JIRA issue
          Hide
          Grant Ingersoll added a comment -

          Ryan,

          I see http://wiki.apache.org/solr/DocTransformers has some pretty thin details on this stuff. How does one actually invoke it at Query Time? I see how to configure it.

          Is it something like fl=id,score,[myTransformer ... ] ? That's what it seems like based on the ReturnFields parsing code.

          Show
          Grant Ingersoll added a comment - Ryan, I see http://wiki.apache.org/solr/DocTransformers has some pretty thin details on this stuff. How does one actually invoke it at Query Time? I see how to configure it. Is it something like fl=id,score, [myTransformer ... ] ? That's what it seems like based on the ReturnFields parsing code.
          Hide
          Grant Ingersoll added a comment -

          Debugging a bit shows that this is indeed how it is done. I will add it to the docs.

          Show
          Grant Ingersoll added a comment - Debugging a bit shows that this is indeed how it is done. I will add it to the docs.
          Hide
          Ryan McKinley added a comment -

          I fixed it up a little bit (at least using the right syntax) and added a link to:
          http://wiki.apache.org/solr/CommonQueryParameters#Transformers

          Show
          Ryan McKinley added a comment - I fixed it up a little bit (at least using the right syntax) and added a link to: http://wiki.apache.org/solr/CommonQueryParameters#Transformers
          Hide
          Antony Stubbs added a comment -

          Are there any tests writing for any of the doctransformer architecture? I can't find any. I'm looking for something to base tests for a new regex transformer and highlighting augmenting transformer..

          Show
          Antony Stubbs added a comment - Are there any tests writing for any of the doctransformer architecture? I can't find any. I'm looking for something to base tests for a new regex transformer and highlighting augmenting transformer..
          Hide
          Hoss Man added a comment -

          Antony: I don't see any specific "unit" tests of the internals, but if you look at all the svn commits associated with this issue there is at least one example of doing functional tests to ensure that the transformers are called at the output is included in the response documents...

          https://svn.apache.org/viewvc/lucene/dev/trunk/solr/src/test/org/apache/solr/client/solrj/SolrExampleTests.java?r1=1085450&r2=1085449&view=diff&pathrev=1085450

          Show
          Hoss Man added a comment - Antony: I don't see any specific "unit" tests of the internals, but if you look at all the svn commits associated with this issue there is at least one example of doing functional tests to ensure that the transformers are called at the output is included in the response documents... https://svn.apache.org/viewvc/lucene/dev/trunk/solr/src/test/org/apache/solr/client/solrj/SolrExampleTests.java?r1=1085450&r2=1085449&view=diff&pathrev=1085450
          Hide
          Grant Ingersoll added a comment -

          The QueryElevationComponentTest has some examples as well. I thought there were tests in other places.

          Show
          Grant Ingersoll added a comment - The QueryElevationComponentTest has some examples as well. I thought there were tests in other places.
          Show
          Ryan McKinley added a comment - See Also: http://wiki.apache.org/solr/DocTransformers http://wiki.apache.org/solr/CommonQueryParameters#fl
          Hide
          Ryan McKinley added a comment -

          There are some tests in: SolrExampleTests.java#testAugmentFields()

          and TestPseudoReturnFields.java

          Show
          Ryan McKinley added a comment - There are some tests in: SolrExampleTests.java #testAugmentFields() and TestPseudoReturnFields.java
          Hide
          Antony Stubbs added a comment -

          Thanks, that's a start. I'm specifically looking to access the highlighting results, but I can't see how I can. The highlighting results are added to the ResponseBuilder.Response directly in HighlightingComponent: rb.rsp.add("highlighting", sumData);. I want to be able to extract that highlighting component and do some processing on it in my DocTransformer..

          Show
          Antony Stubbs added a comment - Thanks, that's a start. I'm specifically looking to access the highlighting results, but I can't see how I can. The highlighting results are added to the ResponseBuilder.Response directly in HighlightingComponent: rb.rsp.add("highlighting", sumData);. I want to be able to extract that highlighting component and do some processing on it in my DocTransformer..
          Hide
          Ryan McKinley added a comment -

          To do highlighting in a Transformer, you may not want/need to use the Highlighting component at all. You could do it directly in the Transformer – this way you have direct access to the output and can avoid building the Map<String,Highlighting Results> that gets added to the response.

          Show
          Ryan McKinley added a comment - To do highlighting in a Transformer, you may not want/need to use the Highlighting component at all. You could do it directly in the Transformer – this way you have direct access to the output and can avoid building the Map<String,Highlighting Results> that gets added to the response.
          Hide
          Antony Stubbs added a comment -

          Thanks Ryan.. I was trying to get access to the highlighting results, rather than run it from the transformer - doesn't seem right. But it could be a start.. I think I saw an access point for that...

          Show
          Antony Stubbs added a comment - Thanks Ryan.. I was trying to get access to the highlighting results, rather than run it from the transformer - doesn't seem right. But it could be a start.. I think I saw an access point for that...
          Hide
          Ryo Hideno added a comment -

          Please do not forget QuerySenderListener.

          Show
          Ryo Hideno added a comment - Please do not forget QuerySenderListener.

            People

            • Assignee:
              Ryan McKinley
              Reporter:
              Noble Paul
            • Votes:
              8 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development