Solr
  1. Solr
  2. SOLR-486

Support binary formats for QueryresponseWriter

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3
    • Component/s: clients - java, search
    • Labels:
      None

      Description

      QueryResponse writer only allows text data to be written.

      So it is not possible to implement a binary protocol . Create another interface which has a method
      write(OutputStream os, SolrQueryRequest request, SolrQueryResponse response)

      1. SOLR-486-iterator.patch
        3 kB
        Noble Paul
      2. SOLR-486-iterator.patch
        4 kB
        Noble Paul
      3. SOLR-486.patch
        2 kB
        Noble Paul
      4. SOLR-486.patch
        3 kB
        Noble Paul
      5. SOLR-486.patch
        29 kB
        Noble Paul
      6. SOLR-486.patch
        30 kB
        Noble Paul
      7. SOLR-486.patch
        32 kB
        Yonik Seeley
      8. SOLR-486.patch
        31 kB
        Noble Paul
      9. SOLR-486.patch
        61 kB
        Yonik Seeley
      10. SOLR-486.patch
        69 kB
        Yonik Seeley
      11. SOLR-486.patch
        7 kB
        Noble Paul
      12. solr-486.patch
        6 kB
        Noble Paul
      13. optimizemap.patch
        0.7 kB
        Noble Paul

        Activity

        Hide
        Noble Paul added a comment -

        Just the way NamedList keys can be externalized, Map keys can also be externalized.And this is backward compatible.

        Maps are not used very commonly in SOLR. but SOLR-561 uses maps for master-slave communication

        Show
        Noble Paul added a comment - Just the way NamedList keys can be externalized, Map keys can also be externalized.And this is backward compatible. Maps are not used very commonly in SOLR. but SOLR-561 uses maps for master-slave communication
        Hide
        Yonik Seeley added a comment -

        OK, I committed efficient Iterator/Iterable support. They return lists when unserialized.

        Show
        Yonik Seeley added a comment - OK, I committed efficient Iterator/Iterable support. They return lists when unserialized.
        Hide
        Yonik Seeley added a comment -

        I'm working on a fix.

        Show
        Yonik Seeley added a comment - I'm working on a fix.
        Hide
        Yonik Seeley added a comment -

        Just because an Iterator was used for writing a value does not mean that it should be reconstituted on the other side as an Iterator. It seems like the type should still be an array, but that if a length isn't available then there should be an alternate way to serialize.

        Show
        Yonik Seeley added a comment - Just because an Iterator was used for writing a value does not mean that it should be reconstituted on the other side as an Iterator. It seems like the type should still be an array, but that if a length isn't available then there should be an alternate way to serialize.
        Hide
        Shalin Shekhar Mangar added a comment -

        Yonik, can we commit Noble's latest patch?

        Show
        Shalin Shekhar Mangar added a comment - Yonik, can we commit Noble's latest patch?
        Hide
        Noble Paul added a comment -

        name in namedlist written as extern string

        Show
        Noble Paul added a comment - name in namedlist written as extern string
        Hide
        Noble Paul added a comment - - edited

        That's a good idea - but do we actually use any iterator or Iterable objects?

        We do not have it now . But NamedListCodec has a public method called writeIterator() . If we leave it like that somebody may actually use it. Iterators are used to stream objects one by one and the current method implementation does not allow it.

        Show
        Noble Paul added a comment - - edited That's a good idea - but do we actually use any iterator or Iterable objects? We do not have it now . But NamedListCodec has a public method called writeIterator() . If we leave it like that somebody may actually use it. Iterators are used to stream objects one by one and the current method implementation does not allow it.
        Hide
        Yonik Seeley added a comment -

        So if I understand the reasoning behind the "iterator" type, it doesn't have a length up-front and instead has a terminator.... the idea being that you don't have to iterate over the entire list and buffer it, just to figure out what the length is, right?

        That's a good idea - but do we actually use any iterator or Iterable objects?

        Show
        Yonik Seeley added a comment - So if I understand the reasoning behind the "iterator" type, it doesn't have a length up-front and instead has a terminator.... the idea being that you don't have to iterate over the entire list and buffer it, just to figure out what the length is, right? That's a good idea - but do we actually use any iterator or Iterable objects?
        Hide
        Noble Paul added a comment -

        a better one.

        (I guess the approach is better even for the other collection types also. )

        Show
        Noble Paul added a comment - a better one. (I guess the approach is better even for the other collection types also. )
        Hide
        Noble Paul added a comment -

        Iterator was being treated like an array which is not optimal

        On a different note .should we use writeExternString() to write all names in (NamedList, Map)

        Show
        Noble Paul added a comment - Iterator was being treated like an array which is not optimal On a different note .should we use writeExternString() to write all names in (NamedList, Map)
        Hide
        Yonik Seeley added a comment -

        Fixed (committed) the constructor inconsistency.

        Show
        Yonik Seeley added a comment - Fixed (committed) the constructor inconsistency.
        Hide
        Noble Paul added a comment -

        These two constructors are inconsistent

        CommonsHttpSolrServer.java
        public CommonsHttpSolrServer(URL baseURL) 
          {
            this(baseURL, null, new BinaryResponseParser());
          }
        
          public CommonsHttpSolrServer(URL baseURL, HttpClient client){
            this(baseURL, client, new XMLResponseParser());
          }
        
        Show
        Noble Paul added a comment - These two constructors are inconsistent CommonsHttpSolrServer.java public CommonsHttpSolrServer(URL baseURL) { this (baseURL, null , new BinaryResponseParser()); } public CommonsHttpSolrServer(URL baseURL, HttpClient client){ this (baseURL, client, new XMLResponseParser()); }
        Hide
        Yonik Seeley added a comment -

        OK, I just committed the latest changes.

        Show
        Yonik Seeley added a comment - OK, I just committed the latest changes.
        Hide
        Noble Paul added a comment - - edited

        Another level of efficiency can be brought in by preloading the string table with well known strings like responseHeader , QTime etc . That did not look very elegant to me.

        The sliding window approach is also good . But we do not have too many repeated strings unless we use highlighting etc .

        I guess this can go into trunk and give it enough time to 'settle' before the release.

        Show
        Noble Paul added a comment - - edited Another level of efficiency can be brought in by preloading the string table with well known strings like responseHeader , QTime etc . That did not look very elegant to me. The sliding window approach is also good . But we do not have too many repeated strings unless we use highlighting etc . I guess this can go into trunk and give it enough time to 'settle' before the release.
        Hide
        Yonik Seeley added a comment -

        Thanks Noble, this looks pretty good.

        I had previously considered caching strings via some kind of sliding window... keep track of the last 100 or so string values written under some certain length, and then if you see a string again in that window, write a reference (an index that says how many values ago it was seen).

        For Solr responses in general, it seems like the main duplication will be in field names (which you have taken care of). The only other duplication I can think of would be the "id" field values (used as a key in other maps such as highlighting), and any duplication that is custom to the collection (such as string values for a type field, etc).

        Thoughts? I'd be happy to commit this version, or give you time to try out an alternative if you think it might be worth it (but I don't currently have time myself to implement the alternative).

        Show
        Yonik Seeley added a comment - Thanks Noble, this looks pretty good. I had previously considered caching strings via some kind of sliding window... keep track of the last 100 or so string values written under some certain length, and then if you see a string again in that window, write a reference (an index that says how many values ago it was seen). For Solr responses in general, it seems like the main duplication will be in field names (which you have taken care of). The only other duplication I can think of would be the "id" field values (used as a key in other maps such as highlighting), and any duplication that is custom to the collection (such as string values for a type field, etc). Thoughts? I'd be happy to commit this version, or give you time to try out an alternative if you think it might be worth it (but I don't currently have time myself to implement the alternative).
        Hide
        Noble Paul added a comment -

        This include changes for making Binary format the default for SolrJ and the changes for optimized write of Field names in Documents. So , for a response with 5 fields and 10 records only 5 names are written instead of 50. there is an overhead of an extra byte per unique string (total 5 bytes in this case)

        Show
        Noble Paul added a comment - This include changes for making Binary format the default for SolrJ and the changes for optimized write of Field names in Documents. So , for a response with 5 fields and 10 records only 5 names are written instead of 50. there is an overhead of an extra byte per unique string (total 5 bytes in this case)
        Hide
        Noble Paul added a comment - - edited

        If we take a look at the data that is written down by NamedListCodec there are a lot of "names" which are repeated. If we could avoid the repetitions we can achieve better optimization.
        Can we have another type EXTERN_STRING
        The NamedListCodec maintains a Map<String,Integer> of EXTERN_STRING vs index as it is written out. When the same string is written it checks up in the List whether it already has a reference.

        While decoding all the EXTERN_STRING values are copied into a List <String>. When an EXTERN_STRING with an index comes it is copied from the List.

        NamedListCodec.java
          private int stringsCount  =  0;
          private Map<String,Integer> stringsMap;
          private List<String > stringsList;
          public void writeExternString(String s) throws IOException {
            if(s == null) {
              writeTag(NULL) ;
              return;
            }
            Integer idx = stringsMap == null ? null : stringsMap.get(s);
            if(idx == null) idx =0;
            writeTag(EXTERN_STRING,idx);
            if(idx == 0){
              writeStr(s);
              if(stringsMap == null) stringsMap = new HashMap<String, Integer>();
              stringsMap.put(s,++stringsCount);
            }
        
          }
          public String  readExternString(FastInputStream fis) throws IOException {
            int idx = readSize(fis);
            if (idx != 0) {// idx != 0 is the index of the extern string
              return stringsList.get(idx-1);
            } else {// idx == 0 means it has a string value
              String s = (String) readVal(fis);
              if(stringsList == null ) stringsList = new ArrayList<String>();
              stringsList.add(s);
              return s;
            }
          }
        
        Show
        Noble Paul added a comment - - edited If we take a look at the data that is written down by NamedListCodec there are a lot of "names" which are repeated. If we could avoid the repetitions we can achieve better optimization. Can we have another type EXTERN_STRING The NamedListCodec maintains a Map<String,Integer> of EXTERN_STRING vs index as it is written out. When the same string is written it checks up in the List whether it already has a reference. While decoding all the EXTERN_STRING values are copied into a List <String>. When an EXTERN_STRING with an index comes it is copied from the List. NamedListCodec.java private int stringsCount = 0; private Map< String , Integer > stringsMap; private List< String > stringsList; public void writeExternString( String s) throws IOException { if (s == null ) { writeTag(NULL) ; return ; } Integer idx = stringsMap == null ? null : stringsMap.get(s); if (idx == null ) idx =0; writeTag(EXTERN_STRING,idx); if (idx == 0){ writeStr(s); if (stringsMap == null ) stringsMap = new HashMap< String , Integer >(); stringsMap.put(s,++stringsCount); } } public String readExternString(FastInputStream fis) throws IOException { int idx = readSize(fis); if (idx != 0) { // idx != 0 is the index of the extern string return stringsList.get(idx-1); } else { // idx == 0 means it has a string value String s = ( String ) readVal(fis); if (stringsList == null ) stringsList = new ArrayList< String >(); stringsList.add(s); return s; } }
        Hide
        Noble Paul added a comment -

        changes required to make binary format default in solrJ

        Show
        Noble Paul added a comment - changes required to make binary format default in solrJ
        Hide
        Noble Paul added a comment -

        I debugged through the problems

        yonik: NamedlistCodec is agnostic of the request params. So wherever it has to write SolrDocumentList or SolrDocument it writes down whatever is available.

        Cases where the maxScore is not asked for and the value is present in SolrDocumentList it is writtten down.

        May be we should ignore it if 'maxScore' is available without being asked for

        'id' are also written down in the same fashion.

        Should it be handed over to BinaryResponseWriter to write the values so that the context is available?

        Show
        Noble Paul added a comment - I debugged through the problems yonik: NamedlistCodec is agnostic of the request params. So wherever it has to write SolrDocumentList or SolrDocument it writes down whatever is available. Cases where the maxScore is not asked for and the value is present in SolrDocumentList it is writtten down. May be we should ignore it if 'maxScore' is available without being asked for 'id' are also written down in the same fashion. Should it be handed over to BinaryResponseWriter to write the values so that the context is available?
        Hide
        Noble Paul added a comment -

        I don't see how this would help. If the ResponseParser for CommonsHttpSolrServer is the binary one, it will send the parameter: wt=javabin – in a 1.2 setup, that should give an error.

        if the wt= javabin is absent the server deafults to 'xml' so the response format will be xml for Solr 1.2. However ,it is important that we make the default format to binary before the 1.3 release.

        Documeting this behavior is good enough.

        Show
        Noble Paul added a comment - I don't see how this would help. If the ResponseParser for CommonsHttpSolrServer is the binary one, it will send the parameter: wt=javabin – in a 1.2 setup, that should give an error. if the wt= javabin is absent the server deafults to 'xml' so the response format will be xml for Solr 1.2. However ,it is important that we make the default format to binary before the 1.3 release. Documeting this behavior is good enough.
        Hide
        Ryan McKinley added a comment -

        or we will have to modify the CommonsHttpSolrServer to fallback to XMLResponseParser if content type header is text/xml. (This is more elegant solution)

        I don't see how this would help. If the ResponseParser for CommonsHttpSolrServer is the binary one, it will send the parameter: wt=javabin – in a 1.2 setup, that should give an error.

        I think documenting that to use this for 1.2 you should set the XMLResponseParser in the constructor is good enough.

        Show
        Ryan McKinley added a comment - or we will have to modify the CommonsHttpSolrServer to fallback to XMLResponseParser if content type header is text/xml. (This is more elegant solution) I don't see how this would help. If the ResponseParser for CommonsHttpSolrServer is the binary one, it will send the parameter: wt=javabin – in a 1.2 setup, that should give an error. I think documenting that to use this for 1.2 you should set the XMLResponseParser in the constructor is good enough.
        Hide
        Noble Paul added a comment -

        one test failed in distributed search results when I changed the the default format to binary

        <testcase classname="org.apache.solr.TestDistributedSearch" name="testDistribSearch" time="5.406">
            <failure message=".response.maxScore:4321.0!=null" type="junit.framework.AssertionFailedError">junit.framework.AssertionFailedError: .response.maxScore:4321.0!=null
        	at org.apache.solr.TestDistributedSearch.compareResponses(TestDistributedSearch.java:416)
        	at org.apache.solr.TestDistributedSearch.query(TestDistributedSearch.java:192)
        	at org.apache.solr.TestDistributedSearch.doTest(TestDistributedSearch.java:456)
        	at org.apache.solr.TestDistributedSearch.testDistribSearch(TestDistributedSearch.java:427)
        </failure>
          </testcase>
        

        I could not figure out the reason

        Show
        Noble Paul added a comment - one test failed in distributed search results when I changed the the default format to binary <testcase classname= "org.apache.solr.TestDistributedSearch" name= "testDistribSearch" time= "5.406" > <failure message= ".response.maxScore:4321.0!=null" type= "junit.framework.AssertionFailedError" > junit.framework.AssertionFailedError: .response.maxScore:4321.0!=null at org.apache.solr.TestDistributedSearch.compareResponses(TestDistributedSearch.java:416) at org.apache.solr.TestDistributedSearch.query(TestDistributedSearch.java:192) at org.apache.solr.TestDistributedSearch.doTest(TestDistributedSearch.java:456) at org.apache.solr.TestDistributedSearch.testDistribSearch(TestDistributedSearch.java:427) </failure> </testcase> I could not figure out the reason
        Hide
        Noble Paul added a comment -

        +1
        It is OK to make it default.
        But we will have to document properly that users with Solr1.2 must explicitly use XMLResponseParser or they will see failure.

        or we will have to modify the CommonsHttpSolrServer to fallback to XMLResponseParser if content type header is text/xml. (This is more elegant solution)

        Show
        Noble Paul added a comment - +1 It is OK to make it default. But we will have to document properly that users with Solr1.2 must explicitly use XMLResponseParser or they will see failure. or we will have to modify the CommonsHttpSolrServer to fallback to XMLResponseParser if content type header is text/xml. (This is more elegant solution)
        Hide
        Erik Hatcher added a comment -

        +1

        Show
        Erik Hatcher added a comment - +1
        Hide
        Ryan McKinley added a comment -

        Should we make this the default parser for CommonsHttpSolrServer?
        I think so...

        Show
        Ryan McKinley added a comment - Should we make this the default parser for CommonsHttpSolrServer? I think so...
        Hide
        Yonik Seeley added a comment -

        Committed. Thanks Noble!

        Show
        Yonik Seeley added a comment - Committed. Thanks Noble!
        Hide
        Yonik Seeley added a comment -

        OK, latest changes.

        • SimpleOrderedMap type preserved (and I fixed some other recent code that accidentally used NamedList instead of SimpleOrderedMap).
        • changed resolver interface to pass codec

        > I guess one test is failing in TestDistributedSearch

        Are you still seeing this?

        Show
        Yonik Seeley added a comment - OK, latest changes. SimpleOrderedMap type preserved (and I fixed some other recent code that accidentally used NamedList instead of SimpleOrderedMap). changed resolver interface to pass codec > I guess one test is failing in TestDistributedSearch Are you still seeing this?
        Hide
        Yonik Seeley added a comment -

        Ian: you can sort of figure out what the total overhead would be given your max query rate.
        For example, XML had an encode rate (for this particular test) of 627 messages per second (on my 3GHz P4).
        If you wanted 100qps, XML encoding would be taking up 16% (100/627)

        Note that the overhead is particularly important for distributed search, given the number of messages that must be sent+received, and also the limited network bandwidth it must be done in.

        Show
        Yonik Seeley added a comment - Ian: you can sort of figure out what the total overhead would be given your max query rate. For example, XML had an encode rate (for this particular test) of 627 messages per second (on my 3GHz P4). If you wanted 100qps, XML encoding would be taking up 16% (100/627) Note that the overhead is particularly important for distributed search, given the number of messages that must be sent+received, and also the limited network bandwidth it must be done in.
        Hide
        Noble Paul added a comment -

        Yonik: it would be nice the change the interface ObjectResolver as follows

        public static interface ObjectResolver

        { public Object resolve(Object o, NamedListCodec codec) throws IOException; }
        Show
        Noble Paul added a comment - Yonik: it would be nice the change the interface ObjectResolver as follows public static interface ObjectResolver { public Object resolve(Object o, NamedListCodec codec) throws IOException; }
        Hide
        Noble Paul added a comment -

        If the response not served from cached there is some extra processing , in which case the share of marshalling/unmarshalling is lower. In cases where the response is given out of a cache it is probably the only CPU intensive activity.

        Overall performance also involves the network latency . It is harder to count that because it is a very subjective number

        Show
        Noble Paul added a comment - If the response not served from cached there is some extra processing , in which case the share of marshalling/unmarshalling is lower. In cases where the response is given out of a cache it is probably the only CPU intensive activity. Overall performance also involves the network latency . It is harder to count that because it is a very subjective number
        Hide
        Ian Holsman added a comment -

        i'm a bit clueless here.

        you mention that it is 9 times faster. how big a component is encoding/decoding in the general scheme of things? (ie compared to actually getting the results)

        Show
        Ian Holsman added a comment - i'm a bit clueless here. you mention that it is 9 times faster. how big a component is encoding/decoding in the general scheme of things? (ie compared to actually getting the results)
        Hide
        Noble Paul added a comment -

        The changes look good , infact very good. it is highly optimized.

        You already incorporated the Document changes which i added

        The writeUTF bug was a very critical one which I noticed after cutting the patch.Anyway you fixed it.

        I guess one test is failing in TestDistributedSearch
        unit.framework.AssertionFailedError: .response.maxScore:4321.0!=null
        at org.apache.solr.TestDistributedSearch.compareResponses(TestDistributedSearch.java:416)
        .....

        need to take a look

        I vote vote for preserving SimpleOrderedMap.(anyway there is are two reserved types one can be taken up)

        Makee sense to use this in SolrJ . But there are people using SolrJ with Solr1.2 and older versions of Solr1.3 in production. We must take care of that in the code.

        send wt=javabin
        read the response header, if the content type is application/octet-steram use BinaryResponseParser or use XMLResponseparser

        Show
        Noble Paul added a comment - The changes look good , infact very good. it is highly optimized. You already incorporated the Document changes which i added The writeUTF bug was a very critical one which I noticed after cutting the patch.Anyway you fixed it. I guess one test is failing in TestDistributedSearch unit.framework.AssertionFailedError: .response.maxScore:4321.0!=null at org.apache.solr.TestDistributedSearch.compareResponses(TestDistributedSearch.java:416) ..... need to take a look I vote vote for preserving SimpleOrderedMap.(anyway there is are two reserved types one can be taken up) Makee sense to use this in SolrJ . But there are people using SolrJ with Solr1.2 and older versions of Solr1.3 in production. We must take care of that in the code. send wt=javabin read the response header, if the content type is application/octet-steram use BinaryResponseParser or use XMLResponseparser
        Hide
        Yonik Seeley added a comment -

        OK, latest version attached.

        The bad:

        • more code, a more complex codec, harder to debug with a hex editor

        The good:

        • 2/3 the size of the previous binary codec (50% improvement and 1/3 the size of XML)
        • certain common types like named list, array, and string pack a size right in the tag byte
        • variable length int and long values.. saves lots of space for facet counts, etc (first 5
          bits of vint are packed in the tag byte)
        • performance results: (9 times faster than xml, encoding + decoding)
          writer xml, size=6885, encodeRate=627, decodeRate=883
          writer json, size=3556, encodeRate=3084, decodeRate=N/A
          writer javabin-v1, size=3170
          writer javabin-current, size=2153, encodeRate=6095, decodeRate=8695
        • performance changes:
        • FastInputStream, FastOutputStream: unsynchronized buffered streams
        • toObject() changes to sortable numerics - bypass string and go straight to numeric type
        • tons of other little things, including everything that was used to decrease the size
        • string values can now be > 64K (Java writeUTF is limited to that size)
        • memory use changes: only "explode" one doc at a time in a doclist
        • testing changes: more thorough random testing... (good thing too - found a long encoding bug)

        Noble: I started from your second-to-last patch (we overlapped in the Document fixes...) I diffed your patches and I believe I saw everything.

        TODO: preserve SimpleOrderedMap vs NamedList?
        TODO: make SolrJ default?

        Show
        Yonik Seeley added a comment - OK, latest version attached. The bad: more code, a more complex codec, harder to debug with a hex editor The good: 2/3 the size of the previous binary codec (50% improvement and 1/3 the size of XML) certain common types like named list, array, and string pack a size right in the tag byte variable length int and long values.. saves lots of space for facet counts, etc (first 5 bits of vint are packed in the tag byte) performance results: (9 times faster than xml, encoding + decoding) writer xml, size=6885, encodeRate=627, decodeRate=883 writer json, size=3556, encodeRate=3084, decodeRate=N/A writer javabin-v1, size=3170 writer javabin-current, size=2153, encodeRate=6095, decodeRate=8695 performance changes: FastInputStream, FastOutputStream: unsynchronized buffered streams toObject() changes to sortable numerics - bypass string and go straight to numeric type tons of other little things, including everything that was used to decrease the size string values can now be > 64K (Java writeUTF is limited to that size) memory use changes: only "explode" one doc at a time in a doclist testing changes: more thorough random testing... (good thing too - found a long encoding bug) Noble: I started from your second-to-last patch (we overlapped in the Document fixes...) I diffed your patches and I believe I saw everything. TODO: preserve SimpleOrderedMap vs NamedList? TODO: make SolrJ default?
        Hide
        Yonik Seeley added a comment -

        Yep, those are about the numbers I'm seeing for text-heavy documents + faceting on 2 text fields and 1 integer field + highlighting (response was only generated once, but continually reused). I think I've got some good improvements in the queue that I can attach soon.

        Show
        Yonik Seeley added a comment - Yep, those are about the numbers I'm seeing for text-heavy documents + faceting on 2 text fields and 1 integer field + highlighting (response was only generated once, but continually reused). I think I've got some good improvements in the queue that I can attach soon.
        Hide
        Noble Paul added a comment -

        performance numbers
        =====================
        bin vs. xml read performance : ~3 times faster
        bin vs. xml write performance : ~2.5 times faster
        datasize : 35% less
        ======================
        The test:
        The same operation is performed 100 times and an average is taken. Before the execution, a warmup run is performed.

        The output of a faceted query is taken out (xml).The size of the xml is 15.7KB. It is loaded into memory and unmarshalled using XMLResponseParser . Then the same object is written to xml using XMLWriter

        The same dataobject is marshalled to byte[](binary) using NamedListCodec. The datasize is 10.47KB . Then it is unmarshalled using NamedListCodec

        Show
        Noble Paul added a comment - performance numbers ===================== bin vs. xml read performance : ~3 times faster bin vs. xml write performance : ~2.5 times faster datasize : 35% less ====================== The test: The same operation is performed 100 times and an average is taken. Before the execution, a warmup run is performed. The output of a faceted query is taken out (xml).The size of the xml is 15.7KB. It is loaded into memory and unmarshalled using XMLResponseParser . Then the same object is written to xml using XMLWriter The same dataobject is marshalled to byte[](binary) using NamedListCodec. The datasize is 10.47KB . Then it is unmarshalled using NamedListCodec
        Hide
        Yonik Seeley added a comment -

        I'm hacking on this now... I like how you separated the dependency on lucene-specific stuff via the resolver. The problem is that we lose streaming ability for doc lists... if someone requests 1000 documents or whatever, everything is blown up in memory which could cause an OOM. I'm adding the ability for the resolver to call back to the codec... not as nicely separated, but better results.

        Show
        Yonik Seeley added a comment - I'm hacking on this now... I like how you separated the dependency on lucene-specific stuff via the resolver. The problem is that we lose streaming ability for doc lists... if someone requests 1000 documents or whatever, everything is blown up in memory which could cause an OOM. I'm adding the ability for the resolver to call back to the codec... not as nicely separated, but better results.
        Hide
        Noble Paul added a comment -

        There were bugs in the previous patch and it was not writing the SolrDocumentList properly. TestDistributedSearch was failing.
        With this patch those are fixed.

        Show
        Noble Paul added a comment - There were bugs in the previous patch and it was not writing the SolrDocumentList properly. TestDistributedSearch was failing. With this patch those are fixed.
        Hide
        Yonik Seeley added a comment -

        Revised patch that switches distributed search to use the binary format.
        Currently fails the distributed search tests though.

        Show
        Yonik Seeley added a comment - Revised patch that switches distributed search to use the binary format. Currently fails the distributed search tests though.
        Hide
        Noble Paul added a comment -

        Added a version byte in the beginning.
        Types, Byte and Short were missing

        Show
        Noble Paul added a comment - Added a version byte in the beginning. Types, Byte and Short were missing
        Hide
        Noble Paul added a comment -

        This contains a very simple binary format implementation using DataOutputStream/DataInputStream

        The class that implements this functionality is NamedListCodec which mustbe used on both end of the pipe

        It can marshal/unmarshal a NamedList to/from a stream.

        It supports all the primitives and List/Map/NamedList/SolrDocument/SolrDocumentList

        There is a class BinaryResponseWriter implements BinaryQueryResponseWriter

        which converts the lucene DocList to SolrDocumentList on demand

        A class BinaryResponseParser extends ResponseParser

        is also added to solrj
        which can be used with the CommonsHttpSolrServer

        Show
        Noble Paul added a comment - This contains a very simple binary format implementation using DataOutputStream/DataInputStream The class that implements this functionality is NamedListCodec which mustbe used on both end of the pipe It can marshal/unmarshal a NamedList to/from a stream. It supports all the primitives and List/Map/NamedList/SolrDocument/SolrDocumentList There is a class BinaryResponseWriter implements BinaryQueryResponseWriter which converts the lucene DocList to SolrDocumentList on demand A class BinaryResponseParser extends ResponseParser is also added to solrj which can be used with the CommonsHttpSolrServer
        Hide
        Yonik Seeley added a comment -

        I accidentally committed this in conjunction with another patch.
        I think I'll just leave it committed though, since it looks fine.
        We can continue to use this JIRA issue for the actual binary protocol implementation though.

        Show
        Yonik Seeley added a comment - I accidentally committed this in conjunction with another patch. I think I'll just leave it committed though, since it looks fine. We can continue to use this JIRA issue for the actual binary protocol implementation though.
        Hide
        Yonik Seeley added a comment -

        Great! If we can get it to handle everything that the current XML handler can handle, then we could use it by default for distributed search.

        Show
        Yonik Seeley added a comment - Great! If we can get it to handle everything that the current XML handler can handle, then we could use it by default for distributed search.
        Hide
        Noble Paul added a comment -

        I can do that.
        I have a java class which can serialize/deserialize a NamedList using
        which I used to send response and deserialize it back
        I can post it as well
        --Noble


        --Noble Paul

        Show
        Noble Paul added a comment - I can do that. I have a java class which can serialize/deserialize a NamedList using which I used to send response and deserialize it back I can post it as well --Noble – --Noble Paul
        Hide
        Yonik Seeley added a comment -

        This patch is small enough, perhaps just combine it with the patch that implements a specific binary format.

        Show
        Yonik Seeley added a comment - This patch is small enough, perhaps just combine it with the patch that implements a specific binary format.
        Hide
        Noble Paul added a comment -

        No changes . Just synchronizing with other code changes.

        This is a very useful option for users who wish to implement a binary format to improve the performance. (I plan to contribute one as soon as this is committed)

        Currently the java clients go though the xml response format which can eat up some time in unmarshalling . It can be quite significant if the document size is large (take the case of facet requests)

        I have an xml reponse which took around 30ms for unmarshalling . Binary format would have taken less than 5 ms.

        Show
        Noble Paul added a comment - No changes . Just synchronizing with other code changes. This is a very useful option for users who wish to implement a binary format to improve the performance. (I plan to contribute one as soon as this is committed) Currently the java clients go though the xml response format which can eat up some time in unmarshalling . It can be quite significant if the document size is large (take the case of facet requests) I have an xml reponse which took around 30ms for unmarshalling . Binary format would have taken less than 5 ms.
        Hide
        Noble Paul added a comment -

        This patch can add support for binary formats.

        Show
        Noble Paul added a comment - This patch can add support for binary formats.
        Hide
        Noble Paul added a comment -

        Without breaking the existing stuff we can add another interface

        BinaryQueryResponse extends QueryResponseWriter{
        public void write(OutputStream out, SolrQueryRequest request,SolrQueryResponse response) throws IOException;

        and in the SolrDispatchFilter add the following linesQueryResponseWriter responseWriter = core.getQueryResponseWriter(solrReq);

        if (responseWriter instanceof BinaryQueryResponse )

        { BinaryQueryResponse binaryResp = (Object) responseWriter; binaryResp.write(response.getOutputStream(), solrReq, solrRsp); }

        else

        { responseWriter.write(response.getWriter(), solrReq, solrRsp); }
        Show
        Noble Paul added a comment - Without breaking the existing stuff we can add another interface BinaryQueryResponse extends QueryResponseWriter{ public void write(OutputStream out, SolrQueryRequest request,SolrQueryResponse response) throws IOException; and in the SolrDispatchFilter add the following linesQueryResponseWriter responseWriter = core.getQueryResponseWriter(solrReq); if (responseWriter instanceof BinaryQueryResponse ) { BinaryQueryResponse binaryResp = (Object) responseWriter; binaryResp.write(response.getOutputStream(), solrReq, solrRsp); } else { responseWriter.write(response.getWriter(), solrReq, solrRsp); }

          People

          • Assignee:
            Yonik Seeley
            Reporter:
            Noble Paul
          • Votes:
            2 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development