Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Component/s: replication (java)
    • Labels:
      None

      Description

      From a discussion on the mailing list solr-user, it would be useful to have an option to compress the files sent between servers for replication purposes.

      Files sent across between indexes can be compressed by a large margin allowing for easier replication between sites.

      ...Noted by Noble Paul

      we will use a gzip on both ends of the pipe . On the slave side you can say <str name="zip">true<str> as an extra option to compress and send data from server

      Other thoughts on issue:

      Do keep in mind that compression is a CPU intensive process so it is a trade off between CPU utilization and network bandwidth. I have see cases where compressing the data before a network transfer ended up being slower than without compression because the cost of compression and un-compression was more than the gain in network transfer.

      Why invent something when compression is standard in HTTP? --wunder

      1. solr-829.patch
        18 kB
        Shalin Shekhar Mangar
      2. solr-829.patch
        7 kB
        Akshay K. Ukey
      3. solr-829.patch
        5 kB
        Noble Paul
      4. solr-829.patch
        3 kB
        Akshay K. Ukey
      5. email discussion.txt
        6 kB
        Simon Collins

        Issue Links

          Activity

          Hide
          Simon Collins added a comment -

          email thread discussing the issue

          Show
          Simon Collins added a comment - email thread discussing the issue
          Hide
          Noble Paul added a comment -
          Show
          Noble Paul added a comment - email thread http://markmail.org/message/rmxywrgdlnz4vbwe
          Hide
          Akshay K. Ukey added a comment -

          Patch with following changes:
          Zip configuration parameter in replicationhandler (on slave):

          <str name="zip">true</str>
          

          Have tested it with replication across two data centres with an index size of 1.1G.
          Time taken for replicating with gzipping is 1012 seconds (17 mins) compared to 1250 seconds (21 mins) with replication without gzipping.

          Show
          Akshay K. Ukey added a comment - Patch with following changes: Zip configuration parameter in replicationhandler (on slave): <str name= "zip" > true </str> Have tested it with replication across two data centres with an index size of 1.1G. Time taken for replicating with gzipping is 1012 seconds (17 mins) compared to 1250 seconds (21 mins) with replication without gzipping.
          Hide
          Shalin Shekhar Mangar added a comment - - edited

          Thanks Akshay

          Hoss, do you know of a GzipServlet which we can borrow? Until that is in place, we can probably go ahead with this patch. Anyway, the configuration will not change, only the internal implementation needs to change (client sending Accept-Encoding in place of zip=true)

          Show
          Shalin Shekhar Mangar added a comment - - edited Thanks Akshay Hoss, do you know of a GzipServlet which we can borrow? Until that is in place, we can probably go ahead with this patch. Anyway, the configuration will not change, only the internal implementation needs to change (client sending Accept-Encoding in place of zip=true)
          Hide
          Noble Paul added a comment -

          Do we really need a zipping for any other response? Assuming that Solr clients mostly make the requests from the same LAN using CPU instead of Bandwidth looks like an overkill to me. (Probably we need a perf comparison). If the wt==javabin we may achieve very little compression.

          Show
          Noble Paul added a comment - Do we really need a zipping for any other response? Assuming that Solr clients mostly make the requests from the same LAN using CPU instead of Bandwidth looks like an overkill to me. (Probably we need a perf comparison). If the wt==javabin we may achieve very little compression.
          Hide
          Ryan McKinley added a comment -

          I have used the GzipFilter from ehcache for a few years and never had any troubles...

          There may be something smaller out there though.

          Re "Do we really need a zipping for any other response?"... The point of using standards based approach is that the client can decide. Essentially we could enable gzip for the whole web application and let each request say if the response should be gzipped.

          Show
          Ryan McKinley added a comment - I have used the GzipFilter from ehcache for a few years and never had any troubles... There may be something smaller out there though. Re "Do we really need a zipping for any other response?"... The point of using standards based approach is that the client can decide. Essentially we could enable gzip for the whole web application and let each request say if the response should be gzipped.
          Hide
          Noble Paul added a comment -

          The server(Replicationhandler is agnostic of compression. The client (SnapPuller) sets appropriate header before sending the request . Use appropriate filter or front-end the master w/ an apache to handle compression

          Show
          Noble Paul added a comment - The server(Replicationhandler is agnostic of compression. The client (SnapPuller) sets appropriate header before sending the request . Use appropriate filter or front-end the master w/ an apache to handle compression
          Hide
          Noble Paul added a comment - - edited

          After a lot of discussions on SOLR-856 I realize that it is not straight forward to provide a 'container independent' means to provide compression. We have to document different ways for different containers to ensure that it works properly.

          • How important is it to use HTTP standards to achieve this? Consider the fact that nothing else in the whole solution is complying with any standard
          • For this feature , compression is a critical . It can mean huge differences in replication time
          • I am not very comfortable with complex configuration documentation saying do this if you use jetty or do this if you use resin and this for glassfish etc etc.
          • How about giving both the options to users and let them choose what they want. This also gives them the flexibility of doing compression only for replication
          • Power users can use their own favorite configuration to do the compression.
            Something like
            <lst name="slave">
              <!-- values can be internal|external . --> 
              <str name="compression">internal</str>
            </lst>
            
          Show
          Noble Paul added a comment - - edited After a lot of discussions on SOLR-856 I realize that it is not straight forward to provide a 'container independent' means to provide compression. We have to document different ways for different containers to ensure that it works properly. How important is it to use HTTP standards to achieve this? Consider the fact that nothing else in the whole solution is complying with any standard For this feature , compression is a critical . It can mean huge differences in replication time I am not very comfortable with complex configuration documentation saying do this if you use jetty or do this if you use resin and this for glassfish etc etc. How about giving both the options to users and let them choose what they want. This also gives them the flexibility of doing compression only for replication Power users can use their own favorite configuration to do the compression. Something like <lst name= "slave" > <!-- values can be internal|external . --> <str name= "compression" >internal</str> </lst>
          Hide
          Hoss Man added a comment -

          Let's keep this issue focused on one thing: making it possible to configure a "slave" solr instance so that it indicates it can "Accept-Encoding" compressed responses during replication (discussion of what the "master" does with that information are a separate matter)

          From my (naive) reading of the current patch, a few things jump out at me...

          1) the "FastOutputStream" changes in ReplicationHandler looks like an unintentional part of the patch.
          2) why does setting the ZIP option to true disable checksums? i'm not sure when/how checksums are currently computed/compared, but if it can be done with a raw i/o streams right now, it can be done with a GZIP i/o streams if the response is compressed.
          3) the behavior of checkCompressed doesn't seem right. A Content-Encoding header is used to indicate that the orriginal content has been compressed in order to transfer over HTTP, but the Content-Type header is used to identify the true type of the payload. we shouldn't silently uncompress files just because they happen to have a mime type of "application/x-gzip-compressed". we might be able to get away with it in dealing with replication, but we shouldn't need it (and unless i'm severaly mistaken, this will break in the event that gzip content is sent with additional gzip Content-Encoding.

          Show
          Hoss Man added a comment - Let's keep this issue focused on one thing: making it possible to configure a "slave" solr instance so that it indicates it can "Accept-Encoding" compressed responses during replication (discussion of what the "master" does with that information are a separate matter) From my (naive) reading of the current patch, a few things jump out at me... 1) the "FastOutputStream" changes in ReplicationHandler looks like an unintentional part of the patch. 2) why does setting the ZIP option to true disable checksums? i'm not sure when/how checksums are currently computed/compared, but if it can be done with a raw i/o streams right now, it can be done with a GZIP i/o streams if the response is compressed. 3) the behavior of checkCompressed doesn't seem right. A Content-Encoding header is used to indicate that the orriginal content has been compressed in order to transfer over HTTP, but the Content-Type header is used to identify the true type of the payload. we shouldn't silently uncompress files just because they happen to have a mime type of "application/x-gzip-compressed". we might be able to get away with it in dealing with replication, but we shouldn't need it (and unless i'm severaly mistaken, this will break in the event that gzip content is sent with additional gzip Content-Encoding.
          Hide
          Shalin Shekhar Mangar added a comment -
          1. Yes, the constructor call is moved to a different line, that's all.
          2. We disable checksum because if GZIP does checksums internally, so we do not need to do it again. However, deflate does not use checksums and when we use the InflaterInputStream, we should do checksums. This is not in the patch right now.
          3. That code is exactly copied from CommonsHttpSolrServer. In this case, if we are getting a compressed stream from the master, it should be decompressed and written to the filesystem as it is. We do not need to worry about the type of the response. This patch is only for this particular use-case.

          I don't think this patch is in sync with Noble's latest proposal. A new one will be needed.

          Show
          Shalin Shekhar Mangar added a comment - Yes, the constructor call is moved to a different line, that's all. We disable checksum because if GZIP does checksums internally, so we do not need to do it again. However, deflate does not use checksums and when we use the InflaterInputStream, we should do checksums. This is not in the patch right now. That code is exactly copied from CommonsHttpSolrServer. In this case, if we are getting a compressed stream from the master, it should be decompressed and written to the filesystem as it is. We do not need to worry about the type of the response. This patch is only for this particular use-case. I don't think this patch is in sync with Noble's latest proposal. A new one will be needed.
          Hide
          Akshay K. Ukey added a comment -

          Patch with additional configuration in replicationhandler as suggested by Noble.

          <lst name="slave">
            <!-- values can be internal|external. --> 
            <str name="compression">internal</str>
          </lst>
          

          If internal compression is used InflaterInputStream and DeflaterOutputStream Java apis are used for data transfer from master to slaves.

          If external compression is used Accept-Encoding header value is set to "gzip,deflate" before making request to the master. And the container has to be configured with appropriate setting. E.g. In case of Tomcat, following settings are to be done in the Connector section to use Tomcat's compression mechanism:

          <Connector .... compression=on 
          compressableMimeType="application/octet-stream,text/html,text/xml,text/plain" 
          compressionMinSize="somePreferredValue"/>
          
          Show
          Akshay K. Ukey added a comment - Patch with additional configuration in replicationhandler as suggested by Noble. <lst name= "slave" > <!-- values can be internal|external. --> <str name= "compression" > internal </str> </lst> If internal compression is used InflaterInputStream and DeflaterOutputStream Java apis are used for data transfer from master to slaves. If external compression is used Accept-Encoding header value is set to "gzip,deflate" before making request to the master. And the container has to be configured with appropriate setting. E.g. In case of Tomcat, following settings are to be done in the Connector section to use Tomcat's compression mechanism: <Connector .... compression=on compressableMimeType= "application/octet-stream,text/html,text/xml,text/plain" compressionMinSize= "somePreferredValue" />
          Hide
          Shalin Shekhar Mangar added a comment -

          Changes:

          1. Fixes a possible connection leak in FileFetcher#getStream method.
          2. A single HttpClient is created with MultiThreadedHttpConnectionManager in SnapPuller and is re-used for every operation
          3. Idle connections are closed in SnapPull#destroy method
          4. Release connection and stream closing is not done in a separate thread anymore
          5. ReplicationHandler does a snappull command in a new thread so that an API call for this command is not kept waiting until the operation completes. The admin jsp which used to make a call to this method in another thread is also changed to remove the thread creation.

          I'll commit this in a day or two if there are no problems.

          Show
          Shalin Shekhar Mangar added a comment - Changes: Fixes a possible connection leak in FileFetcher#getStream method. A single HttpClient is created with MultiThreadedHttpConnectionManager in SnapPuller and is re-used for every operation Idle connections are closed in SnapPull#destroy method Release connection and stream closing is not done in a separate thread anymore ReplicationHandler does a snappull command in a new thread so that an API call for this command is not kept waiting until the operation completes. The admin jsp which used to make a call to this method in another thread is also changed to remove the thread creation. I'll commit this in a day or two if there are no problems.
          Hide
          Shalin Shekhar Mangar added a comment -

          Committed revision 720502.

          Thanks Simon, Noble, Hoss and Akshay!

          Show
          Shalin Shekhar Mangar added a comment - Committed revision 720502. Thanks Simon, Noble, Hoss and Akshay!

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Simon Collins
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development