Kafka
  1. Kafka
  2. KAFKA-595

Decouple producer side compression from server-side compression.

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Implemented
    • Affects Version/s: 0.8.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      In 0.7 Kafka always appended messages to the log using whatever compression codec the client used. In 0.8, after the KAFKA-506 patch, the master always recompresses the message before appending to the log to assign ids. Currently the server uses a funky heuristic to choose a compression codec based on the codecs the producer used. This doesn't actually make that much sense. It would be better for the server to have its own compression (a global default and per-topic override) that specified the compression codec, and have the server always recompress with this codec regardless of the original codec.

      Compression currently happens in kafka.log.Log.assignOffsets (perhaps should be renamed if it takes on compression as an official responsibility instead of a side-effect).

        Activity

        Neha Narkhede created issue -
        Hide
        Neha Narkhede added a comment -

        >> I would recommend we instead add a log.compression.codec property (plus override map) that controls the compression on the broker.

        Yes, when I said move the compression to server side, that is what I meant. Also, I filed this JIRA to keep track of this optimization I came across while performance testing, don't think we should push this in 0.8

        Show
        Neha Narkhede added a comment - >> I would recommend we instead add a log.compression.codec property (plus override map) that controls the compression on the broker. Yes, when I said move the compression to server side, that is what I meant. Also, I filed this JIRA to keep track of this optimization I came across while performance testing, don't think we should push this in 0.8
        Neha Narkhede made changes -
        Field Original Value New Value
        Issue Type Bug [ 1 ] Improvement [ 4 ]
        Jay Kreps made changes -
        Summary Producer side compression is unnecessary Decouple producer side compression from server-side compression.
        Description Compression can be used to store something in less space (less IO) and/or transfer it less expensively (better use of network bandwidth). Often the two go hand in hand, such as when compressed data is written to a disk: the disk I/O takes less time, since less bits are being transferred, and the storage occupied on the disk after the transfer is less. Unfortunately, the time to compress the data can exceed the savings gained from transferring less data, resulting in overall degradation.

        After KAFKA-506, the network usage gains we used to get by compressing data at the producers is exceeded by the cost of decompressing and re-compressing data at the server side. Compression to save on network costs must be done either to reduce the contention in a wide-area network due to multiple point to point connections OR to efficiently transfer data over low-bandwidth networks (cross DC). In the case of producer-server connections, neither is typically true, which means we might not benefit from producer side compression at all in most production deployments of Kafka. On the contrary, it might actually hurt performance since most production deployments turn on compression for all topics.

        The main benefit of compressing data in Kafka is to efficiently transfer data cross DC for setting up mirrored Kafka clusters. The performance benefit is also true for real time consumers, especially when there are multiple groups of consumers consuming the same topic. If data is compressed on the server side instead, which we do anyways, we can get the I/O savings as well as efficient network transfer on the server-consumer links.

        I don't have numbers to quantify the performance impact of re-compression now, since there are other changes that need to be done to test this correctly.

        Thoughts ?
        In 0.7 Kafka always appended messages to the log using whatever compression codec the client used. In 0.8, after the KAFKA-506 patch, the master always recompresses the message before appending to the log to assign ids. Currently the server uses a funky heuristic to choose a compression codec based on the codecs the producer used. This doesn't actually make that much sense. It would be better for the server to have its own compression (a global default and per-topic override) that specified the compression codec, and have the server always recompress with this codec regardless of the original codec.

        Compression currently happens in kafka.log.Log.assignOffsets (perhaps should be renamed if it takes on compression as an official responsibility instead of a side-effect).
        Labels feature features feature
        Jay Kreps made changes -
        Comment [ I think saying it is unnecessary is perhaps overstating it. It depends what you are trying to optimize. Compression trades client CPU for network bandwidth. For our own use case I don't know whether or not that use case is worth it or not. It depends on the CPU usage of compression, the compression ratio, and the relative availability of network bandwidth. The CPU usage isn't necessarily fixed--a cheaper compression algorithm than GZIP, plus a little work on the compression code to avoid recopies and deep iteration could significantly reduce the CPU cost on the broker.

        I would instead rephrase this as a feature request--"Decouple producer compression from broker compression.". Since we are going to recompress anyway this is super easy to implement. Basically right now we have a kind of odd heuristic which says "if there is at least one compressed message in a given message set, recompress the entire message set using the last compression codec that appears in the message set". This is actually a little odd.

        I would recommend we instead add a log.compression.codec property (plus override map) that controls the compression on the broker. This could be set the same as the producer or not. I don't think we necessarily need to support the current behavior of retaining whatever the producer uses--this behavior is actually kind of bad since it means consumers must support EVERY codec ANY producer happens to send. The broker would always apply the configured compression codec to incoming messages regardless of source compression format. ]
        Jay Kreps made changes -
        Comment [ I also think this should be a post 0.8 feature. ]
        Hide
        Manikumar Reddy added a comment -

        Neha Narkhede We have given support for broker-side compression in KAFKA-1499. This issue is similar to KAFKA-1499. I think we can close this issue.

        Show
        Manikumar Reddy added a comment - Neha Narkhede We have given support for broker-side compression in KAFKA-1499 . This issue is similar to KAFKA-1499 . I think we can close this issue.
        Hide
        Joel Koshy added a comment -

        Yes I think we can close this.

        Show
        Joel Koshy added a comment - Yes I think we can close this.
        Joel Koshy made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Assignee Manikumar Reddy [ omkreddy ]
        Resolution Implemented [ 10 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        806d 22h 39m 1 Joel Koshy 15/Jan/15 16:26

          People

          • Assignee:
            Manikumar Reddy
            Reporter:
            Neha Narkhede
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development