Thanks for pointing us to Snappy. I took a brief look at the benchmarks for Snappy, and it does look promising to me. As Jay mentioned, GZIP buys us increased throughput and better utilization of the network bandwidth, due to relatively high compression ratio. Though, its decompression cost, in terms of both TPS and CPU usage is not very low. According to preliminary Kafka compression performance benchmarks, with fetch size of 1MB, the consumer throughput doubled, while consuming a GZIP compressed topic. When the consumer is fully caught up, the CPU usage is ~45%, as compared to ~12% when the same consumer is consuming uncompressed data. On the producer side, for a batch size of 200, message size of 200, the producer throughput for generating compressed data is 1/2 the throughput when producing uncompressed data. That is the cost of compression for GZIP. Though this is tolerable for inter-DC replication, we could do better for more real-time applications that care about TPS more than the compression ratio. I see Snappy fitting well here (http://ning.github.com/jvm-compressor-benchmark/results/canterbury-roundtrip-2011-07-28/index.html).
The compression ratio that we see (for a producer batch size of 200) is 3x for GZIP on our typical tracking data set. I wonder how low this will be for Snappy. It will be good to check.
It will be great to see a Snappy integration patch with some Kafka performance benchmarks that measure compression/decompression overhead, compression ratio, effect on producer/consumer throughput.