Kafka
  1. Kafka
  2. KAFKA-374

Move to java CRC32 implementation

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: None
    • Component/s: core
    • Labels:

      Description

      We keep a per-record crc32. This is fairly cheap algorithm, but the java implementation uses JNI and it seems to be a bit expensive for small records. I have seen this before in Kafka profiles, and I noticed it on another application I was working on. Basically with small records the native implementation can only checksum < 100MB/sec. Hadoop has done some analysis of this and replaced it with a Java implementation that is 2x faster for large values and 5-10x faster for small values. Details are here HADOOP-6148.

      We should do a quick read/write benchmark on log and message set iteration and see if this improves things.

      1. KAFKA-374-draft.patch
        33 kB
        Jay Kreps
      2. KAFKA-374.patch
        33 kB
        David Arthur

        Activity

        Hide
        Jay Kreps added a comment -

        The following is a draft patch that transliterates the Hadoop CRC32 implementation into scala and swaps it in for our own crc32 implementation. I ran a sanity check on a few hundred million values to check I got the same crcs.

        The CRC code is essentially unreadable. So it is not something you can really edit.

        I wrote a simple test that just sequentially creates messages (which requires computing the checksum). On this test I see about a 30-40% improvement, roughly across the board on Linux. I will include full results when it completes.

        I suspect this really only helps very trivial cases: benchmarks, mirroring, or other applications which do no real processing. It should speed up both the broker, the producer, and the consumer.

        Show
        Jay Kreps added a comment - The following is a draft patch that transliterates the Hadoop CRC32 implementation into scala and swaps it in for our own crc32 implementation. I ran a sanity check on a few hundred million values to check I got the same crcs. The CRC code is essentially unreadable. So it is not something you can really edit. I wrote a simple test that just sequentially creates messages (which requires computing the checksum). On this test I see about a 30-40% improvement, roughly across the board on Linux. I will include full results when it completes. I suspect this really only helps very trivial cases: benchmarks, mirroring, or other applications which do no real processing. It should speed up both the broker, the producer, and the consumer.
        Hide
        Jay Kreps added a comment -

        Here are full performance results

        The size is in bytes and the value for native/java is the nanoseconds per message averaged over a large number of messages:

        size native java improvement
        16 149.47 108.11 27.7%
        32 197.8 149.78 24.3%
        64 291.01 219.89 24.4%
        128 487.36 357.64 26.6%
        256 892.78 631.15 29.3%
        512 1774.22 1251.4 29.5%
        1024 3412.79 2470.58 27.6%
        2048 6594.28 4421.38 33.0%
        4096 13121.85 8751.19 33.3%
        8192 25689.03 18173.61 29.3%
        16384 51258.21 36278.3 29.2%
        32768 103584.61 73240.5 29.3%
        65536 207569.05 146748.51 29.3%
        131072 415893.86 292083.12 29.8%

        I suspect there is still some scala numeric boxing magic happening here that would be good to get rid of.

        Show
        Jay Kreps added a comment - Here are full performance results The size is in bytes and the value for native/java is the nanoseconds per message averaged over a large number of messages: size native java improvement 16 149.47 108.11 27.7% 32 197.8 149.78 24.3% 64 291.01 219.89 24.4% 128 487.36 357.64 26.6% 256 892.78 631.15 29.3% 512 1774.22 1251.4 29.5% 1024 3412.79 2470.58 27.6% 2048 6594.28 4421.38 33.0% 4096 13121.85 8751.19 33.3% 8192 25689.03 18173.61 29.3% 16384 51258.21 36278.3 29.2% 32768 103584.61 73240.5 29.3% 65536 207569.05 146748.51 29.3% 131072 415893.86 292083.12 29.8% I suspect there is still some scala numeric boxing magic happening here that would be good to get rid of.
        Hide
        Jun Rao added a comment -

        Should the native and the java column be reversed? Now, it reads that java is faster than native.

        Show
        Jun Rao added a comment - Should the native and the java column be reversed? Now, it reads that java is faster than native.
        Hide
        Jay Kreps added a comment -

        Yeah, that's a little unclear. The "java" column is the "pure jvm" scala implementation in the patch and the native version is java.util.zip.CRC32 which uses JNI to call a C CRC library.

        Show
        Jay Kreps added a comment - Yeah, that's a little unclear. The "java" column is the "pure jvm" scala implementation in the patch and the native version is java.util.zip.CRC32 which uses JNI to call a C CRC library.
        Hide
        David Arthur added a comment -

        I pulled in the pure-Java implementation from Hadoop for comparison:

        https://docs.google.com/spreadsheet/pub?key=0AksaPvYfWJQFdG5fZFNyWnpOUzZfZEtnVl9YZ21FWUE&output=html

        Pure Java has a slight advantage over pure Scala, both have good speedup over JNI. This was done against 0.8 using the same test code in the patch

        Show
        David Arthur added a comment - I pulled in the pure-Java implementation from Hadoop for comparison: https://docs.google.com/spreadsheet/pub?key=0AksaPvYfWJQFdG5fZFNyWnpOUzZfZEtnVl9YZ21FWUE&output=html Pure Java has a slight advantage over pure Scala, both have good speedup over JNI. This was done against 0.8 using the same test code in the patch
        Hide
        Jay Kreps added a comment -

        Ooo very nice.

        Show
        Jay Kreps added a comment - Ooo very nice.
        Hide
        David Arthur added a comment - - edited

        Not sure how you guys feel about having Java in the source tree, but I attached a patch with the pure Java implementation (and the other stuff from Jay Kreps original patch).

        Show
        David Arthur added a comment - - edited Not sure how you guys feel about having Java in the source tree, but I attached a patch with the pure Java implementation (and the other stuff from Jay Kreps original patch).
        Hide
        Jun Rao added a comment -

        Should this be a post 0.8 item?

        Show
        Jun Rao added a comment - Should this be a post 0.8 item?
        Hide
        Jay Kreps added a comment -

        Yes, this should go on trunk.

        Show
        Jay Kreps added a comment - Yes, this should go on trunk.
        Hide
        Scott Carey added a comment -

        Awesome. I was just profiling some Kafka 0.7.1 stuff and noticed the CRC eating up time, and since I co-authored the pure java hadoop one, was just about to file a JIRA here....

        On a related note, it would be nice to offload the CRC and decompression to a different thread than the user's thread. Does 0.8's client do all of this work on the user thread like 0.7.1 ? Our 0.7.1 consumers are often throughput bound by CPU including kafka code. If the client is currently single-threaded I can file a JIRA with some ideas.

        Show
        Scott Carey added a comment - Awesome. I was just profiling some Kafka 0.7.1 stuff and noticed the CRC eating up time, and since I co-authored the pure java hadoop one, was just about to file a JIRA here.... On a related note, it would be nice to offload the CRC and decompression to a different thread than the user's thread. Does 0.8's client do all of this work on the user thread like 0.7.1 ? Our 0.7.1 consumers are often throughput bound by CPU including kafka code. If the client is currently single-threaded I can file a JIRA with some ideas.
        Hide
        Jay Kreps added a comment -

        Hey Scott, currently the implementation of the consumer is that there is a background thread that fetches data and populates a queue of data chunks which are handed off to the user's iterators. A single consumer client can feed many iterators in (potentially) many threads. The goal of this design was specifically to support a large thread pool of processors while still maintaining ordered consumption on a per-partition basis. So although a background thread would be one way to increase parallelism I think it is actually the harder one to reason about since now there are multiple thread pools to tune. I think just increasing the number of user threads/iterators is probably the best approach.

        Show
        Jay Kreps added a comment - Hey Scott, currently the implementation of the consumer is that there is a background thread that fetches data and populates a queue of data chunks which are handed off to the user's iterators. A single consumer client can feed many iterators in (potentially) many threads. The goal of this design was specifically to support a large thread pool of processors while still maintaining ordered consumption on a per-partition basis. So although a background thread would be one way to increase parallelism I think it is actually the harder one to reason about since now there are multiple thread pools to tune. I think just increasing the number of user threads/iterators is probably the best approach.
        Hide
        Scott Carey added a comment -

        I was thinking of using Akka Actors, as they can provide the necessary ordering guarantees, do not require thread pool tuning, and should be easy to reason about. The performance would likely be better than producer/consumer thread chains too, since chained actors often run in the same thread and avoid processor cache thrashing.

        Show
        Scott Carey added a comment - I was thinking of using Akka Actors, as they can provide the necessary ordering guarantees, do not require thread pool tuning, and should be easy to reason about. The performance would likely be better than producer/consumer thread chains too, since chained actors often run in the same thread and avoid processor cache thrashing.
        Hide
        David Arthur added a comment -

        Akka seems a bit overkill for this (although it does have some nice properties). It would be interesting to refactor the threading in Kafka with Akka and see what kind of performance differences there are (certainly beyond the scope of this JIRA).

        As for the CRC implementation, is there consensus of what do here - Java or Scala?

        I say +1 for Java since no one will need to modify this code and it doesn't really matter that it's not Scala.

        Show
        David Arthur added a comment - Akka seems a bit overkill for this (although it does have some nice properties). It would be interesting to refactor the threading in Kafka with Akka and see what kind of performance differences there are (certainly beyond the scope of this JIRA). As for the CRC implementation, is there consensus of what do here - Java or Scala? I say +1 for Java since no one will need to modify this code and it doesn't really matter that it's not Scala.
        Hide
        Jay Kreps added a comment -

        Checked in the java version.

        Show
        Jay Kreps added a comment - Checked in the java version.

          People

          • Assignee:
            Jay Kreps
            Reporter:
            Jay Kreps
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development