Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-5439

StringUtf8Coder is slower than expected

Details

    • Hide
       --- CUSTOM STRING CODER ---

      sbt:scio-jmh> jmh:run -f1 -wi 10 -i 20 com.spotify.scio.jmh.KryoAtomicCoderBenchmark.derivedStringListDecode
      Running (fork) org.openjdk.jmh.Main -f1 -wi 10 -i 20 com.spotify.scio.jmh.KryoAtomicCoderBenchmark.derivedStringListDecode
      # JMH version: 1.21
      # VM version: JDK 1.8.0_162, Java HotSpot(TM) 64-Bit Server VM, 25.162-b12
      # VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_162.jdk/Contents/Home/jre/bin/java
      # VM options: <none>
      # Warmup: 10 iterations, 10 s each
      # Measurement: 20 iterations, 10 s each
      # Timeout: 10 min per iteration
      # Threads: 1 thread, will synchronize iterations
      # Benchmark mode: Average time, time/op
      # Benchmark: com.spotify.scio.jmh.KryoAtomicCoderBenchmark.derivedStringListDecode
      # Run progress: 0.00% complete, ETA 00:05:00
      # Fork: 1 of 1
      # Warmup Iteration 1: 143871.184 ns/op
      # Warmup Iteration 2: 146251.283 ns/op
      # Warmup Iteration 3: 142170.401 ns/op
      # Warmup Iteration 4: 142894.955 ns/op
      # Warmup Iteration 5: 118899.188 ns/op
      # Warmup Iteration 6: 114858.595 ns/op
      # Warmup Iteration 7: 122067.406 ns/op
      # Warmup Iteration 8: 126001.842 ns/op
      # Warmup Iteration 9: 114815.008 ns/op
      # Warmup Iteration 10: 114770.945 ns/op
      Iteration 1: 125273.345 ns/op
      Iteration 2: 138278.789 ns/op
      Iteration 3: 140561.592 ns/op
      Iteration 4: 144293.970 ns/op
      Iteration 5: 143969.709 ns/op
      Iteration 6: 143922.836 ns/op
      Iteration 7: 114001.757 ns/op
      Iteration 8: 127505.765 ns/op
      Iteration 9: 124477.706 ns/op
      Iteration 10: 119351.929 ns/op
      Iteration 11: 125093.176 ns/op
      Iteration 12: 124471.535 ns/op
      Iteration 13: 124345.109 ns/op
      Iteration 14: 125401.103 ns/op
      Iteration 15: 117102.858 ns/op
      Iteration 16: 124733.919 ns/op
      Iteration 17: 125247.740 ns/op
      Iteration 18: 124482.480 ns/op
      Iteration 19: 123607.365 ns/op
      Iteration 20: 123227.978 ns/op
      Result "com.spotify.scio.jmh.KryoAtomicCoderBenchmark.derivedStringListDecode":
        127967.533 ±(99.9%) 7870.248 ns/op [Average]
        (min, avg, max) = (114001.757, 127967.533, 144293.970), stdev = 9063.390
        CI (99.9%): [120097.285, 135837.782] (assumes normal distribution)
      # Run complete. Total time: 00:05:05
      REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
      why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
      experiments, perform baseline and negative tests that provide experimental control, make sure
      the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
      Do not assume the numbers tell you what you want them to tell.
      Benchmark Mode Cnt Score Error Units
      KryoAtomicCoderBenchmark.derivedStringListDecode avgt 20 127967.533 ± 7870.248 ns/op


      --- StringUtf8Coder ---

      sbt:scio-jmh> jmh:run -f1 -wi 10 -i 20 com.spotify.scio.jmh.KryoAtomicCoderBenchmark.derivedStringListDecode

      Done compiling.
      Packaging /Users/julient/Documents/Spotify/scio/scio-jmh/target/scala-2.12/scio-jmh_2.12-0.7.0-SNAPSHOT-jmh.jar ...
      Done packaging.
      Running (fork) org.openjdk.jmh.Main -f1 -wi 10 -i 20 com.spotify.scio.jmh.KryoAtomicCoderBenchmark.derivedStringListDecode
      # JMH version: 1.21
      # VM version: JDK 1.8.0_162, Java HotSpot(TM) 64-Bit Server VM, 25.162-b12
      # VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_162.jdk/Contents/Home/jre/bin/java
      # VM options: <none>
      # Warmup: 10 iterations, 10 s each
      # Measurement: 20 iterations, 10 s each
      # Timeout: 10 min per iteration
      # Threads: 1 thread, will synchronize iterations
      # Benchmark mode: Average time, time/op
      # Benchmark: com.spotify.scio.jmh.KryoAtomicCoderBenchmark.derivedStringListDecode
      # Run progress: 0.00% complete, ETA 00:05:00
      # Fork: 1 of 1
      # Warmup Iteration 1: 168293.193 ns/op
      # Warmup Iteration 2: 193340.092 ns/op
      # Warmup Iteration 3: 195244.767 ns/op
      # Warmup Iteration 4: 195866.848 ns/op
      # Warmup Iteration 5: 196383.206 ns/op
      # Warmup Iteration 6: 196160.991 ns/op
      # Warmup Iteration 7: 196214.871 ns/op
      # Warmup Iteration 8: 193079.657 ns/op
      # Warmup Iteration 9: 179553.851 ns/op
      # Warmup Iteration 10: 194834.527 ns/op
      Iteration 1: 195892.368 ns/op
      Iteration 2: 195410.418 ns/op
      Iteration 3: 196291.456 ns/op
      Iteration 4: 196202.431 ns/op
      Iteration 5: 151604.550 ns/op
      Iteration 6: 168512.092 ns/op
      Iteration 7: 171496.853 ns/op
      Iteration 8: 154729.598 ns/op
      Iteration 9: 169235.528 ns/op
      Iteration 10: 171641.574 ns/op
      Iteration 11: 195484.899 ns/op
      Iteration 12: 195493.509 ns/op
      Iteration 13: 170587.460 ns/op
      Iteration 14: 173990.184 ns/op
      Iteration 15: 179476.846 ns/op
      Iteration 16: 162233.480 ns/op
      Iteration 17: 171842.634 ns/op
      Iteration 18: 173480.442 ns/op
      Iteration 19: 170351.037 ns/op
      Iteration 20: 162195.772 ns/op
      Result "com.spotify.scio.jmh.KryoAtomicCoderBenchmark.derivedStringListDecode":
        176307.657 ±(99.9%) 12630.618 ns/op [Average]
        (min, avg, max) = (151604.550, 176307.657, 196291.456), stdev = 14545.439
        CI (99.9%): [163677.038, 188938.275] (assumes normal distribution)
      # Run complete. Total time: 00:05:04
      REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
      why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
      experiments, perform baseline and negative tests that provide experimental control, make sure
      the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
      Do not assume the numbers tell you what you want them to tell.
      Benchmark Mode Cnt Score Error Units
      KryoAtomicCoderBenchmark.derivedStringListDecode avgt 20 176307.657 ± 12630.618 ns/op
      Show
       --- CUSTOM STRING CODER --- sbt:scio-jmh> jmh:run -f1 -wi 10 -i 20 com.spotify.scio.jmh.KryoAtomicCoderBenchmark.derivedStringListDecode Running (fork) org.openjdk.jmh.Main -f1 -wi 10 -i 20 com.spotify.scio.jmh.KryoAtomicCoderBenchmark.derivedStringListDecode # JMH version: 1.21 # VM version: JDK 1.8.0_162, Java HotSpot(TM) 64-Bit Server VM, 25.162-b12 # VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_162.jdk/Contents/Home/jre/bin/java # VM options: <none> # Warmup: 10 iterations, 10 s each # Measurement: 20 iterations, 10 s each # Timeout: 10 min per iteration # Threads: 1 thread, will synchronize iterations # Benchmark mode: Average time, time/op # Benchmark: com.spotify.scio.jmh.KryoAtomicCoderBenchmark.derivedStringListDecode # Run progress: 0.00% complete, ETA 00:05:00 # Fork: 1 of 1 # Warmup Iteration 1: 143871.184 ns/op # Warmup Iteration 2: 146251.283 ns/op # Warmup Iteration 3: 142170.401 ns/op # Warmup Iteration 4: 142894.955 ns/op # Warmup Iteration 5: 118899.188 ns/op # Warmup Iteration 6: 114858.595 ns/op # Warmup Iteration 7: 122067.406 ns/op # Warmup Iteration 8: 126001.842 ns/op # Warmup Iteration 9: 114815.008 ns/op # Warmup Iteration 10: 114770.945 ns/op Iteration 1: 125273.345 ns/op Iteration 2: 138278.789 ns/op Iteration 3: 140561.592 ns/op Iteration 4: 144293.970 ns/op Iteration 5: 143969.709 ns/op Iteration 6: 143922.836 ns/op Iteration 7: 114001.757 ns/op Iteration 8: 127505.765 ns/op Iteration 9: 124477.706 ns/op Iteration 10: 119351.929 ns/op Iteration 11: 125093.176 ns/op Iteration 12: 124471.535 ns/op Iteration 13: 124345.109 ns/op Iteration 14: 125401.103 ns/op Iteration 15: 117102.858 ns/op Iteration 16: 124733.919 ns/op Iteration 17: 125247.740 ns/op Iteration 18: 124482.480 ns/op Iteration 19: 123607.365 ns/op Iteration 20: 123227.978 ns/op Result "com.spotify.scio.jmh.KryoAtomicCoderBenchmark.derivedStringListDecode":   127967.533 ±(99.9%) 7870.248 ns/op [Average]   (min, avg, max) = (114001.757, 127967.533, 144293.970), stdev = 9063.390   CI (99.9%): [120097.285, 135837.782] (assumes normal distribution) # Run complete. Total time: 00:05:05 REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial experiments, perform baseline and negative tests that provide experimental control, make sure the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts. Do not assume the numbers tell you what you want them to tell. Benchmark Mode Cnt Score Error Units KryoAtomicCoderBenchmark.derivedStringListDecode avgt 20 127967.533 ± 7870.248 ns/op --- StringUtf8Coder --- sbt:scio-jmh> jmh:run -f1 -wi 10 -i 20 com.spotify.scio.jmh.KryoAtomicCoderBenchmark.derivedStringListDecode Done compiling. Packaging /Users/julient/Documents/Spotify/scio/scio-jmh/target/scala-2.12/scio-jmh_2.12-0.7.0-SNAPSHOT-jmh.jar ... Done packaging. Running (fork) org.openjdk.jmh.Main -f1 -wi 10 -i 20 com.spotify.scio.jmh.KryoAtomicCoderBenchmark.derivedStringListDecode # JMH version: 1.21 # VM version: JDK 1.8.0_162, Java HotSpot(TM) 64-Bit Server VM, 25.162-b12 # VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_162.jdk/Contents/Home/jre/bin/java # VM options: <none> # Warmup: 10 iterations, 10 s each # Measurement: 20 iterations, 10 s each # Timeout: 10 min per iteration # Threads: 1 thread, will synchronize iterations # Benchmark mode: Average time, time/op # Benchmark: com.spotify.scio.jmh.KryoAtomicCoderBenchmark.derivedStringListDecode # Run progress: 0.00% complete, ETA 00:05:00 # Fork: 1 of 1 # Warmup Iteration 1: 168293.193 ns/op # Warmup Iteration 2: 193340.092 ns/op # Warmup Iteration 3: 195244.767 ns/op # Warmup Iteration 4: 195866.848 ns/op # Warmup Iteration 5: 196383.206 ns/op # Warmup Iteration 6: 196160.991 ns/op # Warmup Iteration 7: 196214.871 ns/op # Warmup Iteration 8: 193079.657 ns/op # Warmup Iteration 9: 179553.851 ns/op # Warmup Iteration 10: 194834.527 ns/op Iteration 1: 195892.368 ns/op Iteration 2: 195410.418 ns/op Iteration 3: 196291.456 ns/op Iteration 4: 196202.431 ns/op Iteration 5: 151604.550 ns/op Iteration 6: 168512.092 ns/op Iteration 7: 171496.853 ns/op Iteration 8: 154729.598 ns/op Iteration 9: 169235.528 ns/op Iteration 10: 171641.574 ns/op Iteration 11: 195484.899 ns/op Iteration 12: 195493.509 ns/op Iteration 13: 170587.460 ns/op Iteration 14: 173990.184 ns/op Iteration 15: 179476.846 ns/op Iteration 16: 162233.480 ns/op Iteration 17: 171842.634 ns/op Iteration 18: 173480.442 ns/op Iteration 19: 170351.037 ns/op Iteration 20: 162195.772 ns/op Result "com.spotify.scio.jmh.KryoAtomicCoderBenchmark.derivedStringListDecode":   176307.657 ±(99.9%) 12630.618 ns/op [Average]   (min, avg, max) = (151604.550, 176307.657, 196291.456), stdev = 14545.439   CI (99.9%): [163677.038, 188938.275] (assumes normal distribution) # Run complete. Total time: 00:05:04 REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial experiments, perform baseline and negative tests that provide experimental control, make sure the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts. Do not assume the numbers tell you what you want them to tell. Benchmark Mode Cnt Score Error Units KryoAtomicCoderBenchmark.derivedStringListDecode avgt 20 176307.657 ± 12630.618 ns/op

    Description

      While working on Scio's next version, I noticed that StringUtf8Coder is slower than expected.

      I wrote a small micro-benchmark using jmh that serialises a (scala) List of a 1000 Strings using a custom Coder[List[_]]. While profiling it, I noticed that a lot of time is spent in java.io.DataInputStream.<init>(java.io.InputStream).

      Looking into the code for 
      StringUtf8Coder, the readString method is directly reading bytes. It therefore does not seem that a DataInputStream is necessary.
      I replaced StringUtf8Coder with a Coder[String] implementation (in Scala), that is essentially the same as StringUtf8Coder but is not using DataInputStream.
       

      private final object ScioStringCoder extends AtomicCoder[String] {
        import org.apache.beam.sdk.util.VarInt
        import java.nio.charset.StandardCharsets
        import org.apache.beam.sdk.values.TypeDescriptor
        import com.google.common.base.Utf8
      
        def decode(dis: InputStream): String = {
          val len = VarInt.decodeInt(dis)
          if (len < 0) {
            throw new CoderException("Invalid encoded string length: " + len)
          }
          val bytes = new Array[Byte](len)
          dis.read(bytes)
          return new String(bytes, StandardCharsets.UTF_8)
        }
      
        def encode(value: String, outStream: OutputStream): Unit = {
          val bytes = value.getBytes(StandardCharsets.UTF_8)
          VarInt.encode(bytes.length, outStream)
          outStream.write(bytes)
        }
      
        override def verifyDeterministic() = ()
        override def consistentWithEquals() = true
        private val TYPE_DESCRIPTOR = new TypeDescriptor[String] {}
        override def getEncodedTypeDescriptor() = TYPE_DESCRIPTOR
        override def getEncodedElementByteSize(value: String) = {
          if (value == null) {
            throw new CoderException("cannot encode a null String")
          }
          val size = Utf8.encodedLength(value)
          VarInt.getLength(size) + size
        }
      }
      

       
      Using that Coder is about 27% faster than StringUtf8Coder. I've added the jmh output in "Docs Text"

      Is there any particular reason to use DataInputStream ?
      Do you think we can remove that to make StringUtf8Coder more efficient ?
       

      Attachments

        Issue Links

          Activity

            People

              jto Julien Tournay
              jto Julien Tournay
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m