Details
-
Bug
-
Status: Resolved
-
P2
-
Resolution: Fixed
-
2.6.0
Description
While working on Scio's next version, I noticed that StringUtf8Coder is slower than expected.
I wrote a small micro-benchmark using jmh that serialises a (scala) List of a 1000 Strings using a custom Coder[List[_]]. While profiling it, I noticed that a lot of time is spent in java.io.DataInputStream.<init>(java.io.InputStream).
Looking into the code for
StringUtf8Coder, the readString method is directly reading bytes. It therefore does not seem that a DataInputStream is necessary.
I replaced StringUtf8Coder with a Coder[String] implementation (in Scala), that is essentially the same as StringUtf8Coder but is not using DataInputStream.
private final object ScioStringCoder extends AtomicCoder[String] { import org.apache.beam.sdk.util.VarInt import java.nio.charset.StandardCharsets import org.apache.beam.sdk.values.TypeDescriptor import com.google.common.base.Utf8 def decode(dis: InputStream): String = { val len = VarInt.decodeInt(dis) if (len < 0) { throw new CoderException("Invalid encoded string length: " + len) } val bytes = new Array[Byte](len) dis.read(bytes) return new String(bytes, StandardCharsets.UTF_8) } def encode(value: String, outStream: OutputStream): Unit = { val bytes = value.getBytes(StandardCharsets.UTF_8) VarInt.encode(bytes.length, outStream) outStream.write(bytes) } override def verifyDeterministic() = () override def consistentWithEquals() = true private val TYPE_DESCRIPTOR = new TypeDescriptor[String] {} override def getEncodedTypeDescriptor() = TYPE_DESCRIPTOR override def getEncodedElementByteSize(value: String) = { if (value == null) { throw new CoderException("cannot encode a null String") } val size = Utf8.encodedLength(value) VarInt.getLength(size) + size } }
Using that Coder is about 27% faster than StringUtf8Coder. I've added the jmh output in "Docs Text"
Is there any particular reason to use DataInputStream ?
Do you think we can remove that to make StringUtf8Coder more efficient ?
Attachments
Issue Links
- links to