Description
While trying out the newest Parquet version, we noticed that the changes to start using ByteBuffers: https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8 and https://github.com/apache/parquet-mr/commit/6b24a1d1b5e2792a7821ad172a45e38d2b04f9b8 (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow down a bit.
Read overhead: 4-6% (in MB_Millis)
Write overhead: 6-10% (MB_Millis).
Seems like this seems to be due to the encoding / decoding of Strings in the Binary class (https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java) - toStringUsingUTF8() - for reads
encodeUTF8() - for writes
In those methods we're using the nio Charsets for encode / decode:
private static ByteBuffer encodeUTF8(CharSequence value) { try { return ENCODER.get().encode(CharBuffer.wrap(value)); } catch (CharacterCodingException e) { throw new ParquetEncodingException("UTF-8 not supported.", e); } } } ... @Override public String toStringUsingUTF8() { int limit = value.limit(); value.limit(offset+length); int position = value.position(); value.position(offset); // no corresponding interface to read a subset of a buffer, would have to slice it // which creates another ByteBuffer object or do what is done here to adjust the // limit/offset and set them back after String ret = UTF8.decode(value).toString(); value.limit(limit); value.position(position); return ret; }
Tried out some micro / macro benchmarks and it seems like switching those out to using the String class for the encoding / decoding improves performance:
@Override public String toStringUsingUTF8() { String ret; if (value.hasArray()) { try { ret = new String(value.array(), value.arrayOffset() + offset, length, "UTF-8"); } catch (UnsupportedEncodingException e) { throw new ParquetDecodingException("UTF-8 not supported"); } } else { int limit = value.limit(); value.limit(offset+length); int position = value.position(); value.position(offset); // no corresponding interface to read a subset of a buffer, would have to slice it // which creates another ByteBuffer object or do what is done here to adjust the // limit/offset and set them back after ret = UTF8.decode(value).toString(); value.limit(limit); value.position(position); } return ret; } ... private static ByteBuffer encodeUTF8(String value) { try { return ByteBuffer.wrap(value.getBytes("UTF-8")); } catch (UnsupportedEncodingException e) { throw new ParquetEncodingException("UTF-8 not supported.", e); } }