Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-642

Improve performance of ByteBuffer based read / write paths

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.9.0, 1.8.2
    • None
    • None

    Description

      While trying out the newest Parquet version, we noticed that the changes to start using ByteBuffers: https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8 and https://github.com/apache/parquet-mr/commit/6b24a1d1b5e2792a7821ad172a45e38d2b04f9b8 (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow down a bit.
      Read overhead: 4-6% (in MB_Millis)
      Write overhead: 6-10% (MB_Millis).

      Seems like this seems to be due to the encoding / decoding of Strings in the Binary class (https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java) - toStringUsingUTF8() - for reads
      encodeUTF8() - for writes

      In those methods we're using the nio Charsets for encode / decode:

          private static ByteBuffer encodeUTF8(CharSequence value) {
            try {
              return ENCODER.get().encode(CharBuffer.wrap(value));
            } catch (CharacterCodingException e) {
              throw new ParquetEncodingException("UTF-8 not supported.", e);
            }
          }
        }
      ...
          @Override
          public String toStringUsingUTF8() {
            int limit = value.limit();
            value.limit(offset+length);
            int position = value.position();
            value.position(offset);
            // no corresponding interface to read a subset of a buffer, would have to slice it
            // which creates another ByteBuffer object or do what is done here to adjust the
            // limit/offset and set them back after
            String ret = UTF8.decode(value).toString();
            value.limit(limit);
            value.position(position);
            return ret;
          }
      

      Tried out some micro / macro benchmarks and it seems like switching those out to using the String class for the encoding / decoding improves performance:

      @Override
          public String toStringUsingUTF8() {
            String ret;
            if (value.hasArray()) {
              try {
                ret = new String(value.array(), value.arrayOffset() + offset, length, "UTF-8");
              } catch (UnsupportedEncodingException e) {
                throw new ParquetDecodingException("UTF-8 not supported");
              }
            } else {
              int limit = value.limit();
              value.limit(offset+length);
              int position = value.position();
              value.position(offset);
              // no corresponding interface to read a subset of a buffer, would have to slice it
              // which creates another ByteBuffer object or do what is done here to adjust the
              // limit/offset and set them back after
              ret = UTF8.decode(value).toString();
              value.limit(limit);
              value.position(position);
            }
      
            return ret;
          }
      ...
      private static ByteBuffer encodeUTF8(String value) {
            try {
              return ByteBuffer.wrap(value.getBytes("UTF-8"));
            } catch (UnsupportedEncodingException e) {
              throw new ParquetEncodingException("UTF-8 not supported.", e);
            }
          }
      

      Attachments

        Activity

          People

            pnarang Piyush Narang
            pnarang Piyush Narang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment