Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-258

Binary statistics is not updated correctly if an underlying Binary array is modified in place

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.6.0
    • None
    • parquet-mr
    • None

    Description

      The following test case shows the problem:

          byte[] bytes = new byte[] { 49 };
          BinaryStatistics reusableStats =  new BinaryStatistics();
          reusableStats.updateStats(Binary.fromByteArray(bytes));
          bytes[0] = 50;
          reusableStats.updateStats(Binary.fromByteArray(bytes, 0, 1));
       
          assertArrayEquals(new byte[] { 49 }, reusableStats.getMinBytes());
          assertArrayEquals(new byte[] { 50 }, reusableStats.getMaxBytes());
      

      I discovered the bug when converting an AVRO file to a Parquet file by reading GenericRecords from a file using DataFileStream.next(D reuse) method. The problem is that underlying byte array of avro Utf8 object is passed to parquet that saves it as part of BinaryStatistics and then the same array is modified in place on the next read.

      I am not sure what is the right way to fix the problem (in BinaryStatistics or AvroWriteSupport).

      If BinaryStatistics implementation is correct (for performance reasons) then this behavior should be documented and AvroWriteSupport.fromAvroString should be fixed to duplicate underlying Utf8 array.

      I am happy to create a pull request once the desired way to fix the issue is discussed.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              k.shaposhnikov@gmail.com Konstantin Shaposhnikov
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: