Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-17573

[Go] Parquet ByteArray statistics cause memory leak

    XMLWordPrintableJSON

Details

    Description

      When using `arrow.BinaryTypes.String` in a schema, appending multiple strings, and then writing a record out to parquet the memory of the program continuously increases. This also applies for the other `arrow.BinaryTypes` 

       

      I took a heap dump midway through the program and the majority of allocations comes from `StringBuilder.Append` which is not GC'd. I approached 16GB of RAM before terminating the program.

       

      I was not able to replicate this behavior with just PrimativeTypes. Another interesting point, if the records are created but never written with pqarrow memory does not grow. In the below program commenting out `w.Write(rec)` will not cause memory issues.

      Example program which causes memory to leak:

      package main
      
      import (
         "os"
      
         "github.com/apache/arrow/go/v9/arrow"
         "github.com/apache/arrow/go/v9/arrow/array"
         "github.com/apache/arrow/go/v9/arrow/memory"
         "github.com/apache/arrow/go/v9/parquet"
         "github.com/apache/arrow/go/v9/parquet/compress"
         "github.com/apache/arrow/go/v9/parquet/pqarrow"
      )
      
      func main() {
         f, _ := os.Create("/tmp/test.parquet")
      
         arrowProps := pqarrow.DefaultWriterProps()
         schema := arrow.NewSchema(
            []arrow.Field{
               {Name: "aString", Type: arrow.BinaryTypes.String},
            },
            nil,
         )
         w, _ := pqarrow.NewFileWriter(schema, f, parquet.NewWriterProperties(parquet.WithCompression(compress.Codecs.Snappy)), arrowProps)
      
         builder := array.NewRecordBuilder(memory.DefaultAllocator, schema)
         for i := 1; i < 5000000000; i++ {
            builder.Field(0).(*array.StringBuilder).Append("HelloWorld!")
            if i%2000000 == 0 {
               // Write row groups out every 2M times
               rec := builder.NewRecord()
               w.Write(rec)
               rec.Release()
            }
         }
         w.Close()
      }

       

      Attachments

        Issue Links

          Activity

            People

              zeroshade Matthew Topol
              ssirovica Sasha Sirovica
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h