Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16028

Memory leak in `fragment.to_table`

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 6.0.1
    • None
    • Parquet, Python
    • None

    Description

      This "pseudo" code ends with OOM.

       

      import fsspec
      import pyarrow
      import pyarrow.parquet as pq
      
      fs = fsspec.filesystem(
          "s3",
          default_cache_type="none",
          default_fill_cache=False,
          **our_storage_options,
      )
      dataset = pq.ParquetDataset(
          "path in bucket",
          filesystem=fs,
          filters=some_filters,
          use_legacy_dataset=False,
      )
      
      # this ends with OOM
      dataset.read(columns=columns_to_read)
      
      # and this too
      tables = []
      for fragment in dataset.fragments:
         tables.append(fragment.to_table(columns=columns_to_read))
      all_data = pyarrow.lib.concat_tables(tables) 

      What is really weird is if we put a debug point in the loop and load just one fragment. It loads, but something keeps eating memory after load until there is no left.

      We are trying to read a parquet table that has several files under desired partitions. Each fragment has tens of columns and tens of millions of rows.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            ometelka ondrej metelka
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: