XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 0.14.0, 0.14.1
    • Fix Version/s: None
    • Component/s: Python
    • Labels:
      None
    • Environment:
      Ubuntu 18.04, 32GB ram, conda-forge installation

      Description

      Memory leak with large string columns crashes the program. This only seems to affect 0.14.x  - it works fine for me in 0.13.0. It might be related to earlier similar issues? e.g. https://github.com/apache/arrow/issues/2624

      Below is a reprex which works in earlier versions, but crashes on read (writing is fine) in this one. The real-life version of the data is full of URLs as the strings. 

      Weirdly it crashes my 32GB Ubuntu 18.04, but runs (if very slowly for the read) on my 16GB Macbook. 

      Thanks so much for the excellent tools! 

       

       

      import pandas as pd
      
      n_rows = int(1e6)
      n_cols = 10
      col_length = 100
      
      df = pd.DataFrame()
      
      for i in range(n_cols):
          df[f'col_{i}'] = pd.util.testing.rands_array(col_length, n_rows)
      
      print('Generated df', df.shape)
      filename = 'tmp.parquet'
      
      print('Writing parquet')
      df.to_parquet(filename)
      
      print('Reading parquet')
      pd.read_parquet(filename)
      

       

       

       

       

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              dgmp George Prichard
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: