Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6059

[Python] Regression memory issue when calling pandas.read_parquet

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 0.14.0, 0.14.1
    • Fix Version/s: 0.15.0
    • Component/s: Python
    • Labels:
      None

      Description

      I have a ~3MB parquet file with the next schema:

      bag_stamp: timestamp[ns]
      transforms_[]_.header.seq: list<item: int64>
        child 0, item: int64
      transforms_[]_.header.stamp: list<item: timestamp[ns]>
        child 0, item: timestamp[ns]
      transforms_[]_.header.frame_id: list<item: string>
        child 0, item: string
      transforms_[]_.child_frame_id: list<item: string>
        child 0, item: string
      transforms_[]_.transform.translation.x: list<item: double>
        child 0, item: double
      transforms_[]_.transform.translation.y: list<item: double>
        child 0, item: double
      transforms_[]_.transform.translation.z: list<item: double>
        child 0, item: double
      transforms_[]_.transform.rotation.x: list<item: double>
        child 0, item: double
      transforms_[]_.transform.rotation.y: list<item: double>
        child 0, item: double
      transforms_[]_.transform.rotation.z: list<item: double>
        child 0, item: double
      transforms_[]_.transform.rotation.w: list<item: double>
        child 0, item: double
      

       If I read it with pandas.read_parquet() using pyarrow 0.13.0 all seems fine and it takes no time to load. If I try the same with 0.14.0 or 0.14.1 it takes a lot of time and uses ~10GB of RAM. Many times if I don't have enough available memory it will just be killed OOM. Now, if I use the next code snippet instead it works perfectly with all the versions:

      parquet_file = pq.ParquetFile(input_file)
      tables = []
      for row_group in range(parquet_file.num_row_groups):
          tables.append(parquet_file.read_row_group(row_group, columns=columns, use_pandas_metadata=True))
      df = pa.concat_tables(tables).to_pandas()
      

        Attachments

        1. Memory_profile_0.13_rs.png
          37 kB
          Olivier Giboin
        2. Memory_profile_0.13.png
          105 kB
          Olivier Giboin
        3. Memory_profile_0.14.1_use_thread_false_rs.png
          35 kB
          Olivier Giboin
        4. Memory_profile_0.14.1_use_thread_FALSE.png
          98 kB
          Olivier Giboin
        5. Memory_profile_0.14.1_use_thread_true.png
          104 kB
          Olivier Giboin

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                FJ_Sanchez Francisco Sanchez
              • Votes:
                1 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: