Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5791

[Python] pyarrow.csv.read_csv hangs + eats all RAM

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.13.0
    • Fix Version/s: 0.14.0, 0.14.1
    • Component/s: Python
    • Environment:
      Ubuntu Xenial, python 2.7

      Description

      I have quite a sparse dataset in CSV format. A wide table that has several rows but many (32k) columns. Total size ~540K.

      When I read the dataset using `pyarrow.csv.read_csv` it hangs, gradually eats all memory and gets killed.

      More details on the conditions further. Script to run and all mentioned files are under attachments.

      1) `sample_32769_cols.csv` is the dataset that suffers the problem.

      2) `sample_32768_cols.csv` is the dataset that DOES NOT suffer and is read in under 400ms on my machine. It's the same dataset without ONE last column. That last column is no different than others and has empty values.

      The reason of why exactly this column makes difference between proper execution and hanging failure which looks like some memory leak - no idea.

      I have created flame graph for the case (1) to support this issue resolution (`graph.svg`).

       

        Attachments

        1. sample_32768_cols.csv
          537 kB
          Bogdan Klichuk
        2. sample_32769_cols.csv
          537 kB
          Bogdan Klichuk
        3. graph.svg
          67 kB
          Bogdan Klichuk
        4. csvtest.py
          0.1 kB
          Bogdan Klichuk

          Issue Links

            Activity

              People

              • Assignee:
                emkornfield@gmail.com Micah Kornfield
                Reporter:
                klichukb Bogdan Klichuk
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2.5h
                  2.5h