Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5791

[Python] pyarrow.csv.read_csv hangs + eats all RAM

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.13.0
    • 0.14.0, 0.14.1
    • Python
    • Ubuntu Xenial, python 2.7

    Description

      I have quite a sparse dataset in CSV format. A wide table that has several rows but many (32k) columns. Total size ~540K.

      When I read the dataset using `pyarrow.csv.read_csv` it hangs, gradually eats all memory and gets killed.

      More details on the conditions further. Script to run and all mentioned files are under attachments.

      1) `sample_32769_cols.csv` is the dataset that suffers the problem.

      2) `sample_32768_cols.csv` is the dataset that DOES NOT suffer and is read in under 400ms on my machine. It's the same dataset without ONE last column. That last column is no different than others and has empty values.

      The reason of why exactly this column makes difference between proper execution and hanging failure which looks like some memory leak - no idea.

      I have created flame graph for the case (1) to support this issue resolution (`graph.svg`).

       

      Attachments

        1. csvtest.py
          0.1 kB
          Bogdan Klichuk
        2. graph.svg
          67 kB
          Bogdan Klichuk
        3. sample_32769_cols.csv
          537 kB
          Bogdan Klichuk
        4. sample_32768_cols.csv
          537 kB
          Bogdan Klichuk

        Issue Links

          Activity

            People

              emkornfield@gmail.com Micah Kornfield
              klichukb Bogdan Klichuk
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2.5h
                  2.5h