Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5791

[Python] pyarrow.csv.read_csv hangs + eats all RAM

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.13.0
    • 0.14.0, 0.14.1
    • Python
    • Ubuntu Xenial, python 2.7

    Description

      I have quite a sparse dataset in CSV format. A wide table that has several rows but many (32k) columns. Total size ~540K.

      When I read the dataset using `pyarrow.csv.read_csv` it hangs, gradually eats all memory and gets killed.

      More details on the conditions further. Script to run and all mentioned files are under attachments.

      1) `sample_32769_cols.csv` is the dataset that suffers the problem.

      2) `sample_32768_cols.csv` is the dataset that DOES NOT suffer and is read in under 400ms on my machine. It's the same dataset without ONE last column. That last column is no different than others and has empty values.

      The reason of why exactly this column makes difference between proper execution and hanging failure which looks like some memory leak - no idea.

      I have created flame graph for the case (1) to support this issue resolution (`graph.svg`).

       

      Attachments

        1. sample_32768_cols.csv
          537 kB
          Bogdan Klichuk
        2. sample_32769_cols.csv
          537 kB
          Bogdan Klichuk
        3. graph.svg
          67 kB
          Bogdan Klichuk
        4. csvtest.py
          0.1 kB
          Bogdan Klichuk

        Issue Links

          Activity

            rokm Rok Mihevc added a comment -

            This issue has been migrated to issue #22212 on GitHub. Please see the migration documentation for further details.

            rokm Rok Mihevc added a comment - This issue has been migrated to issue #22212 on GitHub. Please see the migration documentation for further details.
            klichukb Bogdan Klichuk added a comment -

            Thanks a lot! 

            klichukb Bogdan Klichuk added a comment - Thanks a lot! 
            wesm Wes McKinney added a comment -

            Issue resolved by pull request 4762
            https://github.com/apache/arrow/pull/4762

            wesm Wes McKinney added a comment - Issue resolved by pull request 4762 https://github.com/apache/arrow/pull/4762

            Made a PR, this wasn't really and overflow. Also, added a fixed cap on 1000*1024 which should be enough for anyone

            emkornfield@gmail.com Micah Kornfield added a comment - Made a PR, this wasn't really and overflow. Also, added a fixed cap on 1000*1024 which should be enough for anyone
            wesm Wes McKinney added a comment -

            I think we should support more columns

            wesm Wes McKinney added a comment - I think we should support more columns

            I think I know where this is occurring, will try to patch tonight.

            Do we want to support more columns or throw an error?

            emkornfield@gmail.com Micah Kornfield added a comment - I think I know where this is occurring, will try to patch tonight. Do we want to support more columns or throw an error?
            wesm Wes McKinney added a comment -

            Evidently there's an int16 overflow somewhere in the Arrow CSV codebase. An initial grep didn't turn up anything obvious.

            Comparisons with other CSV libraries (like pandas) probably are not relevant since there is no code overlap.

            wesm Wes McKinney added a comment - Evidently there's an int16 overflow somewhere in the Arrow CSV codebase. An initial grep didn't turn up anything obvious. Comparisons with other CSV libraries (like pandas) probably are not relevant since there is no code overlap.
            klichukb Bogdan Klichuk added a comment -

            Just to point, I can successfully convert a dataframe (if I read it using Pandas) to pyarrow.Table directly.

            df = pandas.read_csv('...', ...)
            table = pyarrow.Table.from_pandas(df)
            klichukb Bogdan Klichuk added a comment - Just to point, I can successfully convert a dataframe (if I read it using Pandas) to pyarrow.Table directly. df = pandas.read_csv( '...' , ...) table = pyarrow.Table.from_pandas(df)
            klichukb Bogdan Klichuk added a comment -

            bhulette It's a shame I threw away the idea of "maybe this is just a power of 2" and didn't simply try. Great point.

            klichukb Bogdan Klichuk added a comment - bhulette It's a shame I threw away the idea of "maybe this is just a power of 2" and didn't simply try. Great point.
            bhulette Brian Hulette added a comment -

            Thanks for the concise bug report! I haven't had a chance to dig into this very far, but I'm sure it's not a coincidence that 32768 == 2^15. 32767 is the max of an unsigned 16-bit integer, so if we're assigning an unsigned int16 to each column somewhere it would overflow once you get beyond 32768 columns (since one column gets 0).

            I'm not sure where exactly that would be happening though. My first inclination was that it would be in the element count for the vector of fields, but according to the flatbuffers page vectors are prefixed by a 32-bit element count.

            bhulette Brian Hulette added a comment - Thanks for the concise bug report! I haven't had a chance to dig into this very far, but I'm sure it's not a coincidence that 32768 == 2^15. 32767 is the max of an unsigned 16-bit integer, so if we're assigning an unsigned int16 to each column somewhere it would overflow once you get beyond 32768 columns (since one column gets 0). I'm not sure where exactly that would be happening though. My first inclination was that it would be in the element count for the vector of fields , but according to the flatbuffers page vectors are prefixed by a 32-bit element count.

            People

              emkornfield@gmail.com Micah Kornfield
              klichukb Bogdan Klichuk
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2.5h
                  2.5h